# 1. Introduction

Digital video surveillance is prevalent in our daily life. Large numbers of monitoring cameras are installed in public and private places, such as government buildings, military bases, and car parks. To obtain high quality surveillance, video denoising techniques have been well studied in the field of image processing. Apart from denoising itself, these techniques can be used to increase compression efficiency, reduce transmission bandwidth, and improve the effectiveness of further processes, such as feature extraction, object detection, and pattern classification.

Even though video and image denoising can be considered different research topics, some basic image denoising ideas and algorithms are borrowed for video denoising, such as Gaussian filter, bilateral filter [1-2], domain transformation [3-5], similar blocks matching [4-6, 28-29], sparse representations [30-32] etc. Compared to a single image, video can provide sufficient additional information from nearby frames, which can bring better denoising results. Moreover, with the emergence of new multi-resolution tools, such as the wavelet transform [7-8], video denoising methods performed in the transform domain have been proposed continually [9-13]. Zlokolica et al. [9] introduced new wavelet-based motion reliability measures, and performed motion estimation and adaptive recursive temporal filtering in a closed loop, which is followed by an intra-frame spatially adaptive filter. Rahman et al. [10] proposed a joint probability density function to model the video wavelet coefficients of any two neighboring frames, and then applied this statistical model for denoising. Jovanov et al. [11] reused motion estimation resources from a video-coding module for video denoising. They proposed a novel motion field-filtering step and a novel recursive temporal filter with the reliability of the estimated motion field appropriately defined. Yu et al. [12] integrated both spatial filtering and recursive temporal filtering into the 3-D wavelet domain and effectively exploited spatial and temporal redundancies. Maggioni et al. [13] exploited the temporal and nonlocal correlation of the video and constructed 3-D spatiotemporal volumes separately by tracking blocks along trajectories defined by motion vectors. Jin et al. [33] proposed a multi-resolution motion analysis method in the wavelet domain. In [34], the change was estimated in the 3D SCT domain. Lian et al. [35] used vector estimation of wavelet coefficients. In addition, other proposed video denoising methods, such as one that uses low-rank matrix completion [14], achieved relatively better results.

Video denoising technology has made great progress over the previous decades. However, most existing methods cannot obtain ideal results when dealing with large noisy video sequences captured under low light environment. This requirement is urgently demanded in many fields, especially for security monitoring, where a camera is mounted at a stable position with a fixed angle in which the captured video sequences have relatively unchanged backgrounds. In practical applications, the characteristics of both still and moving objects must be clearly seen in the video sequences. This requirement can easily be satisfied during the day. However, at night, statistical noise due to low light illumination seriously affects the video sequences.

In this paper, a novel video denoising method based on Kalman filter is proposed. Taking advantage of the strong spatiotemporal correlations of neighboring frames, motion estimation based on intensity and structure tensor [15-17] is performed by comparing current noisy frame with previous denoised frames. Then, based on motion estimation results, current noisy frame is processed in temporal domain using the Kalman filter [18]. During the filtering process, different positions of the noisy frame have different filtering strengths according to the motion estimation results. Motion positions have weak filtering strength and keeping their motion characteristic is difficult, whereas still positions have strong filtering strength for reducing noise. Simultaneously, the noisy frame is also processed in the spatial domain using the Wiener filter [19]. Finally, by weighting the two denoised frames using Kalman and Wiener filtering methods, a satisfactory result can be obtained. The still region is obtained largely from Kalman filtering, while the motion region is the result of Wiener filtering. Experimental results show that the performance of our proposed method is effective over current competing video denoising methods.

The remainder of the paper is organized as follows. Section 2 describes our proposed video denoising method. Section 3 provides quantitative quality evaluations of the denoising results. Section 4 discusses the experiments as well as the results. Finally, Section 5 concludes this article.

# 2. Proposed Denoising Method

Fig. 1 illustrates the diagram of our proposed video denoising method. The denoising of current noisy frame involves not only the frame itself, but also a series of previously denoised frames. Motion estimation is performed based on intensity and structure tensor between the current noisy frame and the previous denoised frames. Then, the estimation results guide the Kalman filtering on the current noisy frame. In this operation, the final denoised frame from Kalman filtering is needed. Simultaneously, Wiener spatial filtering is also performed on the current noisy frame. Thus, after processing, two denoised frames are obtained. One is obtained using Kalman filtering, and another is obtained using Wiener filtering. Finally, by weighting the two denoised frames, a satisfactory result can be obtained.

**Fig. 1.**Diagram of proposed video denoising method

## 2.1 Motion Estimation based on Intensity and Structure Tensor

To take advantage of the strong correlations between adjacent frames, intensity and structure tensor based motion estimation is performed by comparing the current noisy frame with previous denoised frames.

### 2.1.1 Intensity based Motion Estimation

In order to suppress the noise influence, a strong filter is firstly used to pre-process the noise images. Prefilter is frequently used in many denoising algorithms, such as VBM3D [4]. Considering the algorithm complexity and the noise suppressing ability, we employ the Gaussian filter with large kernel size. Then, the intensity distance could be calculated as follows.

In above equation, k is the temporal index of the frame. In particular, i is the current frame’s index, namely, k = …,i-2,i-1,i,i+1,i+2,… . pk is the pixel value in some position of the frame. In particular, pi is the pixel value of the current frame. Kρ1 is the Gaussian filter kernel with the standard variance ρ1. dI(k,i) is the intensity distance between frame k and frame i.

Fig. 2(a1) and (a2) are the past and current frames with additive Gaussian white noise, whose σ=50. Before calculating the intensity distance, the two frames are prefiltered with a 10×10 Gaussian filter whose ρ1=5, and the results are shown in Fig. 2(b1) and (b2). The choice of the filter kernel follows the noise level. The larger the noise is, the larger the kernel size is. Then, the intensity distance is calculated based on this two prefiltered frames and the result is shown in Fig. 2(b3).

**Fig. 2.**Intensity based motion estimation. (a1) and (a2) are the past and current frame with additive Gaussian white noise (σ=50). (b1) and (b2) are the prefiltered results of (a1) and (a2) with a 10×10 Gaussian filter whose ρ1=5. (b3) is the intensity distance of (b1) and (b2).

### 2.1.2 Structure tensor based Motion Estimation

Although the strong prefilter effectively suppresses the large scale noise, it destroys the edges of the motion area too. Some detail variations are also damaged and even lost. Weickert et al. [15-17] first proposed the structure tensor, which is used as a tool for analyzing image structure, extracting the geometric feature, etc. In this paper, the simple linear structure tensor is used to analyze the image. This simple linear structure tensor is defined as

In the above equation, ∇ is the image gradient operator, and pσ' is the Gaussian filtered image of input p with the Gaussian standard variance σ'. In addition, ⊗ is the structure tensor product. The image gradients Ix(pσ') and Iy(pσ') can be used in x and y directions. Moreover, * is the convolution of Gaussian filter Kρ2 with standard variance ρ2 and the structure tensor product. Generally, ρ2 > σ'. The Gaussian filter σ' before gradient operation and the filter Kρ2 play the role of the strong pre-filter. The Gaussian filter Kρ2 isotropically synthesizes the local neighborhood structure tensor information, and is thus, called “linear structure tensor.”

Jρ2 contains the image geometric structure information. By orthogonally decomposing Jρ2, we obtain eigenvalues, λ1 and λ2, and eigenvectors, and . The eigenvalues describe the strength of the direction of the eigenvectors, which reflect the direction of the image structures. The corresponding eigenvector of the maximum eigenvalue λ1 indicates the direction of the maximum gradient contrast, i.e., the normal direction. The corresponding eigenvector of eigenvalue λ2 indicates the tangential direction.

Different image structures can be described using different eigenvalues. Usually, λ1＋λ2 is used to reflect the strength of the structure. Fig. 3(1) and (2) show the maps of the structure strength extracted from the noise frames in Fig. 2(a1) and (a2), respectively.

**Fig. 3.**Structure tensor based motion estimation. (1) and (2) are the maps of the structure strength λ1+λ2 extracted from the noise frames in Fig. 2(a1) and (a2). (3) is the Log-Euclidean metric distance of (1) and (2).

When motion occurs, variation in the structure tensor is unavoidable. The structure tensor could be used to detect the motion. Thus, the structure tensor distance should be measured. Given that the structure tensor resides in non-Euclidean space, we use a Riemannian metric called Log-Euclidean metric [20] with simple and fast computations. The metric is computed as

In the above equation, Trace(·) is the trace of the matrix, and log(·) is the structure tensor logarithmic operator defined in [20]. In addition, Jρ2 (pcurrent)represents the structure tensor of the current noisy frame, and Jρ2 (ppast,i) represents the structure tensor of the i-th previous denoised frame. Fig. 3(3) shows the Log-Euclidean metric distance of Figs. 3(1) and (2).

Structure tensor based motion estimation is a good supplement for intensity based motion estimation. The intensity and structure tensor combined motion estimation is shown in Fig. 4. The combination follows:

**Fig. 4.**Intensity and structure tensor combined change segmentation

where α and β are weighted parameters. In Fig. 4, α=0.1 and β=1.

## 2.2 Motion Estimation based Kalman Filtering in Temporal Domain

The discrete Kalman filter [18] can provide an efficient solution to the least squares method.

Generally, the step is made up of two consecutive stages, namely, prediction and updating.

The prediction equations are defined as

and

where the superscripts “-” and “+” denote “before” and “after” each measurement, respectively. Moreover, x+k−1 represents the estimated state matrix and p+k−1 represents the state covariance matrix of last state; xk− and pk− represent the a priori estimates of state matrix and state covariance matrix for the current state, respectively; and Ak represents the state transition matrix that determines the relationship between the present state and the previous one. Matrix Bk relates the control input uk to the current state, and Qk−1 represents the covariance matrix of process noise.

In our proposed method, we attempt to estimate the current frame based on the last one. Thus, the state matrix in the equations can be expressed by using the frame matrix. Otherwise, no control input is available, hence, uk = 0. The priori estimates for current state is assumed to be the same as that of the previous state, so the initial Ak is an identity matrix. Then, the following equations can be obtained.

The motion in the video sequences brings the process noise. Thus, for any pixel (x,y) of the current noisy frame,

which keeps the covariance of motion region larger than that of the still region.

The updating equations are defined as

where Kgk is known as the blending factor for minimizing the posteriori error covariance, called the Kalman gain. Variables xk− and pk− are the priori estimates calculated in the prediction stage. Matrix Hk describes the relationship between the measurement vector, zk, and the posteriori state vector, xk+. Rk is the covariance matrix of measurement noise, and pk+ is the posteriori estimate of state covariance matrix for the current state.

In our proposed method, the current noisy and denoised frames are described as zk and xk+. Hk is the unit matrix. The measurement noise just represents the noise in the video sequences. Thus, the following equations can be obtained.

After Kalman filtering, a denoised frame can be obtained. In this frame, the still region is denoised well. However, the moving region still has much noise because the Kalman filter keeps the information of this region intact. Therefore, the noise in the moving region must still be reduced. Reducing the noise in the moving region of denoised frame from Kalman filtering is complicated. Thus, the Wiener filter [19] is applied on the entire current noisy frame. In this case, both the still and moving regions are denoised. Then, by weighting the two denoised frames using Kalman and Wiener filtering, an integrated denoised frame can be obtained. In the denoised frame, the still region is obtained by using Kalman filtering, and the moving region is obtained by using Wiener filtering.

## 2.3 Spatial-Temporal Weighting

After Kalman and Wiener filtering, two denoised frames are obtained. The image from Kalman filtering showed the still regions are well denoised, but the motion regions retained the noisy information. The result of the Wiener filtering indicated that the motion regions were denoised to some extent. Thus, we integrated the two denoised frames by weighting them based on motion estimation results. The weight is based on Gaussian distribution, and, for any pixel, whose position is (x,y), its weight value, wc(x,y), can be calculated as follows.

In the above equation, dIST,x,y is the corresponding motion estimation value in the position (x,y), and σc is used to control the degree of attenuation. The larger the value of motion estimation is, the smaller the weight will be. Thus, the motion and still regions can be further distinguished effectively.

The weighted denoised frame can be calculated as follows.

Here, Wc represents the weight matrix calculated using Equation (16). XKalman and XWiener represent the denoised frame matrices through Kalman filtering and Wiener filtering, respectively. Xc is simply the desired weighted frame matrix. After obtaining the weighted average, both the motion and still regions of the weighted frame have been denoised.

## 2.4 Complexity Analysis

We assume that the size of each frame (total pixel number) is N. The proposed method includes three steps: motion estimation, Kalman filtering and Wiener filtering. Firstly, in motion estimation, intensity based and structure tensor based motion estimation are implemented, respectively. In intensity based motion estimation, the size of Gaussian convolution kernel is assumed to be r×r. If we divide the convolution to the vertical and horizontal one, the time complexity will be O(Nr). However, in our method, the size of Gaussian convolution kernel is usually invariable, such as 5×5, 10×10 or 15×15, and it will not increase along with the increase of frames’ size. So, the time complexity of Gaussian filtering will be O(N). After that, calculating the intensity distance is implemented, in which the time complexity is O(N). So, the total time complexity of intensity based motion estimation still is O(N). Then, in structure tensor based motion estimation, because the size of Gaussian convolution kernel and gradient convolution kernel are also not increase along with the increase of frames’ size, the time complexity of Gaussian filtering and gradient operator are O(N), respectively. Then, the time complexity of calculating the structure tensor distance is O(N). So, the total time complexity of structure tensor based motion estimation still is O(N). Therefore, the total time complexity of the motion estimation is O(N). After motion estimation, Kalman filtering and Wiener filtering are implemented respectively, in which the time complexity are both O(N). Finally, the time complexity of the proposed method is O(N), which is linear.

# 3. Denoising Validation Criteria

To provide quantitative quality evaluations of the denoising results, we employed two objective criteria, namely, PSNR and SSIM [21-23]. PSNR is defined as

where L is the dynamic range of the image (for 8 bits/pixel images, L = 255). MSE is the mean squared error between the original and distorted images. SSIM is first calculated within local windows using

where x and y are the image patches extracted from the local window from the original and noisy images, respectively. μx, σ2x, and σxy are the mean, variance, and cross-correlation computed within the local window, respectively. The overall SSIM score of a video frame is computed as the average local SSIM scores. PSNR is the mostly widely used quality measure in existing literature, but has been criticized for not correlating well with human visual perception [24]. SSIM is believed to be a better indicator for perceived image quality [24] as it also supplies a quality map that indicates the variations of images quality over space. The final PSNR and SSIM results for a denoised video sequence are computed as the frame average of the full sequence.

# 4. Experiments and Results

To evaluate the performance of the proposed method, we compared some state-of-the-art video denoising algorithms, such as ST-GSM [3] and VBM3D [4]. The original codes of these two algorithms can be downloaded online [25-26]. Besides, we also gave the experimental results of using Kalman filter and Wiener filter separately.

The standard test videos can be downloaded at video sequence base [27]. Two types of videos are available in the base, namely, stationary and moving backgrounds. Given that our method is for videos with a stationary background, we chose four former types of videos in our experiment, which are Salesman, Paris, Akiyo, and Hall. The size of the video is 288×352, and the duration is 300 frames. The experiment was conducted on the luminance channel of the video. The noisy video sequences are simulated by adding independent white Gaussian noises at a given variance σ2 on each frame.

Table 1 shows the PSNR and SSIM results of ST-GSM, VBM3D, Kalman-only, Wiener-only, and our proposed method for the four video sequences at five noise levels. As seen from the table, both Kalman-only and Wiener-only methods could not obtain good denoising results. When the noise level was relatively low, the proposed method worked well, but a gap still existed in ST-GSM and VBM3D. However, when the noise level was high, the proposed method performed better than ST-GSM and VBM3D for most test sequences. In particular, the SSIM of our proposed method was better than the other two algorithms.

**Table 1.**PSNR and SSIM Comparisons of Video Denoising Algorithms for Four Video Sequences at Five Noise Levels

Fig. 5 demonstrates the visual effects of above five video denoising algorithms. Specifically, Frame 100 was extracted from the Akiyo sequence together with a noisy version of the same frame. The denoised frames were obtained by using the five video denoising algorithms. The Kalman-only and our proposed method are obviously effective at suppressing background noise, but Kalman-only method is failed to remove the noise of motion region, such as the woman’s head in the frame, while our method could suppress the noise of motion region to some extent. This finding is further verified by examining the SSIM quality maps of the corresponding frames. The results show that our proposed method is effective for the large noisy video sequences and can achieve state-of-the-art denoising performance.

**Fig. 5.**Denoising results of frame 100 in the Akiyo sequence corrupted with noise with a standard deviation σ = 100. (a1) to (a7): Frames in the original, noisy, ST-GSM [3], VBM3D [4], Kalman-only, Wiener-only, and our proposed method denoised sequences. (b2) to (b7): Corresponding SSIM quality maps (brighter areas indicate larger SSIM values).

# 5. Conclusion

This paper presented a video denoising method based on Kalman filter for large noisy video signals. This method was applied to the restoration of noisy video sequences with added white Gaussian noise. Motion estimation was performed by employing intensity and structure tensor comparing the current noisy frame with previous denoised frames. Then, the Kalman and the Wiener filters were applied on the current noisy frame. Finally, by weighting the denoised frames from the filtering methods, a satisfactory result was obtained. The experimental comparisons with state-of-the-art algorithms show that the proposed method achieved competitive results for large noisy video sequences with a fixed background in terms of both subjective and objective evaluations.