DOI QR코드

DOI QR Code

A Robust Video Fingerprinting Algorithm Based on Centroid of Spatio-temporal Gradient Orientations

  • Sun, Ziqiang (Communication & Information Security Lab, Shenzhen Graduate School, Peking University) ;
  • Zhu, Yuesheng (Communication & Information Security Lab, Shenzhen Graduate School, Peking University) ;
  • Liu, Xiyao (Communication & Information Security Lab, Shenzhen Graduate School, Peking University) ;
  • Zhang, Liming (Faculty of Science and Technology, University of Macau)
  • Received : 2013.07.09
  • Accepted : 2013.11.10
  • Published : 2013.11.30

Abstract

Video fingerprints generated from global features are usually vulnerable against general geometric transformations. In this paper, a novel video fingerprinting algorithm is proposed, in which a new spatio-temporal gradient is designed to represent the spatial and temporal information for each frame, and a new partition scheme, based on concentric circle and rings, is developed to resist the attacks efficiently. The centroids of spatio-temporal gradient orientations (CSTGO) within the circle and rings are then calculated to generate a robust fingerprint. Our experiments with different attacks have demonstrated that the proposed approach outperforms the state-of-the-art methods in terms of robustness and discrimination.

Keywords

1. Introduction

With the prompt development of the internet and multimedia technology, videos can be easily accessed and distributed. However, this convenience creates copyright control issues, for example, pirated copies of videos can be easily made and propagated over the internet. Video fingerprinting or content-based copy detection [1] provides an efficient way to identify the video features, and is widely used in multimedia industry [2] such as video database management, content searching, and content tracking. Fig. 1 shows the common structure of a video fingerprinting system, which consists three parts: fingerprint extraction, database storage and fingerprint matching.

Fig. 1.Overall structure of a complete fingerprinting system

A good video fingerprinting algorithm must satisfy following properties:

Discrimination: Video fingerprinting algorithm should generate distinct fingerprints for different video contents, so that these fingerprints can distinguish two different media contents.

Robustness: Fingerprints of a video should remain invariant even if the video suffers several perceptual- preserving distortions like frame rotation or brightness change.

Efficiency: Video fingerprints should have low computational and storage cost to ensure fast retrieval. Efficiency is valuable in industry application.

In general, video fingerprinting algorithms can be categorized into two classes. In the first class, video fingerprints are extracted based on local features. For example, the video fingerprints are constructed by using Harris interest points [3] in literature [4], or formed by using the trajectories of the interest points in literature [5]. In literature [6], the video fingerprints are generated based on SIFT features [7]. One of main drawbacks of these algorithms is that the number and the dimension of the extracted features are usually very large, which incurs large computational and storage cost. In the second class, video fingerprints are obtained based on global video features. Ordinal measure can be used for copy detection in [8]-[10], and the video fingerprint weights based on visual saliency are computed for ordinal measure in literature [11]. In literature [12], the gradient of each pixel is obtained by calculating luminance differences along both horizontal and vertical directions and the fingerprints are generated from the centroid of these luminance gradient orientations (CGO). Although the computation cost of these algorithms is low compared to the algorithms of the first class, they are vulnerable to logo insertion and some geometric transforms like frame rotation or mirroring.

In this paper, a novel video fingerprinting algorithm is proposed to enhance the robustness of video fingerprints, by using the centroid of spatio-temporal gradient orientations (CSTGO). The key points of this paper is: (1) In our spatio-temporal gradient, the spatial derivative in eight directions is designed to avoid geometric limitation of traditional horizontal and vertical derivatives, while the temporal derivative is used to enhance the resistance to logo insertion of our fingerprints. (2) A novel partitioning scheme based on concentric circle and rings is also developed to further improve the robustness of the fingerprints. Our analysis and experimental results show that the combination of the spatial information in each frame and temporal information between two consecutive frames can enhance the robustness against attacks like global change as well as immunize against rotation, frame mirroring, and logo insertion.

The rest of the paper is organized as follows: in Section II the proposed fingerprinting algorithm and the design consideration are described in details, the experiment results and analysis are shown in Section III and the conclusions are given in Section IV.

 

2. The Proposed Fingerprinting Algorithm

In the traditional CGO based algorithm [12] only four neighbors in the horizontal and vertical direction are considered when the luminance difference is calculated for each pixel. In fact, when a video frame suffers the attacks like frame rotation or mirroring, the horizontal and vertical neighbors of each pixel would be changed, while as long as the attacks preserve the perceptual quality, its eight neighbors still revolve around it, only their relative position is changed as shown in Fig. 2.

Fig. 2.Spatial relationship of a pixel’s eight neighbors (a) Original spatial relationship (Each sub-block represent a pixel) (b) After a 90 degrees rotation of (a) around the central pixel (c) After vertical mirroring of (a)

In order to keep the spatial information intact and get a good rotation invariance, in our algorithm the luminance differences at each central pixel f(x, y, k) in frame k are calculated by using its 8-neighborhood values and the maximal absolute value of these differences is used as the spatial luminance difference, and denoted as spatial derivative Gs:

Fig. 3.Block diagram of proposed CSTGO algorithm

On the other hand, the temporal information can provide additional discriminative power for video fingerprints [10]. However the CGO based approach [12] does not consider temporal relationship between frames. It is noted that the differences between adjacent frames are usually small, and sometimes most of them are closed to zero when the change of scenes is little. This characteristic can be used to improve the robustness of fingerprints when logo insertion occurs in our algorithm. In this case, the differences of the logo region between two consecutive frames would be equal to zero. Here the derivative between two adjacent frames is denoted as temporal derivative Gt:

The block diagram of the proposed CSTGO based algorithm is shown in Fig. 3. The pre-processing performs basic operations including frame rate conversion, grayscale conversion and frame resizing [12]. Each frame is then partitioned into a circle and N-1 concentric rings, and the radius of the circle and rings is n×r (where n = 1,2, …, N, r is the radius of the innermost circle) as shown in Fig. 4, where the frame size is X×Y, and N×r is equal to Y/2 or X/2. If frames are rotated around the centers, the relationship between each pixel and its associated circle or ring would not change. So the robustness of video fingerprint can be enhanced when frames suffer rotation attack, compared with traditional block-based partition schemes [8]-[12]. In addition, assume that in the most common cases, the visual contents are distributed throughout the concentric rings and the importance of these contents decrease with the increasing of their distance to the frame center. Thus, if the logos placed or inserted in the regions outside the largest ring as shown in Fig. 4, these regions will be discarded during the feature calculation and the generated fingerprints can resist to the logo insertion without losing important visual content. Moreover, rather than the concentric circle-based partition scheme in [13], in our partition scheme the closer to the central of the frame, the more concentric rings contains, thus different weights are obtained in different regions depended on the importance of visual contents.

Fig. 4.Frame partition into concentric rings (N = 8)

The procedure of the CSTGO calculation is as follows:

(1) For each pixel’s value f(x, y, k) at location (x, y) in frame k, the Euclidean distance Dist(x, y, k) between the pixel and the frame center is defined as:

where X and Y is the width and height of frame k, respectively.

(2) For each f(x, y, k), if (n-1) ×r ≤ Dist(x, y, k) ≤ n×r (1≤n≤N) is satisfied, then the gradient magnitude r(x, y, k) and orientation θ(x, y, k) is given respectively by:

where Gs and Gt is calculated by (1) and (2), respectively.

(3) The centroid of gradient cn,k is calculated as follows:

(4) Since each frame contains a circle and N-1 concentric rings, the fingerprint vector Ck is obtained finally as follows:

and N is the dimension of the fingerprints.

In the fingerprint matching stage, each fingerprint vector in a fingerprint sequence of K consecutive frames is first normalized by the means and standard deviation of the sequence as in [12]. Then the similarity between two video clips is measured by the Euclidean distance d as follows:

where P = {pn,k | n = 1,…, N, k = 1,…, K} and Q = {qn,k | n = 1,…, N, k = 1,…, K} are normalized fingerprint sequences extracted from two video clips, respectively. Given a threshold T, if d(P,Q) ≤ T, the two video clips are considered as similar.

The performance of the algorithm can be measured by the false positive rate Pfp and the false negative rate Pfn, Pfp is a probability which treats different videos as same one, while Pfn is a probability which treats the same videos as dissimilar. Usually, Pfn is hard to obtain directly, thus, an evaluation method in [12] is used to estimate the Pfp and Pfn here. In fact, since the fingerprints are extracted independently from video clips, the distance d is subject to a normal distribution by the central limit theorem if NK is sufficiently large. Let dintra denote the intra distance between original video fingerprint sequence and its distortion version, and dinner denote the inner distance of fingerprint sequences between two different clips, then the distributions of these two distances can be obtained as follows:

where μintra, σintra, μinner and σinner are their means and standard deviations respectively. By maximum likelihood estimation, given the distance values di (1 ≤ i ≤ M) of M fingerprint sequences, their means and variances can be estimated as follows:

The false positive rate Pfp can be written as follows:

and the threshold T can be obtained by:

where erfcinv(x) is the inverse function of complementary error function erfc(x).

Given a set of different Pfp, various thresholds T can be determined by equ.13 and equ.14, and then Pfn can be measured through the experiments using equ.15:

where Nfn is the number of similar clip pairs whose d(P, Q) ≥ T, and the Ns is the true number of similar video clips.

 

3. Experiment Results

Our experiments were conducted on a database which contains more than 200 movies, which are collected from Internet. The total length of these videos is about 300 hours. In our experiment, each video clip is down-sampled to 10 fps (frames per second), and resized to 320 × 320. The fingerprint dimension N is 8, and the clip length is 10 seconds, i.e., K = 100.

3.1 Robustness Test

Fig. 5.Comparison between original video frame and the frames with various video attacks. (a) Original Frame, (b) Resizing to CIF, (c) Aspect ratio change (from 4:3 to 16:9), (d) Gaussian blurring with radius 1 pixel, (e) Global change in brightness (+25%), (f) Gamma correction (γ = 1.3), (g) Gaussian noise addtion, (h) Logo overlay in the upper-left by (9% and 16%), (i) Logo overlay in the middle by (9% and 16%), (j) Crop (14%), (k) Horizontal shift (10 pixel), (l) Frames mirroring vertically, (m) Rotation (1, 13, 20, and 40 degrees).

Fig. 6.Empirical inner and intra distance distributions for CSTGO and CGO algorithms. (a) The inner distance distributions of the two algorithms. (b) The intra and inner distance distributions of CSTGO.

In this test, the videos suffer twelve types of attacks. All attacks, together with the original frame, are illustrated in Fig. 5.

Global attack (Type 1 to Type 6)

(1) Resizing to CIF,

(2) Aspect ratio change (from 4:3 to 16:9),

(3) Gaussian blurring with radius 1 pixel,

(4) Global change in brightness (+25%),

(5) Gamma correction (γ = 1.3),

(6) Gaussian noise addition,

Logo overlay (Type 7 and Type 8)

(7) Logo overlay in the upper-left by (9% and 16%),

(8) Logo overlay in the middle by (9% and 16%),

Geometric attack (Type 9 to Type 12)

(9) Crop (14%),

(10) Horizontal shift (10 pixel),

(11) Frames mirroring vertically,

(12) Rotation (1, 13, 20, and 40 degrees).

First, the empirical inner distance distributions of the CSTGO based algorithm and CGO based algorithm [12] are compared in Fig. 6 (a). 45082260 pairs of different fingerprint sequences are used to analyze the distributions. The mean and the variance of these inner distances in the CSTGO based algorithm are 1.9974 and 0.0127287, while these in the CGO based algorithm are 1.95693 and 0.120112, respectively. The variance of inner distances in the CSTGO based algorithm is smaller than that in the CGO based algorithm. It means that under a certain threshold T, the CSTGO based algorithm can achieve a lower Pfp than the CGO based algorithm. For example, if T is fixed to 0.5, then the expected Pfp in the CGO based algorithm is 1.3122×10-5, while that in the CSTGO based algorithm is 1.6759×10-40, which is much smaller than that in CGO. Hence, the proposed algorithm has a better discriminative feature than the CGO based algorithm does. Fig. 6 (b) shows the inner distance distribution and three different intra distance distributions in CSTGO. These intra distributions are calculated under three attacks: (1) Type 8, logo overlay in the middle of the frame by 16%, (2) Type 9, crop the frame, (3) Type 12, rotate the frame 40 degrees. As shown in Fig. 6 (b), the distances between the inner distance distribution and those three intra distance distributions are very large, so that the CSTGO based algorithm has a better robustness than the CGO based algorithm.

In Fig. 7 - Fig. 11, about 130 hours of videos, together with their twelve attacked versions are used. Because the Pfp is usually hard to obtain in practice, here the Pfp is set from 10-60 to 1, and the corresponding threshold T is determined by equ.14. Then the Pfn is obtained in equ.15, where Nfn, Ns is measured though the experiments, respectively. In these experiments, the robustness performances of the proposed CSTGO are compared with the weighted hashing (WH) method in [11] and CGO [12] under different attacks.

Fig. 7.The robustness performances of CSTGO compared with WH [11] and CGO [12] under global attacks. (a) Resizing to CIF, (b) Aspect ratio change (from 4:3 to 16:9), (c) Gaussian blurring with radius 1 pixel, (d) Global change in brightness (+25%), (e) Gamma correction (γ = 1.3), (f) Gaussian noise addition.

Fig. 8.The robustness performances of CSTGO compared with WH [11] and CGO [12] under logo insertions. (a) Logo overlay in the upper-left by 9%, (b) Logo overlay in the middle by 9%.

Fig. 9.The robustness performances of CSTGO under different logo insertions.

Fig. 10.The robustness performances of CSTGO compared with WH [11] and CGO [12] under geometric attacks. (a) Frame crop (14%), (b)Horizontal shift (10 pixel), (c) Rotation 1 degrees, (d)Frames mirroring.

Fig. 11.The robustness performances of CSTGO under different rotational degrees

Fig. 7 gives the robustness comparisons of above three algorithms under different global attacks. Since the proposed CSTGO uses the maximum of luminance differences as spatial derivative, brightness change and noise addition may influence the performance of the proposed CSTGO. However, a lower Pfn is achieved in the proposed algorithm than other two algorithms under a given Pfp, which means our algorithm is more robust against the global attacks than WH and CGO.

Fig. 8 shows the comparison of these three algorithms under different logo insertions. Our CSTGO achieves better performances than CGO and WH. Usually the logo is inserted in the marginal region. For example, when the logo is inserted in the upper-left corner of original frame, with the feature of temporal derivative and partition scheme, a low Pfn is achieved when Pfp is small because most of the influenced area where logo is inserted would be discarded, so the impact on the fingerprints is weak, as shown in Fig. 8 (a). Furthermore, when the logo is inserted in the middle region of the frame, the main visual content would be changed, so do video fingerprints under this attack. However, Fig. 8 (b) demonstrates that our method is still more robust to this attack than WH and CGO. In addition, Fig. 9 gives the robustness of proposed CSTGO under logo insertion attacks with different logo insertion positions and different logo size. We can see that our method can resist to this attack even when the inserted logo overlays 16% middle area of the original frame.

The comparisons of these three algorithms under different geometric attacks are shown in Fig. 10. The figure demonstrates that our proposed CSTGO is the most robust algorithm among these three methods when videos suffer geometric transforms. Traditional block-based partition scheme like WH [11] and CGO [12] are fragile to some geometric attacks since these attacks may change the order of frame blocks. For example, Fig. 10 (d) illustrates that WH and CGO cannot resist to frame mirroring attack. Fig. 11 shows that with the feature of rotation-invariant spatial derivative and partition scheme in our CSTGO based approach, the generated fingerprints are immune to rotation even when frames are rotated up to 40 degrees.

3.2 Scalability Test

In this test, the scalability of the proposed algorithm is examined. Since the dimension of fingerprinting vector of each frame is 8, the k-d-tree [14] is used as the indexing structure in our database. In the matching stage, each fingerprint vector in the fingerprint sequence is used to retrieval its most similar fingerprint vector in the database via k-d-tree, and the potential video clips would be found in the database. The distances between the query clip and the clips found in the database are then calculated, and the clip with the smallest distance is selected as the retrieval result finally.

In the test, 18047 query clips were constructed, and the total length of these video clips is about 50 hours. In order to test the scalability, the size of the database varies from 50 hours to 300 hours. Fig. 12 shows that the precision of the retrieval result is able to maintain at 100% even if the size of database increases. It demonstrates that the proposed algorithm has good scalability.

Fig. 12.The scalability performances of CSTGO

 

4. Conclusion

In this paper, a new algorithm based on CSTGO is proposed for video fingerprint. The algorithm includes an improved spatial derivative, a new defined spatio-temporal gradient to represent the spatial and temporal information for each frame, and a new partition scheme based on concentric rings. Our investigation indicates that the feature of temporal derivative can be used to resist logo insertion and the combination of the spatial and temporal information in each frame can enhance the discrimination and robustness of the video fingerprints. Our experimental results have demonstrated that the proposed CSTGO based algorithm performs better that the state-of-the-art algorithms in terms of discrimination and robustness when the videos suffer the typical attacks, such as, global change, geometric attacks, and logo insertion. Our fingerprints are also robust against geometric attacks such as rotation and frame mirroring which traditional global feature based fingerprints cannot resist. The proposed CSTGO also has good scalability. The future work is to evaluate our CSTGO algorithm on some open video databases.

References

  1. S. Lian, N. Nikolaos, and H. T. Sencar, "Content-based video copy detection - a survey," Intelligent Multimedia Analysis for Security Applications, vol. 282, pp. 253-273, 2010. https://doi.org/10.1007/978-3-642-11756-5_12
  2. Jian Lu, "Video fingerprinting for copy identification: from research to industry applications," in Proc. of SPIE - Media Forensics and Security XI, vol. 7254, 2009.
  3. C. Harris, M. Stevens, "A combined corner and edge detector," in Proc. of 4th Alvey Vision Conference, pp. 153-158, 1988.
  4. A. Joly, O. Buisson, C. Frelicot, "Content-based copy retrieval using distortion-based probabilistic similarity search," IEEE Transactions on Multimedia, vol. 9 (2), pp.293-306, 2007. https://doi.org/10.1109/TMM.2006.886278
  5. J. Law-To, O. Buisson, V. Gouet-Brunet, N. Boujemaa, "Robust voting algorithm based on labels of behavior for video copy detection," in Proc. of ACM Multimedia, pp. 835-844, 2006.
  6. Z. Liu, T. Liu, and B. Shahraray, "At&t research at TRECVID 2009 content-based copy detection," in Proc. of TRECVID, 2009.
  7. D. G. Lowe, "Distinctive image features from scale-invariant keypoints," International Journal of Computer Vision, vol. 60, no. 2, pp. 91-110, 2004. https://doi.org/10.1023/B:VISI.0000029664.99615.94
  8. R. Mohan, "Video sequence matching," in Proc. of Int. Conf. on Acoustics, Speech and Signal Processing, vol. 6, pp.3697-3700, 1998.
  9. A. Hampapur, R. Bolle, "Comparison of sequence matching techniques for video copy detection," in Proc. of Conf. on Storage and Retrieval for Media Databases, pp.194-201, 2002.
  10. C. Kim and B. Vasudev, "Spatiotemporal sequence matching for efficient video copy detection," IEEE Transactions on Circuits and Systems for Video Technology, vol. 15, no. 1, pp. 127-132, January, 2005. https://doi.org/10.1109/TCSVT.2004.836751
  11. J. Sun, J. Wang, J. Zhang, X. Nie and J, Liu, "Video hashing algorithm with weighted matching based on visual saliency," IEEE Signal Processing letters, vol. 19, no. 6, pp. 328-331, June, 2012. https://doi.org/10.1109/LSP.2012.2192271
  12. S. Lee and C. Yoo, "Robust video fingerprinting for content-based video identification," IEEE Transactions on Circuits and Systems for Video Technology, vol. 18, no. 7, pp. 983-988, July, 2008. https://doi.org/10.1109/TCSVT.2008.920739
  13. R. Radhakrishnan and C. Bauer, "Video Fingerprinting Based on Moment Invariants Capturing Appearance and Motion," in Proc. of Int. Conf. on Multimedia and Expo, pp.1532-1535, 2009.
  14. J. T. Robinson, "The k-d-b-tree: A search structure for large multidimensional dynamic indexing," in Proc. of ACM SIGMOD Int. Conf. Management Data, pp. 10-18, 1981.

Cited by

  1. A near-duplicate 3D video detection algorithm by using hypercomplex representations vol.76, pp.1, 2013, https://doi.org/10.1007/s11042-015-3086-3