DOI QR코드

DOI QR Code

Human Action Recognition Using Pyramid Histograms of Oriented Gradients and Collaborative Multi-task Learning

  • Gao, Zan (Key Laboratory of Computer Vision and System, Ministry of Education, Tianjin University of Technology) ;
  • Zhang, Hua (Key Laboratory of Computer Vision and System, Ministry of Education, Tianjin University of Technology) ;
  • Liu, An-An (School of Electronic Information Engineering, Tianjin University) ;
  • Xue, Yan-Bing (Key Laboratory of Computer Vision and System, Ministry of Education, Tianjin University of Technology) ;
  • Xu, Guang-Ping (Key Laboratory of Computer Vision and System, Ministry of Education, Tianjin University of Technology)
  • Received : 2013.11.06
  • Accepted : 2014.01.09
  • Published : 2014.02.27

Abstract

In this paper, human action recognition using pyramid histograms of oriented gradients and collaborative multi-task learning is proposed. First, we accumulate global activities and construct motion history image (MHI) for both RGB and depth channels respectively to encode the dynamics of one action in different modalities, and then different action descriptors are extracted from depth and RGB MHI to represent global textual and structural characteristics of these actions. Specially, average value in hierarchical block, GIST and pyramid histograms of oriented gradients descriptors are employed to represent human motion. To demonstrate the superiority of the proposed method, we evaluate them by KNN, SVM with linear and RBF kernels, SRC and CRC models on DHA dataset, the well-known dataset for human action recognition. Large scale experimental results show our descriptors are robust, stable and efficient, and outperform the state-of-the-art methods. In addition, we investigate the performance of our descriptors further by combining these descriptors on DHA dataset, and observe that the performances of combined descriptors are much better than just using only sole descriptor. With multimodal features, we also propose a collaborative multi-task learning method for model learning and inference based on transfer learning theory. The main contributions lie in four aspects: 1) the proposed encoding the scheme can filter the stationary part of human body and reduce noise interference; 2) different kind of features and models are assessed, and the neighbor gradients information and pyramid layers are very helpful for representing these actions; 3) The proposed model can fuse the features from different modalities regardless of the sensor types, the ranges of the value, and the dimensions of different features; 4) The latent common knowledge among different modalities can be discovered by transfer learning to boost the performance.

Acknowledgement

Supported by : National Science Foundation of China

References

  1. A. Bobick and J. Davis, "The representation and recognition of action using temporal templates," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.23, no.3, pp. 257-267, 2001. https://doi.org/10.1109/34.910878
  2. L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri, "Actions as space-time shapes," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.29, no.12, pp. 2247-2253, 2007. https://doi.org/10.1109/TPAMI.2007.70711
  3. P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie, "Behavior recognition via sparse spatio-temporal features," in Proc. of the IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, pp. 65-72, 2005. http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=1570899&queryText%3DBehavior+recognition+via+sparse+spatio-temporal+features
  4. C. Schuldt, L. Laptev and B. Caputo, "Recognizing human actions: a local SVM approach," in Proc. of the International Conference on Pattern Recognition, ICPR, pp.32-36, 2004. http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=1334462&queryText%3DRecognizing+human+actions%3A+a+local+SVM+approach
  5. I. Laptev and T. Lindeberg, "Space-time interest points," in Proc. of the International Conference Computer Vision, ICCV, pp. 432-439, 2003. http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=1238378&queryText%3DSpace-time+interest+points
  6. M.-Y. Chen and A.-G. Hauptmann, "MoSIFT: Reocgnizing Human Actions in Surveillance Videos," CMU-CS-09-161, Carnegie Mellon University, 2009. http://www.cs.cmu.edu/-mychen/publication/ChenMoSIFTCMU09.pdf
  7. M. Hu, "Visual pattern recognition by moment invariants," IRE Transactions on Information Theory, vol.8, no.2, pp.179-187, 1962.
  8. R. Mehrotra, "Gabor filter-based edge detection," Pattern Recognition, vol.25, no.12, pp. 1479-1494, 1992. https://doi.org/10.1016/0031-3203(92)90121-X
  9. Y.-C. Lin, M.-C. Hua, W-.H. Cheng, Y.-H. Hsieh, H.-M. Chen, "Human Action Recognition and Retrieval Using Sole Depth Information," in Proc. of the 20th ACM international conference on Multimedia, pp.1053-1056, 2012.
  10. W. Li, Z. Zhang, and Z.-C. Liu, "Action recognition based on a bag of 3D points," in Proc. of International Conference on Human Communicative Behavior Analysis Workshop, CVPR 2010, pp.2-6. http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=5543273&queryText%3DAction+recognition+based+on+a+bag+of+3D+points%2C
  11. J. W. Davis and A. Tyagi, "Minimal-latency human action recognition using reliable-inference," Image and Vision Computing, vol.24, no.5, pp.455-472, 2006. http://www.cse.ohio-state.edu/-jwdavis/Publications/ivc06.pdf https://doi.org/10.1016/j.imavis.2006.01.012
  12. A. A. Efros, A. C. Berg, G.Mori, and J.Malik, "Recognizing action at a distance," in Proc. of IEEE International Conference on Computer Vision, pp.1, 2, 2003.
  13. J. L. B. D. J. Fleet and S. S. Beauchemin, "Performance of optical flow techniques," International Journal of Computer Vision, vol.12, no.1, pp.43-77, 1994. http://link.springer.com/article/10.1007%2FBF01420984 https://doi.org/10.1007/BF01420984
  14. M. J. Black, Y. Yacoob, A. D. Jepson, and D. J. Fleet, "Learning parameterized models of image motion," in Proc. of IEEE International Conference on Computer Vision and Pattern Recognition, pp.561-567, 1997. 1, 2.
  15. A. Klaser, M. Marszalek, and C. Schmid, "A spatio-temporal descriptor based on 3d gradients," in Proc. of The British Machine Vision Conference, 2008. 2 http://lear.inrialpes.fr/pubs/2008/KMS08/
  16. J. Wang, Z.-C. Liu, Y. Wu, J.-S Yuan, "Mining actionlet ensemble for action recognition with depth cameras," in Proc. of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR, pp.1290 -1297, 2012. http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6247813&queryText%3DMining+actionlet+ensemble+for+action+recognition+with+depth+cameras
  17. V. Megavannan, B Agarwal R. Venkatesh Babu, "Human Action Recognition using Depth Maps," in Proc. of International Conference on Signal Processing and Communications, SPCOM pp.1-5, 2012. http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6290032&queryText%3DHuman+Action+Recognition+using+Depth+Maps
  18. Meng Wang, Hao Li, Dacheng Tao, Ke Lu, Xindong Wu, "Multimodal Graph-Based Reranking for Web Image Search," IEEE Transactions on Image Processing, vol. 21, no. 11, pp. 4649-4661, 2012. https://doi.org/10.1109/TIP.2012.2207397
  19. Meng Wang and Xian-Sheng Hua, "Active Learning in Multimedia Annotation and Retrieval: A Survey," ACM Transactions on Intelligent Systems and Technology, vol. 2, no. 2, pp.10-31, 2011. http://dl.acm.org/citation.cfm?id=1899414
  20. Yue Gao, Meng Wang, Zhengjun Zha, Jialie Shen, Xuelong Li, Xindong Wu, "Visual-Textual Joint Relevance Learning for Tag-Based Social Image Search," IEEE Transactions on Image Processing, vol.22, no.1, pp. 363-376, 2013. https://doi.org/10.1109/TIP.2012.2202676
  21. Meng Wang, Xian-Sheng Hua, Jinhui Tang, Richang Hong, "Beyond Distance Measurement: Constructing Neighborhood Similarity for Video Annotation," IEEE Transactions on Multimedia, vol. 11, no. 3, pp. 465-476, 2009. https://doi.org/10.1109/TMM.2009.2012919
  22. Meng Wang, Xian-Sheng Hua, Richang Hong, Jinhui Tang, Guo-Jun Qi, Yan Song, "Unified Video Annotation Via Multi-Graph Learning," IEEE Transactions on Circuits and Systems for Video Technology, vol. 19, no. 5, pp. 733-746, 2009. https://doi.org/10.1109/TCSVT.2009.2017400
  23. Meng Wang, Bingbing Ni, Xian-Sheng Hua, Tat-Seng Chua, "Assistive Tagging: A Survey of Multimedia Tagging with Human-Computer Joint Exploration," ACM Computing Surveys, vol. 4, no. 4, Article 25, 2012. http://www.medsci.cn/sci/show_paper.asp?id=d8003193194
  24. Meng Wang, Richang Hong, Guangda Li, Zheng-Jun Zha, Shuicheng Yan, Tat-Seng Chua, "Event Driven Web Video Summarization by Tag Localization and Key-Shot Identification," IEEE Transactions on Multimedia, vol. 14, no. 4, pp. 975-985, 2012. https://doi.org/10.1109/TMM.2012.2185041
  25. Yue Gao, Meng Wang, Rongrong Ji, Xindong Wu, Qionghai Dai, "3D Object Retrieval with Hausdorff Distance Learning," IEEE Transactions on Industrial Electronics, vol. 61, no. 4, pp. 2088-2098, 2014. https://doi.org/10.1109/TIE.2013.2262760
  26. Yue Gao, Meng Wang, Dacheng Tao, Rongrong Ji, Qionghai Dai, "3D Object Retrieval and Recognition with Hypergraph Analysis," IEEE Transactions on Image Processing, vol.21, no.9, pp. 4290-4303, 2012. https://doi.org/10.1109/TIP.2012.2199502
  27. Yue Gao, Jinhui Tang, Richang Hong, Shuicheng Yan, Qionghai Dai, Naiyao Zhang, Tat-Seng Chua, "Camera Constraint-Free View-Based 3D Object Retrieval," IEEE Transactions on Image Processing, vol.21, no.4, pp. 2269 -2281, 2012. http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6030936&queryText%3DCamera+Constraint-Free+View-Based+3D+Object+Retrieval https://doi.org/10.1109/TIP.2011.2170081
  28. Yue Gao, Meng Wang, Zhengjun Zha, Qi Tian, Qionghai Dai, Naiyao Zhang, "Less is More: Efficient 3D Object Retrieval with Query View Selection," IEEE Transactions on Multimedia, vol.11, no.5, pp.1007-1018, 2011.
  29. Yue Gao, Rongrong Ji, Longfei Zhang, Alexander Hauptmann, "Symbiotic Tracker Ensemble Towards A Unified Tracking Framework," IEEE Transactions on Circuits and Systems for Video Technology, 2014.
  30. Jun Yu, Meng Wang, and Dacheng Tao, "Semi-supervised Multi-view Distance Metric Learning for Cartoon Synthesis," IEEE Transactions on Image Processing, Vol.21, No.11, Nov, 2012. http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6236161&queryText%3DSemi-supervised+Multi-view+Distance+Metric+Learning+for+Cartoon+Synthesis
  31. Jun Yu a, Dacheng Tao, YongRui, JunCheng, "Pairwise constraints based multi-view features fusion for scene classification," Pattern Recognition, Vol.46, 2013, pp.483-496. http://www.sciencedirect.com/science/article/pii/S0031320312003524 https://doi.org/10.1016/j.patcog.2012.08.006
  32. Jun Yu, YongRui, and Bo Chen, "Exploiting Click Constraints and Multi-view Features for Image Reranking," IEEE Transactions on Multimedia, Vol.16, No.1, Jan. 2014. http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6623163&queryText%3DExploiting+Click+Constraints+and+Multi-view+Features+for+Image+Reranking
  33. Jun Yu, Dongquan Liu, Dacheng Tao , and Hock Soon Seah, 2012, On Combining Multi-view Features for Cartoon Character Retrieval and Clip Synthesis, IEEE Transactions on Systems, Man and Cybernetics-Part B: Cybernetics, Vol.42, Np.5, Oct, 2012. http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6189803
  34. Hua Wang, Feiping Nie, Heng Huang, "Multi-View Clustering and Feature Learning via Structured Sparsity," ICML, 2013. http://jmlr.org/proceedings/papers/v28/wang13c.pdf
  35. A. Liu, and D. Han, "Spatiotemporal Sparsity Induced Similarity Measure for Human Action Recognition," International Journal of Digital Content Technology and its Applications, vol.4, no.5, pp. 23-37, 2010.
  36. Zan Gao, An-An Liu, Hua Zhang, Guang-ping Xu,Yan-bing Xue, "Human action recognition based on sparse representation induced by L1/L2 regulations," ICPR, pp. 1868-1871, 2012. http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6460518&queryText%3DHuman+action+recognition+based+on+sparse+representation+induced+by+L1%2FL2+regulations
  37. K. Guo, P. Ishwar, and J. Konrad, "Action Recognition Using Sparse Representation on Covariance Manifolds of Optical Flow," in Proc. of 2010 Seventh IEEE International Conference on Advanced Video and Signal Based Surveillance, pp.188-195, 2010. http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=5597145&queryText%3DAction+Recognition+Using+Sparse+Representation+on+Covariance+Manifolds+of+Optical+Flow
  38. C.-H. Liu, Y. Yang, Y. Chen, "Human action recognition using sparse representation," in Proc. of Processing of IEEE International Conference on Intelligent Computing and Intelligent Systems, pp.184-188, 2009. http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=5357701&queryText%3DHuman+action+recognition+using+sparse+representation
  39. Z. Gao, H. Zhang, G.P. Xu, Y.B. Xue, "Human Behavior Recognition Using Structured and Discriminative Sparse Representation," International Journal of Digital Content Technology and its Applications, Vol.6,No.23, 2012, PP. 416-422. https://doi.org/10.4156/jdcta.vol6.issue23.47
  40. J. Wright, A. Yang, A. Ganesh, S. Sastry, and Y. Ma, "Robust face recognition via sparse representation," IEEE Trans. on Pattern Analysis and Machine Intelligence, vol.31,np.2, pp. 210-227, 2009. https://doi.org/10.1109/TPAMI.2008.79
  41. L. Zhang, M. Yang and X. Feng, "Sparse Representation or Collaborative Representation: Which Helps Face Recognition?" in Proc. of International Conference on Computer Vision, ICCV 2011. http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6126277&queryText%3DSparse+Representation+or+Collaborative+Representation%3A+Which+Helps+Face+Recognition%3F
  42. A. Oliva, A. Torralba, "Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope," International Journal of Computer Vision, vol.42, no.3, pp.145-175, 2001. https://doi.org/10.1023/A:1011139631724
  43. N. Dalal and B. Triggs, "Histograms of oriented gradients for human detection," in Proc. of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR, pp. 886- 893, 2005. http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=1467360&queryText%3DHistograms+of+oriented+gradients+for+human+detection
  44. A. Bosch, M.-X. Zisserman, "Representing Shape with a Spatial Pyramid Kernel," in Proc. of the 6th ACM International Conference on Image and Video Retrieval, pp.401-408, 2007. http://dl.acm.org/citation.cfm?id=1282340
  45. B.-B Ni, G. Wang, P. Moulin, "RGBD-HuDaAct: A Color-Depth Video Database for Human Daily Activity Recognition," in Proc. of International Conference on Computer Vision workshop, ICCV, pp.1147-1153, 2012. http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6130379&queryText%3DRGBD-HuDaAct%3A+A+Color-Depth+Video+Database+for+Human+Daily+Activity+Recognition
  46. S. Marcel, Y. Rodrigue, G. Heusch, "On the Recent Use of Binary Patterns for Face Authentication," International Journal on Image and Video Processing Special Issue on Facial image Processing, pp.1-8, 2007. http://publications.idiap.ch/index.php/publications/show/294
  47. C.-C. Chang, C.J. Lin, 2001, LIBSVM: a library for support vector machines. 2001, http://www.csie.ntu.edu.tw/-cjlin/libsvm/.
  48. Y. Nesterov, "Introductory lectures on convex optimization: A basic course," Springer, 2004.
  49. Z.Gao, M.-Y. Chen, A.-G. Hauptmann and A.-N. Cai, "Comparing Evaluation Protocols on the KTH Dataset," in Proc. of the First international conference on Human behavior understanding, HBU, pp.88-100, 2010. http://link.springer.com/chapter/10.1007%2F978-3-642-14715-9_10
  50. Zan Gao, Jian-ming Song, Hua Zhang, An-An Liu, Yan-bing Xue and Guang-ping Xu, "Action Recognition Via Multi-modality Information," Journal of electrical engineering & Technology, Vol.9 No. 2, pp.742-751, 2014. http://www.jeet.or.kr/LTKPSWeb/uploadfiles/be/201311/191120131352530183750.pdf