DOI QR코드

DOI QR Code

Spatial-temporal Ensemble Method for Action Recognition

행동 인식을 위한 시공간 앙상블 기법

  • Received : 2020.07.08
  • Accepted : 2020.08.13
  • Published : 2020.11.30

Abstract

As deep learning technology has been developed and applied to various fields, it is gradually changing from an existing single image based application to a video based application having a time base in order to recognize human behavior. However, unlike 2D CNN in a single image, 3D CNN in a video has a very high amount of computation and parameter increase due to the addition of a time axis, so improving accuracy in action recognition technology is more difficult than in a single image. To solve this problem, we investigate and analyze various techniques to improve performance in 3D CNN-based image recognition without additional training time and parameter increase. We propose a time base ensemble using the time axis that exists only in the videos and an ensemble in the input frame. We have achieved an accuracy improvement of up to 7.1% compared to the existing performance with a combination of techniques. It also revealed the trade-off relationship between computational and accuracy.

References

  1. O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and Li Fei-Fei, "Imagenet large scale visualrecognition challenge," International Journal of Computer Vision, vol. 115, pp. 211-252, 2015, DOI: 10.1007/s11263-015-0816-y.
  2. K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for imagerecognition," 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, pp. 770-778, 2016, DOI: 10.1109/CVPR.2016.90.
  3. C. Dong, C. C. Loy, K. He, and X. Tang, "Image super-resolution using deepconvolutional networks," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 2, pp. 295-307, Feb., 2015, DOI: 10.1109/TPAMI.2015.2439281.
  4. S. Ren, K. He, R. Girshick, and J. Sun, "Faster r-cnn: Towards real-time object detection with region proposal networks," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1137-1149, Jun., 2017, DOI: 10.1109/TPAMI.2016.2577031.
  5. L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, "Encoder-decoder with atrous separableconvolution for semantic image segmentation," European Conference on Computer Vision, pp. 833-851, 2018, DOI: 10.1007/978-3-030-01234-2_49.
  6. J. Carreira and A. Zisserman, "Quo vadis, action recognition?," 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 2017, DOI: 10.1109/CVPR.2017.502.
  7. H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, "HMDB: a large video database forhuman motion recognition," 2011 International Conference on Computer Vision, Barcelona, Spain, 2011, DOI: 10.1109/ICCV.2011.6126543.
  8. K. Hara, H. Kataoka, and Y. Satoh, "Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?," 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, DOI: 10.1109/CVPR.2018.00685.
  9. A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, "Large-scale video classification with convolutionalneural networks," 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014, DOI: 10.1109/cvpr.2014.223.
  10. J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O.Vinyals, R. Monga, and G. Toderici, "Beyond short snippets: Deep networks for video classification," 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 2015, DOI: 10.1109/CVPR.2015.7299101.
  11. J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, T. Darrell, and K. Saenko, "Long-term recurrentconvolutional networks for visual recognition and description," 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 2015, DOI: 10.1109/CVPR.2015.7298878.
  12. S. Ji, W. Xu, M. Yang, and K. Yu, "3d convolutional neural networks for human action recognition," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 1, pp. 221-231, 2013, DOI: 10.1109/TPAMI.2012.59.
  13. G. W. Taylor, R. Fergus, Y. LeCun, and C. Bregler, "Convolutional learning of spatio-temporal features," European Conference on Computer Vision, pp. 140-153, 2010, DOI: 10.1007/978-3-642-15567-3_11.
  14. D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, "Learning spatiotemporal features with 3d convolutional networks," 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 2015, DOI: 10.1109/ICCV.2015.510.
  15. G. Varol, I. Laptev, and C. Schmid, "Long-term temporal convolutions for action recognition," IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, DOI: 10.1109/TPAMI.2017.2712608.
  16. A. Wedel, T. Pock, C. Zach, H. Bischof, and D. Cremer, "An improved algorithm for tv-l 1optical flow," Statistical and Geometrical Approaches to Visual Motion Analysis, pp. 23-45. 2009, DOI: 10.1007/978-3-642-03061-1_2.
  17. N. Crasto, P. Weinzaepfel, K. Alahari, and C. Schmid, "MARS: Motion-augmented RGB stream for action recognition," 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, US, 2019, DOI: 10.1109/cvpr.2019.00807.