- Volume 15 Issue 4
DOI QR Code
Spatial-temporal Ensemble Method for Action Recognition
행동 인식을 위한 시공간 앙상블 기법
- Seo, Minseok (Hanbat National University) ;
- Lee, Sangwoo (Hanbat National University) ;
- Choi, Dong-Geol (Hanbat National University)
- Received : 2020.07.08
- Accepted : 2020.08.13
- Published : 2020.11.30
As deep learning technology has been developed and applied to various fields, it is gradually changing from an existing single image based application to a video based application having a time base in order to recognize human behavior. However, unlike 2D CNN in a single image, 3D CNN in a video has a very high amount of computation and parameter increase due to the addition of a time axis, so improving accuracy in action recognition technology is more difficult than in a single image. To solve this problem, we investigate and analyze various techniques to improve performance in 3D CNN-based image recognition without additional training time and parameter increase. We propose a time base ensemble using the time axis that exists only in the videos and an ensemble in the input frame. We have achieved an accuracy improvement of up to 7.1% compared to the existing performance with a combination of techniques. It also revealed the trade-off relationship between computational and accuracy.
- O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and Li Fei-Fei, "Imagenet large scale visualrecognition challenge," International Journal of Computer Vision, vol. 115, pp. 211-252, 2015, DOI: 10.1007/s11263-015-0816-y.
- K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for imagerecognition," 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, pp. 770-778, 2016, DOI: 10.1109/CVPR.2016.90.
- C. Dong, C. C. Loy, K. He, and X. Tang, "Image super-resolution using deepconvolutional networks," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 2, pp. 295-307, Feb., 2015, DOI: 10.1109/TPAMI.2015.2439281.
- S. Ren, K. He, R. Girshick, and J. Sun, "Faster r-cnn: Towards real-time object detection with region proposal networks," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1137-1149, Jun., 2017, DOI: 10.1109/TPAMI.2016.2577031.
- L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, "Encoder-decoder with atrous separableconvolution for semantic image segmentation," European Conference on Computer Vision, pp. 833-851, 2018, DOI: 10.1007/978-3-030-01234-2_49.
- J. Carreira and A. Zisserman, "Quo vadis, action recognition?," 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 2017, DOI: 10.1109/CVPR.2017.502.
- H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, "HMDB: a large video database forhuman motion recognition," 2011 International Conference on Computer Vision, Barcelona, Spain, 2011, DOI: 10.1109/ICCV.2011.6126543.
- K. Hara, H. Kataoka, and Y. Satoh, "Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?," 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, DOI: 10.1109/CVPR.2018.00685.
- A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, "Large-scale video classification with convolutionalneural networks," 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014, DOI: 10.1109/cvpr.2014.223.
- J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O.Vinyals, R. Monga, and G. Toderici, "Beyond short snippets: Deep networks for video classification," 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 2015, DOI: 10.1109/CVPR.2015.7299101.
- J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, T. Darrell, and K. Saenko, "Long-term recurrentconvolutional networks for visual recognition and description," 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 2015, DOI: 10.1109/CVPR.2015.7298878.
- S. Ji, W. Xu, M. Yang, and K. Yu, "3d convolutional neural networks for human action recognition," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 1, pp. 221-231, 2013, DOI: 10.1109/TPAMI.2012.59.
- G. W. Taylor, R. Fergus, Y. LeCun, and C. Bregler, "Convolutional learning of spatio-temporal features," European Conference on Computer Vision, pp. 140-153, 2010, DOI: 10.1007/978-3-642-15567-3_11.
- D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, "Learning spatiotemporal features with 3d convolutional networks," 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 2015, DOI: 10.1109/ICCV.2015.510.
- G. Varol, I. Laptev, and C. Schmid, "Long-term temporal convolutions for action recognition," IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, DOI: 10.1109/TPAMI.2017.2712608.
- A. Wedel, T. Pock, C. Zach, H. Bischof, and D. Cremer, "An improved algorithm for tv-l 1optical flow," Statistical and Geometrical Approaches to Visual Motion Analysis, pp. 23-45. 2009, DOI: 10.1007/978-3-642-03061-1_2.
- N. Crasto, P. Weinzaepfel, K. Alahari, and C. Schmid, "MARS: Motion-augmented RGB stream for action recognition," 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, US, 2019, DOI: 10.1109/cvpr.2019.00807.