DOI QR코드

DOI QR Code

Three-stream network with context convolution module for human-object interaction detection

  • Siadari, Thomhert S. (ICT Major of ETRI School, University of Science and Technology) ;
  • Han, Mikyong (City and Transportation ICT Research Department, Electronics and Telecommunications Research Institute) ;
  • Yoon, Hyunjin (ICT Major of ETRI School, University of Science and Technology)
  • Received : 2019.04.29
  • Accepted : 2019.08.14
  • Published : 2020.04.03

Abstract

Human-object interaction (HOI) detection is a popular computer vision task that detects interactions between humans and objects. This task can be useful in many applications that require a deeper understanding of semantic scenes. Current HOI detection networks typically consist of a feature extractor followed by detection layers comprising small filters (eg, 1 × 1 or 3 × 3). Although small filters can capture local spatial features with a few parameters, they fail to capture larger context information relevant for recognizing interactions between humans and distant objects owing to their small receptive regions. Hence, we herein propose a three-stream HOI detection network that employs a context convolution module (CCM) in each stream branch. The CCM can capture larger contexts from input feature maps by adopting combinations of large separable convolution layers and residual-based convolution layers without increasing the number of parameters by using fewer large separable filters. We evaluate our HOI detection method using two benchmark datasets, V-COCO and HICO-DET, and demonstrate its state-of-the-art performance.

Acknowledgement

Grant : Development and Demonstration of Smart City Service over 5G Network

Supported by : MSIT

References

  1. A. Gupta, A. Kembhavi, and L. S. Davis, Observing human-object interactions: Using spatial and functional compatibility for recognition, IEEE Trans. Pattern Anal. Mach. Intell. 31 (2009), no. 10, 1775-1789. https://doi.org/10.1109/TPAMI.2009.83
  2. V. Delaitre, I. Laptev, and J. Sivic, Recognizing human actions in still images: a study of bag-of-features and part-based representations, in Proc. BMVC 2010-21st British Mach. Vision Conf., 2010, pp. 97:1-11.
  3. B. Yao and L. Fei-Fei, Modeling mutual context of object and human pose in human-object interaction activities, in Proc. IEEE Comput. Soc. Conf. Comput. Vision Pattern Recogn., San Francisco, CA, USA, June 2010, pp.17-24.
  4. C. Y. Chen and K. Grauman, Predicting the location of interactees in novel human-object interactions, Asian conference on computer vision, Springer, Cham, Switzerland, 2014, pp. 351-367.
  5. S. Gupta and J. Malik, Visual semantic role labeling, arXiv preprint arXiv:1505.04474, 2015.
  6. L. Wang and D. Sng, Deep learning algorithms with applications to video analytics for a smart city: a survey, arXiv preprint arXiv:1512.03131, 2015.
  7. J. W. Choi, D. Moon, and J. H. Yoo, Robust multi-person tracking for real-time intelligent video surveillance, ETRI J. 37 (2015), no. 3, 551-561. https://doi.org/10.4218/etrij.15.0114.0629
  8. J. Moon et al., Extensible hierarchical method of detecting interactive actions for video understanding, ETRI J. 39 (2017), no. 4, 502-513. https://doi.org/10.4218/etrij.17.0116.0054
  9. K. Yun et al., Vision-based garbage dumping action detection for real-world surveillance platform, ETRI J. 41 (2019), no. 4, 494-505. https://doi.org/10.4218/etrij.2018-0520
  10. Y. Licheng et al., Visual madlibs: fill in the blank image generation and question answering, arXiv preprint arXiv:1506.00278, 2015.
  11. G. Gkioxari et al., Detecting and recognizing human-object interactions, in Proc. IEEE Conf. Comput. Vision Pattern Recogn., Salt Lake City, UT, USA, June 2018, pp. 8359-8367.
  12. Y. W. Chao et al., Learning to detect human-object interactions, in Proc. IEEE Winter Conf. Applicat. Comput. Vision, Lake Tahoe, NV, USA, Mar. 2018, pp. 381-389.
  13. L. Shen et al., Scaling human-object interaction recognition through zero-shot learning, in Proc. IEEE Winter Conf. Applicat. Comput. Vision, Lake Tahoe, NV, USA, Mar. 2018, pp. 1568-1576.
  14. C. Gao, Y. Zou, and J. B. Huang, iCAN: Instance-centric attention network for human-object interaction detection, British Machine Vision Conference, 2018.
  15. C. Peng et al., Large kernel matters-improve semantic segmentation by global convolutional network, in Proc. IEEE Conf. Comput. Vision Pattern Recogn., Honolulu, HI, USA, 2017, pp. 4353-4361.
  16. M. A. Sadeghi and A. Farhadi, Recognition using visual phrases, in Proc. IEEE Conf. Comput. Vision Pattern Recogn., Providence, RI, USA, 2011, pp. 1745-1752.
  17. L. Cewu et al., Visual relationship detection with language priors, European Conference on Computer Vision, Springer, Cham, Switzerland, 2016, pp. 852-869.
  18. M. Yatskar, L. Zettlemoyer, and A. Farhadi, Situation recognition: Visual semantic role labeling for image understanding, in Proc. IEEE Conf. Comput. Vision Pattern Recogn., Las Vegas, NV, USA, 2016, pp. 5534-5542.
  19. B. Dai, Y. Zhang, and D. Lin, Detecting visual relationships with deep relational networks, in Proc. IEEE Conf. Comput. Vision Pattern Recogn., Honolulu, HI, USA, 2017, pp. 3076-3086.
  20. H. Zhang et al., Visual translation embedding network for visual relation detection, in Proc. IEEE Conf. Comput. Vision Pattern Recogn., Honolulu, HI, USA, 2017, pp. 5532-5540.
  21. H. Ronghang et al., Modeling relationships in referential expressions with compositional modular networks, in Proc. IEEE Conf. Comput. Vision Pattern Recogn., Honolulu, HI, USA, 2017, pp. 1115-1124.
  22. J. Peyre et al., Weakly-supervised learning of visual relations, in Proc. IEEE Int. Conf. Comput. Vision, Venice, Italy, 2017, pp. 5179-5188.
  23. A. Kolesnikov, C. H. Lampert, and V. Ferrari. Detecting visual relationships using box attention, arXiv preprint arXiv:1807.02136, 2018.
  24. M. Mostajabi, P. Yadollahpour, and G. Shakhnarovich, Feedforward semantic segmentation with zoom-out features, in Proc. IEEE Conf. Comput. Vision Pattern Recogn., Boston, MA, USA, 2015, pp. 3376-3385.
  25. W. Liu, A. Rabinovich, and A. C. Berg, Parsenet: Looking wider to see better, arXiv preprint arXiv:1506.04579, 2015.
  26. F. Yu and V. Koltun, Multi-scale context aggregation by dilated convolutions, arXiv preprint arXiv:1511.07122, 2015.
  27. K. He et al., Deep residual learning for image recognition, in Proc. IEEE Conf. Comput. Vision Pattern Recogn., Las Vegas, NV, USA, June 2016, pp. 770-778.
  28. R. Girshick et al., Detectron, https://github.com/facebookresearch/detectron, 2018.
  29. T. Y. Lin et al., Microsoft COCO: Common objects in context, in Proc. Computer Vision-ECCV, Zurich, Switzerland, Sept. 2014, pp. 740-755.
  30. Y. W. Chao et al., HICO: A benchmark for recognizing human-object interactions in images, in Proc. IEEE Int. Conf. Comput. Vision, Santiago, Chile, 2015, pp. 1017-1025.
  31. T. Y. Lin et al., Feature pyramid networks for object detection, in Proc. IEEE Conf. Comput. Vision Pattern Recogn., Honolulu, HI, USA, July 2017, pp. 2117-2125.
  32. S. Qi et al., Learning human-object interactions by graph parsing neural networks, in Proc. Eur. Conf. Comput. Vision (ECCV), 2018, pp. 401-417.
  33. X. Bingjie et al., Interact as you intend: Intention-driven human- object interaction detection, CoRR abs/1808.09796, 2018.