JOURNAL BROWSE
Search
Advanced SearchSearch Tips
Active Vision from Image-Text Multimodal System Learning
facebook(new window)  Pirnt(new window) E-mail(new window) Excel Download
  • Journal title : Journal of KIISE
  • Volume 43, Issue 7,  2016, pp.795-800
  • Publisher : Korean Institute of Information Scientists and Engineers
  • DOI : 10.5626/JOK.2016.43.7.795
 Title & Authors
Active Vision from Image-Text Multimodal System Learning
Kim, Jin-Hwa; Zhang, Byoung-Tak;
 
 Abstract
In image classification, recent CNNs compete with human performance. However, there are limitations in more general recognition. Herein we deal with indoor images that contain too much information to be directly processed and require information reduction before recognition. To reduce the amount of data processing, typically variational inference or variational Bayesian methods are suggested for object detection. However, these methods suffer from the difficulty of marginalizing over the given space. In this study, we propose an image-text integrated recognition system using active vision based on Spatial Transformer Networks. The system attempts to efficiently sample a partial region of a given image for a given language information. Our experimental results demonstrate a significant improvement over traditional approaches. We also discuss the results of qualitative analysis of sampled images, model characteristics, and its limitations.
 Keywords
visual Attention;active vision;object recognition;deep learning;
 Language
Korean
 Cited by
 References
1.
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied to document recognition," Proc. of the IEEE, Vol. 86, pp. 2278-2323, 1998. crossref(new window)

2.
P. Simard, B. Victorri, Y. LeCun, and J. Denker, "Tangent prop-a formalism for specifying selected invariances in an adaptive network," Proc. of the Advances in neural information processing systems, pp. 895-903, 1992.

3.
V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu, "Recurrent Models of Visual Attention," Proc. of the Advances in Neural Information Processing Systems, 27, pp. 2204-2212, 2014.

4.
Q. Wang, J. Zhang, S. Song, and Z. Zhang, "Attentional Neural Network : Feature Selection Using Cognitive Feedback," Proc. of the Advances in Neural Information Processing Systems, pp. 1-9, 2014.

5.
J. Ba, V. Mnih, and K. Kavukcuoglu, "Multiple Object Recognition with Visual Attention," arXiv preprint arXiv:1412.7755, pp. 1-10, 2014.

6.
R. Girshick, J. Donahue, T. Darrell, and J. Malik, "Rich feature hierarchies for accurate object detection and semantic segmentation," Proc. of the Computer Vision and Pattern Recognition, pp. 580-587, 2014.

7.
A. Karpathy and L. Fei-Fei, "Deep Visual-Semantic Alignmentsfor Generating Image Descriptions," Proc. of the 28th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), pp. 3128-3137, 2015.

8.
K. Xu, A. Courville, R. S. Zemel, and Y. Bengio, "Show, Attend and Tell : Neural Image Caption Generation with Visual Attention," Proc. of the 32nd International Conference on Machine Learning, 2015.

9.
M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu, "Spatial Transformer Networks," Proc. of the Advances in Neural Information Processing Systems 28, pp. 2008-2016, 2015.

10.
J. Ba, R. Grosse, R. Salakhutdinov, and B. Frey, "Learning Wake-Sleep Recurrent Attention Models," Proc. of the Advances in Neural Information Processing Systems 28, pp. 2575-2583, 2015.

11.
J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng, "Multimodal Deep Learning," Proc. of the 28th International Conference on Machine Learning, pp. 689-696, 2011.

12.
N. Srivastava and R. R. Salakhutdinov, "Multimodal Learning with Deep Boltzmann Machines," Proc. of the Advances in Neural Information Processing Systems, 25, pp. 2222-2230, 2012.

13.
R. Kiros, R. Zemel, and R. Salakhutdinov, "Multimodal Neural Language Models," Proc. of the 31st International Conference on Machine Learning, 2014.

14.
K. Sohn, W. Shang, and H. Lee, "Improved Multimodal Deep Learning with Variation of Information," Proc. of the Advances in Neural Information Processing Systems 27, pp. 2141-2149, 2014.

15.
K. Simonyan and A. Zisserman, "Very Deep Convolutional Networks for Large-Scale Image Recognition," Proc. of the International Conference on Learning Representations, 2015.

16.
M. Malinowski and M. Fritz, "A multi-world approach to question answering about real-world scenes based on uncertain input," Proc. of the Advances in Neural Information Processing Systems 27, pp. 1682-1690, 2014.

17.
S. Ioffe and C. Szegedy, "Batch Normalization : Accelerating Deep Network Training by Reducing Internal Covariate Shift," Proc. of the 32nd International Conference on Machine Learning, 2015.

18.
V. Nair and G. E. Hinton, "Rectified Linear Units Improve Restricted Boltzmann Machines," Proc. of the 27th International Conference on Machine Learning, pp. 807-814, 2010.