DOI QR코드

DOI QR Code

Audio-Visual Fusion for Sound Source Localization and Improved Attention

음성-영상 융합 음원 방향 추정 및 사람 찾기 기술

  • Lee, Byoung-Gi (Center for Cognitive Robotics Research, Korea Institute of Science and Technology) ;
  • Choi, Jong-Suk (Center for Cognitive Robotics Research, Korea Institute of Science and Technology) ;
  • Yoon, Sang-Suk (Center for Intelligent Robotics, Korea Institute of Science and Technology) ;
  • Choi, Mun-Taek (Center for Intelligent Robotics, Korea Institute of Science and Technology) ;
  • Kim, Mun-Sang (Center for Intelligent Robotics, Korea Institute of Science and Technology) ;
  • Kim, Dai-Jin (Dept. Computer Science and Engineering, Postech)
  • 이병기 (한국과학기술연구원 인지로봇연구단) ;
  • 최종석 (한국과학기술연구원 인지로봇연구단) ;
  • 윤상석 (한국과학기술연구원 지능로봇사업단) ;
  • 최문택 (한국과학기술연구원 지능로봇사업단) ;
  • 김문상 (한국과학기술연구원 지능로봇사업단) ;
  • 김대진 (포항공과대학교 컴퓨터공학과)
  • Received : 2010.12.10
  • Accepted : 2011.04.13
  • Published : 2011.07.01

Abstract

Service robots are equipped with various sensors such as vision camera, sonar sensor, laser scanner, and microphones. Although these sensors have their own functions, some of them can be made to work together and perform more complicated functions. AudioFvisual fusion is a typical and powerful combination of audio and video sensors, because audio information is complementary to visual information and vice versa. Human beings also mainly depend on visual and auditory information in their daily life. In this paper, we conduct two studies using audioFvision fusion: one is on enhancing the performance of sound localization, and the other is on improving robot attention through sound localization and face detection.

서비스 로봇은 비전 카메라, 초음파 센서, 레이저 스캐너, 마이크로폰 등과 같은 다양한 센서를 장착하고 있다. 이들 센서들은 이들 각각의 고유한 기능을 가지고 있기도 하지만, 몇몇을 조합하여 사용함으로써 더욱 복잡한 기능을 수행할 수 있다. 음성영상 융합은 서로가 서로를 상호보완 해주는 대표적이면서도 강력한 조합이다. 사람의 경우에 있어서도, 일상생활에 있어 주로 시각과 청각 정보에 의존한다. 본 발표에서는, 음성영상 융합에 관한 두 가지 연구를 소개한다. 하나는 음원 방향 검지 성능의 향상에 관한 것이고, 나머지 하나는 음원 방향 검지와 얼굴 검출을 이용한 로봇 어텐션에 관한 것이다.

Keywords

References

  1. Nakadai, K., Hidai, K., Okuno, H.G. and Kitano, H., 2001, "Real-Time Multiple Speaker Tracking by Multi-Modal Integration for Mobile Robots," in Proc. Eurospeech 2001, pp. 1193-1196.
  2. Lim, Y. and Choi, J., 2009, "Speaker Selection and Tracking in a Cluttered Environment with Audio and Visual Information," IEEE Trans. Consumer Electronics, Vol. 55(3), pp. 1581-1589. https://doi.org/10.1109/TCE.2009.5278030
  3. Hornstein, J., Lopes, M., Santos-Victor, J. and Lacerda, F., 2006, "Sound Localization for Humanoid Robots - Building Audio-Motor Maps based on the HRTF," in Proc. IEEE/RSJ IROS 2006, pp. 1170-1176.
  4. Chan, V., 2009, "Audio-Visual Sensor Fusion for Object Localization," INE NewsLetter, 8 June.
  5. Zabin, R. and Woodfill, J., 1994, "Non-Parametric Local Transforms for Computing Visual Correspondance," In Proc. the 3rd European Conference on Computer Vision, pp.151-158.
  6. Froba, B. and Ernst, A., 2004, "Face Detection with the Modified Census Transform," in Proc. IEEE International Conference on Automatic Face and Gesture Recognition, pp.91-96.
  7. Jun, B.-J. and Kim, D., 2007, "Robust Real-time Face Detection Using Face Certainty Map," in Proc. ICB 2007, pp.29-38.
  8. Haas, H., 1972,"The Influence of a Single Echo on the Audibility of Speech," Journal of the Audio Engineering Society, Vol. 20, pp.146-159.
  9. Lee, B.-G., Choi, J. S., Kim, D. and Kim, M., 2010, "Verification of Sound Source Localization in Reverberation Room and its Real Time Adaptation Using Visual Information," in Proc. ARSO2010, pp.176-181.

Cited by

  1. Interaction Intent Analysis of Multiple Persons using Nonverbal Behavior Features vol.19, pp.8, 2013, https://doi.org/10.5302/J.ICROS.2013.13.1893