DOI QR코드

DOI QR Code

Investigating an Automatic Method for Summarizing and Presenting a Video Speech Using Acoustic Features

음향학적 자질을 활용한 비디오 스피치 요약의 자동 추출과 표현에 관한 연구

  • Received : 2012.11.21
  • Accepted : 2012.12.13
  • Published : 2012.12.30

Abstract

Two fundamental aspects of speech summary generation are the extraction of key speech content and the style of presentation of the extracted speech synopses. We first investigated whether acoustic features (speaking rate, pitch pattern, and intensity) are equally important and, if not, which one can be effectively modeled to compute the significance of segments for lecture summarization. As a result, we found that the intensity (that is, difference between max DB and min DB) is the most efficient factor for speech summarization. We evaluated the intensity-based method of using the difference between max-DB and min-DB by comparing it to the keyword-based method in terms of which method produces better speech summaries and of how similar weight values assigned to segments by two methods are. Then, we investigated the way to present speech summaries to the viewers. As such, for speech summarization, we suggested how to extract key segments from a speech video efficiently using acoustic features and then present the extracted segments to the viewers.

스피치 요약을 생성하는데 있어서 두 가지 중요한 측면은 스피치에서 핵심 내용을 추출하는 것과 추출한 내용을 효과적으로 표현하는 것이다. 본 연구는 강의 자료의 스피치 요약의 자동 생성을 위해서 스피치 자막이 없는 경우에도 적용할 수 있는 스피치의 음향학적 자질 즉, 스피치의 속도, 피치(소리의 높낮이) 및 강도(소리의 세기)의 세 가지 요인을 이용하여 스피치 요약을 생성할 수 있는지 분석하고, 이 중 가장 효율적으로 이용할 수 있는 요인이 무엇인지 조사하였다. 조사 결과, 강도(최대값 dB과 최소값 dB간의 차이)가 가장 효율적인 요인으로 확인되었다. 이러한 강도를 이용한 방식의 효율성과 특성을 조사하기 위해서 이 방식과 본문 키워드 방식간의 차이를 요약문의 품질 측면에서 분석하고, 이 두 방식에 의해서 각 세그먼트(문장)에 할당된 가중치간의 관계를 분석해 보았다. 그런 다음 추출된 스피치의 핵심 세그먼트를 오디오 또는 텍스트 형태로 표현했을 때 어떤 특성이 있는지 이용자 관점에서 분석해 봄으로써 음향학적 특성을 이용한 스피치 요약을 효율적으로 추출하여 표현하는 방안을 제안하였다.

Keywords

Acknowledgement

Supported by : 한국연구재단

References

  1. 김현희 (2011). 비디오 의미 파악을 위한 멀티미디어 요약의 비동시적 오디오와 이미지 정보간의 상호 작용 효과 연구. 한국문헌정보학회지, 45(2), 97-118. http://dx.doi.org/10.4275/KSLIS.2011.45.2.097(Kim, Hyun-Hee (2011). A study on the interactive effect of spoken words and imagery not synchronized in multimedia surrogates for video gisting. Journal of the Korean Society for Library and Information Science, 45(2), 97-118. http://dx.doi.org/10.4275/KSLIS.2011.45.2.097)
  2. 정영미 (2007). 정보검색연구. 서울: 구미무역출판부.(Chung, Young Mee (2007). Information retrieval research. Seoul: Gumi Trading Publisher.)
  3. Boersma, P., & Weenink, D. (2006). Praat: Doing phonetics by computer. Retrieved from http://www.praat.org/
  4. Cawkell, A. (1995). A guide to image processing and picture management. Aldershot, Hampshire: Gower Publishing Ltd.
  5. Chen, B., & Lin, S. (2012). A risk-aware modeling framework for speech summarization. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 211-222. http://dx.doi.org/10.1109/TASL.2011.2159596
  6. Ding, W., Marchionini, G., & Soergel, D. (1999). Multimodal surrogates for video browsing. Proceedings of the Fourth ACM conference on Digital Libraries, 85-93.
  7. Fujii, Y., Yamamoto, K., Kitaoka, N. & Nakagawa, S. (2008). Class lecture summarization taking into account consecutiveness of important sentences. Proceedings of Interspeech, 2438-2441.
  8. Furui, S., Kikuchi, T., Shinnaka, Y., & Hori, C. (2004). Speech-to-text and speech-to-speech summarization of spontaneous speech. IEEE Transactions on Speech Audio Process, 12(4), 401-408. http://dx.doi.org/10.1109/TSA.2004.828699
  9. Hirschberg, J., & Nakatani, C. (1996). A prosodic analysis of discourse segments in direction-given monologues. Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics, 286-293.
  10. Lin, S., Chen, B., & Wang, H. (2009). A comparative study of probabilistic ranking models for Chinese spoken document summarization. ACM Transactions on Asian Language Information Processing, 8(1), 1-23. http://dx.doi.org/10.1145/1482343.1482346
  11. Liu, Y., & Hakkani-Tur, D. (2011). Speech summarization. In G. Tur & R. De Mori (Eds.), Spoken language understanding: Systems for extracting semantic information from speech (pp. 357-392). Chichester, UK: John Wiley & Sons, Ltd.
  12. Maskey, S. (2008). Automatic broadcast news speech summarization. Unpublished doctoral dissertation, Columbia University.
  13. Maskey, S., & Hirschberg, J. (2005). Comparing lexical, acoustic/prosodic, structural and discourse features for speech summarization. Proceedings of Interspeech, 621-624.
  14. Maskey, S., & Hirschberg, J. (2006). Summarizing speech without text using Hidden Markov Models. Proceedings of the Human Language Technology Conference of the NAACL (Companion Volume: Short Papers), Association for Computational Linguistics, 89-92. Retrieved from http://acl.ldc.upenn.edu/N/N06/N06-2023.pdf
  15. Marchionini, G., Song, Y., & Farrell, R. (2009). Multimedia surrogates for video gisting: Toward combining spoken words and imagery. Information Processing and Management, 45(6), 615-630. http://dx.doi.org/10.1016/j.ipm.2009.05.007
  16. Murray, G., Renals, S., & Carletta, J. (2005). Extractive summarization of meeting recordings. Proceedings of the 9th European Conference on Speech Communication and Technology (INTERSPEECH), 593-596. Retrieved from http://www.cstr.ed.ac.uk/downloads/publications/2005/murray-eurospeech05.pdf
  17. Turner, J. (1994). Determining the subject content of still and moving documents for storage and retrieval: An experimental investigation. Unpublished doctoral dissertation, University of Toronto.
  18. Turney, P. (2000). Learning algorithms for keyphrase extraction. Information Retrieval, 2(4), 303-336. Retrieved from http://www.extractor.com/IR2000.pdf https://doi.org/10.1023/A:1009976227802
  19. van Houten, Y., Oltmans, E., & van Setten, M. (2000). Video browsing and summarization (Rep. No. TI/RS/2000/63). Enschede: Telematica Instituut. Retrieved from https://doc.telin.nl/dscgi/ds.py/Get/File-12409/
  20. Wang, D., & Narayanan, S. (2007). An acoustic measure for word prominence in spontaneous speech. IEEE Transactions on Audio, Speech, and Language Processing, 15(2), 690-701. http://dx.doi.org/10.1109/TASL.2006.881703
  21. Xie, S., Hakkani-Tur, D., Favre, B., & Liu, Y. (2009). Integrating prosodic features in extractive meeting summarization. Proceedings of the 11th Biannual IEEE Workshop on Automatic Speech Recognition and Understanding, 387-391. Retrieved from http://www.hlt.utdallas.edu/-shasha/papers/ASRU2009_xie.pdf
  22. Zhang, J., & Fung, P. (2007). Speech summarization without lexical features for Mandarin broadcast news. Proceedings of NAACL HLT (Companion Volume), 213-216.
  23. Zhang, Z., & Fung, P. (2012). Active learning with semi-automatic annotation for extractive speech summarization. ACM Transactions on Speech and Language Processing, 8(4), 1-25. http://dx.doi.org/10.1145/2093153.2093155
  24. Zhang, J., Chan, H., & Fung, P. (2007). Improving lecture speech summarization using rhetorical information. Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding, 195-200.
  25. Zhang, J., Chan, H., Fung, P., & Cao, L. (2007). A comparative study on speech summarization of broadcast news and lecture speech. Proceedings of the annual conference of the international speech communication association, 2781-2784.
  26. Zhu, X., Penn, G., & Rudzicz, F. (2009). Summarizing multiple spoken documents: Finding evidence from untranscribed audio. Proceedings of ACL/AFNLP, 549-557. Retrieved from http://www.aclweb.org/anthology-new/P/P09/P09-1062.pdf