DOI QR코드

DOI QR Code

Building a Korean conversational speech database in the emergency medical domain

응급의료 영역 한국어 음성대화 데이터베이스 구축

  • Kim, Sunhee (Department of French Language Education, Seoul National University) ;
  • Lee, Jooyoung (Department of Linguistics, Seoul National University) ;
  • Choi, Seo Gyeong (Department of English Language and Literature, Seoul National University) ;
  • Ji, Seunghun (Department of Linguistics, Seoul National University) ;
  • Kang, Jeemin (Department of English Language and Literature, Seoul National University) ;
  • Kim, Jongin (Department of Interdisciplinary Program in Cognitive Science, Seoul National University) ;
  • Kim, Dohee (Department of Foreign Language Education, Seoul National University) ;
  • Kim, Boryong (Department of French Language Education, Seoul National University) ;
  • Cho, Eungi (Department of French Language Education, Seoul National University) ;
  • Kim, Hojeong (Department of French Language Education, Seoul National University) ;
  • Jang, Jeongmin (Department of French Language Education, Seoul National University) ;
  • Kim, Jun Hyung (Department of Electronic Engineering, Sogang University) ;
  • Ku, Bon Hyeok (Department of Electronic Engineering, Sogang University) ;
  • Park, Hyung-Min (Department of Electronic Engineering, Sogang University) ;
  • Chung, Minhwa (Department of Linguistics, Seoul National University)
  • 김선희 (서울대학교 불어교육과) ;
  • 이주영 (서울대학교 언어학과) ;
  • 최서경 (서울대학교 영어영문학과) ;
  • 지승훈 (서울대학교 언어학과) ;
  • 강지민 (서울대학교 영어영문학과) ;
  • 김종인 (서울대학교 인지과학협동과정) ;
  • 김도희 (서울대학교 외국어교육과) ;
  • 김보령 (서울대학교 불어교육과) ;
  • 조은기 (서울대학교 불어교육과) ;
  • 김호정 (서울대학교 불어교육과) ;
  • 장정민 (서울대학교 불어교육과) ;
  • 김준형 (서강대학교 전자공학과) ;
  • 구본혁 (서강대학교 전자공학과) ;
  • 박형민 (서강대학교 전자공학과) ;
  • 정민화 (서울대학교 언어학과)
  • Received : 2020.11.14
  • Accepted : 2020.12.15
  • Published : 2020.12.31

Abstract

This paper describes a method of building Korean conversational speech data in the emergency medical domain and proposes an annotation method for the collected data in order to improve speech recognition performance. To suggest future research directions, baseline speech recognition experiments were conducted by using partial data that were collected and annotated. All voices were recorded at 16-bit resolution at 16 kHz sampling rate. A total of 166 conversations were collected, amounting to 8 hours and 35 minutes. Various information was manually transcribed such as orthography, pronunciation, dialect, noise, and medical information using Praat. Baseline speech recognition experiments were used to depict problems related to speech recognition in the emergency medical domain. The Korean conversational speech data presented in this paper are first-stage data in the emergency medical domain and are expected to be used as training data for developing conversational systems for emergency medical applications.

본 논문은 응급의료 환경에서 음성인식 성능을 향상시키기 위하여 실제 환경에서 데이터 수집 방법을 정의하고 정의된 환경에서 수집된 데이터를 전사하는 방법을 제안한다. 그리고 제안된 방법으로 수집되고 전사된 데이터를 이용하여 기본 음성인식 실험을 진행함으로써 제안한 수집 및 전사 방법을 평가하고 향후 연구 방향을 제시하고자 한다. 모든 음성은 기본적으로 16비트 해상도와 16 kHz 샘플링으로 저장되었다. 수집된 데이터는 총 166건의 대화로서 8시간 35분의 분량이다. 수집된 데이터는 Praat를 이용하여 철자 전사, 음소 전사, 방언 전사, 잡음 전사, 그리고 의료 코드 전사를 수행하여 다양한 정보를 포함한 텍스트 데이터를 구축하였다. 이와 같이 수집된 데이터를 이용하여 기본 베이스라인 실험을 통하여 응급의료 영역에서의 음성인식 문제를 실제로 확인할 수 있었다. 본 논문에서 제시한 데이터는 응급의료 영역의 1단계 데이터로서 향후 의료 영역에서의 음성인식 모델의 학습 데이터로 활용되고, 나아가 이 분야의 음성기반 시스템 개발에 기여할 수 있을 것으로 기대된다.

Keywords

References

  1. Bang, J. U., Yun, S., Kim, S. H., Choi, M. Y., Lee, M. K., Kim, Y. J., Kim, D. H., & Kim, S. H. (2020). KsponSpeech: Korean spontaneous speech corpus for automatic speech recognition. Applied Sciences, 10(19), 6936. https://doi.org/10.3390/app10196936
  2. Boeddeker, C., Nakatani, T., Kinoshita, K., & Haeb-Umbach, R. (2020, May). Jointly optimal dereverberation and beamforming. Proceedings of the 2020−2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 216-220). Barcelona, Spain.
  3. Boersma, P., & Weenink, D. (2018). Praat: Doing phonetics by computer (version 6.0.37) [Computer program]. Retrieved from http://www.praat.org/
  4. Chapman, W. W., Aronsky, D., Fiszman, M., & Haug, P. J. (2000). Contribution of a speech recognition system to a computerized pneumonia guideline in the emergency department. Proceedings of the AMIA Symposium (p. 131).
  5. Cho, B. J., Lee, J. M., & Park, H. M. (2019). A beamforming algorithm based on maximum likelihood of a complex Gaussian distribution with time-varying variances for robust speech recognition. IEEE Signal Processing Letters, 26(9), 1398-1402. https://doi.org/10.1109/lsp.2019.2932848
  6. Cummins, N., Scherer, S., Krajewski, J., Schnieder, S., Epps, J., & Quatieri, T. F. (2015). A review of depression and suicide risk assessment using speech analysis. Speech Communication, 71, 10-49. https://doi.org/10.1016/j.specom.2015.03.004
  7. Grondin, F., & Glass, J. (2019, May). SVD-PHAT: A fast sound source localization method. Proceedings of the 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4140-4144). Brighton, UK.
  8. Hernandez, A., Kim, S., & Chung, M. (2020). Prosody-based measures for automatic severity assessment of dysarthric speech. Applied Sciences, 10(19), 6999. https://doi.org/10.3390/app10196999
  9. Higuchi, T., Ito, N., Yoshioka, T., & Nakatani, T. (2016, March). Robust MVDR beamforming using time-frequency masks for online/offline ASR in noise. Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5210-5214). Shanghai, China.
  10. Huang, Z., Epps, J., Joachim, D., Stasak, B., Williamson, J. R., & Quatieri, T. F. (2020). Domain adaptation for enhancing Speech-based depression detection in natural environmental conditions using dilated CNNs.Interspeech 2020 (pp. 4561-4565). Shanghai, China.
  11. Kudo, T., & Richardson, J. (2018, August). Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (pp. 66-71).
  12. Kubo, Y., Nakatani, T., Delcroix, M., Kinoshita, K., & Araki, S. (2019). Mask-based MVDR beamformer for noisy multisource environments: introduction of time-varying spatial covariance model. Proceedings of the 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (pp. 6855-6859). Brighton, UK.
  13. Laukka, P., Linnman, C., Ahs, F., Pissiota, A., Frans, O., Faria, V., Palmquist, A. M., & Furmark, T. (2008). In a nervous voice: Acoustic analysis and perception of anxiety in social phobics' speech. Journal of Nonverbal Behavior, 32(4), 195. https://doi.org/10.1007/s10919-008-0055-9
  14. Lee, Y., Shon, S., & Kim, T. (2018). Learning pronunciation from a foreign language in speech synthesis network. arXiv. Retrieved from https://arxiv.org/abs/1811.09364
  15. Mariani, C., Tronchi, A., Oncini, L., Pirani, O., & Murri, R. (2006). Analysis of the X-ray work flow in two diagnostic imaging departments with and without a RIS/PACS system. Journal of Digital Imaging, 19(1), 18-28. https://doi.org/10.1007/s10278-006-0858-3
  16. Maryn, Y., Roy, N., De Bodt, M., Van Cauwenberge, P., & Corthals, P. (2009). Acoustic measurement of overall voice quality: A meta-analysis. The Journal of the Acoustical Society of America, 126(5), 2619-2634. https://doi.org/10.1121/1.3224706
  17. Park, D. S., Chan, W., Zhang, Y., Chiu, C. C., Zoph, B., Cubuk, E. D., & Le, Q. V. (2019). Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech 2019 (pp. 2613-2617). Graz, Austria.
  18. Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., ... Vesely, K. (2011). The Kaldi speech recognition toolkit. IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. Big Island, HI.
  19. Seo, I., & Seong, C. (2013). Voice quality of dysarthric speakers in connected speech. Journal of the Korean Society of Speech Sciences, 5(4), 33-41. https://doi.org/10.13064/KSSS.2013.5.4.033
  20. Wang, D., Wang, X., & Lv, S. (2019). An overview of end-to-end automatic speech recognition. Symmetry, 11(8), 1018. https://doi.org/10.3390/sym11081018
  21. Weiner, J., Engelbart, M., & Schultz, T. (2017). Manual and automatic transcriptions in dementia detection from speech. Interspeech 2017 (pp. 3117-3121). Stockholm, Sweden.
  22. Xezonaki, D., Paraskevopoulos, G., Potamianos, A., & Narayanan, S. (2020). Affective conditioning on hierarchical attention networks applied to depression detection from transcribed clinical interviews. Interspeech 2020 (pp. 4556-4560). Shanghai, China.
  23. Xu, H., Stenner, S. P., Doan, S., Johnson, K. B., Waitman, L. R., & Denny, J. C. (2010). MedEx: A medication information extraction system for clinical narratives. Journal of the American Medical Informatics Association, 17(1), 19-24. https://doi.org/10.1197/jamia.M3378
  24. Yoshioka, T., & Nakatani, T. (2012). Generalization of multi-channel linear prediction methods for blind MIMO impulse response shortening. IEEE Transactions on Audio, Speech, and Language Processing, 20(10), 2707-2720. https://doi.org/10.1109/TASL.2012.2210879
  25. Yoshioka, T., & Nakatani, T. (2013, September). Dereverberation for reverberation-robust microphone arrays. 21st European Signal Processing Conference (EUSIPCO 2013) (pp. 1-5). Marrakech, Morocco.