DOI QR코드

DOI QR Code

Development of articulatory estimation model using deep neural network

심층신경망을 이용한 조음 예측 모형 개발

  • 유희조 (고려대학교 심리학과) ;
  • 양형원 (고려대학교 영어영문학과) ;
  • 강재구 (고려대학교 영어영문학과) ;
  • 조영선 (고려대학교 영어영문학과) ;
  • 황성하 (고려대학교 영어영문학과) ;
  • 홍연정 (고려대학교 영어영문학과) ;
  • 조예진 (고려대학교 영어영문학과) ;
  • 김서현 (고려대학교 영어영문학과) ;
  • 남호성 (고려대학교)
  • Received : 2016.05.30
  • Accepted : 2016.09.20
  • Published : 2016.09.30

Abstract

Speech inversion (acoustic-to-articulatory mapping) is not a trivial problem, despite the importance, due to the highly non-linear and non-unique nature. This study aimed to investigate the performance of Deep Neural Network (DNN) compared to that of traditional Artificial Neural Network (ANN) to address the problem. The Wisconsin X-ray Microbeam Database was employed and the acoustic signal and articulatory pellet information were the input and output in the models. Results showed that the performance of ANN deteriorated as the number of hidden layers increased. In contrast, DNN showed lower and more stable RMS even up to 10 deep hidden layers, suggesting that DNN is capable of learning acoustic-articulatory inversion mapping more efficiently than ANN.

Keywords

References

  1. Ghosh, P. K. & Narayanan, S. (2011). Automatic speech recognition using articulatory features from subject-independent acoustic-to-articulatory inversion. The Journal of the Acoustical Society of America, 130(4), EL251-EL257. https://doi.org/10.1121/1.3634122
  2. Sondhi, M. M. & Resnick, J. R. (1983). The inverse problem for the vocal tract: Numerical methods, acoustical experiments, and speech synthesis. The Journal of the Acoustical Society of America, 73(3), 985-1002. https://doi.org/10.1121/1.389024
  3. Wilson, I., Gick, B., O'Brien, M. G., Shea, C., & Archibald, J. (2006). Ultrasound technology and second language acquisition research. Proceedings of the 8th Generative Approaches to Second Language Acquisition Conference (GASLA 2006) (pp. 148-152).
  4. Wrench, A. A., Gibbon, F., McNeill, A. M., & Wood, S. (2002). An EPG therapy protocol for remediation and assessment of articulation disorders. ICSLP.
  5. Dusan, S. (2001). Methods for integrating phonetic and phonological knowledge in speech inversion. Proceedings of the International Conference on Speech, Signal and Image Processing. Malta.
  6. Engwall, O. (2006). Evaluation of speech inversion using an articulatory classifier. Proceedings of the 7th International Seminar on Speech Production (pp. 469-476).
  7. Papcun, G., Hochberg, J., Thomas, T. R., Laroche, F., Zacks, J., & Levy, S. (1992). Inferring articulation and recognizing gestures from acoustics with a neural network trained on x-ray microbeam data. The Journal of the Acoustical Society of America, 92(2), 688-700. https://doi.org/10.1121/1.403994
  8. Zacks, J. & Thomas, T. R. (1994). A new neural network for articulatory speech recognition and its application to vowel identification. Computer Speech & Language, 8(3), 189-209. https://doi.org/10.1006/csla.1994.1009
  9. Richmond, K. (2001). Mixture density networks, human articulatory data and acoustic-to-articulatory inversion of continuous speech. Proceedings of Workshop on Innovation in Speech Processing (WISP 2001) (pp. 259-276).
  10. Qin, C. & Carreira-Perpinan, M. A. (2010). Articulatory inversion of american english /r/ by conditional density modes. Proceedings of 11th Annual Conference of the International Speech Communication Association (Interspeech 2010) (pp. 1998-2001)
  11. Richmond, K., Hoole, P., & King, S. (2011). Announcing the Electromagnetic Articulography (Day 1) Subset of the mngu0 Articulatory Corpus. Proceedings of 12th Annual Conference of the International Speech Communication Association (Interspeech 2011) (pp. 1505-1508).
  12. Mitra, V., Nam, H., Espy-Wilson, C., Saltzman, E., & Goldstein, L. (2011). Articulatory information for noise robust speech recognition. Audio, Speech, and Language Processing, IEEE Transaction on Audio, Speech, and Language Processing, 19(7), 1913-1924. https://doi.org/10.1109/TASL.2010.2103058
  13. Najnin, S. & Banerjee, B. (2015). Improved speech inversion using general regression neural network. The Journal of the Acoustical Society of America,138(3), EL229-EL235. https://doi.org/10.1121/1.4929626
  14. Tu, J. V. (1996). Advantages and disadvantages of using artificial neural networks versus logistic regression for predicting medical outcomes. Journal of clinical epidemiology, 49(11), 1225-1231. https://doi.org/10.1016/S0895-4356(96)00002-9
  15. Hinton, G. E., Osindero, S., & Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural computation, 18(7), 1527-1554. https://doi.org/10.1162/neco.2006.18.7.1527
  16. Simpson, A. J. (2015). Taming the ReLU with Parallel Dither in a Deep Neural Network (arXiv preprint). Retrieved from http://arxiv.org/abs/1509.05173 on September 17, 2015