DOI QR코드

DOI QR Code

Model adaptation employing DNN-based estimation of noise corruption function for noise-robust speech recognition

잡음 환경 음성 인식을 위한 심층 신경망 기반의 잡음 오염 함수 예측을 통한 음향 모델 적응 기법

  • 윤기무 (인천대학교 컴퓨터공학부) ;
  • 김우일 (인천대학교 컴퓨터공학부)
  • Received : 2018.11.09
  • Accepted : 2019.01.23
  • Published : 2019.01.31

Abstract

This paper proposes an acoustic model adaptation method for effective speech recognition in noisy environments. In the proposed algorithm, the noise corruption function is estimated employing DNN (Deep Neural Network), and the function is applied to the model parameter estimation. The experimental results using the Aurora 2.0 framework and database demonstrate that the proposed model adaptation method shows more effective in known and unknown noisy environments compared to the conventional methods. In particular, the experiments of the unknown environments show 15.87 % of relative improvement in the average of WER (Word Error Rate).

본 논문에서는 잡음 환경에서 효과적인 음성 인식을 위하여 DNN(Deep Neural Network) 기반의 잡음 오염 함수 예측을 이용한 음향 모델 적응 기법을 제안한다. 깨끗한 음성과 잡음 정보를 입력으로 하고 오염된 음성에 대한 특징 벡터를 출력으로 하는 DNN을 학습하여 비선형 관계를 갖는 잡음 오염 함수를 예측한다. 예측된 잡음 오염 함수를 음향모델의 평균 벡터에 적용하여 잡음 환경에 적응된 음향 모델을 생성한다. Aurora 2.0 데이터를 이용한 음성 인식 성능 평가에서 본 논문에서 제안한 모델 적응 기법이 기존의 전처리, 모델 적응 기법에 비해 일치, 불일치 잡음 환경에서 모두 평균적으로 우수한 성능을 나타낸다. 특히 불일치 잡음 환경에서 평균 오류율이 15.87 %의 상대 향상률을 나타낸다.

Keywords

GOHHBH_2019_v38n1_47_f0001.png 이미지

Fig. 1. Configuration of the deep neural network employed for estimation of noise corruption function in the proposed method.

Table 1. Recognition performance in “known” noisy environments as average over all SNRs: 0 dB, 5 dB, 10 dB, 15 dB, and 20 dB (WER, %).

GOHHBH_2019_v38n1_47_t0001.png 이미지

Table 2. Recognition performance in “unknown”noisy environments as average over all SNRs: 0 dB, 5 dB, 10 dB, 15 dB, and 20 dB (WER, %).

GOHHBH_2019_v38n1_47_t0002.png 이미지

References

  1. S. F. Boll, "Suppression of acoustic noise in speech using spectral subtraction," Proc. IEEE Trans. on Acoustics, Speech and Signal, 27, 113-120 (1979). https://doi.org/10.1109/TASSP.1979.1163209
  2. Y. Ephraim and D. Malah, "Speech enhancement using minimum mean square error short time spectral amplitude estimator," Proc. IEEE Trans. on Acoustics, Speech and Signal, 32, 1109-1121 (1984). https://doi.org/10.1109/TASSP.1984.1164453
  3. P. J. Moreno, B. Raj, and R. M. Stern, "Data-driven environmental compensation for speech recognition: a unified approach," Speech Communication, 24, 267-285 (1998). https://doi.org/10.1016/S0167-6393(98)00025-9
  4. W. Kim and J. H. L. Hansen, "Feature compensation in the cepstral domain employing model combination," Speech Communication, 51, 83-96 (2009). https://doi.org/10.1016/j.specom.2008.06.004
  5. C. J. Leggetter and P. C. Woodland, "Maximum likelihood linear regression for speaker adaptation of continuous density HMMs," Computer Speech and Language, 9, 171-185 (1995). https://doi.org/10.1006/csla.1995.0010
  6. M. J. F. Gales and S. J. Young, "Robust continuous speech recognition using parallel model combination," Proc. IEEE Trans. on Speech and Audio, 4, 352-359 (1996). https://doi.org/10.1109/89.536929
  7. J. Du, L.-R. Dai, and Q. Huo, "Synthesized stereo mapping via deep neural networks for noisy speech recognition," ICASSP 2014, 1764-1768 (2014).
  8. K. Han, Y. He, D. Bagchi, E. F. -Luissier, and D. L. Wang, "Deep neural network based spectral feature mapping for robust speech recognition," Interspeech 2015, 2484-2488 (2015).
  9. H. G. Hirsch and D. Pearce, "The AURORA experimental framework for the performance evaluations of speech recognition systems under noisy conditions," ISCA ITRW ASR2000 (2000).