DOI QR코드

DOI QR Code

A study on Gaussian mixture model deep neural network hybrid-based feature compensation for robust speech recognition in noisy environments

잡음 환경에 효과적인 음성 인식을 위한 Gaussian mixture model deep neural network 하이브리드 기반의 특징 보상

  • 윤기무 (인천대학교 컴퓨터공학부) ;
  • 김우일 (인천대학교 컴퓨터공학부)
  • Received : 2018.09.27
  • Accepted : 2018.11.22
  • Published : 2018.11.30

Abstract

This paper proposes an GMM(Gaussian Mixture Model)-DNN(Deep Neural Network) hybrid-based feature compensation method for effective speech recognition in noisy environments. In the proposed algorithm, the posterior probability for the conventional GMM-based feature compensation method is calculated using DNN. The experimental results using the Aurora 2.0 framework and database demonstrate that the proposed GMM-DNN hybrid-based feature compensation method shows more effective in Known and Unknown noisy environments compared to the GMM-based method. In particular, the experiments of the Unknown environments show 9.13 % of relative improvement in the average of WER (Word Error Rate) and considerable improvements in lower SNR (Signal to Noise Ratio) conditions such as 0 and 5 dB SNR.

본 논문에서는 잡음 환경에서 효과적인 음성인식을 위하여 GMM(Gaussian Mixture Model)-DNN(Deep Neural Network) 하이브리드 기반의 특징 보상 기법을 제안한다. 기존의 GMM 기반의 특징 보상에서 필요로 하는 사후 확률을 DNN을 통해 계산한다. Aurora 2.0 데이터를 이용한 음성 인식 성능 평가에서 본 논문에서 제안한 GMM-DNN 하이브리드 기법이 기존의 GMM 기반 기법에 비해 Known, Unknown 잡음 환경에서 모두 평균적으로 우수한 성능을 나타낸다. 특히 Unknown 잡음 환경에서 평균 오류율이 9.13 %의 상대 향상률을 나타내고, 낮은 SNR(Signal to Noise Ratio) 잡음 환경에서 상당히 우수한 성능을 보인다.

GOHHBH_2018_v37n6_506_f0001.png 이미지

Fig. 1. Recognition performance in “Known” noisy environments at different SNRs as average over all environments: Subway, Babble, Car and Exhibition (WER, %).

GOHHBH_2018_v37n6_506_f0002.png 이미지

Fig. 2. Recognition performance in “Unknown” noisy environments at different SNRs as average over all environments: Factory, Babble2, Car2 and Music (WER, %).

Table 1. Recognition performance in “Known” noisy environments as average over all SNRs: 0 dB, 5 dB, 10 dB, 15 dB, and 20 dB (WER, %).

GOHHBH_2018_v37n6_506_t0001.png 이미지

Table 2, Recognition performance in “Unknown” noisy environments as average over all SNRs: 0 dB, 5 dB, 10 dB, 15 dB, and 20 dB (WER, %).

GOHHBH_2018_v37n6_506_t0002.png 이미지

Acknowledgement

Supported by : 인천대학교

References

  1. S. F. Boll, "Suppression of acoustic noise in speech using spectral subtraction," Proc. IEEE Trans. on Acoustics, Speech and Signal, 27, 113-120 (1979).
  2. Y. Ephraim and D. Malah, "Speech enhancement using minimum mean square error short time spectral amplitude estimator," Proc. IEEE Trans. on Acoustics, Speech and Signal, 32, 109-1121 (1984).
  3. J. H. L. Hansen and M. Clements, "Constrained iterative speech enhancement with application to speech recognition," Proc. IEEE Trans. on Signal, 39, 795-805 (1991).
  4. P. J. Moreno, B. Raj, and R. M. Stern, "Data-driven environmental compensation for speech recognition: a unified approach," Speech Communication, 24, 267-285 (1998). https://doi.org/10.1016/S0167-6393(98)00025-9
  5. W. Kim and J. H. L. Hansen, "Feature compensation in the cepstral domain employing model combination," Speech Communication, 51, 83-96 (2009). https://doi.org/10.1016/j.specom.2008.06.004
  6. J. L. Gauvain and C. H. Lee, "Maximum a posteriori estimation for multivariate gaussian mixture observations of markov chains," Proc. IEEE Trans. on Speech and Audio, 2, 291-298 (1994).
  7. C. J. Leggetter and P. C. Woodland, "Maximum likelihood linear regression for speaker adaptation of continuous density HMMs," Computer Speech and Language, 9, 171-185 (1995). https://doi.org/10.1006/csla.1995.0010
  8. M. J. F. Gales and S. J. Young, "Robust continuous speech recognition using parallel model combination," Proc. IEEE Trans. on Speech and Audio, 4, 352-359 (1996).
  9. J. Du, L.-R. Dai, and Q. Huo, "Synthesized stereo mapping via deep neural networks for noisy speech recognition," ICASSP 2014, 1764-1768 (2014).
  10. K. Han, Y. He, D. Bagchi, E. Fosler-Lussier, and D. Wang, "Deep neural network based spectral feature mapping for robust speech recognition," Interspeech 2015, 2484-2488 (2015).
  11. H. G. Hirsch and D. Pearce, "The AURORA experimental framework for the performance evaluations of speech recognition systems under noisy conditions," ISCA ITRW ASR2000, Sep. 2000.
  12. ETSI ES 201 108, ETSI standard document, v1.1.2(2000- 04), Feb. 2000.
  13. R. Martin, "Spectral Subtraction Based on Minimum Statistics," EUSIPCO-94, 1182-1185 (1994).