DOI QR코드

DOI QR Code

A study on Gaussian mixture model deep neural network hybrid-based feature compensation for robust speech recognition in noisy environments

잡음 환경에 효과적인 음성 인식을 위한 Gaussian mixture model deep neural network 하이브리드 기반의 특징 보상

  • 윤기무 (인천대학교 컴퓨터공학부) ;
  • 김우일 (인천대학교 컴퓨터공학부)
  • Received : 2018.09.27
  • Accepted : 2018.11.22
  • Published : 2018.11.30

Abstract

This paper proposes an GMM(Gaussian Mixture Model)-DNN(Deep Neural Network) hybrid-based feature compensation method for effective speech recognition in noisy environments. In the proposed algorithm, the posterior probability for the conventional GMM-based feature compensation method is calculated using DNN. The experimental results using the Aurora 2.0 framework and database demonstrate that the proposed GMM-DNN hybrid-based feature compensation method shows more effective in Known and Unknown noisy environments compared to the GMM-based method. In particular, the experiments of the Unknown environments show 9.13 % of relative improvement in the average of WER (Word Error Rate) and considerable improvements in lower SNR (Signal to Noise Ratio) conditions such as 0 and 5 dB SNR.

GOHHBH_2018_v37n6_506_f0001.png 이미지

Fig. 1. Recognition performance in “Known” noisy environments at different SNRs as average over all environments: Subway, Babble, Car and Exhibition (WER, %).

GOHHBH_2018_v37n6_506_f0002.png 이미지

Fig. 2. Recognition performance in “Unknown” noisy environments at different SNRs as average over all environments: Factory, Babble2, Car2 and Music (WER, %).

Table 1. Recognition performance in “Known” noisy environments as average over all SNRs: 0 dB, 5 dB, 10 dB, 15 dB, and 20 dB (WER, %).

GOHHBH_2018_v37n6_506_t0001.png 이미지

Table 2, Recognition performance in “Unknown” noisy environments as average over all SNRs: 0 dB, 5 dB, 10 dB, 15 dB, and 20 dB (WER, %).

GOHHBH_2018_v37n6_506_t0002.png 이미지

Acknowledgement

Supported by : 인천대학교

References

  1. W. Kim and J. H. L. Hansen, "Feature compensation in the cepstral domain employing model combination," Speech Communication, 51, 83-96 (2009). https://doi.org/10.1016/j.specom.2008.06.004
  2. J. L. Gauvain and C. H. Lee, "Maximum a posteriori estimation for multivariate gaussian mixture observations of markov chains," Proc. IEEE Trans. on Speech and Audio, 2, 291-298 (1994).
  3. C. J. Leggetter and P. C. Woodland, "Maximum likelihood linear regression for speaker adaptation of continuous density HMMs," Computer Speech and Language, 9, 171-185 (1995). https://doi.org/10.1006/csla.1995.0010
  4. M. J. F. Gales and S. J. Young, "Robust continuous speech recognition using parallel model combination," Proc. IEEE Trans. on Speech and Audio, 4, 352-359 (1996).
  5. J. Du, L.-R. Dai, and Q. Huo, "Synthesized stereo mapping via deep neural networks for noisy speech recognition," ICASSP 2014, 1764-1768 (2014).
  6. K. Han, Y. He, D. Bagchi, E. Fosler-Lussier, and D. Wang, "Deep neural network based spectral feature mapping for robust speech recognition," Interspeech 2015, 2484-2488 (2015).
  7. H. G. Hirsch and D. Pearce, "The AURORA experimental framework for the performance evaluations of speech recognition systems under noisy conditions," ISCA ITRW ASR2000, Sep. 2000.
  8. ETSI ES 201 108, ETSI standard document, v1.1.2(2000- 04), Feb. 2000.
  9. R. Martin, "Spectral Subtraction Based on Minimum Statistics," EUSIPCO-94, 1182-1185 (1994).
  10. S. F. Boll, "Suppression of acoustic noise in speech using spectral subtraction," Proc. IEEE Trans. on Acoustics, Speech and Signal, 27, 113-120 (1979).
  11. Y. Ephraim and D. Malah, "Speech enhancement using minimum mean square error short time spectral amplitude estimator," Proc. IEEE Trans. on Acoustics, Speech and Signal, 32, 109-1121 (1984).
  12. J. H. L. Hansen and M. Clements, "Constrained iterative speech enhancement with application to speech recognition," Proc. IEEE Trans. on Signal, 39, 795-805 (1991).
  13. P. J. Moreno, B. Raj, and R. M. Stern, "Data-driven environmental compensation for speech recognition: a unified approach," Speech Communication, 24, 267-285 (1998). https://doi.org/10.1016/S0167-6393(98)00025-9