DOI QR코드

DOI QR Code

Rank-weighted reconstruction feature for a robust deep neural network-based acoustic model

  • Chung, Hoon (SW Contents Research Laboratory, Electronics and Telecommunications Research Institute) ;
  • Park, Jeon Gue (SW Contents Research Laboratory, Electronics and Telecommunications Research Institute) ;
  • Jung, Ho-Young (SW Contents Research Laboratory, Electronics and Telecommunications Research Institute)
  • Received : 2018.05.11
  • Accepted : 2018.09.05
  • Published : 2019.04.07

Abstract

In this paper, we propose a rank-weighted reconstruction feature to improve the robustness of a feed-forward deep neural network (FFDNN)-based acoustic model. In the FFDNN-based acoustic model, an input feature is constructed by vectorizing a submatrix that is created by slicing the feature vectors of frames within a context window. In this type of feature construction, the appropriate context window size is important because it determines the amount of trivial or discriminative information, such as redundancy, or temporal context of the input features. However, we ascertained whether a single parameter is sufficiently able to control the quantity of information. Therefore, we investigated the input feature construction from the perspectives of rank and nullity, and proposed a rank-weighted reconstruction feature herein, that allows for the retention of speech information components and the reduction in trivial components. The proposed method was evaluated in the TIMIT phone recognition and Wall Street Journal (WSJ) domains. The proposed method reduced the phone error rate of the TIMIT domain from 18.4% to 18.0%, and the word error rate of the WSJ domain from 4.70% to 4.43%.

Keywords

References

  1. G. Hinton et al., Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups, Signal Process Mag. 29 (2012), no. 6, 82-97. https://doi.org/10.1109/MSP.2012.2205597
  2. G. E. Dahl et al., Context‐dependent pre‐trained deep neural networks for large‐vocabulary speech recognition, IEEE Trans. Audio Speech Language Process. 20 (2012), no. 1, 30-42. https://doi.org/10.1109/TASL.2011.2134090
  3. L. Deng et al., Recent advances in deep learning for speech research at Microsoft, in IEEE Int. Conf. Acoustics, Speech, Signal Process. (ICASSP), Vancouver, Canada, May 26-31, 2013, pp. 8604-8608.
  4. J. Pan et al., Investigation of deep neural networks (DNN) for large vocabulary continuous speech recognition: why DNN surpasses GMMS in acoustic modeling, in IEEE Int. Symp. Chinese Spoken Language Process (ISCSLP), Kowloon, China, Dec. 2012, pp. 301-305.
  5. A. L. Maas et al., Building DNN acoustic models for large vocabulary speech recognition, Comput. Speech Lang. 41 (2017), pp. 195-213. https://doi.org/10.1016/j.csl.2016.06.007
  6. T. N. Sainath et al., Deep convolutional neural networks for LVCSR, in IEEE Int. Conf. Acoustics, Speech Signal Processing (ICASSP), Vancouver, Canada, May 2013, pp. 8614-8618.
  7. H. Sak, A. Senior, and F. Beaufays, Long short-term memory recurrent neural network architectures for large scale acoustic modeling, in Annu. Conf. Int. Speech Commun. Assoc., Singapore, Sept. 14-18, 2014, pp. 338-342.
  8. T. N. Sainath et al., Convolutional, long short-term memory, fully connected deep neural networks, in IEEE Int. Conf. Acoustics, Speech Signal Process. (ICASSP), Brisbane, Australia, Apr. 19-24, 2015, pp. 4580-4584.
  9. Y. Shinohara, Adversarial multi-task learning of deep neural networks for robust speech recognition, in INTERSPEECH, San Francisco, CA, USA, Sept. 8-12, 2016, pp. 2369-2372.
  10. D. Povey, X. Zhang, and S. Khudanpur, Parallel training of deep neural networks with natural gradient and parameter averaging, arXiv preprint, 2014.
  11. X. Cui, V. Goel, and B. Kingsbury, Data augmentation for deep neural network acoustic modeling, IEEE/ACM Trans. Audio Speech Language Process. 23 (2015), no. 9, 1469-1477. https://doi.org/10.1109/TASLP.2015.2438544
  12. V. Nair, and G. E. Hinton, Rectified linear units improve restricted Boltzmann machines, in Proc. Int. Conf. Mach. Learn. (ICML-10), Haifa, Israel, June 21-24, 2010, pp. 807-814.
  13. K. Hermus, and P. Wambacq, A review of signal subspace speech enhancement and its application to noise robust speech recognition, EURASIP J. Appl. Signal Process. 2007 (2007), 1-15.
  14. K. Hermus et al., Fully adaptive SVD-based noise removal for robust speech recognition, in Eur. Conf. Speech Commun. Technol., Budapest, Hungary, Sept. 5-9, 1999, pp. 1-4.
  15. T. Schanze, Compression and noise reduction of biomedical signals by singular value decomposition, IFAC‐PapersOnLine 51 (2018), no. 2, 361-366. https://doi.org/10.1016/j.ifacol.2018.03.062
  16. S. Chirtmay, and M. Tahernezhadi, Speech enhancement using wiener filtering, Acoustics lett. 21, (1997), 110-115.
  17. J. Chen et al., New insights into the noise reduction wiener filter, IEEE Trans. Audio Speech Language Process. 14 (2006), no. 4, 1218-1234. https://doi.org/10.1109/TSA.2005.860851
  18. S. Lee et al., Statistical model‐based noise reduction approach for car interior applications to speech recognition, ETRI J. 32 (2010), no. 5, 801-809. https://doi.org/10.4218/etrij.10.1510.0024
  19. D. Palaz et al., Analysis of CNN-based speech recognition system using raw speech as input, in INTERSPEECH, Dresden, Germany, Sept. 6-10, 2015, pp. 11-15.
  20. P. Golik et al., Convolutional neural networks for acoustic modeling of raw time signal in LVCSR, in INTERSPEECH, Dresden, Germany, Sept. 6-10, 2015, pp. 26-30.
  21. T. N. Sainath et al., Learning the speech front-end with raw waveform CLDNNs, in INTERSPEECH, Dresden, Germany, Sept. 6-10, 2015, pp. 1-5.
  22. G. H. Golub, C. Reinsch, Singular value decomposition and least squares solutions, Numerische Mathematik 14 (1970), no. 5, 403-420. https://doi.org/10.1007/BF02163027
  23. D. Povey et al., The Kaldi speech recognition toolkit, in IEEE Workshop Automatic Speech Recogn. Understanding, Waikoloa, HI, USA, Dec. 11-15, 2011, no. EPFL-CONF192584.
  24. D. B. Paul, J. M. Baker, The design for the wall street journal-based CSR corpus, in Proc. Workshop Speech Natural Language, Harriman, NY, USA, Feb. 23-26, 1992, pp. 357-362.
  25. C. Lopes, F. Perdigao, Phoneme recognition on the TIMIT database, in Speech Technologies, InTech, 2011.

Cited by

  1. CGNet: A graph-knowledge embedded convolutional neural network for detection of pneumonia vol.58, pp.1, 2019, https://doi.org/10.1016/j.ipm.2020.102411