DOI QR코드

DOI QR Code

Hyperparameter experiments on end-to-end automatic speech recognition

  • Yang, Hyungwon (Department of English Language and Literature, Korea University) ;
  • Nam, Hosung (Department of English Language and Literature, Korea University)
  • Received : 2021.01.31
  • Accepted : 2021.03.16
  • Published : 2021.03.31

Abstract

End-to-end (E2E) automatic speech recognition (ASR) has achieved promising performance gains with the introduced self-attention network, Transformer. However, due to training time and the number of hyperparameters, finding the optimal hyperparameter set is computationally expensive. This paper investigates the impact of hyperparameters in the Transformer network to answer two questions: which hyperparameter plays a critical role in the task performance and training speed. The Transformer network for training has two encoder and decoder networks combined with Connectionist Temporal Classification (CTC). We have trained the model with Wall Street Journal (WSJ) SI-284 and tested on devl93 and eval92. Seventeen hyperparameters were selected from the ESPnet training configuration, and varying ranges of values were used for experiments. The result shows that "num blocks" and "linear units" hyperparameters in the encoder and decoder networks reduce Word Error Rate (WER) significantly. However, performance gain is more prominent when they are altered in the encoder network. Training duration also linearly increased as "num blocks" and "linear units" hyperparameters' values grow. Based on the experimental results, we collected the optimal values from each hyperparameter and reduced the WER up to 2.9/1.9 from dev93 and eval93 respectively.

Keywords

References

  1. Chang, X., Zhang, W., Qian, Y., Le Roux, J., & Watanabe, S. (2020, May). End-to-end multi-speaker speech recognition with transformer. Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6134-6138). Barcelona, Spain.
  2. Gale, W. A., & Sampson, G. (1995). Good-turing frequency estimation without tears. Journal of Quantitative Linguistics, 2(3), 217-237. https://doi.org/10.1080/09296179508590051
  3. Graves, A., Fernandez, S., Gomez, F., & Schmidhuber, J. (2006, June). Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. Proceedings of the 23rd International Conference on Machine Learning (pp. 369- 376). Pittsburgh, PA.
  4. James, F. (2000). Modified kneser-ney smoothing of n-gram models (RIACS Technical Report 00.07). Mountain View, CA: Research institute for advanced computer science. Retrieved from https://www.researchgate.net/profile/Frankie-James/publication/255479295_Modified_Kneser-Ney_Smoothing_of_n-gram_Models/links/54d156750cf28959aa7adc08/Modified-Kneser-Ney-Smoothingof-n-gram-Models.pdf
  5. Kim, S., Bae, S., & Won, C. (2020). KoSpeech: open-source toolkit for end-to-end Korean speech recognition. arXiv. Retrieved from https://arxiv.org/abs/2009.03092
  6. Kingma, D. P., & Ba, J. (2014). Adam: a method for stochastic optimization. arXiv. Retrieved from https://arxiv.org/abs/1412.6980
  7. Koutsoukas, A., Monaghan, K. J., Li, X., & Huan, J. (2017). Deeplearning: investigating deep neural networks hyper-parameters and comparison of performance to shallow methods for modeling bioactivity data. Journal of Cheminformatics, 9(1), 1-13. https://doi.org/10.1186/s13321-016-0187-6
  8. Lakomkin, E., Zamani, M. A., Weber, C., Magg, S., & Wermter, S. (2019, May). dorporating end-to-end speech recognition models for sentiment analysis. Proceedings of the 2019 International Conference on Robotics and Automation (ICRA) (pp. 7976-7982). Montreal, QC.
  9. LeCun, Y. A., Bottou, L., Orr, G. B., & Muller, K. R. (2012). Efficient backprop. In G. Montavon, G. B. Orr & K. S. Muller (Eds.), Neural networks: tricks of the trade (2nd ed., Vol. 7700, pp. 9-48). Berlin, Germany: Springer.
  10. Miao, H., Cheng, G., Gao, C., Zhang, P., & Yan, Y. (2020, May). Transformer-based online CTC/attention end-to-end speech recognition architecture. Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6084-6088). Barcelona, Spain.
  11. Nakatani, T. (2019, September). Improving transformer-based end-to-end speech recognition with connectionist temporal classification and language model integration. Proceedings of the Interspeech 2019. Graz, Austria
  12. Okewu, E., Adewole, P., & Sennaike, O. (2019, July). Experimental comparison of stochastic optimizers in deep learning. Proceedings of the International Conference on Computational Science and its Applications (pp. 704-715). Saint Petersburg, Russia.
  13. Popel, M., & Bojar, O. (2018). Training tips for the transformer model. The Prague Bulletin of Mathematical Linguistics, 110(1), 43-70. https://doi.org/10.2478/pralin-2018-0002
  14. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser L., & Polosukhin, I. (2017). Attention is all you need. arXiv. Retrieved from https://arxiv.org/abs/1706.03762
  15. Wang, C., Wu, Y., Du, Y., Li, J., Liu, S., Lu, L., Ren S., ...Zhou, M. (2019). Semantic mask for transformer based end-to-end speech recognition. arXiv. Retrieved from https://arxiv.org/abs/1912.03010
  16. Watanabe, S., Boyer, F., Chang, X., Guo, P., Hayashi, T., Higuchi, Y., Hori, T., … Zhang, W. (2020). The 2020 ESPnet update: new features, broadened applications, performance improvements, and future plans. arXiv, arXiv:2012.13006
  17. Watanabe, S., Hori, T., Karita, S., Hayashi, T., Nishitoba, J., Unno, Y., Soplin, N. E. Y., ... Ochiai, T. (2018). Espnet: end-to-end speech processing toolkit. arXiv. Retrieved from https://arxiv.org/abs/1804.00015
  18. Watanabe, S., Hori, T., Kim, S., Hershey, J. R., & Hayashi, T. (2017). Hybrid CTC/attention architecture for end-to-end speech recognition. IEEE Journal of Selected Topics in Signal Processing, 11(8), 1240-1253. https://doi.org/10.1109/JSTSP.2017.2763455
  19. Wei, C., Yu, Z., & Fong, S. (2018, February). How to build a chatbot: chatbot framework and its capabilities. Proceedings of the 2018 10th International Conference on Machine Learning and Computing (pp. 369-373). Macau, China.
  20. You, Y., Li, J., Reddi, S., Hseu, J., Kumar, S., Bhojanapalli, S., & Hsieh, C. J. (2019). Large batch optimization for deep learning: training bert in 76 minutes. arXiv. Retrieved from https://arxiv.org/abs/1904.00962