Hyperparameter experiments on end-to-end automatic speech recognition

Yang, Hyungwon;Nam, Hosung;

doi:10.13064/KSSS.2021.13.1.045

Phonetics and Speech Sciences (말소리와 음성과학)

Volume 13 Issue 1
/
Pages.45-51
/
2021
/
2005-8063(pISSN)
/
2586-5854(eISSN)

Korean Society of Speech Sciences (한국음성학회)

DOI QR Code

Hyperparameter experiments on end-to-end automatic speech recognition

Yang, Hyungwon (Department of English Language and Literature, Korea University) ;
Nam, Hosung (Department of English Language and Literature, Korea University)

Received : 2021.01.31
Accepted : 2021.03.16
Published : 2021.03.31

https://doi.org/10.13064/KSSS.2021.13.1.045 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

End-to-end (E2E) automatic speech recognition (ASR) has achieved promising performance gains with the introduced self-attention network, Transformer. However, due to training time and the number of hyperparameters, finding the optimal hyperparameter set is computationally expensive. This paper investigates the impact of hyperparameters in the Transformer network to answer two questions: which hyperparameter plays a critical role in the task performance and training speed. The Transformer network for training has two encoder and decoder networks combined with Connectionist Temporal Classification (CTC). We have trained the model with Wall Street Journal (WSJ) SI-284 and tested on devl93 and eval92. Seventeen hyperparameters were selected from the ESPnet training configuration, and varying ranges of values were used for experiments. The result shows that "num blocks" and "linear units" hyperparameters in the encoder and decoder networks reduce Word Error Rate (WER) significantly. However, performance gain is more prominent when they are altered in the encoder network. Training duration also linearly increased as "num blocks" and "linear units" hyperparameters' values grow. Based on the experimental results, we collected the optimal values from each hyperparameter and reduced the WER up to 2.9/1.9 from dev93 and eval93 respectively.

Keywords

References

Chang, X., Zhang, W., Qian, Y., Le Roux, J., & Watanabe, S. (2020, May). End-to-end multi-speaker speech recognition with transformer. Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6134-6138). Barcelona, Spain.
Gale, W. A., & Sampson, G. (1995). Good-turing frequency estimation without tears. Journal of Quantitative Linguistics, 2(3), 217-237. https://doi.org/10.1080/09296179508590051
Graves, A., Fernandez, S., Gomez, F., & Schmidhuber, J. (2006, June). Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. Proceedings of the 23rd International Conference on Machine Learning (pp. 369- 376). Pittsburgh, PA.
James, F. (2000). Modified kneser-ney smoothing of n-gram models (RIACS Technical Report 00.07). Mountain View, CA: Research institute for advanced computer science. Retrieved from https://www.researchgate.net/profile/Frankie-James/publication/255479295_Modified_Kneser-Ney_Smoothing_of_n-gram_Models/links/54d156750cf28959aa7adc08/Modified-Kneser-Ney-Smoothingof-n-gram-Models.pdf
Kim, S., Bae, S., & Won, C. (2020). KoSpeech: open-source toolkit for end-to-end Korean speech recognition. arXiv. Retrieved from https://arxiv.org/abs/2009.03092
Kingma, D. P., & Ba, J. (2014). Adam: a method for stochastic optimization. arXiv. Retrieved from https://arxiv.org/abs/1412.6980
Koutsoukas, A., Monaghan, K. J., Li, X., & Huan, J. (2017). Deeplearning: investigating deep neural networks hyper-parameters and comparison of performance to shallow methods for modeling bioactivity data. Journal of Cheminformatics, 9(1), 1-13. https://doi.org/10.1186/s13321-016-0187-6
Lakomkin, E., Zamani, M. A., Weber, C., Magg, S., & Wermter, S. (2019, May). dorporating end-to-end speech recognition models for sentiment analysis. Proceedings of the 2019 International Conference on Robotics and Automation (ICRA) (pp. 7976-7982). Montreal, QC.
LeCun, Y. A., Bottou, L., Orr, G. B., & Muller, K. R. (2012). Efficient backprop. In G. Montavon, G. B. Orr & K. S. Muller (Eds.), Neural networks: tricks of the trade (2nd ed., Vol. 7700, pp. 9-48). Berlin, Germany: Springer.
Miao, H., Cheng, G., Gao, C., Zhang, P., & Yan, Y. (2020, May). Transformer-based online CTC/attention end-to-end speech recognition architecture. Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6084-6088). Barcelona, Spain.
Nakatani, T. (2019, September). Improving transformer-based end-to-end speech recognition with connectionist temporal classification and language model integration. Proceedings of the Interspeech 2019. Graz, Austria
Okewu, E., Adewole, P., & Sennaike, O. (2019, July). Experimental comparison of stochastic optimizers in deep learning. Proceedings of the International Conference on Computational Science and its Applications (pp. 704-715). Saint Petersburg, Russia.
Popel, M., & Bojar, O. (2018). Training tips for the transformer model. The Prague Bulletin of Mathematical Linguistics, 110(1), 43-70. https://doi.org/10.2478/pralin-2018-0002
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser L., & Polosukhin, I. (2017). Attention is all you need. arXiv. Retrieved from https://arxiv.org/abs/1706.03762
Wang, C., Wu, Y., Du, Y., Li, J., Liu, S., Lu, L., Ren S., ...Zhou, M. (2019). Semantic mask for transformer based end-to-end speech recognition. arXiv. Retrieved from https://arxiv.org/abs/1912.03010
Watanabe, S., Boyer, F., Chang, X., Guo, P., Hayashi, T., Higuchi, Y., Hori, T., … Zhang, W. (2020). The 2020 ESPnet update: new features, broadened applications, performance improvements, and future plans. arXiv, arXiv:2012.13006
Watanabe, S., Hori, T., Karita, S., Hayashi, T., Nishitoba, J., Unno, Y., Soplin, N. E. Y., ... Ochiai, T. (2018). Espnet: end-to-end speech processing toolkit. arXiv. Retrieved from https://arxiv.org/abs/1804.00015
Watanabe, S., Hori, T., Kim, S., Hershey, J. R., & Hayashi, T. (2017). Hybrid CTC/attention architecture for end-to-end speech recognition. IEEE Journal of Selected Topics in Signal Processing, 11(8), 1240-1253. https://doi.org/10.1109/JSTSP.2017.2763455
Wei, C., Yu, Z., & Fong, S. (2018, February). How to build a chatbot: chatbot framework and its capabilities. Proceedings of the 2018 10th International Conference on Machine Learning and Computing (pp. 369-373). Macau, China.
You, Y., Li, J., Reddi, S., Hseu, J., Kumar, S., Bhojanapalli, S., & Hsieh, C. J. (2019). Large batch optimization for deep learning: training bert in 76 minutes. arXiv. Retrieved from https://arxiv.org/abs/1904.00962

Phonetics and Speech Sciences (말소리와 음성과학)

Hyperparameter experiments on end-to-end automatic speech recognition

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)