Joint streaming model for backchannel prediction and automatic speech recognition

Yong-Seok Choi;Jeong-Uk Bang;Seung Hi Kim;

doi:10.4218/etrij.2023-0358

ETRI Journal

Volume 46 Issue 1
/
Pages.118-126
/
2024
/
1225-6463(pISSN)
/
2233-7326(eISSN)

Electronics and Telecommunications Research Institute (한국전자통신연구원)

DOI QR Code

Joint streaming model for backchannel prediction and automatic speech recognition

Yong-Seok Choi (Integrated Intelligence Research Section, Electronics and Telecommunications Research Institute) ;
Jeong-Uk Bang (Integrated Intelligence Research Section, Electronics and Telecommunications Research Institute) ;
Seung Hi Kim (Integrated Intelligence Research Section, Electronics and Telecommunications Research Institute)

Received : 2023.08.27
Accepted : 2023.12.20
Published : 2024.02.20

https://doi.org/10.4218/etrij.2023-0358 Citation PDF

Download PDF

⟨ Previous Next ⟩

Abstract

In human conversations, listeners often utilize brief backchannels such as "uh-huh" or "yeah." Timely backchannels are crucial to understanding and increasing trust among conversational partners. In human-machine conversation systems, users can engage in natural conversations when a conversational agent generates backchannels like a human listener. We propose a method that simultaneously predicts backchannels and recognizes speech in real time. We use a streaming transformer and adopt multitask learning for concurrent backchannel prediction and speech recognition. The experimental results demonstrate the superior performance of our method compared with previous works while maintaining a similar single-task speech recognition performance. Owing to the extremely imbalanced training data distribution, the single-task backchannel prediction model fails to predict any of the backchannel categories, and the proposed multitask approach substantially enhances the backchannel prediction performance. Notably, in the streaming prediction scenario, the performance of backchannel prediction improves by up to 18.7% compared with existing methods.

Keywords

Acknowledgement

We would like to thank the Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea Government (MSIT) (no. 2022-0-00608; Development of artificial intelligence technology of multimodal interaction for empathetic and social conversations with humans).

References

K. K. Bowden, S. Oraby, A. Misra, J. Wu, S. Lukin, and M. Walker, Data-driven dialogue systems for social agents, (8th International Workshop on Spoken Dialog Systems, PA, USA), 2017.
P. Fung, D. Bertero, Y. Wan, A. Dey, R. H. Y. Chan, F. B. Siddique, Y. Yang, C.-S. Wu, and R. Lin, Towards empathetic human-robot interactions, (Proceedings of 17th International Conference on Intelligent Text Processing and Computational Linguistics, Konya, Turkiye), 2016.
S. Iwasaki, The northridge earthquake conversations: the floor structure and the 'loop' sequence in japanese conversation, J. Pragm. 28 (1997), no. 6, 661-693.
M. Barange, S. Rasendarasoa, M. Bouabdelli, J. Saunier, and A. Pauchet, Impact of adaptive multimodal empathic behavior on the user interaction, (Proceedings of the 22nd ACM International Conference on Intelligent Virtual Agents, Faro, Portugal), 2022, pp. 1-8.
L. Hunag, L.-P. Morency, and J. Gratch, Virtual rapport 2.0, (Proceedings of the 10th ACM international conference on intelligent virtual agents, Reykjavik, Iceland), 2011, pp. 68-79.
A. I. Adiba, T. Homma, and T. Miyoshi, Towards immediate backchannel generation using attention-based early prediction model, (Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Toronto, Ontario, Canada), 2021, pp. 7408-7412.
J. Y. Jang, S. Kim, M. Jung, S. Shin, and G. Gweon, BPM_MT: Enhanced backchannel prediction model using multi-task learning, (Proceedings of the Conference on Empirical Methods in Natural Language Processing), 2021, pp. 3447-3452.
D. Ortega, C.-Y. Li, and N. T. Vu, Oh, jeez! or uh-huh? a listener-aware backchannel predictor on ASR transcriptions, (Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, Spain), 2020, pp. 8064-8068.
R. Ruede, Backchannel prediction for conversational speech using recurrent neural networks, Karlsruhe Institute of Technology, Institute for Anthropomatics and Robotics, Bachelor's thesis, 2017, pp. 1-52.
A. Jain, A. Singh, H. S. Koppula, S. Soh, and A. Saxena, Recurrent neural networks for driver activity anticipation via sensory-fusion architecture, (Proceedings of International Conference on Robotics and Automation, Stockholm, Sweden), 2016, pp. 3118-3125.
T. Suzuki, H. Kataoka, Y. Aoki, and Y. Satoh, Anticipating traffic accidents with adaptive loss and large-scale incident DB, (Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA), 2018, pp. 3521-3529.
S. Ruder, An overview of multi-task learning in deep neural networks, 2017. Available from: https://catalog.ldc.upenn.edu/LDC97S62 [last accessed Augst 2023].
A. Graves, Sequence transduction with recurrent neural networks, (Workshop on representation learning, Edinburgh, Scotland), 2012.
A. Graves, A. Mohamed, and G. Hinton, Speech recognition with deep recurrent neural networks, (IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, Canada), 2013, DOI 10.1109/ICASSP.2013.6638947
Y. He, T. N. Sainath, R. Prabhavalkar, I. McGraw, R. Alvarez, D. Zhao, A. Kannan, Y. Wu, R. Pang, Q. Liang, D. Bhatia, Y. Shangguan, B. Li, G. Pundak, K. C. Sim, T. Bagby, S.-Y. Chang, K. Rao, and A. Gruenstein, Streaming end-to-end speech recognition for mobile devices, (Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processin, Vancouver, Canada), 2013.
C.-C. Chiu and C. Raffel, Monotonic chunkwise attention, (Proceedings of International Conference on Learning Representations, Vancouver, Canada), 2018.
J. Hou, S. Zhang, and L. Dai, Gaussian prediction based attention for online end-to-end speech recognition, (Proceedings of Annual Conference of the International Speech Communication Association, Stockholm, Sweden), 2017, pp. 3692-3696.
N. Moritz, T. Hori, and J. L. Roux, Triggered attention for end-to-end speech recognition, (Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processin, Brighton, UK), 2019, DOI 10.1109/ICASSP.2019.8683510.
L. Dong, F. Wang, and B. Xu, Self-attention aligner: a latencycontrol end-to-end model for ASR using self-attention network and chunk-hopping, (Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Brighton, UK), 2019, pp. 5656-5660.
E. Tsunoo, Y. Kashiwagi, T. Kumakura, and S. Watanabe, Transformer ASR with contextual block processing, (Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop, Singapore), 2019, pp. 427-433.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, Attention is all you need, (Proceedings of the 31st International Conference on Neural Information Processing Systems, CA, USA), 2017, pp. 6000-6010.
E. Tsunoo, Y. Kashiwagi, and S. Watanabe, Streaming transformer ASR with blockwise synchronous beam search, (Proceedings of IEEE Spoken Language Technology Workshop, Virtual), 2021, pp. 22-29.
J. J. Godfrey and E. Holliman, Switchboard-1 release 2 ldc97s62, 1993. Available from: https://arxiv.org/abs/1706.05098 [last accessed Augst 2023].
D. Jurafsky, R. Bates, N. Coccaro, R. Martin, M. M. Bbn, K. Ries, E. S. Sri, A. S. Sri, P. Taylor, and C. V. E.-D. Dod, Switchboard discourse language modeling project final report, (Johns Hopkins LVCSR workshop-97), 1998.
S. Watanabe, T. Hori, S. Kim, J. R. Hershey, and T. Hayashi, Hybrid CTC/attention architecture for end-to-end speech recognition, J. Sel. Top. Sig. Process. 11 (2017), no. 8, 1240-1253.
R. Sennrich, B. Haddow, and A. Birch, Neural machine translation of rare words with subword units, (Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany), 2016, pp. 1715-1725.
S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. E. Y. Soplin, J. Heymann, M. Wiesner, N. Chen, and A. Renduchintala, ESPNET: end-to-end speech processing toolkit, (Proceedings of Annual Conference of the International Speech Communication Association, Graz, Austria), 2019, pp. 2207-2211.
A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, Automatic differentiation in Pytorch, (Workshop on the nips autodiff, CA, USA), 2017.

ETRI Journal

Joint streaming model for backchannel prediction and automatic speech recognition

Abstract

Keywords

Acknowledgement

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)