Fig. 1. General pipeline process of teacher student learning framework for far-field speaker verification.
Fig. 2. Illustration of RWCNN-residual model (offline phase, Jung et al., Interspeech 2018[9]) and overall speaker verification pipeline (online phase). Three numbers next to convolutions each refer to the length of kernel, stride size, and the number of kernels.
Table 1. EER of the baseline and proposed teacher student based systems (near / far field evaluation). ‘ts’ means teacher student learning, ‘teacher init’ refers to initializing the student network using learned teacher network, and ‘student w near’ refers to usingnear-field utterances for student training as well.
References
- M. Brandstein and D. Ward, Microphone arrays: signal processing techniques and applications (Springer Science & Media , Heidelberg, 2013), pp. 39-60.
- J. Sohn, N. Kim, and W. Sung, "A statistical model-based voice activity detection" IEEE signal processing letters, 6, 1-3 (1999).
- J. Li, R. Zhao, Z. Chen, C. Liu, X. Xiao, G. Ye, and Y. Gong, "Developing Far-Field Speaker System via teacher student Learning," Proc. ICASSP, 5699-5703 (2018).
- M. Ravanelli, P. Brakel, M. Omologo, and Y. Bengio, "Batch-normalized joint training for DNN-based distant speech recognition," Proc. SLT workshop, 28-34 (2016).
- J. Li, R. Zhao, J. Huang, and Y. Gong, "Learning smallsize DNN with output-distribution-based criteria," Proc. Interspeech, 1910-1914 (2014).
- J. Jung, H. Heo, Y. Yang, H. Shim, and H. Yu, "A complete end-to-end speaker verification system using deep neural networks: from raw signals to verification result," Proc. ICASSP, 5349-5353 (2018).
- H. Kaiming, Z. Xiangyu, R. Shaoqing, and S. Jian, "Identity mappings in deep residual networks," Proc. ECCV, 30-645 (2016).
- S. Ioffe and C. Szegedy, "Batch normalization: accelerating deep network training by reducing internal covariate shift," Proc. ICML, 448-456 (2015).
- J. Jung, H. Heo, Y. Yang, H. Shim, and H. Yu, "Avoiding speaker overfitting in End-to-End DNNs using raw waveform for text-independent speaker verification" Proc. Interspeech, 3583-3587 (2018).