JOURNAL BROWSE
Search
Advanced SearchSearch Tips
Performance Comparison of Deep Feature Based Speaker Verification Systems
facebook(new window)  Pirnt(new window) E-mail(new window) Excel Download
  • Journal title : Phonetics and Speech Sciences
  • Volume 7, Issue 4,  2015, pp.9-16
  • Publisher : The Korean Society of Speech Sciences
  • DOI : 10.13064/KSSS.2015.7.4.009
 Title & Authors
Performance Comparison of Deep Feature Based Speaker Verification Systems
Kim, Dae Hyun; Seong, Woo Kyeong; Kim, Hong Kook;
  PDF(new window)
 Abstract
In this paper, several experiments are performed according to deep neural network (DNN) based features for the performance comparison of speaker verification (SV) systems. To this end, input features for a DNN, such as mel-frequency cepstral coefficient (MFCC), linear-frequency cepstral coefficient (LFCC), and perceptual linear prediction (PLP), are first compared in a view of the SV performance. After that, the effect of a DNN training method and a structure of hidden layers of DNNs on the SV performance is investigated depending on the type of features. The performance of an SV system is then evaluated on the basis of I-vector or probabilistic linear discriminant analysis (PLDA) scoring method. It is shown from SV experiments that a tandem feature of DNN bottleneck feature and MFCC feature gives the best performance when DNNs are configured using a rectangular type of hidden layers and trained with a supervised training method.
 Keywords
speaker verification;deep neural network;tandem feature;
 Language
Korean
 Cited by
 References
1.
Kinnunen, T. & Li, H. (2010). An overview of text-independent speaker recognition: From features to supervectors. Speech Communication, Vol. 52, No. 1, 12-40. crossref(new window)

2.
Reynolds, D. A., Quatieri, T. F. & Dunn, R. B. (2000). Speaker verification using adapted Gaussian mixture models. Digital Signal Processing, Vol. 10, No. 1, 19-41. crossref(new window)

3.
Kenny, P., Boulianne, G., Ouellet, P. & Dumouchel, P. (2007). Joint factor analysis versus eigenchannels in speaker recognition. IEEE Transactions on Audio, Speech, and Language Processing, Vol. 15, No. 4, 1435-1447. crossref(new window)

4.
Matrouf, D., Scheffer, N., Fauve, B. G. & Bonastre, J. F. (2007). A straightforward and efficient implementation of the factor analysis model for speaker verification. In Proceedings of Interspeech, Antwerp, Belgium, 1242-1245.

5.
Dehak, N., Dehak, R., Glass, J. R., Reynolds, D. A. & Kenny, P. (2010). Cosine similarity scoring without score normalization techniques. In Proceedings of Odyssey Speaker and Language Recognition Workshop, Brno, Czech Republic, 71-75.

6.
Fu, T., Qian, Y., Liu, Y. & Yu, K. (2014). Tandem deep features for text-dependent speaker verification. In Proceedings of Interspeech, Singapore, Singapore, 1327-1331.

7.
Yu, D. & Seltzer, M. L. (2011). Improved bottleneck features using pretrained deep neural networks. In Proceedings of Interspeech, Florence, Italy, 237-240.

8.
Zhang, Y., Chuangsuwanich, E., & Glass, J. (2014). Extracting deep neural network bottleneck features using low-rank matrix factorization. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 185-189.

9.
Liu, Y., Fu, T., Fan, Y., Qian, Y., & Yu, K. (2014). Speaker verification with deep features. In Proceedings of International Joint Conference on Neural Networks (IJCNN), Beijing, China, 747-753.

10.
Kanagasundaram, A. (2014). Speaker verification using I-vector features. Ph.D. Dissertation, Queensland University of Technology.

11.
Kenny, P., Boulianne, G. & Dumouchel, P. (2005). Eigenvoice modeling with sparse training data. IEEE Transactions on Speech and Audio Processing, Vol. 13, No. 3, 345-354. crossref(new window)

12.
Bishop, C. M. (2007). Pattern Recognition and Machine Learning (Information Science and Statistics), Springer.

13.
Prince, S. J. & Elder, J. H. (2007). Probabilistic linear discriminant analysis for inferences about identity. In Proceedings of IEEE International Conference on Computer Vision (ICCV), Rio de Janeiro, Brazil, 1-8.

14.
Lee, K. A., Larcher, A., You, C. H., Ma, B. & Li, H. (2013). Multi-session PLDA scoring of i-vector for partially open-set speaker detection. In Proceedings of Interspeech, Lyon, France, 3651-3655.

15.
Kenny, P. (2010). Bayesian speaker verification with heavy tailed priors. In Proceedings of Odyssey Speaker and Language Recognition Workshop, Brno, Czech Republic, paper no 014.

16.
Sainath, T. N., Kingsbury, B. & Ramabhadran, B. (2012). Auto-encoder bottleneck features using deep belief networks. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kyoto, Japan, 4153-4156.

17.
Larcher, A., Bonastre, J. F., Fauve, B. G., Lee, K. A., Levy, C., Li, H. & Parfait, J. Y. (2013). ALIZE 3.0-open source toolkit for state-of-the-art speaker recognition. In Proceedings of Interspeech, Lyon, France, 2768-2772.

18.
Bonastre, J. F., Wils, F. & Meignier, S. (2005). ALIZE, a free toolkit for speaker recognition. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Philadelphia, PA, 737-740.

19.
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N. & Vesel, K. (2011). The Kaldi speech recognition toolkit. In Proceedings of IEEE ASRU, Honolulu, HI, 1-4.

20.
Brummer, N. & De Villiers, E. (2010). The speaker partitioning problem. In Proceedings of Odyssey Speaker and Language Recognition Workshop, Brno, Czech Republic, 194-201.

21.
Greenberg, C. S., Stanford, V. M., Martin, A. F., Yadagiri, M., Doddington, G. R., Godfrey, J. J. & Hernandez-Cordero, J. (2013). The 2012 NIST speaker recognition evaluation. In Proceedings of Interspeech, Lyon, France, 1971-1975.