DOI QR코드

DOI QR Code

Improved speech emotion recognition using histogram equalization and data augmentation techniques

히스토그램 등화와 데이터 증강 기법을 이용한 개선된 음성 감정 인식

  • Received : 2017.01.24
  • Accepted : 2017.06.21
  • Published : 2017.06.30

Abstract

We propose a new method to reduce emotion recognition errors caused by variation in speaker characteristics and speech rate. Firstly, for reducing variation in speaker characteristics, we adjust features from a test speaker to fit the distribution of all training data by using the histogram equalization (HE) algorithm. Secondly, for dealing with variation in speech rate, we augment the training data with speech generated in various speech rates. In computer experiments using EMO-DB, KRN-DB and eNTERFACE-DB, the proposed method is shown to improve weighted accuracy relatively by 34.7%, 23.7% and 28.1%, respectively.

Keywords

References

  1. Sethu, V., Ambikairajah, E., & Epps, J. (2007). Speaker normalisation for speech-based emotion detection. Proceedings of Digital Signal Processing (pp. 611-614).
  2. Ko, T., Peddinti, V., Povey, D., & Khudanpur, S. (2015). Audio Augmentation for Speech Recognition. Proceedings of INTERSPEECH (pp. 3586-3589).
  3. Chiou, B. C., & Chen, C. P. (2014). Speech Emotion Recognition with Cross-lingual Databases. Proceedings of INTERSPEECH (pp. 558-561).
  4. Kwon, C., Song, S., Kim, J., Kim, K., & Jang, J. (2012). Extraction of Speech Features for Emotion Recognition. Phonetics and Speech Sciences, 4(2), 73-78. (권철홍.송승규.김종열.김근호.장준수 (2012). 감정 인식을 위한 음성 특징 도출. 말소리와 음성과학, 4(2), 73-78.) https://doi.org/10.13064/KSSS.2012.4.2.073
  5. Han, K., Yu, D., & Tashev, I. (2014). Speech Emotion Recognition Using Deep Neural Network and Extreme Learning Machine. Proceedings of INTERSPEECH (pp. 223-227).
  6. Eyben, F., Wollmer, M., & Schuller, B. (2009). OpenEAR- Introducing the Munich Open-Source Emotion and Affect Recognition Toolkit. Proceedings of the Affective Computing and Intelligent Interaction (pp. 1-6).
  7. Schuller, B., Steidl, S., & Batliner, A. (2009). The INTERSPEECH 2009 Emotion Challenge. Proceedings of INTERSPEECH (pp. 312-315).
  8. Cortes, C., & Vapnik, V. (1995). Support-Vector Networks. Machine Learning, 20(3), 273-297. https://doi.org/10.1007/BF00994018
  9. Verhelst, W., & Roelands, M. (1993). An overlap-add technique based on waveform similarity (WSOLA) for high quality time-scale modification of speech. Proceedings of International Conference Acoustics, Speech, and Signal Processing (pp. 554-557).
  10. Bagwell, C., & Klauer, U. (2015). SoX - sound exchange. Retrieved from http://sox.sourceforge.net/ on November 25, 2016.
  11. Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W. F., & Weiss, B. (2005). A Database of German Emotional Speech. Proceedings of INTERSPEECH (pp. 1517-1520).
  12. Jang, K., & Kwon, O. (2006). Speech Emotion Recognition for Affective Human-Robot Interaction. Proceedings of International Conference on Speech and Computer (pp. 419-422).
  13. Martin, O., Kotsia, I., Macq, B., & Pitas, I. (2006). The eNTERFACE'05 Audio-Visual Emotion Database. Proceedings of International Conference Data Engineering Workshops (pp. 1-8).
  14. Lee, J., & Tashev, I. (2015). High-level Feature Representation using Recurrent Neural Network for Speech Emotion Recognition. Proceedings of INTERSPEECH (pp. 1537-1540).
  15. Jin, Q., Li, C., Chen, S., & Wu, H. (2015). Speech emotion recognition with acoustic and lexical features. Proceedings of International Conference Acoustics, Speech, and Signal Processing (pp. 4749-4753).
  16. Powers, D. M. (2011). Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation. Journal of Machine Learning Technologies, 2(1), 37-63.
  17. Van der Maaten, L. (2014). Accelerating t-SNE using Tree-Based Algorithms. Journal of Machine Learning Research, 15(1), 3221-3245.