DOI QR코드

DOI QR Code

A Weighted Feature Voting Approach for Robust and Real-Time Voice Activity Detection

  • 투고 : 2010.03.16
  • 심사 : 2010.09.07
  • 발행 : 2011.02.28

초록

This paper concerns a robust real-time voice activity detection (VAD) approach which is easy to understand and implement. The proposed approach employs several short-term speech/nonspeech discriminating features in a voting paradigm to achieve a reliable performance in different environments. This paper mainly focuses on the performance improvement of a recently proposed approach which uses spectral peak valley difference (SPVD) as a feature for silence detection. The main issue of this paper is to apply a set of features with SPVD to improve the VAD robustness. The proposed approach uses a weighted voting scheme in order to take the discriminative power of the employed feature set into account. The experiments show that the proposed approach is more robust than the baseline approach from different points of view, including channel distortion and threshold selection. The proposed approach is also compared with some other VAD techniques for better confirmation of its achievements. Using the proposed weighted voting approach, the average VAD performance is increased to 89.29% for 5 different noise types and 8 SNR levels. The resulting performance is 13.79% higher than the approach based only on SPVD and even 2.25% higher than the not-weighted voting scheme.

키워드

참고문헌

  1. M.H. Savoji, "A Robust Algorithm for Accurate End Pointing of Speech," Speech Commun., 1989, vol. 8, no. 1, pp. 45-60. https://doi.org/10.1016/0167-6393(89)90067-8
  2. T. Kristjansson, S. Deligne, and P. Olsen, "Voicing Features for Robust Speech Detection," Interspeech, 2005, pp. 369-372.
  3. R.E. Yantorno, K.L. Krishnamachari, and J.M. Lovekin, "The Spectral Autocorrelation Peak Valley Ratio (SAPVR): A Usable Speech Measure Employed as a Co-channel Detection System," IEEE Int. Workshop Intell. Signal Process., 2001, pp. 193-197.
  4. J.L. Shen, J.W. Hung, and L.S. Lee, "Robust Entropy Based Endpoint Detection for Speech Recognition in Noisy Environments," ICSP, 1998, pp. 232-235.
  5. A. Benyassine et al., "ITU-T Recommendation G.729 Annex B: A Silence Compression Scheme for Use with G.729 Optimized for V.70 Digital Simultaneous Voice and Data Applications," IEEE Commun. Mag., vol. 35, 1997, pp. 64-73.
  6. M. Marzinzik and B. Kollmeier, "Speech Pause Detection for Noise Spectrum Estimation by Tracking Power Envelope Dynamics," IEEE Trans. Speech Audio Process., vol. 10, 2002, pp. 109-118. https://doi.org/10.1109/89.985548
  7. J. Ram irez et al., "Efficient Voice Activity Detection Algorithms Using Long-Term Speech Information," Speech Commun., 2004, vol. 42, pp. 271-287. https://doi.org/10.1016/j.specom.2003.10.002
  8. B.F. Wu and K.C. Wang, "Robust Endpoint Detection Algorithm Based on the Adaptive Band Partitioning Spectral Entropy in Adverse Environments," IEEE Trans. Speech Audio Process., vol. 13, 2005, pp. 762-775. https://doi.org/10.1109/TSA.2005.851909
  9. S. Ahmadi and A.S. Spanias, "Cepstrum-Based Pitch Detection Using a New Statistical V/UV Classification Algorithm," IEEE Trans. Speech Audio Process., vol. 7, 1999, pp. 333-338. https://doi.org/10.1109/89.759042
  10. Y. Tian, Z. Wang, and D. Lu, "Non-Speech Segment Rejection Based on Prosodic Information for Robust Speech Recognition," IEEE Signal Process. Lett., vol. 9, no. 11, 2002, pp. 364-367. https://doi.org/10.1109/LSP.2002.804564
  11. K. Ishizuka et al., "Noise Robust Voice Activity Detection Based on Periodic to Aperiodic Component Ratio," Speech Commun., vol. 52, 2010, pp. 41-60. https://doi.org/10.1016/j.specom.2009.08.003
  12. S. Shafiee et al., "A Two-Stage Speech Activity Detection System Considering Fractal Aspects of Prosody," Pattern Recog. Lett., 2010.
  13. M. Fujimoto and K. Ishizuka, "Noise Robust Voice Activity Detection Based on Switching Kalman Filter," IEICE Trans. Inf. Syst., 2008, E91-D, pp. 467-477. https://doi.org/10.1093/ietisy/e91-d.3.467
  14. A. Agarwal and Y.M. Cheng, "Two-Stage Mel-Warped Wiener Filter for Robust Speech Recognition," IEEE Workshop Auto. Speech Recog. Understanding, 1999, pp. 67-70.
  15. D. Cournapeau and T. Kawahara, "Evaluation of Real-Time Voice Activity Detection Based on High Order Statistics," Interspeech, 2007, pp. 2945-2949.
  16. H. Kato Solvang, K. Ishizuka, and M. Fujimoto, "Voice Activity detection Based on Adjustable Linear Prediction and GARCH Models," Speech Commun., 2008, vol. 50, pp. 476-486. https://doi.org/10.1016/j.specom.2008.02.003
  17. M.H. Moattar and M.M. Homayounpour, "A Simple but Efficient Real-Time Voice Activity Detection Algorithm," Eusipco, 2009, pp. 2549-2553.
  18. I.C. Yoo and D. Yook, "Robust Voice Activity Detection Using the Spectral Peaks of Vowel Sounds," ETRI J., vol. 31, no. 4, 2009, pp. 451-453 https://doi.org/10.4218/etrij.09.0209.0104
  19. M.H. Moattar, M.M. Homayounpour, and N.K. Kalantari, "A New Approach for Robust Realtime Voice Activity Detection Using Spectral Pattern," ICASSP, 2010, pp. 4478-4481.
  20. J.S. Garofalo et al., DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus CDROM, Linguistic Data Consortium, 1993.
  21. M. Bijankhan and M.J. Sheikhzadegan, "FARSDAT- the Farsi Spoken Language Database," 5th Australian Int. Conf. Speech Sci. Technol., 1994, vol. 2, pp. 826-829.
  22. H.G. Hirsch and D. Pearce, "The AURORA Experimental Framework for the Performance Evaluation of Speech Recognition Systems under Noise Conditions," ISCA ITRW, 2000, pp. 181-188.
  23. A.P. Varga et al., "The NOISEX-92 Study on the Effect of Additive Noise on Automatic Speech Recognition," Technical report, DRA Speech Research Unit, 1992.
  24. B. Lee and M. Hasegawa-Johnson, "Minimum Mean Squared Error A Posteriori Estimation of High Variance Vehicular Noise," Biennial DSP In-Vehicle Mobile Syst., 2007.
  25. ETSI, Digital Cellular Telecommunications Systems (Phase 2+); Voice Activity Detector (VAD) for Adaptive Multi-Rate (AMR) Speech Traffic Channels, GSM 06.94, version 7.1.1, EN 301 708, 1999.
  26. ETSI, Speech Processing, Transmission, and Quality Aspects (STQ), Distributed Speech Recognition, Advanced Front-End Feature Extraction Algorithm, Compression Algorithms, version 1.1.1, ES 202 050, 2001.

피인용 문헌

  1. A Hierarchical Framework Approach for Voice Activity Detection and Speech Enhancement vol.2014, pp.None, 2011, https://doi.org/10.1155/2014/723643
  2. Manifold learning based speaker dependent dimension reduction for robust text independent speaker verification vol.17, pp.3, 2014, https://doi.org/10.1007/s10772-014-9228-6
  3. Formant-Based Robust Voice Activity Detection vol.23, pp.12, 2011, https://doi.org/10.1109/taslp.2015.2476762
  4. Efficient harmonic peak detection of vowel sounds for enhanced voice activity detection vol.12, pp.8, 2011, https://doi.org/10.1049/iet-spr.2017.0553