Vector Quantizer Based Speaker Normalization for Continuos Speech Recognition

연속음성 인식기를 위한 벡터양자화기 기반의 화자정규화

  • 신옥근 (한국해양대학교 IT공학부)
  • Published : 2004.11.01

Abstract

Proposed is a speaker normalization method based on vector quantizer for continuous speech recognition (CSR) system in which no acoustic information is made use of. The proposed method, which is an improvement of the previously reported speaker normalization scheme for a simple digit recognizer, builds up a canonical codebook by iteratively training the codebook while the size of codebook is increased after each iteration from a relatively small initial size. Once the codebook established, the warp factors of speakers are estimated by comparing exhaustively the warped versions of each speaker's utterance with the codebook. Two sets of phones are used to estimate the warp factors: one, a set of vowels only. and the other, a set composed of all the Phonemes. A Piecewise linear warping function which corresponds to the estimated warp factor is adopted to warp the power spectrum of the utterance. Then the warped feature vectors are extracted to be used to train and to test the speech recognizer. The effectiveness of the proposed method is investigated by a set of recognition experiments using the TIMIT corpus and HTK speech recognition tool kit. The experimental results showed comparable recognition rate improvement with the formant based warping method.

포만트 등의 음향학적인 정보를 이용하지 않는 연속음성인식 (CSR)을 위한 벡터 양자화기 기반의 화자 정규화 방법을 제안한다. 이 방법은 앞서 제안한 간단한 숫자음 인식기를 위한 화자정규화 방법을 개선한 것으로, 코드북의 크기를 증가시켜 가면서 벡터양자화기를 반복적으로 학습시킴으로써 정규화된 코드북을 구한 다음, 치를 이용하여 시험용화자의 워핑계수를 추정한다. 코드북 생성과 워핑계수 추정을 위해 모음 음소의 집합과 자음과 모음을 포함한 모든 음소의 집합 등 두 가지 음소집합을 이용i,겨 실험하였으며, 추정한 워핑계수에 상응하는 구간선형 워핑함수를 이용하여 인식기의 학습과 시험에 사용될 특징벡터를 워핑하였다. TIMIT 코퍼스와 HTK toolkit을 이용한 음소인식 실험을 수행하여 제안하는 방법의 성능을 조사한 결과, 포만트를 이용한 워핑 방법과 비슷한 성능을 가짐을 확인하였다.

Keywords

References

  1. P. Zhan and A. Waibel. 'Vocal Tract Length Normalization for Large Vocabulary Continuous Speech Recognition', Language Technologies Institute Technical Report : CMULTI-97-150, Carnegie Melon University, May, 1997
  2. L. Lee and R. C. Rose, 'A Frequency Warping Approach to Speaker Normalization', IEEE Trans. on Speech and Audio Processing, 6(1), 49-60. Jan. 1998 https://doi.org/10.1109/89.650310
  3. 신옥근, 'DHMM 음성 인식 시스템을 위한 양자화 기반의 화자 정규화', 한국음향학회지, 22(4), 299-307, 2003
  4. S. Young, D. Kershaw, J. Odell, D. Ollason, V. Valtchev and P. Woodland, The HTK Book. ver. 3., Microsoft CorP., 2000
  5. J. S. Garofolo, L. F. Lamel. W. M. Fisher, J. G. Fiscus, D. S. Pallet and N. L. Dahlgren, DARPA TIMIT Acoustic Phonetic Continuous Speech Corpus: CDROM. NIST., 1993
  6. S. Umesh, L. Cohen and D. Nelson, 'Frequency Warping and the Mel Scale', IEEE Signal Processing Letters, pp.l04-107, 9(3), March 2001 https://doi.org/10.1109/97.995829
  7. S. Molau, S. Kanthak and H. Ney, 'Efficient Vocal Tract Normalization in Automatic Speech Recognition', Proc. ESSV, 209-216, Sept. 2000
  8. E. Edie and H. Gish, 'A Parametric Approach to Vocal Tract Length Normalization', Proc. ICASSP'96, 346-349, 1996
  9. J. Hogberg, 'Prediction of formant frequencies from linear combinations of filterbank and cepstral coefficient', Speech, Music and Hearing Quarterly Progress and Status Report, 33, 41-49. Institutionen for tal, musik och horsel, 1997
  10. Y. Linde, A. Buzo and R. M. Gray, 'An algorithm for vector quantizer design', IEEE Transactions on Communications, 28(1), 84-95, 1980 https://doi.org/10.1109/TCOM.1980.1094577
  11. M.A. Bacchiani, Speech Recognition System Design Based On Automatically Derived Units, Ph. D. Thesis, Boston University, 1999