Development of a Korean Speech Recognition Platform (ECHOS)

한국어 음성인식 플랫폼 (ECHOS) 개발

  • Published : 2005.11.01

Abstract

We introduce a Korean speech recognition platform (ECHOS) developed for education and research Purposes. ECHOS lowers the entry barrier to speech recognition research and can be used as a reference engine by providing elementary speech recognition modules. It has an easy simple object-oriented architecture, implemented in the C++ language with the standard template library. The input of the ECHOS is digital speech data sampled at 8 or 16 kHz. Its output is the 1-best recognition result. N-best recognition results, and a word graph. The recognition engine is composed of MFCC/PLP feature extraction, HMM-based acoustic modeling, n-gram language modeling, finite state network (FSN)- and lexical tree-based search algorithms. It can handle various tasks from isolated word recognition to large vocabulary continuous speech recognition. We compare the performance of ECHOS and hidden Markov model toolkit (HTK) for validation. In an FSN-based task. ECHOS shows similar word accuracy while the recognition time is doubled because of object-oriented implementation. For a 8000-word continuous speech recognition task, using the lexical tree search algorithm different from the algorithm used in HTK, it increases the word error rate by $40\%$ relatively but reduces the recognition time to half.

교육 및 연구 목적을 위하여 개발된 한국어 음성인식 플랫폼인 ECHOS를 소개한다. 음성인식을 위한 기본 모듈을 제공하는 BCHOS는 이해하기 쉽고 간단한 객체지향 구조를 가지며, 표준 템플릿 라이브러리 (STL)를 이용한 C++ 언어로 구현되었다. 입력은 8또는 16 kHz로 샘플링된 디지털 음성 데이터이며. 출력은 1-beat 인식결과, N-best 인식결과 및 word graph이다. ECHOS는 MFCC와 PLP 특징추출, HMM에 기반한 음향모델, n-gram 언어모델, 유한상태망 (FSN)과 렉시컬트리를 지원하는 탐색알고리듬으로 구성되며, 고립단어인식으로부터 대어휘 연속음성인식에 이르는 다양한 태스크를 처리할 수 있다. 플랫폼의 동작을 검증하기 위하여 ECHOS와 hidden Markov model toolkit (HTK)의 성능을 비교한다. ECHOS는 FSN 명령어 인식 태스크에서 HTK와 거의 비슷한 인식률을 나타내고 인식시간은 객체지향 구현 때문에 약 2배 정도 증가한다. 8000단어 연속음성인식에서는 HTK와 달리 렉시컬트리 탐색 알고리듬을 사용함으로써 단어오류율은 $40\%$ 증가하나 인식시간은 0.5배로 감소한다.

Keywords

References

  1. HTK Home page. http://htk.eng.cam.ac.uk
  2. CMU Sphinx: Open Source Speech Recognition. http://www.speech.cs.cmu.edu/sphinx/Sphinx. html
  3. Automatic Speech Recognition: Software. http://www.isip.msstate. edu/proiects/speech/software/
  4. Multipurpose Large Vocabulary Continuous Speech Recognition Engine Julius. http://www.ar.media.kyoto-u.ac.jp/ members/ian/doc
  5. http://speech.chungbuk.ac.kr/~owkwon/srhome/index.html ezCSR
  6. 권오욱, 김회린,유창동,김봉완,이용주,'한국어 음성인식 플랫폼의 설계,' 말소리, 51 (9). 2004
  7. Standard Template Library Programmer's Guide. http://www.sgi.com/tsch/stl/
  8. Practical UML: A Hands-On Introduction for Developers- by Randy Miller
  9. L. Rabiner and B.-H. .Iuang, Fundamentals of Speech Recognition, (Prentice-Hall, 1993)
  10. F. Jelinek, Statistical Methods for Speech Recognition (Language, Speech, and Communication), (MIT Press, 1999)
  11. S.B. Davis and P. Mermelstein, 'Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,' IEEE Trans. ASSP, 28, 357-366, Aug. 1980 https://doi.org/10.1109/TASSP.1980.1163420
  12. H. Herrnanskv, 'Perceptual linear predictive (PLP) analysis of speech,' Journal of the Acoustical Society of America, 87 (4), 1738-1752, 1990 https://doi.org/10.1121/1.399423
  13. Aurora, Distributed Speech Recognition. http://portal.etsi.org/stq/kta/DSR/dsr.asp
  14. X. Huang, A. Acero, and H.-W. Hon, Spoken Language Processing, 648-650, Pretice Hall, 2001
  15. M.K. Raishankar, Efficient Algorithms for Speech Recognition, (PhD Thesis, CMU, 1996)