DOI QR코드

DOI QR Code

Speech synthesis using acoustic Doppler signal

초음파 도플러 신호를 이용한 음성 합성

  • Lee, Ki-Seung (Department of Electronic Engineering, Konkuk University)
  • Received : 2015.10.08
  • Accepted : 2015.12.10
  • Published : 2016.03.31

Abstract

In this paper, a method synthesizing speech signal using the 40 kHz ultrasonic signals reflected from the articulatory muscles was introduced and performance was evaluated. When the ultrasound signals are radiated to articulating face, the Doppler effects caused by movements of lips, jaw, and chin observed. The signals that have different frequencies from that of the transmitted signals are found in the received signals. These ADS (Acoustic-Doppler Signals) were used for estimating of the speech parameters in this study. Prior to synthesizing speech signal, a quantitative correlation analysis between ADS and speech signals was carried out on each frequency bin. According to the results, the feasibility of the ADS-based speech synthesis was validated. ADS-to-speech transformation was achieved by the joint Gaussian mixture model-based conversion rules. The experimental results from the 5 subjects showed that filter bank energy and LPC (Linear Predictive Coefficient) cepstrum coefficients are the optimal features for ADS, and speech, respectively. In the subjective evaluation where synthesized speech signals were obtained using the excitation sources extracted from original speech signals, it was confirmed that the ADS-to-speech conversion method yielded 72.2 % average recognition rates.

본 논문에서는 40 kHz 초음파 신호를 입 주변에 쏘고, 되돌아오는 초음파 신호를 이용해 음성신호를 합성하는 방법을 소개하고 성능을 평가하였다. 발성하고 있는 입주변에 초음파를 방사하게 되면, 입술, 턱, 뺨 등의 움직임으로 인한 변위로 도플러 현상이 발생하고, 이에 따라 반사 신호에는 본래의 주파수 성분과는 다른 도플러 주파수가 관찰되는데, 본 논문에서는 이러한 도플러 주파수를 이용하여 음성 파라메터를 추정하도록 하였다. 음성합성에 앞서서 초음파 도플러 신호와 음성 신호 간의 상관관계를 각 주파수 별로 분석하였으며, 이로부터 초음파 도플러 신호를 이용한 음성 신호의 합성 가능성을 살펴보았다. 변환에는 초음파 도플러의 정적, 동적 특성을 함께 반영한 특징 변수를 사용하였으며 결합-혼합 가우시안 기법을 이용하여 음성 파라메터로 변환하였다. 5명의 피 실험자를 이용한 음성 합성 실험에서 필터뱅크 에너지 값을 초음파신호의 특징변수로, LPC(Linear Predictive Coefficient) 켑스트럼 계수를 음성 변수로 사용하는 경우 가장 우수한 변환 성능을 나타내었다. 음성신호에서 추출한 여기신호를 이용하여 합성음을 생성하고, 이를 청취하였을 때 72.2 %의 평균 인식율이 얻어짐을 확인할 수 있었다.

Keywords

References

  1. B. Denby, T. Schultz, K. Honda, T. Hueber, J. M. Gilbert, and J. S. Brumberg, "Silent speech interfaces," Speech Comm. 52, 270-287 (2010). https://doi.org/10.1016/j.specom.2009.08.002
  2. K. S. Lee, "Prediction of acoustic feature parameters using myoelectric signals," IEEE Trans. on Biomed. Eng. 51, 1587-1595 (2010).
  3. T. Toda and K. Shikano, "NAM-to-Speech conversion with Gaussian Mixture Models," in Proc. Interspeech, 1957-1960 (2005).
  4. S. Li, J. Q. Wang, M. Niu, T. Liu, and X. J. Jing, "The enhancement of millimeter wave conduct speech based on perceptual weighting," Progress in Electromagnetics Research B, 9, 199-214 (2008). https://doi.org/10.2528/PIERB08063001
  5. K. Nakamura, T. Toda, H. Saruwatari, and K. Shikano, "Speaking-aid systems using GMM-based voice conversion for electrolaryngeal speech," Speech Comm. 54, 134-146 (2012). https://doi.org/10.1016/j.specom.2011.07.007
  6. K. Kalgaonkar and B. Raj, "An acoustic Doppler-based front end for hands free spoken user interfaces," in Proc. SLT, 158-161 (2006).
  7. K. Kalgaonkar and B. Raj, "Acoustic Doppler sonar for gait recognition," in Proc. 2007 IEEE Conf. Advanced Video and Signal Based Surveillance, 27-32 (2007).
  8. K. Kalgaonkar and B. Raj, "One-handed gesture recognition using ultrasonic Doppler sonar," Proc. ICASSP, 1889-1892 (2009).
  9. S. Srinivasan, B. Raj, and T. Ezzat, "Ultrasonic sensing for robust speech recognition," in Proc. ICASSP, 5102-5105 (2010).
  10. K. Livescu, B. Zhu, and J. Glass, "On the phonetic information in ultrasonic microphone signals," in Proc. ICASSP, 4621-4624 (2009).
  11. A. R. Toth, B. Raj, K. Kalgaonkar, and T. Ezzat, "Synthesizing speech from Doppler signals," in Proc. ICASSP, 4638-4641 (2010).
  12. I. Almajai and B. Milner, "Visually derived Wiener filters for speech enhancement," IEEE Trans. on Audio, Speech, and Lang. Process. 19, 1642-1651 (2011). https://doi.org/10.1109/TASL.2010.2096212