DOI QR코드

DOI QR Code

Singing Voice Synthesis Using HMM Based TTS and MusicXML

HMM 기반 TTS와 MusicXML을 이용한 노래음 합성

  • Received : 2015.04.09
  • Accepted : 2015.05.15
  • Published : 2015.05.30

Abstract

Singing voice synthesis is the generation of a song using a computer given its lyrics and musical notes. Hidden Markov models (HMM) have been proved to be the models of choice for text to speech synthesis. HMMs have also been used for singing voice synthesis research, however, a huge database is needed for the training of HMMs for singing voice synthesis. And commercially available singing voice synthesis systems which use the piano roll music notation, needs to adopt the easy to read standard music notation which make it suitable for singing learning applications. To overcome this problem, we use a speech database for training context dependent HMMs, to be used for singing voice synthesis. Pitch and duration control methods have been devised to modify the parameters of the HMMs trained on speech, to be used as the synthesis units for the singing voice. This work describes a singing voice synthesis system which uses a MusicXML based music score editor as the front-end interface for entry of the notes and lyrics to be synthesized and a hidden Markov model based text to speech synthesis system as the back-end synthesizer. A perceptual test shows the feasibility of our proposed system.

노래음 합성이란 주어진 가사와 악보를 이용하여 컴퓨터에서 노래음을 생성하는 것이다. 텍스트/음성 변환기에 널리 사용된 HMM 기반 음성합성기는 최근 노래음 합성에도 적용되고 있다. 그러나 기존의 구현방법에는 대용량의 노래음 데이터베이스 수집과 학습이 필요하여 구현에 어려움이 있다. 또한 기존의 상용 노래음 합성시스템은 피아노 롤 방식의 악보 표현방식을 사용하고 있어 일반인에게는 익숙하지 않으므로 읽기 쉬운 표준 악보형식의 사용자 인터페이스를 지원하여 노래 학습의 편의성을 향상시킬 필요가 있다. 이 문제를 해결하기 위하여 본 논문에서는 기존 낭독형 음성합성기의 HMM 모델을 이용하고 노래음에 적합한 피치값과 지속시간 제어방법을 적용하여 HMM 모델 파라미터 값을 변화시킴으로서 노래음을 생성하는 방법을 제안한다. 그리고 음표와 가사를 입력하기 위한 MusicXML 기반의 악보편집기를 전단으로, HMM 기반의 텍스트/음성 변환 합성기를 합성기 후단으로서 사용하여 노래음 합성시스템을 구현하는 방법을 제안한다. 본 논문에서 제안하는 방법을 이용하여 합성된 노래음을 평가하였으며 평가결과 활용 가능성을 확인하였다.

Keywords

References

  1. H. Kenmochi and H. Ohshita, "VOCALOID-commercial singing synthesizer based on sample concatenation," in Proc. INTERSPEECH, pp. 4009-4010, 2007.
  2. H. Kenmochi, "Singing synthesis as a new musical instrument," in Proc. ICASSP, pp. 5385-5388, 2012.
  3. "UTAU" available: http://utau-synth.com/
  4. J. Xu, H. Li, and S. Zhou, "An Overview of Deep Generative Models," IETE Technical Review, pp. 1-9, 2014.
  5. G.J. Lim and J.C. Lee, "Improvement of Naturalness for a HMM-based Korean TTS using the prosodic boundary information," Journal of the Korea Society of Computer and Information, v.17, no.9, pp. 75-84, Sep. 2012. https://doi.org/10.9708/jksci/2012.17.9.075
  6. H. Zen, T. Nose, J. Yamagishi, S. Sako, T. Masuko, A. W. Black, et al., "The HMM-based speech synthesis system (HTS) version 2.0," in Proc. ISCA Workshop Speech Synthesis, pp. 294-299, 2007.
  7. K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, and T. Kitamura, "Speech parameter generation algorithms for HMM-based speech synthesis," in Proc. ICASSP, pp. 1315-1318, 2000.
  8. K. Saino, H. Zen, Y. Nankaku, A. Lee, and K. Tokuda, "An HMM-based singing voice synthesis system," in Proc. INTERSPEECH, 2006.
  9. K. Oura, A. Mase, T. Yamada, S. Muto, Y. Nankaku, and K. Tokuda, "Recent development of the HMM-based singing voice synthesis system-Sinsy," in Proc. ISCA Workshop Speech Synthesis, pp. 211-216, 2010.
  10. K. Nakamura, K. Oura, Y. Nankaku, and K. Tokuda, "HMM-Based singing voice synthesis and its application to Japanese and English," in Proc. ICASSP, pp. 265-269, 2014.
  11. K. Shirota, K. Nakamura, K. Hashimoto, K. Oura, Y. Nankaku, and K. Tokuda, "Integration of speaker and pitch adaptive training for HMM-based singing voice synthesis," in Proc. ICASSP, pp. 2559-2563, 2014.
  12. T. Saitou, M. Goto, M. Unoki, and M. Akagi, "Speech-to-singing synthesis: Converting speaking voices to singing voices by controlling acoustic features unique to singing voices," in Applications of Signal Processing to Audio and Acoustics, pp. 215-218, 2007.
  13. J. Kominek and A. W. Black, "The CMU Arctic speech databases," in Fifth ISCA Workshop on Speech Synthesis, pp. 223-224, 2004.
  14. K. Tokuda, T. Masuko, N. Miyazaki, and T. Kobayashi, "Multi-space probability distribution HMM," IEICE TRANSACTIONS on Information and Systems, vol. 85, pp. 455-464, 2002.
  15. T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura, "Duration modeling for HMM-based speech synthesis," in Proc. ICSLP, pp. 29-31, 1998.
  16. K. Shinoda and T. Watanabe, "MDL-based context-dependent subword modeling for speech recognition," The Journal of the Acoustical Society of Japan (E), vol. 21, pp. 79-86, 2000. https://doi.org/10.1250/ast.21.79
  17. T. YoshimuraY, K. TokudaY, T. MasukoYY, T. KobayashiYY, and T. KitamuraY, "Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis," in Proc. Eurospeech, pp. 2347-2350, 1999.
  18. "MusicXML" available: http://www.musicxml.com/
  19. N. U. Khan and J.C. Lee, "Development of a Music Score Editor based on MusicXML," Journal of the Korea Society of Computer and Information, vol. 19, pp. 77-90, 2014.
  20. W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery, Numerical recipes in C, vol. 2: Citeseer, 1996.
  21. K. Tokuda, T. Kobayashi, T. Masuko, and S. Imai, "Mel-generalized cepstral analysis-a unified approach to speech spectral estimation," in Proc. ICSLP, pp. 1043-1046, 1994.