References
- Bengio, Y., Simard, P., & Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2), 157-166. https://doi.org/10.1109/72.279181
- Bengio, Y. (2009). Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1), 1-127. https://doi.org/10.1561/2200000006
- Chung, J., Gulcehre, C., Cho, K. H., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. Retrieved from https://arxiv.org/abs/1412.3555
- CSTR [The Center for Speech Technology Research]. (2014). Festival:The festival speech synthesis system (version 2.4) [Computer program]. Retrieved from http://www.cstr.ed.ac.uk/projects/festival/
- CSTR [The Center for Speech Technology Research]. (2018a). Ossian:A python based tool for automatically building speech synthesis front ends [Computer program]. Retrieved from https://github.com/CSTR-Edinburgh/Ossian/
- CSTR [The Center for Speech Technology Researc]. (2018b). The Merlin toolkit [Computer program]. Retrieved from https://github.com/CSTR-Edinburgh/merlin/tree/master/egs/build_your_own_voice/
- Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735-1780. https://doi.org/10.1162/neco.1997.9.8.1735
- Hunt, A. J., & Black, A. W. (1996). Unit selection in a concatenative speech synthesis system using a large speech database. Proceedings of the International Conference on Acoustics, Speech, Signal Processing (pp. 373-376).
- Imai, S., & Kobayashi, T. (2017). SPTK: Speech signal processing toolkit (version 3.11) [Computer program]. Retrieved from http://sp-tk.sourceforge.net/
- Kawahara, H., Masuda-Katsuse, I., & de Cheveigne, A. (1999). Restructuring speech representations using a pitch-adaptive timefrequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds. Speech Communication, 27(3-4), 187-207. https://doi.org/10.1016/S0167-6393(98)00085-5
- Kubichek, R. (1993). Mel-cepstral distance measure for objective speech quality assessment. Proceedings of the IEEE Pacific Rim Conference on Communications Computers and Signal Processing (pp. 125-128). Victoria, BC, Canada.
- Ling, Z. H., Kang, S. Y., Zen, H., Senior, A., Schuster, M., Qian, X. J., Meng, H. M., & Deng, L. (2015). Deep learning for acoustic modeling in parametric speech generation: A systematic review of existing techniques & future trends. IEEE Signal Processing Magazine, 32(3), 35-52. https://doi.org/10.1109/MSP.2014.2359987
- Luo, Z., Takiguchi, T., & Ariki, Y. (2016). Emotional voice conversion using deep neural networks with MCC and F0 features. Proceedings of the IEEE/ACIS 15th International Conference on Computer and Information Science (pp. 1-5). Okayama, Japan.
- Merritt, T., Latorre, J., & King, S. (2015). Attributing modelling errors in HMM synthesis by stepping gradually from natural to modelled speech. Proceedings of the International Conference on Acoustics, Speech, Signal Processing (pp. 4220-4224). Brisbane, Australia.
- Morise, M., Yokomori, F., & Ozawa, K. (2016). WORLD: A vocoder-based high-quality speech synthesis system for real-time applications. IEICE Transactions on Information and Systems, E99.D(7), 1877-1884. https://doi.org/10.1587/transinf.2015EDP7457
- Najafabadi, M., Villanustre, F., Khoshgoftaar, T., Seliya, N., Wald, R., & Muharemagic, E. (2015). Deep learning applications and challenges in big data analytics. Journal of Big Data, 2(1), 1-21.
- Nitech [Nagoya Institute of Technology]. (2015). HTS: HMM/DNNbased speech synthesis system (version 2.3) [Computer program]. Retrieved from http://hts.sp.nitech.ac.jp/
- Riedi, M. (1995). A neural-network-based model of segmental duration for speech synthesis. Proceedings of the Eurospeech 1995 (pp. 599-602).
- Schuster, M., & Paliwal, K. (1997). Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11), 2673-2681. https://doi.org/10.1109/78.650093
- Tokuda, K., Kobayashi, T., & Imai, S. (1995). Speech parameter generation from HMM using dynamic features. Proceedings of the 1995 International Conference on Acoustics, Speech, Signal Processing (pp. 660-663). Detroit, MI.
- Weijters, T., & Thole, J. (1993). Speech synthesis with artificial neural networks. Proceedings of the International Conference on Neural Networks (pp. 1764-1769). San Diego, CA.
- Williams, R. J., & Zipser, D. (1992). Gradient-based learning algorithms for recurrent networks and their computational complexity. In Y. Chauvin, & D. E. Rumelhart (Eds.), Back-propagation:Theory, architectures and applications (pp. 433-486). Hillsdale, NJ: Lawrence Erlbaum Associates.
- Wu, Z., Watts, O., & King, S. (2016). Merlin: An open source neural network speech synthesis system. Proceedings of the 9th ISCA Speech Synthesis Workshop (pp. 202-207).
- Yoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T., & Kitamura, T. (1999). Simultaneous modeling of spectrum, pitch and duration in HMM based speech synthesis. Proceedings of the Eurospeech 1999 (pp. 2347-2350).
- Yu, K., Zen, H., Mairesse, F., & Young, S. (2011). Context adaptive training with factorized decision trees for HMM-based statistical parametric speech synthesis. Speech Communication, 53(6), 914-923. https://doi.org/10.1016/j.specom.2011.03.003
- Zen, H., Tokuda, K., & Black, A. W. (2009). Statistical parametric speech synthesis. Speech Communication, 51(11), 1039-1064. https://doi.org/10.1016/j.specom.2009.04.004
- Zen, H., Senior, A., & Schuster, M. (2013). Statistical parametric speech synthesis using deep neural networks. Proceedings of the 2013 IEEE International Conference on Acoustics, Speech, Signal Processing (pp. 7962-7966). Vancouver, BC.