DOI QR코드

DOI QR Code

Design of Speech Enhancement U-Net for Embedded Computing

임베디드 연산을 위한 잡음에서 음성추출 U-Net 설계

  • Received : 2020.07.28
  • Accepted : 2020.08.31
  • Published : 2020.10.31

Abstract

In this paper, we propose wav-U-Net to improve speech enhancement in heavy noisy environments, and it has implemented three principal techniques. First, as input data, we use 128 modified Mel-scale filter banks which can reduce computational burden instead of 512 frequency bins. Mel-scale aims to mimic the non-linear human ear perception of sound by being more discriminative at lower frequencies and less discriminative at higher frequencies. Therefore, Mel-scale is the suitable feature considering both performance and computing power because our proposed network focuses on speech signals. Second, we add a simple ResNet as pre-processing that helps our proposed network make estimated speech signals clear and suppress high-frequency noises. Finally, the proposed U-Net model shows significant performance regardless of the kinds of noise. Especially, despite using a single channel, we confirmed that it can well deal with non-stationary noises whose frequency properties are dynamically changed, and it is possible to estimate speech signals from noisy speech signals even in extremely noisy environments where noises are much lauder than speech (less than SNR 0dB). The performance on our proposed wav-U-Net was improved by about 200% on SDR and 460% on NSDR compared to the conventional Jansson's wav-U-Net. Also, it was confirmed that the processing time of out wav-U-Net with 128 modified Mel-scale filter banks was about 2.7 times faster than the common wav-U-Net with 512 frequency bins as input values.

Keywords

References

  1. J-M. Valin, J. Rouat, F. Michaud, "Enhanced Robot Audition Based on Microphone Array Source Separation with Post-Filter," In IROS 2004, Sendai, Japan, pp. 2123-2128, 2004.
  2. R. Takeda, S. Yamamoto, K. Komatani, T. Ogata, and H. G. Okuno,"Missing-Feature based Speech Recognition for Two Simultaneous Speech Signals Separated by ICA with a pair of Humanoid Eras," In IROS 2006, Beijing, China, pp. 878-885, 2006.
  3. Y. Ephraim, D. Malah, “Speech Enhancement Using a Minimum Mean-Square Error Short-Time Spectral Amplitude Estimator,” IEEE Transaction on Acoustics, Speech, and Signal Processing, Vol. ASSP-32, No. 6, pp. 1109-1121, 1984. https://doi.org/10.1109/TASSP.1984.1164453
  4. H-D. Kim, S-S. Ahn, K. Kim, J. Choi, "Single Channel Particular Voice Activity Detection for Monitoring the Violence Situations", In 2013 IEEE RO-MAN, pp. 412-417, 2013.
  5. A Jansson, E Humphrey, N Montecchio, R Bittner, A Kumar, T Weyde, "Singing Voice Separation with Deep U-net Convolutional Networks," In ISMIR 2017, Suzhou, China, pp. 23-27, 2017.
  6. D. Stoller, S. Ewert, S. Dixon, "Wave-u-net: A Multi-scale Neural Network for End-to-end Audio Source Separation," In ICASSP 2018, Calgary, Canada, pp. 2391-2395, 2018.
  7. Douglas O'Shaughnessy, "Speech Communication: Human and Machine," Addison-Wesley. New York, pp. 150, 1987.
  8. O. Kupyn, V. Budzan, M. Mykhailych, D. Mishkin, J. Matas, "DeblurGAN: Blind Motion Deblurring Using Conditional Adversarial Networks", In IEEE/CVF, Salt Lake City, UT, USA, pp. 8183-8192, 2018.
  9. O. Ronneberger, P. Fischer, T. Brox, "U-Net: Convolutional Networks for Biomedical Image Segmentation", In MICCAI 2015, Springer, Vol. 9351, pp. 234-241, 2015.
  10. W. Wang, K. Yu, J. Hugonot, P. Fua, M. Salzman, "Recurrent U-Net for Resource-Constrained Segmentation", In ICCV 2019, Seoul, South Korea, pp. 2142-2151, 2019.
  11. Z. Rafii, A. Liutkus, F.-R. Stter, S.-I. Mimilakis, R. Bittner, "The MUSDB18 corpus for music separation," 2017.
  12. A. Liutkus, D. Fitzgerald, Z. Rafii, "Scalable audio separation with light kernel additive modelling," In ICASSP 2015, Brisbane, Australia, pp. 76-80, 2015.
  13. C.-L. Hsu, J. R. Jang, “On the Improvement of Singing Voice Separation for Monaural Recordings Using the MIR-1K Dataset,” IEEE Transactions on Audio Speech and Language Processing, Vol. 18, No. 2, pp. 310-319, 2010. https://doi.org/10.1109/TASL.2009.2026503
  14. C. K. A. Reddy, E. Beyrami, H. Dubey, V. Gopal, R. Cheng, R. Cutler, S. Matusevych, R. Aichner, A. Aazami, S. Braun, P. Rana, S. Srinivasan, J. Gehrke, "The INTERSPEECH 2020 Deep Noise Suppression Challenge: Datasets, Subjective Speech Quality and Testing Framework", 2020.
  15. E. Vincent, R. Gribonval, C. Fevotte, "Performance Measurement in Blind Audio Source Separation," IEEE Transactions on Audio, Speech, and Language Processing, Nol. 14, No. 4, pp. 1462-1469, 2006. https://doi.org/10.1109/TSA.2005.858005
  16. E. Vincent, S. Araki, P. Bofill, "The 2008 Signal Separation Evaluation Campaign: A community-based Approach to Large-scale Evaluation," In ICA 2009, Paraty, Brazil, pp 734-741, 2009.