DOI QR코드

DOI QR Code

Performance Improvement of Mean-Teacher Models in Audio Event Detection Using Derivative Features

차분 특징을 이용한 평균-교사 모델의 음향 이벤트 검출 성능 향상

  • 곽진열 (계명대학교 전기전자융합시스템공학과) ;
  • 정용주 (계명대학교 전자공학과)
  • Received : 2021.03.07
  • Accepted : 2021.06.17
  • Published : 2021.06.30

Abstract

Recently, mean-teacher models based on convolutional recurrent neural networks are popularly used in audio event detection. The mean-teacher model is an architecture that consists of two parallel CRNNs and it is possible to train them effectively on the weakly-labelled and unlabeled audio data by using the consistency learning metric at the output of the two neural networks. In this study, we tried to improve the performance of the mean-teacher model by using additional derivative features of the log-mel spectrum. In the audio event detection experiments using the training and test data from the Task 4 of the DCASE 2018/2019 Challenges, we could obtain maximally a 8.1% relative decrease in the ER(Error Rate) in the mean-teacher model using proposed derivative features.

최근 들어, 음향 이벤트 검출을 위하여 CRNN(: Convolutional Recurrent Neural Network) 구조에 기반 한 평균-교사 모델이 대표적으로 사용되고 있다. 평균-교사 모델은 두 개의 병렬 형태의 CRNN을 가진 구조이며, 이들의 출력들의 일치성을 학습 기준으로 사용함으로서 약-전사 레이블(label)과 비-전사 레이블 음향 데이터에 대해서도 효과적인 학습이 가능하다. 본 연구에서는 최신의 평균-교사 모델에 로그-멜 스펙트럼에 대한 차분 특징을 추가적으로 사용함으로서 보다 나은 성능을 이루고자 하였다. DCASE 2018/2019 Challenge Task 4용 학습 및 테스트 데이터를 이용한 음향 이벤트 검출 실험에서 제안된 차분특징을 이용한 평균-교사모델은 기존의 방식에 비해서 최대 8.1%의 상대적 ER(: Error Rate)의 향상을 얻을 수 있었다.

Keywords

References

  1. N. Turpault, R. Serizel, A. Shah, and J. Salamon, "Sound event detection in domestic environments with weakly labeled data and soundscape synthesis," Proceedings of the Workshop on Detection and Classification of Acoustic Scenes and Events, New York, NY, USA, Oct. 2019.
  2. M. K., Nandwana, A. Ziaei, and J. Hansen, "Robust unsupervised detection of human screams in noisy acoustic environments," Proceedings of the International Conference on Acoustics, Speech and Signal Processing, Brisbane, Australia, Apr. 2015, pp. 161-165.
  3. M. Crocco, M. Christani, A. Trucco, and V. Murino, "Audio Surveillance: A Systematic Review," ACM Computing Surveys, vol. 48. no. 4, Feb. 2016, pp. 52:1-52:46.
  4. J. Salamon and J. P. Bello, "Feature learning with deep scattering for urban sound analysis," Proceedings of the 23rd European Signal Processing Conference (EUSIPCO), Nice, France, Sept. 2015, pp. 724-728.
  5. S., Ntalampiras, I. Potamitis, and N. Fakotakise, "On acoustic surveillance of hazardous situations," Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Taipei, Taiwan, Apr. 2009, pp. 165-168.
  6. Y. Wang, L. Neves, and F. Metze, "Audio-based Multimedia Event Detection Using Deep Recurrent Neural Networks," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, Mar. 2016, pp. 2742-2746.
  7. G. Dekkers, L. Vuegen, T. Waterschoot, B. Vanrumste, and P. Karsmakers, "DCASE 2018 challenge-Task 5: Monitoring of domestic activities based on multi-channel acoustics," Workshop on Detection and Classification of Acoustic Scenes and Events, Oct. 2018.
  8. S. Chung and Y. Chung, "Sound Event Detection based on Deep Neural Networks," J. of the Korea Institute of Electronic Communication Sciences, vol. 14, no. 2, 2019, pp. 389-396. https://doi.org/10.13067/JKIECS.2019.14.2.389
  9. S. Chung and Y. Chung, "Comparison of Audio Event Detection Performance using DNN," J. of the Korea Institute of Electronic Communication Sciences, vol. 13, no. 3, 2018, pp. 571-577. https://doi.org/10.13067/JKIECS.2018.13.3.571
  10. J. Kwak and Y. Chung, "Audio Event Detection Based on Attention CRNN," J. of the Korea Institute of Electronic Communication Sciences, vol. 15, no. 3, 2020, pp. 465-472. https://doi.org/10.13067/JKIECS.2020.15.3.465
  11. R. Serizel, N. Turpault, H. Eghbal-Zadeh, and A. P. Shah, "Large-Scale Weakly Labeled Semi-Supervised Sound Event Detection in Domestic Environments," Proceedings of the Workshop on Detection and Classification of Acoustic Scenes and Events, Surrey, UK, Nov. 2018.
  12. L. JiaKai, "Mean teacher convolution system for DCASE 2018 task 4," Proceedings of the Workshop on Detection and Classification of Acoustic Scenes and Events, Surrey, UK, Nov. 2018.
  13. A. Tarvainen and H. Valpola, "Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results," Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, Dec. 2017, pp. 1195-1204.
  14. O. A. Hamid, A. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu, "Convolutional neural networks for speech recognition," IEEE/ACM Tran. Audio Speech Lang. Process, vol. 22, no. 10, Oct. 2014, pp. 1533-1545. https://doi.org/10.1109/TASLP.2014.2339736