DOI QR코드

DOI QR Code

Convolutional Neural Network based Audio Event Classification

  • Lim, Minkyu (Dept. of Computer Science and Engineering, Sogang University) ;
  • Lee, Donghyun (Dept. of Computer Science and Engineering, Sogang University) ;
  • Park, Hosung (Dept. of Computer Science and Engineering, Sogang University) ;
  • Kang, Yoseb (Dept. of Computer Science and Engineering, Sogang University) ;
  • Oh, Junseok (Dept. of Computer Science and Engineering, Sogang University) ;
  • Park, Jeong-Sik (Dept. of English Linguistics & Language Technology, Hankuk University of Foreign Studies) ;
  • Jang, Gil-Jin (School of Electronics Engineering, Kyungpook National University) ;
  • Kim, Ji-Hwan (Dept. of Computer Science and Engineering, Sogang University)
  • Received : 2017.05.29
  • Accepted : 2018.02.13
  • Published : 2018.06.30

Abstract

This paper proposes an audio event classification method based on convolutional neural networks (CNNs). CNN has great advantages of distinguishing complex shapes of image. Proposed system uses the features of audio sound as an input image of CNN. Mel scale filter bank features are extracted from each frame, then the features are concatenated over 40 consecutive frames and as a result, the concatenated frames are regarded as an input image. The output layer of CNN generates probabilities of audio event (e.g. dogs bark, siren, forest). The event probabilities for all images in an audio segment are accumulated, then the audio event having the highest accumulated probability is determined to be the classification result. This proposed method classified thirty audio events with the accuracy of 81.5% for the UrbanSound8K, BBC Sound FX, DCASE2016, and FREESOUND dataset.

Keywords

References

  1. K. Kim and H. Kim, "Storytelling Strategy of Visual-Image Contents base on Rhetoric Metaphors," Journal of Digital Content Society, vol. 14, no. 4, pp. 481-491, December, 2013. https://doi.org/10.9728/dcs.2013.14.4.481
  2. L. Lu, H. Jiang and H. Zhang, "A robust audio classification and segmentation method," in Proc. of ACM International Conference on Multimedia, pp. 203-211, September 30-October 5, 2001.
  3. M. Xu, N. Maddage, C. Xu, M. Kankanhalli and Q. Tian, "Creating audio keywords for event detection in soccer video," in Proc. of IEEE International Conference on Multimedia and Expo, pp.281-284, July 6-9, 2003.
  4. W. Cheng, W. Chu and J. Wu, "Semantic context detection based on hierarchical audio models," in Proc. of ACM SIGMM International Workshop on Multimedia Information Retrieval, pp.109-115, November 7-7, 2003.
  5. H. Lee, P. Pham, Y. Largman and Y. Ng, "Unsupervised feature learning for audio classification using convolutional deep belief networks," in Proc. of Advances in Neural Information Processing Systems, pp.1096-1104, December 7-10, 2009.
  6. Y. Bengio and Y. LeCun, "Large-scale Kernel Machines," MIT Press, 2007.
  7. J. Portelo, M. Bugalho, I. Trancoso, J. Neto, A. Abad and A. Serralheiro, "Non-speech audio event detection," in Proc. of Internationa Conference on Acoustics, Speech and Signal Processing, pp.1973-1976, April 19-24, 2009.
  8. L. Ballan, A. Bazzica and M. Bertini, A. Bimbo, and G. Serra, "Deep networks for audio event classification in soccer videos," in Proc. of International Conference on Multimedia and Expo, pp.474-477, June 28-3, 2009.
  9. T. Heittola, A. Mesaros, A. Eronen and T. Virtanen, "Context-dependent sound event detection," EURASIP Journal on Audio, Speech, and Music Processing, vol.1, pp.1-13, January, 2013.
  10. K. Zvi and T. Orith, "Audio event classification using deep neural networks," in Proc. of Interspeech, pp.1482-1486, August 25-29, 2013.
  11. S. Downie, et al., "The Music Information Retrieval Evaluation eXchange: Some observations and insights," Advances in Music Information Retrieval, pp. 93-115, 2010.
  12. R. Malkin, "Multimodal Technologies for Perception of Humans," Springer, pp. 323-330, 2007.
  13. F. Smeaton, et al., "Evaluation campaigns and TRECVid," in Proc. of ACM International Workshop on Multimedia Information Retrieval, pp. 321-330, 2006.
  14. E. Vincent, et al., "The signal separation evaluation campaign (2007-2010): Achievements and remaining challenges," Signal Processing, vol. 82, no. 8, pp. 1928-1936, 2012.
  15. H. Larochelle, et al., "An empirical evaluation of deep architectures on problems with many factors of variation," in Proc. of International Conference on Machine Learning, pp.473-480, 2007.
  16. M. Lim and J. Kim, "Audio Event Classification Using Deep Neural Networks," Phonetics and Speech Sciences, vol. 7, no. 4, pp.27-33, January, 2015. https://doi.org/10.13064/KSSS.2015.7.4.027
  17. J. Salamon, C. Jacoby and J. Bello, "A dataset and taxonomy for urban sound research," in Proc. of ACM International Conference on Multimedia, pp.1041-1044, November 3-7, 2014.
  18. M. Slaney, "Semantic-audio retrieval," in Proc. of International Conference on Acoustics, Speech and Signal Processing, pp.1408-1411, May 13-17, 2002.
  19. A. Mesaros, T. Heittola, and T. Virtanen, "TUT database for acoustic scene classification and sound event detection," in Proc. of 24th European Signal Processing Conference, pp. 1128-1132, 2016.
  20. S. Young, G. Evermann, M. Gales and P. Woodland, "The HTK book (for HTK version 3.4)," Entropic Cambridge Research Laboratory, 2006.
  21. M. Abadi, A. Agarwal, et al, "Tensorflow: Large-scale machine learning on heterogeneous distributed systems," 2016, Preprint at.
  22. Y. LeCun, Y. Bengio, and G. Hinton, "Deep learning," Nature, vol. 521, pp.436-444, May, 2015. https://doi.org/10.1038/nature14539

Cited by

  1. Intelligent User Pattern Recognition based on Vision, Audio and Activity for Abnormal Event Detections of Single Households vol.24, pp.5, 2018, https://doi.org/10.9708/jksci.2019.24.05.059
  2. Oil Pipeline Weld Defect Identification System Based on Convolutional Neural Network vol.14, pp.3, 2020, https://doi.org/10.3837/tiis.2020.03.010
  3. Speaker Adaptation Using i-Vector Based Clustering vol.14, pp.7, 2018, https://doi.org/10.3837/tiis.2020.07.003
  4. Improving Smart Cities Safety Using Sound Events Detection Based on Deep Neural Network Algorithms vol.7, pp.3, 2018, https://doi.org/10.3390/informatics7030023