DOI QR코드

DOI QR Code

Study on data augmentation methods for deep neural network-based audio tagging

Deep neural network 기반 오디오 표식을 위한 데이터 증강 방법 연구

  • 김범준 (연세대학교 전산학과) ;
  • 문현기 (연세대학교 전기전자공학과) ;
  • 박성욱 (강릉원주대학교 전자공학과) ;
  • 박영철 (연세대학교 컴퓨터정보통신공학부)
  • Received : 2018.09.14
  • Accepted : 2018.11.21
  • Published : 2018.11.30

Abstract

In this paper, we present a study on data augmentation methods for DNN (Deep Neural Network)-based audio tagging. In this system, an audio signal is converted into a mel-spectrogram and used as an input to the DNN for audio tagging. To cope with the problem associated with a small number of training data, we augment the training samples using time stretching, pitch shifting, dynamic range compression, and block mixing. In this paper, we derive optimal parameters and combinations for the augmentation methods through audio tagging simulations.

GOHHBH_2018_v37n6_475_f0001.png 이미지

Fig. 1. Block diagram of the DNN structure.

GOHHBH_2018_v37n6_475_f0002.png 이미지

Fig. 2. Example of DRC curve.

GOHHBH_2018_v37n6_475_f0003.png 이미지

Fig. 3. Block diagram for overall structure.

GOHHBH_2018_v37n6_475_f0004.png 이미지

Fig. 4. Performance according to parameters of time stretching and pitch shifting. (a) Time stretching, (b) Pitch shifting.

GOHHBH_2018_v37n6_475_f0005.png 이미지

Fig. 5. Performance according to DRC curve and block mixing method. (a) Dynamic range compression, (b) Block mixing.

Table 1. Distribution of weakly labeled data each class.

GOHHBH_2018_v37n6_475_t0001.png 이미지

Table 2. Parameters of block mixing and dynamic range compression.

GOHHBH_2018_v37n6_475_t0002.png 이미지

Table 3. Performance per data augmentation method and its parameters.

GOHHBH_2018_v37n6_475_t0003.png 이미지

Acknowledgement

Supported by : 정보통신기술진흥센터

References

  1. M. Schuster and K. K. Paliwal, "Bidirectional recurrent neural networks," IEEE Trans. Signal Process., 45, 2673-2681(1997). https://doi.org/10.1109/78.650093
  2. G. E. Dahl, T. N. Sainath, and G. E. Hinton "Improving DNNs for LVCSR using rectified linear units and dropout," Proc. IEEE ICASSP, 8609-8613 (2013).
  3. M. Hilsamer and S. Herzog, "A statistical approach to automated offline dynamic processing in the audio mastering process," In. DAFx, 35-40 (2014).
  4. Dolby E, "Standards and practices for authoring Dolby Digital and Dolby E bitstreams," Dolby Labortories, Inc. 2002.
  5. J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, "Audio set: An ontology and human-labeled dataset for audio events," Proc. IEEE ICASSP, 776-780 (2017).
  6. S. M. Beitzel, On Understanding And Classifying Web Queries, (Ph.D. thesis, Illinois Institute of Technology, Chicago, IL, CiteSeerX 10.1.1.127.634, 2006).
  7. E. Wold, T. Blum, D. Keislar, and J. Wheaten, "Content-based classification, search, and retrieval of audio," IEEE Multimedia, 3, 27-36 (1996). https://doi.org/10.1109/93.556537
  8. D. Giannoulis, E. Benetos, D. Stowell, M. Rossignol, M. Lagrange, and M. D. Plumbley, "Detection and classification of acoustic scenes and events: an IEEE AASP challenge," Proc. of IEEE WASPAA, 1-4, (2013).
  9. P. Cano, M. Koppenberger, and N. Wack, "Content-based music audio recommendation," Proc. ACM 13th, 211-212 (2005).
  10. P. Foster, S. Sigtia, S. Krstulovic, J. Barker, and M. D. Plumbley, "CHiME-home: A dataset for sound source recognition in a domestic environment," Proc. of IEEE WASPAA, 15, 2015.
  11. J. Salamon and J. P. Bello, "Deep convolutional neural networks and data augmentation for environmental sound classification," in. IEEE Signal Process. Lett., 24, 279-283(2016).
  12. S. Mum, S. Park, D. K. Han, and H. Ko, "Generative adversarial network based acoustic scene training set augmentation and selection using SVM hyper-plane," Proc. DCASE, 93-97 (2017).
  13. R. Seizel, N. Turpault, H. Eghbal-Zadeh, and A. P. Shah, "Large-scale weakly labeled semi-supervised sound event detection," arXiv preprint arXiv:1807.10501, July (2018).