Study on data augmentation methods for deep neural network-based audio tagging

Kim, Bum-Jun;Moon, Hyeongi;Park, Sung-Wook;Park, Young cheol;

doi:10.7776/ASK.2018.37.6.475

The Journal of the Acoustical Society of Korea (한국음향학회지)

Volume 37 Issue 6
/
Pages.475-482
/
2018
/
1225-4428(pISSN)
/
2287-3775(eISSN)

The Acoustical Society of Korea (한국음향학회)

DOI QR Code

Study on data augmentation methods for deep neural network-based audio tagging

Deep neural network 기반 오디오 표식을 위한 데이터 증강 방법 연구

Kim, Bum-Jun ;
Moon, Hyeongi ;
Park, Sung-Wook ;
Park, Young cheol (Division of Computer and Telecommunication Engineering, Yonsei University)

김범준 (연세대학교 전산학과) ;
문현기 (연세대학교 전기전자공학과) ;
박성욱 (강릉원주대학교 전자공학과) ;
박영철 (연세대학교 컴퓨터정보통신공학부)

Received : 2018.09.14
Accepted : 2018.11.21
Published : 2018.11.30

https://doi.org/10.7776/ASK.2018.37.6.475 Citation PDF KSCI HTML

Download PDF

⟨ Previous Next ⟩

Abstract

In this paper, we present a study on data augmentation methods for DNN (Deep Neural Network)-based audio tagging. In this system, an audio signal is converted into a mel-spectrogram and used as an input to the DNN for audio tagging. To cope with the problem associated with a small number of training data, we augment the training samples using time stretching, pitch shifting, dynamic range compression, and block mixing. In this paper, we derive optimal parameters and combinations for the augmentation methods through audio tagging simulations.

본 논문에서는 DNN(Deep Neural Network) 기반 오디오 표식을 위한 데이터 증강 방법을 연구한다. 본 시스템에서는 오디오 신호를 멜-스펙트로그램으로 변환하여 오디오 표식을 위한 심층신경망의 입력으로 사용한다. 적은 수의 훈련 데이터를 사용하는 경우 발생하는 문제를 해결하기 위해, 타임 스트레칭, 피치 변화, 동적 영역 압축, 블록 혼합 등의 방법을 사용하여 훈련 데이터를 증강시켰다. 사용된 데이터 증강 기법의 최적 파라미터와 최적 조합을 오디오 표식 시뮬레이션을 통해 확인하였다.

Keywords

GOHHBH_2018_v37n6_475_f0001.png 이미지

Fig. 1. Block diagram of the DNN structure.

GOHHBH_2018_v37n6_475_f0002.png 이미지

Fig. 2. Example of DRC curve.

GOHHBH_2018_v37n6_475_f0003.png 이미지

Fig. 3. Block diagram for overall structure.

GOHHBH_2018_v37n6_475_f0004.png 이미지

Fig. 4. Performance according to parameters of time stretching and pitch shifting. (a) Time stretching, (b) Pitch shifting.

GOHHBH_2018_v37n6_475_f0005.png 이미지

Fig. 5. Performance according to DRC curve and block mixing method. (a) Dynamic range compression, (b) Block mixing.

Table 1. Distribution of weakly labeled data each class.

GOHHBH_2018_v37n6_475_t0001.png 이미지

Table 2. Parameters of block mixing and dynamic range compression.

GOHHBH_2018_v37n6_475_t0002.png 이미지

Table 3. Performance per data augmentation method and its parameters.

GOHHBH_2018_v37n6_475_t0003.png 이미지

References

E. Wold, T. Blum, D. Keislar, and J. Wheaten, "Content-based classification, search, and retrieval of audio," IEEE Multimedia, 3, 27-36 (1996). https://doi.org/10.1109/93.556537
D. Giannoulis, E. Benetos, D. Stowell, M. Rossignol, M. Lagrange, and M. D. Plumbley, "Detection and classification of acoustic scenes and events: an IEEE AASP challenge," Proc. of IEEE WASPAA, 1-4, (2013).
P. Cano, M. Koppenberger, and N. Wack, "Content-based music audio recommendation," Proc. ACM 13th, 211-212 (2005).
P. Foster, S. Sigtia, S. Krstulovic, J. Barker, and M. D. Plumbley, "CHiME-home: A dataset for sound source recognition in a domestic environment," Proc. of IEEE WASPAA, 15, 2015.
J. Salamon and J. P. Bello, "Deep convolutional neural networks and data augmentation for environmental sound classification," in. IEEE Signal Process. Lett., 24, 279-283(2016).
S. Mum, S. Park, D. K. Han, and H. Ko, "Generative adversarial network based acoustic scene training set augmentation and selection using SVM hyper-plane," Proc. DCASE, 93-97 (2017).
R. Seizel, N. Turpault, H. Eghbal-Zadeh, and A. P. Shah, "Large-scale weakly labeled semi-supervised sound event detection," arXiv preprint arXiv:1807.10501, July (2018).
M. Schuster and K. K. Paliwal, "Bidirectional recurrent neural networks," IEEE Trans. Signal Process., 45, 2673-2681(1997). https://doi.org/10.1109/78.650093
G. E. Dahl, T. N. Sainath, and G. E. Hinton "Improving DNNs for LVCSR using rectified linear units and dropout," Proc. IEEE ICASSP, 8609-8613 (2013).
M. Hilsamer and S. Herzog, "A statistical approach to automated offline dynamic processing in the audio mastering process," In. DAFx, 35-40 (2014).
Dolby E, "Standards and practices for authoring Dolby Digital and Dolby E bitstreams," Dolby Labortories, Inc. 2002.
J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, "Audio set: An ontology and human-labeled dataset for audio events," Proc. IEEE ICASSP, 776-780 (2017).
S. M. Beitzel, On Understanding And Classifying Web Queries, (Ph.D. thesis, Illinois Institute of Technology, Chicago, IL, CiteSeerX 10.1.1.127.634, 2006).

The Journal of the Acoustical Society of Korea (한국음향학회지)

Study on data augmentation methods for deep neural network-based audio tagging

Deep neural network 기반 오디오 표식을 위한 데이터 증강 방법 연구

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)