DOI QR코드

DOI QR Code

Data augmentation in voice spoofing problem

데이터 증강기법을 이용한 음성 위조 공격 탐지모형의 성능 향상에 대한 연구

  • Choi, Hyo-Jung (Department of Applied Statistics, Chung-Ang University) ;
  • Kwak, Il-Youp (Department of Applied Statistics, Chung-Ang University)
  • 최효정 (중앙대학교 응용통계학과) ;
  • 곽일엽 (중앙대학교 응용통계학과)
  • Received : 2021.01.11
  • Accepted : 2021.02.06
  • Published : 2021.06.30

Abstract

ASVspoof 2017 deals with detection of replay attacks and aims to classify real human voices and fake voices. The spoofed voice refers to the voice that reproduces the original voice by different types of microphones and speakers. data augmentation research on image data has been actively conducted, and several studies have been conducted to attempt data augmentation on voice. However, there are not many attempts to augment data for voice replay attacks, so this paper explores how audio modification through data augmentation techniques affects the detection of replay attacks. A total of 7 data augmentation techniques were applied, and among them, dynamic value change (DVC) and pitch techniques helped improve performance. DVC and pitch showed an improvement of about 8% of the base model EER, and DVC in particular showed noticeable improvement in accuracy in some environments among 57 replay configurations. The greatest increase was achieved in RC53, and DVC led to an approximately 45% improvement in base model accuracy. The high-end recording and playback devices that were previously difficult to detect were well identified. Based on this study, we found that the DVC and pitch data augmentation techniques are helpful in improving performance in the voice spoofing detection problem.

본 논문에서는 음성위조공격탐지(Voice spoofing detection) 문제에 데이터 증강을 적용한다. ASVspoof 2017은 리플레이 공격 탐지에 대해 다루며 진짜 사람의 음성과 환경이나 녹음·재생 장치의 조건들을 다르게 하여 위조한 가짜 음성을 분류하는 것을 목적으로 한다. 지금까지 이미지 데이터에 대한 데이터 증강 연구가 활발히 이루어졌으며 음성에도 데이터 증강을 시도하는 여러 연구가 진행되어왔다. 하지만 음성 리플레이 공격에 대한 데이터 증강시도는 이루어지지 않아 본 논문에서는 데이터 증강기법을 통한 오디오 변형이 리플레이 공격 탐지에 어떠한 영향을 미치는지에 대해 탐구해본다. 총 7가지의 데이터 증강기법을 적용해보았으며 그 중 DVC, Pitch 음성 증강기법이 성능향상에 도움되었다. DVC와 Pitch는 기본 모델 EER의 약 8% 개선을 보여주었으며, 특히 DVC는 57개의 환경변수 중 일부 환경에서 눈에 띄는 정확도 향상이 있었다. 가장 큰 폭으로 증가한 RC53의 경우 DVC가 기본 모델 정확도의 약 45% 향상을 이끌어내며 기존에 탐지하기 어려웠던 고사양의 녹음·재생 장치를 잘 구분해냈다. 본 연구를 토대로 기존에 증강기법의 효과에 대한 연구가 이루어지지 않았던 음성 위조 탐지 문제에서 DVC, Pitch 데이터 증강기법이 성능 향상에 도움이 된다는 것을 알아내었다.

Keywords

Acknowledgement

이 성과는 2020년도 정부(과학기술정보통신부)의 재원으로 한국연구재단의 지원을 받아 수행된 연구임(No. 2020R1C1C1A01013020).

References

  1. Alfassy A, Karlinsky L, Aides A, Shtok J, Harary S, Feris, R, Giryes R, and Bronstein AM (2019). Laso: label-set operations networks for multi-label few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6548-6557.
  2. Brown JC (1991). Calculation of a constant Q spectral transform, The Journal of the Acoustical Society of America, 89, 425-434. https://doi.org/10.1121/1.400476
  3. Chen Z, Fu Y, Wang YX, Ma L, Liu W, and Hebert M (2019). Image deformation meta-networks for one-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8680-8689.
  4. Delgado H, Todisco M, Sahidulla Md, Evans N, Kinnunen T, Lee K, and Yamagishi J (2018). Asvspoof 2017 Version 2.0: Meta-Data Analysis and Baseline En-hancements.
  5. Goodfellow I, Warde-Farley D, Mirza M, Courville A, and Bengio Y (2013). Maxout networks. In International Conference on Machine Learning, PMLR, 1319-1327.
  6. Kinnunen T, Sahidulla Md, Delgado H, Todisco M, Evans N, Yamagishi J, and Lee K (2017). The asvspoof 2017 challenge: assessing the limits of replayspoofing attack detection, Interspeech 2017, 2-6.
  7. Ko T, Peddinti V, Povey D, and Khudanpur S (2015). Audio augmentation for speech recognition. In Sixteenth Annual Conference of the International Speech Communication Association.
  8. Korshunov P and Marcel S (2016). Cross-database evaluation of audio-based spoofing detection systems, In Interspeech.
  9. Lavrentyeva G, Novoselov S, Malykh E, Kozlov A, Kudashev O, and Shchemelinin V (2017). Audio replay attack detection with deep learning frameworks, In Interspeech, 82-86.
  10. McFee B, Humphrey EJ, and Bello JP (2015). A software framework for musical data augmentation, ISMIR, 248-254.
  11. Mikolajczyk A and Grochowski M (2018). Data augmentation for improving deep learning in image classification problem. In 2018 International Interdisciplinary PhD Workshop (IIPhDW), IEEE, 117-122.
  12. Perez L and Wang J (2017). The Effectiveness of Data Augmentation in Image Classification Using Deep Learning, arXiv preprint arXiv:1712.04621.
  13. Salamon J and Bello JP(2017). Deep convolutional neural networks and data augmentation for environmental sound classification, IEEE Signal Processing Letters, 24, 279-283. https://doi.org/10.1109/LSP.2017.2657381
  14. Srivastava N, Hinton G, Krizhevsky A, Sutskever I, and Salakhutdinov R (2014). Dropout: a simple way to prevent neural networks from overfitting, The Journal of Machine Learning Research, 15, 1929-1958.
  15. Todisco M, Delgado H, and Evans N (2017). Constant Q cepstral coefficients: a spoofing countermeasure for automatic speaker verification, Computer Speech & Language, 45, 516-535. https://doi.org/10.1016/j.csl.2017.01.001
  16. Wu X, He, R, Sun Z, and Tan T (2018). A light CNN for deep face representation with noisy labels, IEEE Transactions on Information Forensics and Security, 13, 2884-2896. https://doi.org/10.1109/TIFS.2018.2833032