DOI QR코드

DOI QR Code

음성위조 탐지에 있어서 데이터 증강 기법의 성능에 관한 비교 연구

Comparative study of data augmentation methods for fake audio detection

  • 박관열 (중앙대학교 응용통계학과) ;
  • 곽일엽 (중앙대학교 응용통계학과)
  • KwanYeol Park (Department of Applied Statistics, Chung-Ang University) ;
  • Il-Youp Kwak (Department of Applied Statistics, Chung-Ang University)
  • 투고 : 2022.11.07
  • 심사 : 2022.12.13
  • 발행 : 2023.04.30

초록

데이터 증강 기법은 학습용 데이터셋을 다양한 관점에서 볼 수 있게 해주어 모형의 과적합 문제를 해결하는데 효과적으로 사용되고 있다. 이미지 데이터 증강기법으로 회전, 잘라내기, 좌우대칭, 상하대칭등의 증강 기법 외에도 occlusion 기반 데이터 증강 방법인 Cutmix, Cutout 등이 제안되었다. 음성 데이터에 기반한 모형들에 있어서도, 1D 음성 신호를 2D 스펙트로그램으로 변환한 후, occlusion 기반 데이터 기반 증강기법의 사용이 가능하다. 특히, SpecAugment는 음성 스펙트로그램을 위해 제안된 occlusion 기반 증강 기법이다. 본 연구에서는 위조 음성 탐지 문제에 있어서 사용될 수 있는 데이터 증강기법에 대해 비교 연구해보고자 한다. Fake audio를 탐지하기 위해 개최된 ASVspoof2017과 ASVspoof2019 데이터를 사용하여 음성을 2D 스펙트로그램으로 변경시켜 occlusion 기반 데이터 증강 방식인 Cutout, Cutmix, SpecAugment를 적용한 데이터셋을 훈련 데이터로 하여 CNN 모형을 경량화시킨 LCNN 모형을 훈련시켰다. Cutout, Cutmix, SpecAugment 세 증강 기법 모두 대체적으로 모형의 성능을 향상시켰으나 방법에 따라 오히려 성능을 저하시키거나 성능에 변화가 없을 수도 있었다. ASVspoof2017 에서는 Cutmix, ASVspoof2019 LA 에서는 Mixup, ASVspoof2019 PA 에서는 SpecAugment 가 가장 좋은 성능을 보였다. 또, SpecAugment는 mask의 개수를 늘리는 것이 성능 향상에 도움이 된다. 결론적으로, 상황과 데이터에 따라 적합한 augmentation 기법이 다른 것으로 파악된다.

The data augmentation technique is effectively used to solve the problem of overfitting the model by allowing the training dataset to be viewed from various perspectives. In addition to image augmentation techniques such as rotation, cropping, horizontal flip, and vertical flip, occlusion-based data augmentation methods such as Cutmix and Cutout have been proposed. For models based on speech data, it is possible to use an occlusion-based data-based augmentation technique after converting a 1D speech signal into a 2D spectrogram. In particular, SpecAugment is an occlusion-based augmentation technique for speech spectrograms. In this study, we intend to compare and study data augmentation techniques that can be used in the problem of false-voice detection. Using data from the ASVspoof2017 and ASVspoof2019 competitions held to detect fake audio, a dataset applied with Cutout, Cutmix, and SpecAugment, an occlusion-based data augmentation method, was trained through an LCNN model. All three augmentation techniques, Cutout, Cutmix, and SpecAugment, generally improved the performance of the model. In ASVspoof2017, Cutmix, in ASVspoof2019 LA, Mixup, and in ASVspoof2019 PA, SpecAugment showed the best performance. In addition, increasing the number of masks for SpecAugment helps to improve performance. In conclusion, it is understood that the appropriate augmentation technique differs depending on the situation and data.

키워드

과제정보

이 성과는 2023년도 정부(과학기술정보통신부)의 재원으로 한국연구재단의 지원을 받아 수행된 연구임 (No. RS-2023-00208284).

참고문헌

  1. Abdel-Hamid O, Mohamed AR, Jiang H, Deng L, Penn G, and Yu D (2014). Convolutional neural networks for speech recognition, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22, 1533-1545. https://doi.org/10.1109/TASLP.2014.2339736
  2. Brown JC (1991). Calculation of a constant Q spectral transform, The Journal of the Acoustical Society of America, 89, 425-434. https://doi.org/10.1121/1.400476
  3. Chapelle O, Weston J, Bottou L, and Vapnik V (2000). Vicinal risk minimization, Advances in Neural Information Processing Systems, 13, Cambridge MA, USA.
  4. Cheng X, Xu M, and Zheng TF (2019). Replay detection using CQT-based modified group delay feature and ResNeWt network in ASVspoof 2019. In Proceedings of 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Lanzhou, China, 540-545.
  5. Choi HJ and Kwak IY (2021). Data augmentation in voice spoofing problem, The Korean Journal of Applied Statistics, 34, 449-460.
  6. Delgado H, Todisco M, Sahidullah M, Evans N, Kinnunen T, Lee KA, and Yamagishi J (2017). ASVspoof 2017 Version 2.0: Meta-data analysis and baseline enhancement, Odyssey 2018-The Speaker and Language Recognition Workshop.
  7. DeVries T and Taylor GW (2017). Improved regularization of convolutional neural networks with Cutout, Available from: arXiv preprint arXiv
  8. Dua M, Jain C, and Kumar S (2021). LSTM and CNN based ensemble approach for spoof detection task in automatic speaker verification systems, Journal of Ambient Intelligence and Humanized Computing, 13, 1985-2000. https://doi.org/10.1007/s12652-021-02960-0
  9. Fong R and Vedaldi A (2019). Occlusions for effective data augmentation in image classification. In Proceedings of 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Korea, 4158-4166.
  10. Goodfellow I, Warde-Farley D, Mirza M, et al. (2013). Maxout networks, In Proceedings of the 30th International Conference on Machine Learning (ICML), Atlanta, Georgia, USA, 1319-1327.
  11. Haut JM, Paoletti ME, Plaza J, Plaza A, and Li J (2019). Hyperspectral image classification using random occlusion data augmentation, IEEE Geoscience and Remote Sensing Letters, 16, 1751-1755. https://doi.org/10.1109/LGRS.2019.2909495
  12. Hsu CY, Lin LE, and Lin CH (2021). Age and gender recognition with random occluded data augmentation on facial images, Multimedia Tools and Applications, 80, 11631-11653. https://doi.org/10.1007/s11042-020-10141-y
  13. Ioffe S and Szegedy C (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift, International Conference on Machine Learning, 37, 448-456.
  14. Yang J, Das RK, and Li H (2018). Extended constant-Q cepstral coefficients for detection of spoofing attacks. In Proceedings of 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Honolulu, HI, USA, 1024-1029.
  15. Ke Y, Hoiem D, and Sukthankar R (2005). Computer vision for music identification. In Proceedings of 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05), San Diego, CA, USA, 597-604.
  16. Kim G, Han DK, and Ko H (2021). Specmix: A mixed sample data augmentation method for training with time-frequency domain features, Available from: arXiv preprint arXiv:2108.03020
  17. Kinnunen T, Delgado H, Evans N, et al. (2020). Tandem assessment of spoofing countermeasures and automatic speaker verification: Fundamentals, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28, 2195-2210. https://doi.org/10.1109/TASLP.2020.3009494
  18. Krizhevsky A, Sutskever I, and Hinton GE (2012). Imagenet classification with deep convolutional neural networks, Communications of the ACM, 60, 84-90. https://doi.org/10.1145/3065386
  19. Lavrentyeva G, Novoselov S, Malykh E, Kozlov A, Kudashev O, and Shchemelinin V (2017). Audio replay attack detection with deep learning frameworks, In Interspeech 2017 (pp. 82-86).
  20. Lavrentyeva, G, Novoselov S, Tseren A, Volkova M, Gorlanov A, and Kozlov A (2019). STC antispoofing systems for the ASVspoof2019 challenge, Interspeech 2019, 1033-1037.
  21. Madhu A and Kumaraswamy S (2019). Data augmentation using generative adversarial network for environmental sound classification. In Proceedings of 27th IEEE European Signal Processing Conference (EUSIPCO), A Coruna, Spain, 1-5.
  22. Nam H, Kim SH, and Park YH (2022). FilterAugment: An acoustic environmental data augmentation method. In Proceedings of ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore,4308-4312.
  23. Nagarsheth P, Khoury E, Patil K, and Garland M (2017). Replay attack detection using DNN for channel discrimination, Interspeech 2017, 97-101.
  24. Park DS, Chan W, Zhang Y, Chiu C-C, Zoph B, Cubuk ED, and Le QV (2019). SpecAugment: A simple data augmentation method for automatic speech recognition, Available from: arXiv preprint arXiv:1904.08779
  25. Shim HJ, Jung JW, Kim JH, and Yu HJ (2022). Attentive max feature map and joint training for acoustic scene classification. In Proceedings of ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 1036-1040.
  26. Singh KK, Yu H, Sarmasi A, Pradeep G, Lee YJ (2018). Hide-and-Seek: A data augmentation technique for weakly-supervised localization and beyond, Available from: arXiv preprint arXiv:1811.02545
  27. Sukthankar R, Ke Y, and Hoiem D (2006). Semantic learning for audio applications: A computer vision approach. In Proceedings of 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'06), New York, NY, USA, 112-112.
  28. Tomilov A, Svishchev A, Volkova M, Chirkovskiy A, Kondratev A, and Lavrentyeva G (2021). STC Antispoofing Systems for the ASVspoof2021 Challenge. In Proc. 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge, (pp. 61-67).
  29. Wei S, Zou S, and Liao F (2020). A comparison on data augmentation methods based on deep learning for audio classification, Journal of Physics: Conference Series, 1453, 012085.
  30. Witkowski M, Kacprzak S, Zelasko P, Kowalczyk K, and Galka J (2017). Audio replay attack detection using high-frequency features, Interspeech 2017, 27-31.
  31. Wu X, He R, Sun Z, and Tan T (2018). A light cnn for deep face representation with noisy labels, IEEE Transactions on Information Forensics and Security, 13, 2884-2896. https://doi.org/10.1109/TIFS.2018.2833032
  32. Wu Z, Kinnunen T, Evans N, Yamagishi J, Hanilci C, Sahidullah Md, and Sizov A (2015). ASVspoof 2015: The first automatic speaker verification spoofing and countermeasures challenge, Sixteenth Annual Conference of the International Speech Communication Association, 2037-2041.
  33. Yun S, Han D, Chun S, Oh SJ, Yoo Y, and Choe J (2019). Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF international conference on computer vision (ICCV), Seoul, Korea, 6023-6032.
  34. Zhang C, Yu C, and Hansen JH (2017). An investigation of deep-learning frameworks for speaker verification antispoofing, IEEE Journal of Selected Topics in Signal Processing, 11, 684-694. https://doi.org/10.1109/JSTSP.2016.2647199
  35. Zhang H, Cisse M, Dauphin YN, and Lopez-Paz D (2017). Mixup: Beyond empirical risk minimization, Available from: arXiv preprint arXiv
  36. Zhong Z, Zheng L, Kang G, Li S, and Yang Y (2020). Random erasing data augmentation, In Proceedings of the AAAI conference on artificial intelligence, Hilton New York Midtown, NY, USA, 13001-13008.