DOI QR코드

DOI QR Code

Ensemble Machine Learning Model Based YouTube Spam Comment Detection

앙상블 머신러닝 모델 기반 유튜브 스팸 댓글 탐지

  • Jeong, Min Chul (Department of Digital Media, Ajou University) ;
  • Lee, Jihyeon (Department of English Language and Literature, Ajou University) ;
  • Oh, Hayoung (Global Convergence, Sungkyunkwan University)
  • Received : 2019.11.13
  • Accepted : 2019.11.19
  • Published : 2020.05.31

Abstract

This paper proposes a technique to determine the spam comments on YouTube, which have recently seen tremendous growth. On YouTube, the spammers appeared to promote their channels or videos in popular videos or leave comments unrelated to the video, as it is possible to monetize through advertising. YouTube is running and operating its own spam blocking system, but still has failed to block them properly and efficiently. Therefore, we examined related studies on YouTube spam comment screening and conducted classification experiments with six different machine learning techniques (Decision tree, Logistic regression, Bernoulli Naive Bayes, Random Forest, Support vector machine with linear kernel, Support vector machine with Gaussian kernel) and ensemble model combining these techniques in the comment data from popular music videos - Psy, Katy Perry, LMFAO, Eminem and Shakira.

이 논문은 최근 엄청난 성장을 하고 있는 유튜브의 댓글 중 스팸 댓글을 판별하는 기법을 제안한다. 유튜브에서는 광고를 통한 수익 창출이 가능하기 때문에 인기 동영상에서 자신의 채널이나 동영상을 홍보하거나 영상과 관련 없는 댓글을 남기는 스패머(spammer)들이 나타났다. 유튜브에서는 자체적으로 스팸 댓글을 차단하는 시스템을 운영하고 있지만 여전히 제대로 차단하지 못한 스팸 댓글들이 있다. 따라서, 유튜브 스팸 댓글 판별에 대한 관련 연구들을 살펴 보고 인기 동영상인 싸이, 케이티 페리, LMFAO, 에미넴, 샤키라의 뮤직비디오 댓글 데이터에 6가지 머신러닝 기법(의사결정나무, 로지스틱 회귀분석, 베르누이 나이브 베이즈, 랜덤 포레스트, 선형 커널을 이용한 서포트 벡터 머신, 가우시안 커널을 이용한 서포트 벡터 머신)과 이들을 결합한 앙상블 모델로 스팸 탐지 실험을 진행하였다.

Keywords

References

  1. KBS NEWS [Internet] Available: https://mn.kbs.co.kr/news/view.do?ncd=4260664
  2. YouTube Help, [Internet] Available: https://support.google.com/youtube/answer/72857?hl=ko
  3. M. S. Patil, and A. M. Bagade, "Online review spam detection using language model and feature selection." International Journal of Computer Applications, 59(7), December 2012, 1-4. https://doi.org/10.5120/9557-4017
  4. M. Mishne, G. Carmel, D. David, L. Ronny, "Blocking Blog Spam with Language Model Disagreement.", ACM Transactions on Multimedia Computing, Communications, and Applications, May, 2005, 1-6.
  5. T. Bogers and D. B. Van, "Using Language Models for Spam Detection in Social Book marking", Proceedings of ECML/PKDD Discovery Challenge Workshop, 2008, 1-12.
  6. P. S. Kiran, "Detecting spammers in YouTube : A study to find spam content in a video platform", IOSR Journal of Engineering (IOSRJEN), 05(07), July 2015, 26-30.
  7. Y. Yusof and O. H. Sadoon, "Detecting video spammers in youtube social media", Proceedings of the 6th International Conference of Computing & Informatics, April 2017, 228-235.
  8. A. Shreyas, and S. Nisha, "N-Gram Assisted Youtube Spam Comment Detection", Procedia Computer Science, 132, Jan 2018, 174-182. https://doi.org/10.1016/j.procs.2018.05.181
  9. A. Tulio, L. Johannes and A. Tiago, "TubeSpam: Comment Spam Filtering on YouTube", IEEE 14th International Conference on Machine Learning and Applications (ICMLA), Dec 2015, 1-6.
  10. Bag-of-words model [Internet] Available: https://en.wikipedia.org/wiki/Bag-of-words_model
  11. A. Thulfiqar, and A. Hussein, and Q. Samir, "YouTube spam comments detection using Artificial Neural Network", Journal of Engineering and Applied Sciences, 13(22), 2018, 9638-9642.
  12. A. Rafaqat, "Spammer Detection: A Study of Spam Filter Comments on YouTube Videos", Lahore Garrison Education System, May 2019, 1-6.
  13. Project jupyter [Internet] Available: https://jupyter.org/
  14. Welcome to Python.org [Internet] Available: https://python.org/
  15. Scikit-learn: machine learning in python [Internet] Available: https://scikit-learn.org/stable/
  16. YouTube Spam Collection v.1, [Internet] Available: http://dcomp.sor.ufscar.br/talmeida/youtubespamcollection
  17. YouTube Spam Collection, [Internet] Available: http://www.dt.fee.unicamp.br/-tiago//youtubespamcollection/
  18. Y. J. Jang, H. J. Kim, and H. J. Jo, "Data Mining", KNOU PRESS, 2016, 1-200.