DOI QR코드

DOI QR Code

Harmful Document Classification Using the Harmful Word Filtering and SVM

유해어 필터링과 SVM을 이용한 유해 문서 분류 시스템

  • 이원휘 (전북대학교 컴퓨터공학과) ;
  • 정성종 (전북대학교 전자정보공학부) ;
  • 안동언 (전북대학교 전자정보공학부)
  • Published : 2009.02.28

Abstract

As World Wide Web is more popularized nowadays, the environment is flooded with the information through the web pages. However, despite such convenience of web, it is also creating many problems due to uncontrolled flood of information. The pornographic, violent and other harmful information freely available to the youth, who must be protected by the society, or other users who lack the power of judgment or self-control is creating serious social problems. To resolve those harmful words, various methods proposed and studied. This paper proposes and implements the protecting system that it protects internet youth user from harmful contents. To classify effective harmful/harmless contents, this system uses two step classification systems that is harmful word filtering and SVM learning based filtering. We achieved result that the average precision of 92.1%.

오늘날 웹이 일반화되면서 사람들은 원하는 정보를 웹을 통해 얻고, 또한 제공하고 있다. 웹이 다양한 정보의 제공과 습득의 장이라는 편의성을 제공하고 있지만, 반면에 너무 많은 정보, 무분별한 유해 정보의 범람 등 여러 가지 문제를 내포하고 있다. 현재 유해 웹 문서를 분류하기 위한 다양한 방법이 연구되고 사용되고 있다. 그러나 각각의 방법들이 갖는 단점들로 인해 획기적인 성과를 내지 못하고 있다. 본 논문에서는 유해 정보로부터 사회적으로 보호를 받아야 할 사용자들을 보호하기 위한 수단으로 유해 웹 문서 차단 방법에 대해 제안하고자 한다. 본 논문에서는 키워드 필터링과 SVM 알고리즘을 이용한 2단계 분류 과정을 통해 분류의 정확률을 높이고자 하였다.

Keywords

References

  1. Chih-Wei Hsu, Chih-Chung Chang and Chih-Jen Lin, “A Practical Guide to Support Vector Classification,” http://www.csie.ntu.edu.tw/~cjlin/libsvm/
  2. Christopher D. Hunter, “Internet Filter Effectiveness : Testing Over and Underinclusive Blocking Decisions of Four Popular Filters,” Proceedings of the tenth conference on Computers, freedom and privacy: challenging the assumptions, pp.287-294, April 2000 https://doi.org/10.1145/332186.332302
  3. Dequan Zheng, Yi Hu, Tiejun Zhao, Hao Yu and Sheng Li, “Research of Machine Learning Method for Specific Information Recognition on the Internet,” IEEE International Conference on Multimedia Interfaces(ICMI), pp, October 2002 https://doi.org/10.1109/ICMI.2002.1166998
  4. Huicheng Zheng, Hongmei Liu and Mohamed Daoudi, “Blocking Objectionable Image : Adult Images and Harmful Symbols,” IEEE International Conference on Multimedia and Expo(ICME), pp.1223-1226, June 2004 https://doi.org/10.1109/ICME.2004.1394442
  5. Jae-Sun Lee and Young-Hee Jeon, “A Study on the Effective Selective Filtering Technology of Harmful Website Using Internet Content Rating Service,” Communication of KIPS Review, Vol.09, No.02, Oct. 2002
  6. KwangHyun Kim, JoungMi Choi and JoonHo Lee, “Detecting Harmful Web Documents Based on Web Document Analyses,” Communication of KIPS Review, Vol.12-D, No.5, pp.683-688, Oct. 2005 https://doi.org/10.3745/KIPSTD.2005.12D.5.683
  7. M. Hammami, Y.Chahir and L.Chen, “WebGuard: Web Based Adult Content Detection and Filtering System,” IEEE WIC International Conference. Web Intelligence, pp.574-578, 2003
  8. Mohamed Hammami, Youssef Chahir and Liming Chen, “WebGuard: A Web Filtering Engine Combining Textual, Structural, and Visual Content-Based Analysis,” IEEE Transaction On Knowledge and Data Engineering, Vol.18, No.2, February 2006 https://doi.org/10.1109/TKDE.2006.34
  9. Nello Cristianini and John Shawe-Taylor, “An Introduction to Support Vector Machines and other kernel-based learning methods,” Cambridge university press, 2000
  10. P.Y. Lee and S.C. Hui, “An Intelligent Categorization Engine for Bilingual Web Content Filtering,” IEEE Transaction On Multimedia, Vol.7, No.6, December 2005 https://doi.org/10.1109/TMM.2005.858414
  11. P.Y.Lee, S.C.Hui and A.C.M. Fong, “Neural Networks for Web Content Filtering,” IEEE Intelligent Systems, pp.48-57, Sept./Oct. 2002 https://doi.org/10.1109/MIS.2002.1039832
  12. Qing Yang and Fang-Min Li, “SUPPORT VECTOR MACHINE FOR CUSTOMIZED EMAIL FILTERING BASED ON IMPROVING LATENT SEMANTIC INDEXING,” Proceedings of the Fourth International conference on Machine Learning and Cybernetics, Vol.6, pp.3787-3791, Aug. 2005 https://doi.org/10.1109/ICMLC.2005.1527599
  13. Seung-Man Lee, Young-Hun Jang and Jung-Hwan Lim, “Implementation of a Harmful Website's Automatic Classification System based on Morphological Analysis and Skin-Color Distribution's Human Detection Algorithm,” KISS Spring Conference Vol.31, No.1, pp.601-603, Apr. 2004
  14. Thorsten Joachims, “Learning to Classify Text using Support Vector Machines,” Kluwer Academic Publishers, 2002
  15. Yun-Jung Jang, Taehun Lee, Kyu Cheol Jung and Kihong Park, “The Method of Hurtfulness Site Interception Using Poisonous Character Weight,” KIPS Spring Conference, Vol.10, No.01, pp.2185-2188, May 2003
  16. Chih-Chung Chang and Chih-Jen Lin, “LIBSVM:a Library for Support Vector Machines,” http://www.csie.ntu.edu.tw/~cjlin/libsvm/index.html
  17. 김영수, 남택용, 원동호, “등급에 따른 웹 유해 문서 분류 기술”, 한국정보처리학회논문지C, 제13C권 7호, pp.859-864, 2006 https://doi.org/10.3745/KIPSTC.2006.13C.7.859
  18. 김영택 외, “자연언어처리”, 생능출판사, 2003
  19. 권용진, 황수찬 역, “정보검색개론”, 도서출판 미래컴, 2003
  20. Reed J.W, Jiao Yu, Potok T.E, Klump B.A, Elmore M.T and Hurson A.R, “TF-ICF, A New Term Weighting Scheme for Clustering Dynamic Data Streams,” Machine Learning and Applications, 2006. ICMLA '06. 5th International Conference on Dec. 2006 Page(s), 258-263 https://doi.org/10.1109/ICMLA.2006.50

Cited by

  1. Website Classification based on Occurrence Frequency of Medical Terms and Hyperlinks in Webpage vol.23, pp.2, 2013, https://doi.org/10.5391/JKIIS.2013.23.2.126
  2. Korean Document Classification Using Extended Vector Space Model vol.18B, pp.2, 2011, https://doi.org/10.3745/KIPSTB.2011.18B.2.093