DOI QR코드

DOI QR Code

A Spam Filter System Based on Maximum Entropy Model Using Co-training with Spamminess Features and URL Features

스팸성 자질과 URL 자질의 공동 학습을 이용한 최대 엔트로피 기반 스팸메일 필터 시스템

  • 공미경 (전북대학교 컴퓨터공학과) ;
  • 이경순 (전북대학교 전자정보공학부/영상정보신기술연구센터)
  • Published : 2008.02.29

Abstract

This paper presents a spam filter system using co-training with spamminess features and URL features based on the maximum entropy model. Spamminess features are the emphasizing patterns or abnormal patterns in spam messages used by spammers to express their intention and to avoid being filtered by the spam filter system. Since spammers use URLs to give the details and make a change to the URL format not to be filtered by the black list, normal and abnormal URLs can be key features to detect the spam messages. Co-training with spamminess features and URL features uses two different features which are independent each other in training. The filter system can learn information from them independently. Experiment results on TREC spam test collection shows that the proposed approach achieves 9.1% improvement and 6.9% improvement in accuracy compared to the base system and bogo filter system, respectively. The result analysis shows that the proposed spamminess features and URL features are helpful. And an experiment result of the co-training shows that two feature sets are useful since the number of training documents are reduced while the accuracy is closed to the batch learning.

본 논문에서는 스팸메일에 나타나는 스팸성 자질과 URL 자질의 공동 학습을 이용한 최대엔트로피모델 기반 스팸 필터 시스템을 제안한다. 스팸성 자질은 스패머들이 스팸메일에 인위적으로 넣는 강조 패턴이나 필터 시스템을 통과하기 위해 비정상적으로 변형시킨 단어들을 말한다. 스팸성 자질 외에 반복적으로 나타나는 URL과 비정상적인 URL도 자질로 사용하였다. 메일에 나타난 정상적인 URL과 필터 시스템을 피하기 위해 변형된 비정상적인 URL들이 스팸 메일을 걸러내는데 도움을 줄 수 있기 때문이다. 또한 스팸성 자질과 URL자질을 이용한 공동 학습을 하였다. 공동 학습은 학습 과정에서 두 자질을 독립적으로 이용한 비지도 학습 방법으로 정답을 모르는 문서를 이용할 수 있다는 장점을 갖는다. 실험을 통해 스팸성 자질과 URL을 이용함으로써 스팸 필터 시스템의 성능을 향상시킬 수 있음을 확인하였으며 두 자질 집합을 이용한 공동 학습이 필요한 학습 문서의 수를 감소시키면서, 정확도는 일괄 학습 정확도에 근접한다는 것을 확인하였다.

Keywords

References

  1. Sahami, M., Dumais, S., Heckerman, D., and Horvitz, E. 'A Bayesian Approach to Filtering Junk E-mail', AAAI-98 Workshop on Learning for Text Categorization, 1998
  2. Cormack, B., Lynam, T. 'TREC2005 Spam Track Overview', Text REtrieval Conference, 2005
  3. Yang, K., Yu, N., George, N., Loehrlen, A., McCaulay, D., Zhang, H., Akram, S., Mei, J., Record, I. 'WIDIT in TREC 2005 HARD, Robust, and SPAM Tracks', Text REtrieval Conference, 2005
  4. Keselj, V., Milios, E., Tuttle, A., Wang, S., Zhang, R. 'DalTREC 2005 Spam Track: Spam Filtering Using N-gram-based Techniques', Text REtrieval Conference, 2005
  5. Assis, F., Yerazunis, W., Siefkes, C., Chhabra, S. 'CRM114 versus Mr. X: CRM114 Notes for the TREC 2005 Spam Track', Text REtrieval Conference, 2005
  6. Cao, W., An, A., Huang, X. 'York University at TREC 2005: SPAM Track', Text REtrieval Conference, 2005
  7. 김현준, 정재은, 조근식 '가중치가 부여된 베이지안 분류자를 이용한 스팸 메일 필터링 시스템', 정보과학회논문지, 제 31권, 제8호, pp 1092-1100, 2004
  8. Segal, R. 'IBM SpamGuru on the TREC 2005 Spam Track', Text REtrieval Conference, 2005
  9. Bratko, A., Filipic, B. 'Spam Filtering Using Character-Level Markov Models: Experiments for the TREC 2005 Spam Track', Text REtrieval Conference, 2005
  10. Ion Androutsopoulos et al, 'An Experimental Comparison of Naive Bayesian and Keyword-Based Anti-Spam Filtering with Personal E-mail Messages', International ACM SIGIR conference on Research and development in information retrieval, pp. 160-167, 2000
  11. Breyer, L. A. 'DBACL at the TREC 2005', Text REtrieval Conference, TREC 2005
  12. Robinson, G. A. 'Statistical Approach to the Spam Problem', Linux Journal, vol. 107, 2003. http://bogofilter.sourceforge.net/
  13. Wang, S., Wang, B., Lang, H., Cheng, X. 'CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance', Text REtrieval Conference, 2005
  14. Blum, A. and Mitchell, T. M. 'Combining labeled and unlabeled data with co-training', Annual Conference on Computational Learning Theory, pp. 92-100, 1998
  15. Kiritchenko, S. and Matwin, S. 'Email classification with co-training', Conference of the Centre for Advanced Studies on Collaborative Research, page 8, Toronto, Ontario, Canada, 2001
  16. Pierce, D. and Cardie, C. 'Limitations of Co-Training for natural language learning from large datasets', Conference on Empirical Methods in NLP, pp. 1-9, 2001
  17. Ratnaparkhi, A. 'Maximum Entropy Models for Natural Language Ambiguity Resolution', Ph.D. Dissertation. University of Pennsylvania, 1998. http://maxent.sourceforge.net/(http://homepages.inf.ed.ac.uk/s0450736/maxent_toolkit. html)
  18. Darroch, J.N. and Ratcliff, D. 'Generalized iterative scaling for log-linear models', The Annals of Mathematical Statistics, 1972

Cited by

  1. Spam Filter by Using X2Statistics and Support Vector Machines vol.17B, pp.3, 2010, https://doi.org/10.3745/KIPSTB.2010.17B.3.249
  2. Spam Message Filtering for Internet Communities using Collection and Frequency Analysis vol.18C, pp.2, 2011, https://doi.org/10.3745/KIPSTC.2011.18C.2.061