A Comparative Study of Feature Selection Methods for Korean Web Documents Clustering

Kim Young-Gi;

doi:10.4275/KSLIS.2005.39.1.045

Journal of the Korean Society for Library and Information Science (한국문헌정보학회지)

Volume 39 Issue 1
/
Pages.45-58
/
2005
/
1225-598X(pISSN)

Korean Society For Library And Information Science (한국문헌정보학회)

DOI QR Code

A Comparative Study of Feature Selection Methods for Korean Web Documents Clustering

한글 웹 문서 클러스터링 성능향상을 위한 자질선정 기법 비교 연구

Kim Young-Gi

김영기 (경성대학교 문과대학 문헌저보학과)

Published : 2005.03.01

https://doi.org/10.4275/KSLIS.2005.39.1.045 Citation PDF

Download PDF

⟨ Previous Next ⟩

Abstract

This Paper is a comparative study of feature selection methods for Korean web documents clustering. First, we focused on how the term feature and the co-link of web documents affect clustering performance. We clustered web documents by native term feature, co-link and both, and compared the output results with the originally allocated category. And we selected term features for each category using $X^2$, Information Gain (IG), and Mutual Information (MI) from training documents, and applied these features to other experimental documents. In addition we suggested a new method named Max Feature Selection, which selects terms that have the maximum count for a category in each experimental document, and applied $X^2$ (or MI or IG) values to each term instead of term frequency of documents, and clustered them. In the results, $X^2$ shows a better performance than IG or MI, but the difference appears to be slight. But when we applied the Max Feature Selection Method, the clustering Performance improved notably. Max Feature Selection is a simple but effective means of feature space reduction and shows powerful performance for Korean web document clustering.

이 연구는 한글 웹 문서를 클러스터링 하기 위한 자질 선정 방법에 대한 비교연구이다. 이 연구에는 두 개의 코퍼스가 사용되었다. 클러스터링을 위한 실험 문서는 Naver의 자연과학 범주에서, 자질 선정을 위한 학습문서는 Yahoo Korea의 같은 범주에서 수집하였다. 우선 실험 문서를 단어자질과 동시링크, 그리고 이 둘을 혼합한 방법으로 클러스터링 한 다음 그 성능을 비교하였다. 다음으로 학습문서에서 카이제곱 통계량$(X^2)$, 정보획득량(IG), 그리고 상호정보량(MI)을 이용하여 용어자질을 선정한 다음. 이를 실험문서에 적용하여 클러스터링 성능을 비교하였다. 석기에 각 범주별로 최댓값을 갖는 용어들만을 해당 범주를 대표하는 자질로 선정하는 '최댓간 자질 선정기법'을 실험적으로 도입하여 적용해 보았다. 실험 결과 사용된 자질에 따른 한글 웹 문서 클러스터링 정확률은 자연어 $ 72.3\%$, 동시링크 $74.3\%$, 단어-링크 혼합 $74.8\%$, $X^2\;79.6\%\;Max\;X^2\;83.8\%$로 나타났다. 전통적 자질 선정 기법 중에서는 $X^2$가 약간 나은 성능을 보여 주었지만 큰 차이는 발견되지 않았다. 그러나 최댓값 자질 선정기법을 적용하였을 때 클러스터링 성능은 크게 향상되었다. 이 논문에서 제안된 최댓간 자질 선정 기법은 웹 문서의 자질 공간 축소와 한글 웹 문서의 클러스터링을 위한 간단하면서도 효과적인 수단이다.

Keywords

References

고영중, 서정연. 2002. 문서관리를 위한 자동문서범주화에 대한 이론 및 기법. '정보관리 연구' , 33(2): 19-32
김영기, 이원희, 권혁철. 2003. 동시링크를 이용한 웹 문서 클러스터링 실험. '한국도서관.정보학회지' , 34(2): 233-253
정성원, 이원희, 김영기, 권혁철. 2002. 웹 문서중 의미 있는 표의 추출. '한글 및 한국어 정보처리' , 14: 332-339
Baker, L. Douglas and Andrew K. Maccallum, 1998. 'Distributional clustering of words for text classification', Proc. of the $21^{th}\$ Annual International ACM-SIGIR
Barfourosh, A. Abdollahzadeh, M. L. Anderson, H. R. Motahary and D. Perlis, 2003. 'Information Retrieval on the World Wide Web and Active Logic : A Survey and Problem Definition'
Belew, R. K. 2000. Finding Out About: A Cognitive perspective on search engine technology and the WWW. Cambridge University Press
Broder, A. Z., S. C. Glassman, M. S. Manasse and G. Zweig, 1997. 'Syntactic clustering of the Web', Proceedings of the $6^{th}$ International WWW Conference: 391-404
Chakrabarti, Soumen, Byron Dom, and Piotr Indyk, 1998. 'Enhanced hypertext categorization using hyperlinks', Proc. of International Conference on SIGMOD '98: 307-318
Chang, C. H. and C. C. Hsu, 1998. 'Integrating query expansion and conceptual relevance feedback for personalized Web information retrieval', Proceedings of the $7^{th}$ International WWW Conference
He, Xiaofeng, Hongyuan Zha, Chris H. Q. Ding and Horst D. Simon, 2002. 'Web document clustering using hyperlink structures,' Computational Statistics & Data Analysis, vol.41: 19-45 https://doi.org/10.1016/S0167-9473(02)00070-1
Jansen, M. B., A. Spink, J. Bateman, and T. Saracevic, 1998. 'Real Life Information Retrieval : A Study of User Queries On The Web', ACM SIGIR Forum Archive vol 32
Karypis, George, 2002. 'CLUTO: A Clustering Toolkit', Technical Report TR #02-017, Department of Computer Science, University of Minnesota
Kumar, S. R., P. Raghavan, S. Rajagopalan and A. Tomkins, 1999. 'Trawling the Web for emerging cybercommunities', Proceedings of the $8^{th}$ WWW Conference
Larson, R. R. 1996. 'Bibliometrics of the World Wide Web: An Exploratory Analysis of the Intellectual Structure of Cyberspace', Proceedings of the 1996 American Society for Information Science Annual Meeting
Lee Kyo-Woon, Kim Young-Gi and Kwon Hyuk-Chul, 2004. 'Clustering of web documents with the use of term frequency and co-link in hypertext', The $3^{rd}$ Asia Pacific International symposium on Information Technology, Jan., 2004: 122-127
Lewis, David L. and Marc Ringuette, 1998. 'A comparison of two learning algorithms for text categorization', Proc. of the $3^{rd}$ Annual Symposium on Document Analysis and Information Retrieval: 96-103
Lewis, David L., Robert E. Schapire, James P. Callan, and Ron Papka, 1996. 'Training algorithms for linear text classifier', Proc. of the 19th Annual International ACM-SIGIR: 298-315
Mukherjea, S, 2000, 'Organizing topicspecific Web information', Proceedings of the $11^{th}$ ACM Conference on Hypertext: 133-141
Mukherjea, S, 2000. 'WTMS: a system for collecting and analyzing topicspecific Web information', Proceedings of the $9^{th}$ International World Wide Web Conference: 457-471
Pirolli, P., P. Schank, M. Hearst and C. Diehl, 1996. 'Scatter/ Gather browsing communicates the topic structure of a very large text collection', Proceedings of the Conference on Human Factors in Computing Systems: 213-220
Salton, G. and M. J. McGill, 1983. 'Introduction to Modern Information Retrieval', McGrawHill
Small, H., 1973. 'Co-citation in the scientific literature: A new measure of the relationship between two documents', Journal of American society for Information Science. vol.24: 265-269
Smith, Kate A. and Alan Ng, 2003. 'Web page clustering using a self-organizing map of user navigation patterns' Decision Support systems, vol.35: 245-256 https://doi.org/10.1016/S0167-9236(02)00109-4
Wang, Yitong and Masaru Kitsuregawa, 2001. 'Line Based Clustering of Web Search Results', Second International Conference on Advances in Web - Age Information management (WAIM)
Yang, Yiming and Jan O. Pederson, 1997. 'A comparative study on feature selection in text categorization', Proceeding of ICML-97, 14th International Conference on Machine Learning
Zhao, Ying and George, Karypis, 'Criterion functions for document clustering experiment and analysis', Technical Report TR #01-40, Department of Computer Science, University of Minnesota, 2001

Journal of the Korean Society for Library and Information Science (한국문헌정보학회지)

A Comparative Study of Feature Selection Methods for Korean Web Documents Clustering

한글 웹 문서 클러스터링 성능향상을 위한 자질선정 기법 비교 연구

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)