DOI QR코드

DOI QR Code

An Effective Incremental Text Clustering Method for the Large Document Database

대용량 문서 데이터베이스를 위한 효율적인 점진적 문서 클러스터링 기법

  • 강동혁 ((주)네트빌 부설연구소) ;
  • 주길홍 (연세대학교 대학원 컴퓨터과학과) ;
  • 이원석 (연세대학교 컴퓨터과학과)
  • Published : 2003.02.01

Abstract

With the development of the internet and computer, the amount of information through the internet is increasing rapidly and it is managed in document form. For this reason, the research into the method to manage for a large amount of document in an effective way is necessary. The document clustering is integrated documents to subject by classifying a set of documents through their similarity among them. Accordingly, the document clustering can be used in exploring and searching a document and it can increased accuracy of search. This paper proposes an efficient incremental cluttering method for a set of documents increase gradually. The incremental document clustering algorithm assigns a set of new documents to the legacy clusters which have been identified in advance. In addition, to improve the correctness of the clustering, removing the stop words can be proposed and the weight of the word can be calculated by the proposed TF$\times$NIDF function.

컴퓨터의 발전과 인터넷의 급속한 발전으로 정보의 양이 폭발적으로 증가하게 되었고 이러한 방대한 양의 정보들은 대부분 문서 형태로 관리되고 있으며, 문서 단위별 표현된 많은 정보들을 효과적으로 관리하고 검색하기 위한 방법의 연구가 필요하게 되었다. 문서 클러스터링은 문서간의 유사도를 바탕으로 서로 연관된 문서들을 군집화하여 문서들을 주제별로 통합하는 방법으로 대용량의 문서들을 자동으로 분류하고, 검색하는 데 있어서 검색의 정확성을 증대시킬 수 있다. 본 논문에서는 새로운 문서의 추가나 기존문서의 삭제로 인하여 군집화 대상이 되는 문서 집합이 점진적으로 변화하는 환경을 위한 점진적 문서 클러스터링 알고리즘을 제안한다. 점진적 문서 클러스터링 알고리즘은 새로운 문서가 추가되었을 경우 문서 전체를 다시 클러스터링하지 않고, 이미 생성된 클러스터들의 구조를 적극적으로 변화시킴으로써 높은 효율성을 제공할 수 있다. 또한, 문서 클러스터링의 정확도를 높이기 위하여 통계적인 기법으로 불용어를 판별하여 제거하는 알고리즘을 제안하고, 문서 클러스터링에서 정확한 단어가중치 산출을 위해 TF$\times$IDF 공식을 수정한 TF$\times$NIDF 공식을 제안한다.

Keywords

References

  1. Douglass, R. Cutting, David, R. Karger, Jao, O. Pedersen, and John, W. Tukey, 'Scatter/Gather : A Cluster-based Approach to Browsing Large Document Collections,' 15th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp.318-329, 1992 https://doi.org/10.1145/133160.133214
  2. B. W. Frakes and R. Baeza-Yates, 'Information Retrieval : Data Structures & Algorithms,' Prentice Hall, 1992
  3. 강승식, 'HAM : 한국어 분석 모듈', http://nlp.kookmin.ac.kr.
  4. G. Salton, C. Buckley, 'Term-weighting approaches in au-tomatic text retrieval,' Information Processing and Mana-gement, Vol.24, No.5, pp.513-523, 1988 https://doi.org/10.1016/0306-4573(88)90021-0
  5. '야후!', http://www.yahoo.com/
  6. Jain, A. K. and Dubes, R. C, 'Algorithms for Clustering Data,' Prentice Hall, 1988
  7. J. J. Rocchio, 'Document Retrieval Systems - Optimization and Evaluation,' Ph. D. Thesis, Havard University, 1966
  8. C. J. Van Rijsvergen, 'Information Retrieval,' Butterworth, London, 2nd edition, 1979
  9. David D. Lewis, Robert E. Schapire, James P.Callan, Ron Papka, 'Training Algorithms for Linear Text Classifiers,' Proceedings of 19th ACM International Conference on Research and Development in Information Retrieval, 1996 https://doi.org/10.1145/243199.243277
  10. Eui-Hong (Sam) Han, George Karypis, and Vipin Kumar, 'Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification,' 5th Pacific Asia Conference on Knowledge Discovery And Data Mining, 2001
  11. Yiming Yang, 'Expert Network : Effective and efficient learning from human decisions in text categorization and retrieval,' 17th ACM SIGIR Conference on Research and Development in Information Retrieval, pp.13-22, 1994
  12. Ron Fagin, Yoelle Maarek, Israel Ben-Shaul, and Dan Pel-leg, 'Ephemeral document clustering for web applications,' IBM Research Report RJ 10186, April, 2000
  13. Amit Singhal, Chris Buckley, and Mandar Mitra, 'Pivoted Document Length Normalization,' Proceedings of 19th ACM International Conference on Research and Development in Information Retrieval, 1996 https://doi.org/10.1145/243199.243206
  14. M. Ester, H. Kriegel, J. Sander, M. Wimmer, and X. Xu, 'Incremental Clustering for Mining in a Data Warehousing Environment,' Proceedings of the 24th VLDB Conference, New York, USA, 1998
  15. Futamura Shoukchi and Matsuo Fumihiro, 'Automatic In-dexing by Stop Word Removal on Scientific and Technical Documents Written in English,' Information Processing Society of Japan, Vol.28 No.07, 1987
  16. G. Salton, 'Automatic Text Processing,' Addison-Welsley Publishing Company, 1989
  17. Weifeng Li, Baowen Xu, Cheng-Cheng Chu, Chih-Wei Lu, 'Application of Genetic Algorithm in Search Engine,' Pro-ceedings of International Symposium on Multimedia Soft-ware Engineering, pp.366-371, 2000 https://doi.org/10.1109/MMSE.2000.897237
  18. W. E. L. Grimson and D. P. Huttenlocher, 'On the sensi-tivity of geometric hashing,' 3rd International Conference on Computer Vision, pp.334-338, 1990 https://doi.org/10.1109/ICCV.1990.139544
  19. I. Aalbersberg, 'A Document Retrieval Model Based on Term Frequency Ranks,' 17th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp.163-172, 1994
  20. 야후!코리아 뉴스, http://kr.dailynews.yahoo.com/