Document Clustering Methods using Hierarchy of Document Contents

문서 내용의 계층화를 이용한 문서 비교 방법

  • 황명권 (조선대학교 컴퓨터공학부) ;
  • 배용근 (조선대학교 컴퓨터공학부) ;
  • 김판구 (조선대학교 컴퓨터공학부)
  • Published : 2006.12.30

Abstract

The current web is accumulating abundant information. In particular, text based documents are a type used very easily and frequently by human. So, numerous researches are progressed to retrieve the text documents using many methods, such as probability, statistics, vector similarity, Bayesian, and so on. These researches however, could not consider both subject and semantic of documents. So, to overcome the previous problems, we propose the document similarity method for semantic retrieval of document users want. This is the core method of document clustering. This method firstly, expresses a hierarchy semantically of document content ut gives the important hierarchy domain of document to weight. With this, we could measure the similarity between documents using both the domain weight and concepts coincidence in the domain hierarchies.

References

  1. T. Joachirns, 'A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization,' ICML-97, 1997
  2. Hyunjang Kong, M.G. Hwang, P.K. Kim, 'A New Methodology for Merging the Heterogeneous Domain Ontologies based on the WordNet', International Conference on Next Generation Web Services Practices, 2005.08
  3. Haruechaiyasak, C. Mei-Ling, Shyu Shu-Ching Chen, 'Web Document Classification Based on Fuzzy Association', Computer Software and Applications Conference, 2002. COMPSAC 2002. Proceedings. 26th Annual International, p.487- 492
  4. http://wordnet.princeton.edu/
  5. J. McMahon and F. Smith, 'Improving statistical language model performance with automatically generated word hierarchies,' Computational Linguistics, Vol.22, No.2, 1995
  6. A. McCallum and K. Nigram, 'A Comparsion of Event Models for Naive Bayes Text Classification,' AAAI-98 Workshop on Learning for Text Categorization, 1998
  7. D.D.Lewis, 'Naive(Bayes) at forty: The Independence Assumption in Information Retrieval,' In European Conference on Machine Learning, 1998
  8. 'The Classic Vector Space Model', http://www.miislita. comfterm-vector/term-vector-3.html
  9. 한광록, 선복근, 한상태, 임기욱, '인터넷 문서 자동분류 시스템 개발에 관한 연구', 제9회 한국정보처리학회 논문집, 제7권 제9호, pp.2867-2875, 2000
  10. 고수정, 이정현, 'Apriori-Genetic 알고리즘을 이용한 베이지안 자동 문서 분류', 정보처리학회 논문지 B, Vol.01, No.01, p.001-012, 2001년 6월
  11. S. Banerjee, T. Pedersen, 'An adapted Lesk algorithm for word sense disambiguation using WordNet,' In Proceedings of the Third International Conference on Intelligent Text Processing and Computational Linguistics, Mexico City, pp. 136 - 145,2002 https://doi.org/10.1007/3-540-45715-1_11
  12. Satanjeev Banerjee, Ted Pedersen, 'An Adapted Lesk Algorithm for Word Sense Disambiguation Using WordNet', Computational Linguistics and Intelligent Text Processing: Third International Conference, p.136-147, Vol.2276, February 17-23,2002
  13. L.A. Zadeh, 'Fuzzy Sets', in D.Dubois, H.Prade, and R.R.Yager, editors, Readings in Fuzzy Sets for Intelligent Systems, Morgan Kaufmann Publishers, 1993
  14. D.L. Lee, H. Chuang, K. Seamons., 'Document Ranking and the Vector-Space Model', IEEE Software, p.67-75, 1997