Text Categorization Using TextRank Algorithm

TextRank 알고리즘을 이용한 문서 범주화

  • 배원식 (국립창원대학교 컴퓨터공학과) ;
  • 차정원 (국립창원대학교 컴퓨터공학과)
  • Published : 2010.01.15

Abstract

We describe a new method for text categorization using TextRank algorithm. Text categorization is a problem that over one pre-defined categories are assigned to a text document. TextRank algorithm is a graph-based ranking algorithm. If we consider that each word is a vertex, and co-occurrence of two adjacent words is a edge, we can get a graph from a document. After that, we find important words using TextRank algorithm from the graph and make feature which are pairs of words which are each important word and a word adjacent to the important word. We use classifiers: SVM, Na$\ddot{i}$ve Bayesian classifier, Maximum Entropy Model, and k-NN classifier. We use non-cross-posted version of 20 Newsgroups data set. In consequence, we had an improved performance in whole classifiers, and the result tells that is a possibility of TextRank algorithm in text categorization.

본 논문에서는 TextRank 알고리즘을 이용한 문서 범주화 방법에 대해 기술한다. TextRank 알고리즘은 그래프 기반의 순위화 알고리즘이다. 문서에서 나타나는 각각의 단어를 노드로, 단어들 사이의 동시출현성을 이용하여 간선을 만들면 문서로부터 그래프를 생성할 수 있다. TextRank 알고리즘을 이용하여 생성된 그래프로부터 중요도가 높은 단어를 선택하고, 그 단어와 인접한 단어를 묶어 하나의 자질로 사용하여 문서 분류를 수행하였다. 동시출현 자질(인접한 단어 쌍)은 단어 하나가 갖는 의미를 보다 명확하게 만들어주므로 문서 분류에 좋은 자질로 사용될 수 있을 것이라 가정하였다. 문서 분류기로는 지지 벡터 기계, 베이지언 분류기, 최대 엔트로피 모델, k-NN 분류기 등을 사용하였다. 20 Newsgroups 문서 집합을 사용한 실험에서 모든 분류기에서 제안된 방법을 사용했을 때, 문서 분류 성능이 향상된 결과를 확인할 수 있었다.

Keywords

References

  1. Y. Yang and J. O. Pederson, "A comparative study on feature selection in text categorization," Proc. of the 14th International Conference on Machine Learning, pp.412-420, 1997.
  2. C. Y. Lin and E. Hovy, "The Automated Acquisition of Topic Signatures for Text Summarization," Proc. of the 18th International Conference on Computation Linguistics, pp.495-500, 2000.
  3. D. D. Lewis, "Naive (bayes) at forty: The independence assumption in information retrieval," Proc. of the 10th European Conference on Machine Learning, pp.4-15, 1998.
  4. A. K. McCallum and K. Nigram, "A Comparison of Event Models for Naive Bayes Text Classification," Proc. of the AAAI-98 Workshop on Learning for Text Categorization, pp.41-48, 1998.
  5. T. Joachims, "Text Categorization with Support Vector Machines: Learning with Many Relevant Features," Proc. of the 10th European Conference on Machine Learning, pp.137-142, 1998.
  6. Y. Yang, "Expert netword: Effective and efficient learning from human decisions in text categorization and retrieval," Proc. of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp.13-22, 1994.
  7. K. Nigam, J. Lafferty, and A. K. McCallum, "Using Maximum Entropy for Text Categorization," Proc. of the IJCAI-99 Workshop on Machine Learning for Information Filtering, pp.61-67, 1999.
  8. R. Mihalcea and P. Tarau, "TextRank: Bringing Order into Texts," Proc. of the Conference on Empirical Methods in Natural Language Processing 2004, pp.404-411, 2004.
  9. K. Lang, "The 20 Newsgroups data set," http://people.csail.mit.edu/~jrennie/20Newsgroups
  10. W. Bae, Y. Han, and J. Cha, "Text Categorization using Topic Signature and Co-occurrence Features," Proc. of the KIISE Korea Computer Congress 2008, vol.35, no.1, pp.262-267, 2008. (in Korean)
  11. D. D. Lewis, "The Reuters-21578 data set," http://www.daviddlewis.com/resources/testcollections/reut ers21578
  12. S. Brin and L. Page, "The Anatomy of a Large-Scale Hypertextual Web Search Engine," Computer Networks and ISDN Systems, vol.30, pp.107-117, 1998. https://doi.org/10.1016/S0169-7552(98)00110-X
  13. A. K. McCallum, "Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering," http://www.cs.cmu.edu/~mccallum/bow/, 1996.
  14. K. Pearson, "On the theory of contingency and its relation to association and normal correlation," In Karl Pearson's early statistical papers, Cambridge: Cambridge University Press, pp.443-475, 1904/1948.
  15. Y. Yang, "An evaluation of statistical approach to text categorization," Information Retrieval, vol.1, no.1-2, pp.69-90, 1996.
  16. A. Gliozzo and C. Strapparava, "Domain Kernels for Text Categorization," Proc. of the 9th Conference on Computational Natural Language Learning, pp.56-63, 2005.
  17. S. Tan, "Using Error-Correcting Output Codes with Model-Refinement to Boost Centroid Text Classifier," Proc. of the ACL 2007 Demo and Poster Sessions, pp.81-84, 2007.
  18. R. Bekkerman, R. El-Yaniv, N. Tishby, and Y. Winter, "On feature distributional clustering for text categorization," Proc. of 24th Annual International ACM SIGIR Conference, pp.146-153, 2001.
  19. Y. Yoon, C. Lee, and G. G. Lee, "Hierarchical text categorization using support vector machine," Proc. of the 15th Human and Cognitive Language Technology, pp.1-8, 2003. (in Korean)
  20. Y. Yoon and G. G. Lee, "Efficient implementation of associative classifiers for document classification," Information Processing and Management, vol.43, pp.393-405, 2007. https://doi.org/10.1016/j.ipm.2006.07.012