DOI QR코드

DOI QR Code

Latent Keyphrase Extraction Using LDA Model

LDA 모델을 이용한 잠재 키워드 추출

  • Cho, Taemin (Department of Electrical and Computer Engineering, Sungkyunkwan University) ;
  • Lee, Jee-Hyong (Department of Electrical and Computer Engineering, Sungkyunkwan University)
  • 조태민 (성균관대학교 정보통신대학 전자전기컴퓨터공학과) ;
  • 이지형 (성균관대학교 정보통신대학 전자전기컴퓨터공학과)
  • Received : 2014.09.14
  • Accepted : 2015.03.24
  • Published : 2015.04.25

Abstract

As the number of document resources is continuously increasing, automatically extracting keyphrases from a document becomes one of the main issues in recent days. However, most previous works have tried to extract keyphrases from words in documents, so they overlooked latent keyphrases which did not appear in documents. Although latent keyphrases do not appear in documents, they can undertake an important role in text summarization and information retrieval because they implicate meaningful concepts or contents of documents. Also, they cover more than one fourth of the entire keyphrases in the real-world datasets and they can be utilized in short articles such as SNS which rarely have explicit keyphrases. In this paper, we propose a new approach that selects candidate keyphrases from the keyphrases of neighbor documents which are similar to the given document and evaluates the importance of the candidates with the individual words in the candidates. Experiment result shows that latent keyphrases can be extracted at a reasonable level.

인터넷 미디어의 발달과 함께 온라인 문서의 양이 급격하게 증가함에 따라, 문서 요약과 정보 검색 등 다양한 분야에 활용가능한 키워드를 자동으로 찾고자하는 연구가 활발히 진행되고 있다. 하지만 기존의 키워드 추출 연구들은 문서에서 나타나는 키워드만을 대상으로 하고 있어, 문서에서 등장하지 않는 잠재 키워드를 추출하지 못하는 한계를 갖고 있다. 잠재 키워드는 실데이터 키워드의 1/4 이상을 차지하고 있으며, 문서에서 나타나지는 않지만 문서의 중요한 개념이나 내용을 함축하고 있어 문서 요약 및 정보 검색에 중요한 역할을 차지할 수 있다. 특히 SNS와 같이 내용이 적어 키워드가 명시적으로 나타나기 어려운 문서에서 유용하게 활용될 수 있다. 본 논문에서는 잠재 키워드를 추출하기 위해 주어진 문서와 유사한 문서의 키워드를 후보 키워드로 선택하고 후보 키워드를 구성하는 개별 단어들을 이용해 후보 키워드의 중요도를 평가하는 방법을 제안한다. 실험을 통해, 제안 기법이 잠재 키워드를 합리적인 수준으로 추출할 수 있음을 보였다.

Keywords

References

  1. E. Frank, G. W. Paynter, I. H. Witten, C. Gutwin, and C. G. Nevill-Manning, "Domain-specific keyphrase extraction," Proceedings of the 16th international joint conference on artificial intelligence, pp 668-673, 1999.
  2. A. Hulth, "Improved automatic keyword extraction given more linguistic knowledge," Proceedings of the 2003 conference on Empirical methods in natural language processing. Association for Computational Linguistics, 2003.
  3. W. You, D. Fontaine, and J. P. Barthes, "An automatic keyphrase extraction system for scientific documents," Knowledge and information systems, vol. 34, no .3, pp. 691-724, 2013. https://doi.org/10.1007/s10115-012-0480-2
  4. K. Zhang, H. Xu, J. Tang, and J. Li, "Keyword exraction using support vector machine," Proceedings of the 7th international conference on web-age information management, pp 86-96, 2006.
  5. C. Zhang, H. Wang, Y. Liu, D. Wu, Y. Liao, and B. Wang, "Automatic keyword extraction from documents using conditional random fields," Journal of Computational Information System, vol. 4, no. 3, pp. 1169-1180, 2008.
  6. K. S. Jones, "A statistical interpretation of term specificity and its application in retrieval," Journal of documentation, vol. 28, no. 1, pp. 11-21, 1972. https://doi.org/10.1108/eb026526
  7. M. Haddoud, and S. Abdeddaïm, "Accurate keyphrase extraction by discriminating overlapping phrases," Journal of Information Science, 2014.
  8. R. Mihalcea, and P. Tarau, "Textrank: bringing order into texts," Association for Computational Linguistics, 2004.
  9. X. Wan, and J. Xiao, "Single Document Keyphrase Extraction Using Neighborhood Knowledge," Association for the Advancement of Artificial Intelligence, vol. 8, 2008.
  10. Z. Liu, W. Huang, Y. Zheng, and M. Sun, "Automatic keyphrase extraction via topic decomposition," Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2010.
  11. Y. Matsuo, and M. Ishizuka, "Keyword extraction from a single document using word co-occurrence statistical information," International Journal on Artificial Intelligence Tools, vol. 13, no. 1, pp. 157-169, 2004. https://doi.org/10.1142/S0218213004001466
  12. J. Park, J. Kim, J. Lee, and J. H. Lee, "Keyword extraction for blogs based on content richness," Journal of Information Science, vol. 40, no.1, pp. 38-49. https://doi.org/10.1177/0165551513508877
  13. T. Cho, H. Cho, and H. J. Lee, "Latent Keyphrase Generation by Combining Contextually Similar Primitive Words," Joint 7th International Conference on Soft Computing and Intelligent Systems and 15th International Symposium on Advanced Intelligent Systems, pp. 600-604, 2014.
  14. M. G. Kim, N. G. Kim, and I. H. Jung, "A Methodology for Extracting Shopping-Related Keywords by Analyzing Internet Navigation Patterns," Journal of Intelligence and Information Systems, vol. 20, no. 2, pp. 123-136, 2014. https://doi.org/10.13088/jiis.2014.20.2.123
  15. J. Go, J. W. Son, H. J. Song, and S. Y. Park, "Personalized Keyword Extraction using Dialogue History," Journal of the Korean Institute of Information Scientists and Engineers: Computing Practices and Letters, vol. 18, no. 12, pp. 896-900, 2012.
  16. D. J. Choi, S. W. Lee, J. K. Kim, and J. H. Lee, "A Study on Graph-based Topic Extraction from Microblogs," Journal of The Korean Institute of Intelligent System, vol. 21, no. 5, pp. 564-568, 2011. https://doi.org/10.5391/JKIIS.2011.21.5.564
  17. M. Krapivin, A. Autaeu, and M. Marchese, "Large dataset for keyphrases extraction," Technical Report DISI-09-055, 2009.
  18. S. N. Kim, O. Medelyan, M. K. Kan, and T. Baldwin, "Semeval-2010 task 5: automatic keyphrase extraction from scientific articles," Proceedings of the 5th International Workshop on Semantic Evaluation. Association for Computational Linguistics, 2010.
  19. lextek, "Stop Word List 1," Available: http://www.lextek.com/manuals/onix/stopwords1.html, [Accessed: March 10, 2015].
  20. M. F. Porter, "An algorithm for suffix stripping," Program: electronic library and information systems, vol. 14, no. 3, pp. 130-137, 1980. https://doi.org/10.1108/eb046814
  21. X. H. Phan and C. T. Nguyen, "Jgibblda," Available: http://jgibblda.sourceforge.net, [Accessed: January 16, 2015].

Cited by

  1. Latent Keyphrase Extraction Using Deep Belief Networks vol.15, pp.3, 2015, https://doi.org/10.5391/IJFIS.2015.15.3.153