JOURNAL BROWSE
Search
Advanced SearchSearch Tips
KR-WordRank : An Unsupervised Korean Word Extraction Method Based on WordRank
facebook(new window)  Pirnt(new window) E-mail(new window) Excel Download
 Title & Authors
KR-WordRank : An Unsupervised Korean Word Extraction Method Based on WordRank
Kim, Hyun-Joong; Cho, Sungzoon; Kang, Pilsung;
  PDF(new window)
 Abstract
A Word is the smallest unit for text analysis, and the premise behind most text-mining algorithms is that the words in given documents can be perfectly recognized. However, the newly coined words, spelling and spacing errors, and domain adaptation problems make it difficult to recognize words correctly. To make matters worse, obtaining a sufficient amount of training data that can be used in any situation is not only unrealistic but also inefficient. Therefore, an automatical word extraction method which does not require a training process is desperately needed. WordRank, the most widely used unsupervised word extraction algorithm for Chinese and Japanese, shows a poor word extraction performance in Korean due to different language structures. In this paper, we first discuss why WordRank has a poor performance in Korean, and propose a customized WordRank algorithm for Korean, named KR-WordRank, by considering its linguistic characteristics and by improving the robustness to noise in text documents. Experiment results show that the performance of KR-WordRank is significantly better than that of the original WordRank in Korean. In addition, it is found that not only can our proposed algorithm extract proper words but also identify candidate keywords for an effective document summarization.
 Keywords
Word Extraction;Keyword Extraction;Text Mining;Unsupervised Learning;WordRank;
 Language
Korean
 Cited by
 References
1.
Berry, M. W. and Castellanos, M. (2007), Survey of Text Mining : Clustering, Classification, and Retrieval, Springer, New York, NY, USA.

2.
Chen, S., Xu, Y., and Chang, H. (2011), A simple and effective unsupervised word segmentation approach, In proceedings of the 25th AAAI Conference on Artificial Intelligence, San Francisco, CA, USA.

3.
Cho, S. G. and Kim, S. B. (2012), Finding meaningful pattern of key words in IIE Transactions using text mining, Journal of the Korean Institute of Industrial Engineers, 38(1), 67-73. crossref(new window)

4.
Fellbaum, C. (2005), WordNet and wordnets, In: Brown, Keith et al. (eds.), Encyclopedia of Language and Linguistics, Second Edition, Oxford: Elsevier, 665-670.

5.
Feng, H., Chen, K., Deng, X., and Zheng, W. (2004), Accessor variety criteria for Chinese word extraction. Computational Linguistics, 30(1), 75-93. crossref(new window)

6.
Harris, Z. S. (1955), From phoneme to morpheme, Language, 31(2), 190-222. crossref(new window)

7.
Hotho, A., Nurnberger, A., and Paass, Gerhard (2005), A brief survey of text mining, Ldv Forum, 20(1), 19-62.

8.
Jin, Z. and Tanaka-Ishii, K. (2006), Unsupervised segmentation of Chinese text by use of branching entropy, In Proceedings of the COLING/ACL on Main conference poster sessions, Association for Computational Linguistics.

9.
Jurafsky, D. and Martin, J. H. (2009), Speech and Language Processing : An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, Prentice Hall.

10.
Kleinberg, J. M. (1999), Authoritative sources in a hyperlinked environment, Journal of ACM, 46(5), 604-632. crossref(new window)

11.
Lawrence, P., Brin, S., Rajeev, M., and Terry, W. (1999), The PageRank citation ranking: Bringing order to the web. Technical Report, Stanford InfoLab.

12.
Lee, D., Yeon, J., Hwang, I., and Lee, S.-G. (2010), KKMA : A tool for utilizing Sejong Corpus based on Relational Database, Journal of KIISE : Computing Practices and Letters, 16(11), 1046-1050.

13.
Lu, X., Zhang, L., and Hu, J. (2004), Statistical substring reduction in linear time, In proceedings of the 1st International Joint Conference on Natural Language Processing (IJCNLP), Hainan Island, China.

14.
Maosong, S. Dayang, S., and Tsou, B. K. (1998), Chinese word segmentation without using lexicon and hand-crafted training data, In proceedings of the 17th International Conference on Computational Linguistics (COLING), Stroudsburg, PA, USA.

15.
McKinsey Global Institute (2011), Big Data : The Next Frontier for Innovation, Competition, and Productivity.

16.
Mihalcea, R. and Tarau, P. (2004), TextRank : Bringing order into texts, In proceedings of 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP), Barcelona, Spain.

17.
Mochihashi, D. Yamada T. and Ueda N. (2009), Bayesian unsupervised word segmentation with nested Pitman-Yor language modeling, Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP.

18.
Petrovic, S., Snajder J., and Dalbelo B. (2010), Extending lexical association measures for collocation extraction, 24(2), 383-394. crossref(new window)

19.
Porter, M. F. (1980), An algorithm for suffix stripping, Program, 14(3), 130-137. crossref(new window)

20.
Willett, P. (2006), The Porter stemming algorithm : then and now, Program : Electronic Library and Information Systems, 40(3), 219-223. crossref(new window)

21.
Zhao, H. and Kit, C. (2007), Incorporating global information into supervised learning for Chinese word segmentation, In proceedings of the 10th Conference of the Pacifi c Association for Computational Linguistics (PCALING), Melbourne, Australia.