Advanced SearchSearch Tips
Probabilistic Segmentation and Tagging of Unknown Words
facebook(new window)  Pirnt(new window) E-mail(new window) Excel Download
  • Journal title : Journal of KIISE
  • Volume 43, Issue 4,  2016, pp.430-436
  • Publisher : Korean Institute of Information Scientists and Engineers
  • DOI : 10.5626/JOK.2016.43.4.430
 Title & Authors
Probabilistic Segmentation and Tagging of Unknown Words
Kim, Bogyum; Lee, Jae Sung;
  PDF(new window)
Processing of unknown words such as proper nouns and newly coined words is important for a morphological analyzer to process documents in various domains. In this study, a segmentation and tagging method for unknown Korean words is proposed for the 3-step probabilistic morphological analysis. For guessing unknown word, it uses rich suffixes that are attached to open class words, such as general nouns and proper nouns. We propose a method to learn the suffix patterns from a morpheme tagged corpus, and calculate their probabilities for unknown open word segmentation and tagging in the probabilistic morphological analysis model. Results of the experiment showed that the performance of unknown word processing is greatly improved in the documents containing many unregistered words.
unknown word processing;word segmentation;open word class processing;probabilistic morphological analysis;
 Cited by
R. Weischedel, R. Schwartz, J. Palmucci, M. Meteer, and L. Ramshaw, "Coping with ambiguity and unknown words through probabilistic models," Computational linguistics, Vol. 19, No.2, pp. 361-382, 1993.

J. Kupiec, "Robust part-of-speech tagging using a hidden Markov model," Computer Speech & Language, Vol. 6, No.3, pp.225-242, Jul. 1992. crossref(new window)

R. Sproat, W. Gale, C. Shih, and N. Chang, "A stochastic fin ite-state word-segmentation algorithm for Chinese," Computational linguistics, Vol. 22, No.3, pp.377-404, 1996.

T. Nakagawa, "Chinese and Japanese word segmentation using word-level and character-level information," Proc. of the 20th international conference on Computational Linguistics, pp.466-472, Aug. 2004.

B.-R. Park, Y.-S. Hwang, and H.-C. Rim, "Recognizing unknown words by analyzing their examples," Journal of KISS(B): Software and Applications, Vol. 25, No.2, pp.397-407, 1998. (in Korean)

S. Kim, J Yoon, and M. Song, "Analysis of unknown words for Korean document processing based on dynamically - generated local dictionary," Journal of KISS: Software and Applications, 29(5.6), pp.407-416, 2002. (in Korean)

G. G. Lee, J. Cha, and J.-H. Lee, "Syllable-pattern-based unknown-morpheme segmentation and estimation for hybrid part-of-speech tagging of Korean," Computational Linguistics, Vol. 28, No.1 , pp.53-70, Mar. 2002. crossref(new window)

S. S. Kang, Korean morphological analysis using syllable information and multiple-word units, Dept. of Computer Engineering, Seoul National University, Ph. D. thesis, 1993. (in Korean)

D. Lee, Probabilistic models for Korean morphological analysis and part-of-speech tagging, Korea Univ. Dept. of Computer Science and Engineering Ph. D thesis, 2005. (in Korean)

J. S. Lee, "Three-step probabilistic model for Korean morphological analysis," Journal of KIISE : Software and Applications, Vol. 38, No.5, pp.257-268, 2011. (in Korean)

Y. Zhang and S. Clark, "Joint word segmentation and POS tagging using a single perceptron," ACL, pp.888-896, Jun. 2008.

Y. Zhang and S. Clark, "A fast decoder for joint word segmentation and POS-tagging using a single discriminative model," Proc. of the 2010 Conference on EMNLP, Association for Computational Linguistics, 2010.

J. Hatori, T. Matsuzaki, Y. Miyao, and J. I. Tsujii, "Incremental joint approach to word segmentation, pos tagging, and dependency parsing in chinese," Proc. of the 50th Annual Meeting of the Association for Computational Linguistics, Vol. 1, pp.1045-1053, Jul. 2012.

D. Lee, B. Kim, and J. S. Lee, "Language model smoothing for Korean morpheme recovery," KIISE conference proceedings, Vol. 39, No. 1B, pp.309-311 , 2012. (in Korean)

The National Institute of the Korean Language, 21st Century Sejong Project Final Result - 2011.12. Revised Edition, 2011. (in Korean)

W. Gale, "Good-Turing smoothing without tears," Journal of Quantitative Linguistics, Vol. 2, pp.217-237, 1995. crossref(new window)

[Online]. Avail able:

B. Kim and J. S. Lee, "Automatic space information extraction from Korean text," Journal of Information, Vol. 18, No.7, pp.2953-2962, 2015.