DOI QR코드

DOI QR Code

확률 기반 미등록 단어 분리 및 태깅

Probabilistic Segmentation and Tagging of Unknown Words

  • 김보겸 (충북대학교 디지털정보융합학과) ;
  • 이재성 (충북대학교 소프트웨어학과)
  • 투고 : 2015.10.06
  • 심사 : 2016.02.02
  • 발행 : 2016.04.15

초록

형태소 분석시 나타나는 고유명사나 신조어 등의 미등록어에 대한 처리는 다양한 도메인의 문서 처리에 필수적이다. 이 논문에서는 3단계 확률 기반 형태소 분석에서 미등록어를 분리하고 태깅하기 위한 방법을 제시한다. 이 방법은 고유명사나 일반명사와 같은 개방어 뒤에 붙는 다양한 접미사를 분석하여 미등록 개방어를 추정할 수 있도록 했다. 이를 위해 형태소 품사 부착 말뭉치에서 자동으로 접미사 패턴을 학습하고, 확률 기반 형태소 분석에 맞도록 미등록 개방어의 분리 및 태깅 확률을 계산하는 방법을 제시하였다. 실험 결과, 제안한 방법은 새로운 미등록 용어가 많이 나오는 문서에서 미등록어 처리 성능을 크게 향상시켰다.

Processing of unknown words such as proper nouns and newly coined words is important for a morphological analyzer to process documents in various domains. In this study, a segmentation and tagging method for unknown Korean words is proposed for the 3-step probabilistic morphological analysis. For guessing unknown word, it uses rich suffixes that are attached to open class words, such as general nouns and proper nouns. We propose a method to learn the suffix patterns from a morpheme tagged corpus, and calculate their probabilities for unknown open word segmentation and tagging in the probabilistic morphological analysis model. Results of the experiment showed that the performance of unknown word processing is greatly improved in the documents containing many unregistered words.

키워드

참고문헌

  1. R. Weischedel, R. Schwartz, J. Palmucci, M. Meteer, and L. Ramshaw, "Coping with ambiguity and unknown words through probabilistic models," Computational linguistics, Vol. 19, No.2, pp. 361-382, 1993.
  2. J. Kupiec, "Robust part-of-speech tagging using a hidden Markov model," Computer Speech & Language, Vol. 6, No.3, pp.225-242, Jul. 1992. https://doi.org/10.1016/0885-2308(92)90019-Z
  3. R. Sproat, W. Gale, C. Shih, and N. Chang, "A stochastic fin ite-state word-segmentation algorithm for Chinese," Computational linguistics, Vol. 22, No.3, pp.377-404, 1996.
  4. T. Nakagawa, "Chinese and Japanese word segmentation using word-level and character-level information," Proc. of the 20th international conference on Computational Linguistics, pp.466-472, Aug. 2004.
  5. B.-R. Park, Y.-S. Hwang, and H.-C. Rim, "Recognizing unknown words by analyzing their examples," Journal of KISS(B): Software and Applications, Vol. 25, No.2, pp.397-407, 1998. (in Korean)
  6. S. Kim, J Yoon, and M. Song, "Analysis of unknown words for Korean document processing based on dynamically - generated local dictionary," Journal of KISS: Software and Applications, 29(5.6), pp.407-416, 2002. (in Korean)
  7. G. G. Lee, J. Cha, and J.-H. Lee, "Syllable-pattern-based unknown-morpheme segmentation and estimation for hybrid part-of-speech tagging of Korean," Computational Linguistics, Vol. 28, No.1 , pp.53-70, Mar. 2002. https://doi.org/10.1162/089120102317341774
  8. S. S. Kang, Korean morphological analysis using syllable information and multiple-word units, Dept. of Computer Engineering, Seoul National University, Ph. D. thesis, 1993. (in Korean)
  9. D. Lee, Probabilistic models for Korean morphological analysis and part-of-speech tagging, Korea Univ. Dept. of Computer Science and Engineering Ph. D thesis, 2005. (in Korean)
  10. J. S. Lee, "Three-step probabilistic model for Korean morphological analysis," Journal of KIISE : Software and Applications, Vol. 38, No.5, pp.257-268, 2011. (in Korean)
  11. Y. Zhang and S. Clark, "Joint word segmentation and POS tagging using a single perceptron," ACL, pp.888-896, Jun. 2008.
  12. Y. Zhang and S. Clark, "A fast decoder for joint word segmentation and POS-tagging using a single discriminative model," Proc. of the 2010 Conference on EMNLP, Association for Computational Linguistics, 2010.
  13. J. Hatori, T. Matsuzaki, Y. Miyao, and J. I. Tsujii, "Incremental joint approach to word segmentation, pos tagging, and dependency parsing in chinese," Proc. of the 50th Annual Meeting of the Association for Computational Linguistics, Vol. 1, pp.1045-1053, Jul. 2012.
  14. D. Lee, B. Kim, and J. S. Lee, "Language model smoothing for Korean morpheme recovery," KIISE conference proceedings, Vol. 39, No. 1B, pp.309-311 , 2012. (in Korean)
  15. The National Institute of the Korean Language, 21st Century Sejong Project Final Result - 2011.12. Revised Edition, 2011. (in Korean)
  16. W. Gale, "Good-Turing smoothing without tears," Journal of Quantitative Linguistics, Vol. 2, pp.217-237, 1995. https://doi.org/10.1080/09296179508590051
  17. [Online]. Avail able: http://wikitravel.org/ko/
  18. B. Kim and J. S. Lee, "Automatic space information extraction from Korean text," Journal of Information, Vol. 18, No.7, pp.2953-2962, 2015.