Phase-based Model Using Web Documents for Korean Unknown Word Recognition

Park, So-Young;

doi:10.6109/JKIICE.2009.13.9.1898

Journal of the Korea Institute of Information and Communication Engineering (한국정보통신학회논문지)

Volume 13 Issue 9
/
Pages.1898-1904
/
2009
/
2234-4772(pISSN)
/
2288-4165(eISSN)

The Korea Institute of Information and Commucation Engineering (한국정보통신학회)

DOI QR Code

Phase-based Model Using Web Documents for Korean Unknown Word Recognition

웹문서를 이용한 단계별 한국어 미등록어 인식 모델

Park, So-Young

박소영 (상명대학교 디지털미디어학부)

Published : 2009.09.30

https://doi.org/10.6109/JKIICE.2009.13.9.1898 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

Recently, real documents such as newspapers as well as blogs include newly coined words such as "Wikipedia". However, most previous information processing technologies cannot deal with these newly coined words because they construct their dictionaries based on materials acquired during system development. In this paper, we propose a model to automatically recognize Korean unknown words excluded from the previously constructed dictionary. The proposed model consists of an unknown noun recognition phase based on full text analysis, an unknown verb recognition phase based on web document frequency, and an unknown noun recognition phase based on web document frequency. The proposed model can recognize accurately the unknown words occurred once and again in a document by the full text analysis. Also, the proposed model can recognize broadly the unknown words occurred once in the document by using web documents. Besides, the proposed model fan recognize both a Korean unknown verb, which syllables can be changed from its base form by inflection, and a Korean unknown noun, which syllables are not changed in any eojeol. Experimental results shows that the proposed model improves precision 1.01% and recall 8.50% as compared with a previous model.

신문이나 블로그와 같은 실제 문서에서는 위키백과(Wikipedia)와 같은 기존에 없던 새로운 단어를 포함하고 있다. 그러나, 대부분의 정보 처리 기술은 시스템 개발 당시 확보한 자료를 바탕으로 사전을 구축하므로, 이러한 새로운 단어에 대해 신속하게 대처할 수 없다는 한계가 있다. 따라서 본 논문에서는 사전에 등록되어 있지 않은 한국어 미등록어를 자동으로 인식하는 모델을 제안한다. 제안하는 모델은 전문분석 기반 미등록명사 인식 단계, 웹 출현빈도 기반 미등록용언 인식 단계, 웹 출현빈도 기반 미등록명사 인식 단계로 구성된다. 제안하는 모델은 문서에서 여러 번 나타난 미등록어에 대해 전문분석을 통해 정확하게 인식할 수 있다. 그리고, 제안하는 모델은 문서에 한번 나타난 미등록어에 대해서도 웹문서를 바탕으로 광범위하게 인식할 수 있다. 또한, 제안하는 모델은 기본형이 어절에 그대로 나타나는 미등록명사뿐만 아니라 기본형이 변형하여 나타날 수 있는 미등록용언도 인식할 수 있다. 실험 결과 기존 미등록어 인식방법에 비해 제안하는 접근방법은 정확률 1.01%와 재현을 8.50%를 개선하였다.

Keywords

References

양장모, 김민정, 권혁철, "언어정보를 이용한 한국어 미등록어 추정", 한국정보과학회 봄 학술발표논문집, 제23권 제1호, 957쪽-960쪽, 1996
차정원, 이원일, 이근배, 이종혁, "형태소 패턴 사전을 이용한 일반화된 미등록어 처리", 정보과학회 인공지능연구회 춘계학술대회 논문집, 37쪽-42쪽, 1997
박봉래, 전문분석에 기반한 한국어 미등록어의 인식, 고려대학교 박사학위 논문, 1999
Ralph Weishedel, Marie Meteer, Richard Schwartz, Lance Ramshaw, and Jeff Palmulcci, "Coping with Ambiguity and Unknown Words through Probabilistic Models", Computational Linguistics, Vol.19, No.2, pp.359-382, 1993
Masaaki Nagata, "Automatic Extraction of New Words from Japanese Texts using Generalized Forward- Backward Search," Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp.48-59, 1996
박소영, "웹문서에서의 출현빈도를 이용한 한국어 미등록어 사전 자동 구축", 한국컴퓨터정보학회 논문지, 제13권 제3호, 27쪽-33쪽, 2008
이도길, 이상주, 임해창, "명사 출현 특성을 이용한 효율적인 한국어 명사 추출 방법", 정보과학회논문지:소프트웨어 및 응용, 제30권 제2호, 173쪽-183쪽, 2003
김선호, 윤준태, 송만석, "한국어 문서 처리를 위한 동적 생성 로컬 사전 기반 미등록어 분석", 정보과학회논문지:소프트웨어 및 응용, 제29권 제6호, 407쪽-416쪽, 2002
이도길, 한국어 형태소 분석과 품사부착을 위한 확률 모형, 고려대학교 박사학위 논문, 2005

Journal of the Korea Institute of Information and Communication Engineering (한국정보통신학회논문지)

Phase-based Model Using Web Documents for Korean Unknown Word Recognition

웹문서를 이용한 단계별 한국어 미등록어 인식 모델

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)