Coreference Resolution for Korean Using Random Forests

Jeong, Seok-Won;Choi, MaengSik;Kim, HarkSoo;

doi:10.3745/KTSDE.2016.5.11.535

KIPS Transactions on Software and Data Engineering (정보처리학회논문지:소프트웨어 및 데이터공학)

Volume 5 Issue 11
/
Pages.535-540
/
2016
/
2287-5905(pISSN)
/
2734-0503(eISSN)

Korea Information Processing Society (한국정보처리학회)

DOI QR Code

Coreference Resolution for Korean Using Random Forests

랜덤 포레스트를 이용한 한국어 상호참조 해결

정석원 (강원대학교 컴퓨터정보통신공학전공) ;
최맹식 (강원대학교 컴퓨터정보통신공학전공) ;
김학수 (강원대학교 컴퓨터정보통신공학전공)

Received : 2016.10.04
Accepted : 2016.10.13
Published : 2016.11.30

https://doi.org/10.3745/KTSDE.2016.5.11.535 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

Coreference resolution is to identify mentions in documents and is to group co-referred mentions in the documents. It is an essential step for natural language processing applications such as information extraction, event tracking, and question-answering. Recently, various coreference resolution models based on ML (machine learning) have been proposed, As well-known, these ML-based models need large training data that are manually annotated with coreferred mention tags. Unfortunately, we cannot find usable open data for learning ML-based models in Korean. Therefore, we propose an efficient coreference resolution model that needs less training data than other ML-based models. The proposed model identifies co-referred mentions using random forests based on sieve-guided features. In the experiments with baseball news articles, the proposed model showed a better CoNLL F1-score of 0.6678 than other ML-based models.

상호참조 해결은 문서 내에 존재하는 멘션들을 식별하고, 참조하는 멘션끼리 군집화하는 것으로 정보 추출, 사건 추적, 질의응답과 같은 자연어처리 응용에 필수적인 과정이다. 최근에는 기계학습에 기반한 다양한 상호참조 해결 모델들이 제안되었으며, 잘 알려진 것처럼 이런 기계학습 기반 모델들은 상호참조 멘션 태그들이 수동으로 부착된 대량의 학습 데이터를 필요로 한다. 그러나 한국어에서는 기계학습 모델들을 학습할 가용한 공개 데이터가 존재하지 않는다. 그러므로 본 논문에서는 다른 기계학습 모델보다 적은 학습 데이터를 필요로 하는 효율적인 상호참조 해결 모델을 제안한다. 제안 모델은 시브-가이드 자질 기반의 랜덤 포레스트를 사용하여 상호참조하는 멘션들을 구분한다. 야구 뉴스 기사를 이용한 실험에서 제안 모델은 다른 기계학습 모델보다 높은 0.6678의 CoNLL F1-점수를 보였다.

Keywords

References

Lee, Heeyoung, et al., "Stanford's multi-pass sieve coreference resolution system at the CoNLL-2011 shared task," Proceedings of the Fifteenth Conference on Computational Natural Language Learning: Shared Task, 2011.
Park, Cheon-Eum, Kyoung-Ho Choi, and Changki Lee, "Korean Coreference Resolution using the Multi-pass Sieve," Journal of KIISE, Vol.41, No.11, pp.992-1005, 2014. https://doi.org/10.5626/JOK.2014.41.11.992
Brennan, Susan E., Marilyn W. Friedman, and Carl J. Pollard. "A centering approach to pronouns," Proceedings of the 25th Annual Meeting on Association for Computational Linguistics. 1987.
Strube Michael, "Never look back: An alternative to centering," Proceedings of the 17th International Conference on Computational Linguistics-Volume 2, Association for Computational Linguistics, 1998.
Ellen F. Prince, "Toward a taxonomy of given-new information," Radical Pragmatics, 1981.
Breiman Leo, "Random Forests," Machine Learning, Vol.45, No.1, pp.5-32, 2001. https://doi.org/10.1023/A:1010933404324
Adam L. Berger, Vincent J. Della Pietra, and Stephen A. Della Pietra, "A maximum entropy approach to natural language processing," Computational Linguistics, Vol.22, No.1, pp.39-71, 1996.
J. A. K. Suykens and J. Vandewalle, "Least squares support vector machine classifiers," Neural Processing Letters, Vol.9, No.3, pp.293-300, 1999. https://doi.org/10.1023/A:1018628609742
M. Vilain, J. Burger, J. Aberdeen, D. Connolly, and L. Hirschman, "A model-theoretic coreference scoring scheme," Proceedings of the 6th Conference on Message Understanding, Association for Computational Linguistics, pp.45-52, 1995.
A. Bagga and B. Baldwin, "Algorithms for scoring coreference chains," The First International Conference on Language Resources and Evaluation Workshop on Linguistics Coreference, Vol.1, pp.563-566, 1998.
X. Luo, "On coreference resolution performance metrics," Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, Association for Computational Linguistics, pp. 25-32, 2005.
S. Pradhan, L. Ramshaw, M. Marcus, M. Palmer, R. Weischedel, and N. Xue, "Conll-2011 shared task: Modeling unrestricted coreference in ontonotes," Proc. of the Fifteenth Conference on Computational Natural Language Learning: Shared Task, Association for Computational Linguistics, pp. 1-27, 2011.
Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten, "The WEKA Data Mining Software: An Update," SIGKDD Explorations, Vol.11, Iss.1, 2009.
C. C. Chang and C. J. Lin, "LIBSVM: a library for support vector machines," Proc. of ACM Transactions on Intelligent Systems and Technology (TIST), Vol.2, Iss.3, Apr., 2011, Article No.27, 2011.

KIPS Transactions on Software and Data Engineering (정보처리학회논문지:소프트웨어 및 데이터공학)

Coreference Resolution for Korean Using Random Forests

랜덤 포레스트를 이용한 한국어 상호참조 해결

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)