Named Entity and Event Annotation Tool for Cultural Heritage Information Corpus Construction

문화유산정보 말뭉치 구축을 위한 개체명 및 이벤트 부착 도구

  • 최지예 (상명대학교 디지털미디어학부) ;
  • 김명근 (상명대학교 디지털미디어학부) ;
  • 박소영 (상명대학교 게임모바일콘텐츠학과)
  • Received : 2012.07.09
  • Accepted : 2012.08.09
  • Published : 2012.09.30


In this paper, we propose a named entity and event annotation tool for cultural heritage information corpus construction. Focusing on time, location, person, and event suitable for cultural heritage information management, the annotator writes the named entities and events with the proposed tool. In order to easily annotate the named entities and the events, the proposed tool automatically annotates the location information such as the line number or the word number, and shows the corresponding string, formatted as both bold and italic, in the raw text. For the purpose of reducing the costs of the manual annotation, the proposed tool utilizes the patterns to automatically recognize the named entities. Considering the very little training corpus, the proposed tool extracts simple rule patterns. To avoid error propagation, the proposed patterns are extracted from the raw text without any additional process. Experimental results show that the proposed tool reduces more than half of the manual annotation costs.

본 논문에서는 문화유산정보 말뭉치 구축을 위한 개체명 및 이벤트 부착 도구를 제안한다. 제안하는 도구를 이용하여 말뭉치 구축자는 문화유산정보 관리에 유용한 시간, 장소, 인물, 사건을 중심으로 개체명과 이벤트를 부착할 수 있다. 이 때, 개체명과 이벤트 부착이 용이하도록, 제안하는 도구에서 줄번호나 어절번호와 같은 개체명이나 이벤트의 위치정보를 자동으로 부착하며, 구축된 개체명이나 이벤트 중에서 하나를 선택하면 해당 문자열을 원문에서 진한 이탤릭체로 표시하여 올바르게 부착되었는지 쉽게 확인할 수 있다. 그리고, 제안하는 도구는 말뭉치 구축자의 수작업을 줄이기 위해서 개체명 자동인식 패턴을 활용한다. 학습말뭉치가 거의 없다는 점을 고려하여 단순한 규칙 패턴을 학습한다. 또한, 오류 전파를 차단하기 위해서, 제안하는 개체명 자동인식 패턴은 개체명 부착 말뭉치에서 추가적인 분석처리 없이 바로 추출한다. 실험결과 제안하는 개체명 및 이벤트 부착 도구는 말뭉치 구축자의 수작업량을 절반이상 줄여주었다.



  1. Bang-Hyeon Na, "A Design of Cultural and Historical Contents Model for Web Services", Proceedings of the Association of Korean Cultural and Historical Geographers Symposium, pp.27-35, Nov. 2010.
  2. Dong-hwan Yoo, "The current situation and the task of developing the national cultural heritage contents", Korean Studies, Vol.12, pp.5-49, Jun. 2008.
  3. So-Young Cha, Jung-Wha Kim, "Constructing a Foundation for Semantic Structure of Korean Heritage Information : A Study on Creating a Substructure of Korean Heritage Portal by Implementing CIDOC CRM", Proceedings of the 17th Conference on the Korean Society for Information Management, pp.177-184, Aug. 2010.
  4. Tomoko Ohta, Jin-Dong Kim, Sampo Pyysalo, Yue Wang, Jun'ichi Tsujii, "Incorporating GENETAG-style annotation to GENIA corpus", Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing(BioNLP), pp.106-107, Jun. 2009.
  5. Ozlem Uzuner, Brett R South, Shuying Shen, Scott L DuVall, "2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text", J Am Med Inform Assoc, Vol.18, No.5, pp.552-556, Jun. 2011.
  6. Hae-Chang Rim, Young-Sook Hwang, Kyung-Mi Park, "Development of Bio Text Mining System", Communications of KIISE, Vol.21, No.6, pp.60-68, Jul. 2003.
  7. Masaki Noguchi, Kenta Miyoshi, Takenobu Tokunaga, Ryu Iida, Mamoru Komachi, Kentaro Inui, "Multiple Purpose Annotation using SLAT -Segment and Link-based Annotation Tool-", Proceedings of the 2nd Linguistic Annotation Workshop, pp.61-64, May. 2008.
  8. Mitchell P. Marcus, B. Santorini, and M. A. Marchinkiewicz, "Building a large annotated corpus of English : the Penn TreeBank", Computational Linguistics, Vol.19, No.2, pp.313-330, Jun. 1993.
  9. Hye-Kyum Kim, Kyung-Mi Park, Yeo-Chan Yoon, Hae-Chang Rim, So-Young Park, "Tree Tagging Tool using Two-phrase Parsing", Proceedings of the 17th Annual Conference on Human and Cognitive Language Technology, pp.151-158, Oct. 2005.
  10. Piek Vossen, Attila Gorog, Fons Laan, Maarten van Gompel, Ruben Izquierdo, Antal van den Bosch, "DutchSemCor: Building a semantically annotated corpus for Dutch", Proceedings of eLex, pp.286-296, Nov. 2011.
  11. Joo-Young Lee, Young-In Song, Hae-Chang Rim, "Title Named Entity Recognition based on Automatically Constructed Context Patterns and Entity Dictionary", Proceedings of the 17th Annual Conference on Human and Cognitive Language Technology, Vol.16, No.1, pp.111-117, Oct. 2004.
  12. Chang-Ki Lee, Myung-Gil Jang, "Named Entity Recognition with Structural SVMs and Pegasos algorithm", Cognitive Science, Vol.21, No.4, pp.655-667, Dec. 2010.
  13. Seong-Won Kim, Dong-Yul Ra, "Korean Named Entity Recognition Using Two-level Maximum Entropy Model", Proceedings of KIISE Symposium, Vol.2, No.1, pp.81-86, Jun. 2008.
  14. Hee-Sun Chung, Hee-Sun Kim, "Database and Corpus Construction methodology for the Content of Religious architectural heritage Information", Proceedings of a Seminar Held by the Convergence Study Team of SangMyung University, pp.43-60, Jun. 2012.
  15. The Institute of Seoul Studies, "Modern Cultural Heritage Resource and Cataloging Project Report", Jun. 2004.
  16. The Academy of Korean Studies, "Encyclopedia of Korean Culture", Dec. 1991.