DOI QR코드

DOI QR Code

Restoring Omitted Sentence Constituents in Encyclopedia Documents Using Structural SVM

Structural SVM을 이용한 백과사전 문서 내 생략 문장성분 복원

  • Hwang, Min-Kook (Computer & Telecommunications Engineering Division, Yonsei University) ;
  • Kim, Youngtae (Computer & Telecommunications Engineering Division, Yonsei University) ;
  • Ra, Dongyul (Computer & Telecommunications Engineering Division, Yonsei University) ;
  • Lim, Soojong (SW.Content Research Lab., Electronics and Telecommunications Research Institute) ;
  • Kim, Hyunki (SW.Content Research Lab., Electronics and Telecommunications Research Institute)
  • 황민국 (연세대학교 컴퓨터정보통신공학부) ;
  • 김영태 (연세대학교 컴퓨터정보통신공학부) ;
  • 나동열 (연세대학교 컴퓨터정보통신공학부) ;
  • 임수종 (한국전자통신연구원) ;
  • 김현기 (한국전자통신연구원)
  • Received : 2015.03.06
  • Accepted : 2015.06.02
  • Published : 2015.06.30

Abstract

Omission of noun phrases for obligatory cases is a common phenomenon in sentences of Korean and Japanese, which is not observed in English. When an argument of a predicate can be filled with a noun phrase co-referential with the title, the argument is more easily omitted in Encyclopedia texts. The omitted noun phrase is called a zero anaphor or zero pronoun. Encyclopedias like Wikipedia are major source for information extraction by intelligent application systems such as information retrieval and question answering systems. However, omission of noun phrases makes the quality of information extraction poor. This paper deals with the problem of developing a system that can restore omitted noun phrases in encyclopedia documents. The problem that our system deals with is almost similar to zero anaphora resolution which is one of the important problems in natural language processing. A noun phrase existing in the text that can be used for restoration is called an antecedent. An antecedent must be co-referential with the zero anaphor. While the candidates for the antecedent are only noun phrases in the same text in case of zero anaphora resolution, the title is also a candidate in our problem. In our system, the first stage is in charge of detecting the zero anaphor. In the second stage, antecedent search is carried out by considering the candidates. If antecedent search fails, an attempt made, in the third stage, to use the title as the antecedent. The main characteristic of our system is to make use of a structural SVM for finding the antecedent. The noun phrases in the text that appear before the position of zero anaphor comprise the search space. The main technique used in the methods proposed in previous research works is to perform binary classification for all the noun phrases in the search space. The noun phrase classified to be an antecedent with highest confidence is selected as the antecedent. However, we propose in this paper that antecedent search is viewed as the problem of assigning the antecedent indicator labels to a sequence of noun phrases. In other words, sequence labeling is employed in antecedent search in the text. We are the first to suggest this idea. To perform sequence labeling, we suggest to use a structural SVM which receives a sequence of noun phrases as input and returns the sequence of labels as output. An output label takes one of two values: one indicating that the corresponding noun phrase is the antecedent and the other indicating that it is not. The structural SVM we used is based on the modified Pegasos algorithm which exploits a subgradient descent methodology used for optimization problems. To train and test our system we selected a set of Wikipedia texts and constructed the annotated corpus in which gold-standard answers are provided such as zero anaphors and their possible antecedents. Training examples are prepared using the annotated corpus and used to train the SVMs and test the system. For zero anaphor detection, sentences are parsed by a syntactic analyzer and subject or object cases omitted are identified. Thus performance of our system is dependent on that of the syntactic analyzer, which is a limitation of our system. When an antecedent is not found in the text, our system tries to use the title to restore the zero anaphor. This is based on binary classification using the regular SVM. The experiment showed that our system's performance is F1 = 68.58%. This means that state-of-the-art system can be developed with our technique. It is expected that future work that enables the system to utilize semantic information can lead to a significant performance improvement.

영어와 달리 한국어나 일본어 문장의 경우 용언의 필수격을 채우는 명사구가 생략되는 무형대용어 현상이 빈번하다. 특히 백과사전이나 위키피디아의 문서에서 표제어로 채울 수 있는 격의 경우 그 격이 문장에서 더 쉽게 생략된다. 정보검색, 질의응답 시스템 등 주요 지능형 응용시스템들은 백과사전류의 문서에서 주요한 정보를 추출하여 수집하여야 한다. 그러나 이러한 명사구 생략 현상으로 인해 양질의 정보추출이 어렵다. 본 논문에서는 백과사전 종류 문서에서 생략된 명사구 즉 무형대용어를 복원하는 시스템의 개발을 다루었다. 우리 시스템이 다루는 문제는 자연어처리의 무형대용어 해결 문제와 거의 유사하나, 우리 문제의 경우 문서의 일부가 아닌 표제어도 복원에 이용할 수 있다는 점이 다르다. 무형대용어 복원을 위해서는 먼저 무형대용어의 탐지 즉 문서 내에서 명사구 생략이 일어난 곳을 찾는 작업을 수행한다. 그 다음 무형대용어의 선행어 탐색 즉 무형대용어의 복원에 사용될 명사구를 문서 내에서 찾는 작업을 수행한다. 문서 내에서 선행어를 발견하지 못하면 표제어를 이용한 복원을 시도해 본다. 우리 방법의 특징은 복원에 사용된 문장성분을 찾기 위해 Structural SVM을 사용하는 것이다. 문서 내에서 생략이 일어난 위치보다 앞에 나온 명사구들에 대해 Structural SVM에 의한 시퀀스 레이블링(sequence labeling) 작업을 시행하여 복원에 이용 가능한 명사구인 선행어를 찾아내어 이를 이용하여 복원 작업을 수행한다. 우리 시스템의 성능은 F1 = 68.58로 측정되었으며 이는 의미정보의 이용 없이 달성한 점을 감안하면 높은 수준으로 평가된다.

Keywords

References

  1. Iida, R., K. Inui, and Y. Matsumoto, "Anaphora resolution by antecedent identification followed by anaphoricity determination," ACM Transactions on Asian Language Information Processing, Vol. 4, No. 4(2005), 417-434. https://doi.org/10.1145/1113308.1113312
  2. Iida, R., K. Inui, H. Takamura, and Y. Matsumoto, "Incorporating contextual cues in trainable models for coreference resolution," Proceedings of the 10th EACL Workshop on the Computational Treatment of Anaphora, (2003), 23-30.
  3. Iida, R., I. Kentaro, and Y. Matsumoto, "Zero- Anaphora Resolution by Learning Rich Syntactic Pattern Features," ACM Transactions on Asian Language Information Processing, Vol. 6, No. 4, Article 12(2007), 1-22.
  4. Iida, R. and M. Poesio, "A Cross-lingual ILP Solution to Zero Anaphora Resolution," Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, (2011), 804-813.
  5. Jelinek, F., Statistical Methods for Speech Recognition, MIT Press, 1998.
  6. Joachims, T., T. Finley, and C.-N. Yu, "Cuttingplane training of structural SVMs," Machine Learning, Vol. 77 No. 1 (2009), 27-59. https://doi.org/10.1007/s10994-009-5108-8
  7. Kim, K.-S., S.-B. Park, and S.-J. Lee, "Identification and Application of Non-anaphoric Zero Pronouns in Intra-sentential Contexts based on the Cohesion between Clauses," Journal of Korean Institute of Information Scientists and Engineers: Software and Applications, Vol. 41, No. 3(2014), 233-240. (In Korean)
  8. Lafferty, J., A. McCallum, F. Pereira, "Conditional random fields: Probabilistic models for segmenting and labeling sequence data," Proceedings of 18th International Conference on MachineLearning, (2001), 282-289.
  9. Lee, C. and M.-G. Jang, "A Modified Fixedthreshold SMO for 1-Slack Structural SVM," ETRI Journal, Vol.32, No.1, (2010), 120-128. https://doi.org/10.4218/etrij.10.0109.0425
  10. Lim J., Y. Yun, Y. Bae, S. Lim, H. Kim, and K. Lee, "Korean Dependency Parsing Model based on Transition System using Head Final Constraint," Proceedings of 26th Annual Conference on Humanand Cognitive Language Technology, (2014), 81-86. (In Korean)
  11. Lim, S., C. Lee and M. Jang, "Restoring an Elided Entry Word in a Sentence for Encyclopedia QA System," Proceedings of Second International Joint Conference on Natural Language Processing(IJCNLP-2005), (2005), 215-219.
  12. Nariyama, S., "Grammar for ellipsis resolution in Japanese," Proceedings of the 9th International Conference on Theoretical and Methodological Issues in Machine Translation, (2002), 135-145.
  13. Ng, V., "Learning noun phrase anaphoricity to improve coreference resolution: Issues in representation and optimization," Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, (2004), 152-159.
  14. Ng, V. and C. Cardie, "Improving machine learning approaches to coreference resolution," Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, (2002), 104-111.
  15. Nguyen, N. and Y. Guo, "Comparisons of Sequence Labeling Algorithms and Extensions," Proceedings of 24th International Conference on Machine Learning, (2007), 681-688.
  16. Sarawagi, S. and R. Gupta, "Accurate Max-margin Training for Structured Output Spaces," Proceedings of 25th International Conference on Machine Learning, (2008), 888-895.
  17. Seki, K., A. Fujii, and T. Ishikawa, "A probabilistic method for analyzing Japanese anaphora integrating zero pronoun detection and resolution," Proceedings of 19th International Conference on Computational Linguistics, (2002), 911-917.
  18. Shalev-Shwartz, S., Y. Singer, N. Srebro, "Pegasos: Primal estimated subgradient solver for SVM," Proceedings of 24th International Conference on Machine Learning, (2007), 807-814.
  19. Shalev-Shwartz, S., Y.Singer, N. Srebro, A. Cotter, "Pegasos: Primal estimated subgradient solver for SVM." Mathematical Programming, Vol. 127, No. 1, (2011), 3-30. https://doi.org/10.1007/s10107-010-0420-4
  20. Soon, W. M., H. T. Ng, and D. C. Y. Lim, "A machine learning approach to coreference resolution of noun phrases," Computational Linguistics, Vol. 27, No. 4, (2001), 521-544. https://doi.org/10.1162/089120101753342653
  21. Tsochantaridis, I., T. Hofmann, T. Joachims, Y. Altun, "Support vector machine learning for interdependent and structured output space," Proceedngs of 21st International Conference on Machine Learaning, (2004), 104-111.
  22. Tsochantaridis, I., T. Joachims, T. Hofmann, Y. Altun, "Large margin methods for structured and interdependent and structured output spaces," Journal of Machine Learning Research, Vol. 6, (2005), 1453-1484.
  23. Yang, X., G. Zhou, J. Su, and C. L. Tan, "Coreference resolution using competition learning approach," Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, (2003), 176-183.