Pivot Discrimination Approach for Paraphrase Extraction from Bilingual Corpus

이중 언어 기반 패러프레이즈 추출을 위한 피봇 차별화 방법

  • Park, Esther (Dept. of Computer and Radio Communications Engineering, Korea University) ;
  • Lee, Hyoung-Gyu (Dept. of Computer and Radio Communications Engineering, Korea University) ;
  • Kim, Min-Jeong (Dept. of Computer and Radio Communications Engineering, Korea University) ;
  • Rim, Hae-Chang (Dept. of Computer and Radio Communications Engineering, Korea University)
  • 박에스더 (고려대학교 컴퓨터.전파통신공학과) ;
  • 이형규 (고려대학교 컴퓨터.전파통신공학과) ;
  • 김민정 (고려대학교 컴퓨터.전파통신공학과) ;
  • 임해창 (고려대학교 컴퓨터.전파통신공학과)
  • Received : 2011.01.24
  • Accepted : 2011.03.07
  • Published : 2011.03.30

Abstract

Paraphrasing is the act of writing a text using other words without altering the meaning. Paraphrases can be used in many fields of natural language processing. In particular, paraphrases can be incorporated in machine translation in order to improve the coverage and the quality of translation. Recently, the approaches on paraphrase extraction utilize bilingual parallel corpora, which consist of aligned sentence pairs. In these approaches, paraphrases are identified, from the word alignment result, by pivot phrases which are the phrases in one language to which two or more phrases are connected in the other language. However, the word alignment is itself a very difficult task, so there can be many alignment errors. Moreover, the alignment errors can lead to the problem of selecting incorrect pivot phrases. In this study, we propose a method in paraphrase extraction that discriminates good pivot phrases from bad pivot phrases. Each pivot phrase is weighted according to its reliability, which is scored by considering the lexical and part-of-speech information. The experimental result shows that the proposed method achieves higher precision and recall of the paraphrase extraction than the baseline. Also, we show that the extracted paraphrases can increase the coverage of the Korean-English machine translation.

Acknowledgement

Supported by : 한국연구재단