Automatic Extraction of Alternative Words using Parallel Corpus

병렬말뭉치를 이용한 대체어 자동 추출 방법

  • Received : 2010.08.12
  • Accepted : 2010.10.22
  • Published : 2010.12.15

Abstract

In information retrieval, different surface forms of the same object can cause poor performance of systems. In this paper, we propose the method extracting alternative words using translation words as features of each word extracted from parallel corpus, korean/english title pair of patent information. Also, we propose an association word filtering method to remove association words from an alternative word list. Evaluation results show that the proposed method outperforms other alternative word extraction methods.

정보 검색에 있어서 통일 객체를 다양한 표기로 기술하는 문제는 시스템의 성능을 저하시키는 요인이 된다. 본 연구에서는 이러한 문제를 해결하기 위하여 특허 정보의 국/영문 제목을 병렬말뭉치로 이용하여 대역어 뭉치를 추출하고, 이를 각 단어의 특징(Feature)으로 이용하여 대체어 목록을 자동 추출하는 방법을 제안한다. 또한 대체어 목록 내에 대체어가 아닌 다수의 연관단어들이 포함되는 문제점을 해결하기 위하여 국문 제목에서 추출한 연관단어 뭉치를 이용하여 대체어 목록 내 연관단어들을 필터링하는 방법을 제안한다. 평가결과에 따르면 본 연구에서 제안한 방법이 기존의 대체어 추출 방법들보다 더 우수한 것으로 나타났다.

Keywords

References

  1. P. D. Turney, "Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL," Proceedings of the Twelfth European Conference on Machine Learning, 2001.
  2. Ruiz-Casado, M., Alfonseca, E. and Castells, P., "Using Context-Window Overlapping in Synonym Discovery and Ontology Extension," Proceedings of the International Conference Recent Advances in Natural Language Processing, 2005.
  3. J. Baik, S. Kim and S. Lee, "Automatic Construction of Alternative Word Candidates to Improve Patent Information Search Quality," Journal of KIISE : Software and Applications, vol.36, no.10, pp.861-873, 2009. (In Korean)
  4. Pierre P. Senellart and Vincent D. Blondel, "Automatic discovery of similar words," in Survey of Text Mining, Springer, 2003.
  5. Jon M. Kleinberg, "Automatic construction of networks of concepts characterizing document databases," Journal of the ACM, vol.46, no.5, pp.604- 632, 1999. https://doi.org/10.1145/324133.324140
  6. Vincent D. Blondel and Pierre P. Senellart, "Automatic extraction of synonyms in a dictionary," Presented at the TextMining Workshop, Arlington, Virginia, 2002.
  7. John McCrae and Nigel Collier, "Synonym Set Extraction from the Biomedical Literature by Lexical Pattern Discovery," BMC Bioinformatics, vol.9, no.159, 2008.
  8. Rema Ananthanarayanan, Vijil Chenthamarakshan, Prasad M Deshpande, and Raghuram Krishnapuram, "Rule based Synonyms for Entity Extraction from Noisy Text," Proceedings of the Second Workshop on Analytics for Noisy Unstructured Text Data In AND '08, pp.31-38, New York, NY, USA, 2008.
  9. Ricardo Baeza-Yates and Berthier Ribeiro-Neto, Modern Information Retrieval, Addison-Wesley, 1999.