DOI QR코드

DOI QR Code

정렬기법을 이용한 미등록 대역어의 자동 추출

Automatically Extracting Unknown Translations Using Phrase Alignment

  • 김재훈 (한국한양대학교 컴퓨터공학과) ;
  • 양성일 (한국전자통신연구원 언어처리연구팀)
  • 발행 : 2007.06.30

초록

이 논문은 정렬 기법을 이용한 미등록 대역어 추출 모델을 제안하고 그 추출 시스템을 구현한다. 제안된 미등록 대역어 추출 모델은 일종의 구절정렬 모델로서 경계모델과 언어모델 그리고 번역 모델로 구성된다. 제안된 추출 시스템은 병렬말뭉치 구축, 단어정렬, 미등록어 추출로 구성된다. 이 논문에서는 제안된 시스템을 평가하기 위해서 약 1,500여 개의 미등록어가 포함된 2,200문장의 평가말뭉치를 구축하여 다양한 실험을 수행하였다. 실험을 통해서 제안된 모델이 미등록 대역어 추출에 매우 유용함을 알 수 있었다. 앞으로 좀 더 객관적인 평가를 위해 대량의 평가말뭉치 구축이 선행되어야 하며 좀 더 양질의 병렬말뭉치의 구축이 필요할 것이다. 또한 미등록어 추출 모델을 개선하기 다양한 연구가 추진되어야 할 것이다.

In this paper, we propose an automatic extraction model for unknown translations and implement an unknown translation extraction system using the proposed model. The proposed model as a phrase-alignment model is incorporated with three models: a phrase-boundary model, a language model, and a translation model. Using the proposed model we implement the system for extracting unknown translations, which consists of three parts: construction of parallel corpora, alignment of Korean and English words, extraction of unknown translations. To evaluate the performance of the proposed system we have established the reference corpus for extracting unknown translation, which comprises of 2,220 parallel sentences including about 1,500 unknown translations. Through several experiments, we have observed that the proposed model is very useful for extracting unknown translations. In the future, researches on objective evaluation and establishment of parallel corpora with good quality should be performed and studies on improving the performance of unknown translation extraction should be kept up.

키워드

참고문헌

  1. Hutchins, W. J. and Somers, H. L., An Introduction to Machine Translation, Academic Press Limited, 1992
  2. Papineni, K. Roukos, S. Ward, Todd, Zhu, W. J., BLEU: A Method for Automatic Evaluation of Machine Translation, IBM Research Report RC22176, 2001
  3. NIST 2006 Machine Translation Evaluation Official Results, http://www.nist.gov/speech/tests/mt/mt06eval_official_results.html, 2006
  4. Arnold, D. J., Balkan, L., Meijer, S., Humphreys, R. L. and Sadler, L., Machine Translation: an Introductory Guide, Blackwells-NCC, London, 1994
  5. Rey, A., Eassys on Terminology, John Benjamins, 1997
  6. Sinha, R. M. K., 'Interpreting Unknown Words in Machine Translation from Hindi to English', Proceeding of Computational Intelligence, pp.278-282, 2005
  7. 이연호, 김금희, 이홍윤, 유병기, 김규웅, 이영교, 임인칠, '한-일 기계번역 시스템의 관용구 및 미등록어 처리 알고리즘', 대한전자공학회 학술대회 논문집, 제14권, 1호, pp.201-204, 1991
  8. Manning, C. D. and Schutze, H., Foundation of Statistical Natural Language Processing, The MIT Press, 1999
  9. Resnik, P. and Smith N.A., 'The web as a parallel corpus', Computational Linguistics, vo. 29, no. 3, pp.349-380, 2003 https://doi.org/10.1162/089120103322711578
  10. Kilgarriff, A. and Grefenstette, G., 'Introduction to the Special Issue on the Web as Corpus'. Computational Linguistics, vol. 29, no. 3, pp.333-347, 2003 https://doi.org/10.1162/089120103322711569
  11. Gale, W. A. and Church, K. W., 'A program for aligning sentences in bilingual corpora', Computational Linguistics, vol. 19, no. 1, pp.75-102, 1993
  12. Brown, P., Della Pietra, V., Della Pietra, S., and Mercer, R., 'The mathematics of statistical machine translation: Parameter estimation', Computational Linguistics, vol. 19, no. 2, pp.263-311, 1993
  13. Smadja, F., McKeown, K. R. and Hatzivassiloglou, V., 'Translating collocations for bilingual lexicons: A statistical approach', Computational Linguistics, vol. 22, no. 1, pp.1-38, 1996
  14. Diab, M. 'An unsupervised method for word sense tagging using parallel corpora: A preliminary investigation', Special Interest Group in Lexical Semantics Workshop, Association for Computational Linguistics, 2000
  15. Zhang, Y. and Vogel, S., 'An efficient phrase-to-phrase alignment model for arbitrarily long phrase and large corpora', Proceedings of the Tenth Conference of the European Association for Machine Translation, pp.294-301, 2005
  16. Och, F. J. and Ney, H., 'The alignment template approach to statistical machine translation', Computational Linguistics, vol. 30, no. 4. pp.417-449, 2004 https://doi.org/10.1162/0891201042544884
  17. Wu, D. 'Stochastic inversion transduction grammars and bilingual parsing of parallel corpora', Computational Linguistics, vol. 23, no. 3, pp.377-403, 1997
  18. Yamada, K. and Knight, K. 'A syntax-based statistical translation model', Proceedings of the 39th Annual Conference of the Association for Computational Linguistics, pp.523-530, 2001 https://doi.org/10.3115/1073012.1073079
  19. Ion, R., Ceausu, A. and Tufs, D. 'Dependency-based phrase alignment', Proceedings of the Fifth International Conference on Language Resources and Evaluation, pp.1290-1293 2006
  20. Gale, W. and Church, K. 'Identifying word correspondence in parallel text', Proceedings of the workshop on Speech and Natural Language, pp.152-157, 1991 https://doi.org/10.3115/112405.112428
  21. Fung, P. and Church, K. 'K-vec: A new approach for aligning parallel texts', Proceedings of COLING 94, pp.1096-1102, 1994 https://doi.org/10.3115/991250.991328
  22. Hiemstra, D. 'Multilingual domain modeling In Twenty-One: automatic creation of a bi-directional translation lexicon from a parallel corpus', Proceedings of the 8th CLIN meeting, pp.41-58, 1998
  23. Fung, P. 'A statistical view of bilingual lexicon extraction: From parallel corpora to nonparallel corpora', Proceedings of the Third Conference of the Association for Machine Translation in the Americas, pp.1-16, 1998 https://doi.org/10.1007/3-540-49478-2_1
  24. Varma, N. Identifying Word Translation in Parallel Corpora Using Measures of Association, Master Thesis, Department of Computer Science, University of Minnesota, USA, 2002
  25. Koehn, P. Noun Phrase Translation, PhD. Thesis, University of Southern California, 2003
  26. Callison-Burch, C., Koehn, P. and Osborne, M. 'Improved statistical machine translation using paraphrases', Proceedings of the Human Language Technology Conference of the North American Chapter of the ACL, pp.17-24, 2006 https://doi.org/10.3115/1220835.1220838
  27. Kim, C.-H. and Hong, M. 'A Korean syntactic parser customized for Korean-English patent MT system', Proceedings of the 5th International Conference on Natural Language, pp.44-55, 2006 https://doi.org/10.1007/11816508_7
  28. 서형원, 김형철, 조희영, 김재훈, 양성일, '웹 문서로부터 한영 병렬말뭉치의 자동 구축', 제26회 한국정보처리학회 추계학술대회 논문집, 제13권, 제2호, pp.161-164, 2006
  29. 조희영, 서형원, 김재훈, 양성일, '한영 명사구 기계 번역', 제18회 한글 및 한국어 정보처리 학술대회 발표 논문집, pp.273-278, 2006
  30. Stolcke, A. 'SRILM-An extensible language modeling toolkit', Proceedings of Intl. Conf. on Spoken Language Processing, vol. 2, pp.901-904, 2002
  31. Crego, J.M., Marino, J. B., Gispert, A. 'An ngram-based statistical machine translation decoder', Proceedings of the 9th European Conference on Speech Communication and Technology, pp.3193-3196, 2005