JOURNAL BROWSE
Search
Advanced SearchSearch Tips
Building a Korean-English Parallel Corpus by Measuring Sentence Similarities Using Sequential Matching of Language Resources and Topic Modeling
facebook(new window)  Pirnt(new window) E-mail(new window) Excel Download
  • Journal title : Journal of KIISE
  • Volume 42, Issue 7,  2015, pp.901-909
  • Publisher : Korean Institute of Information Scientists and Engineers
  • DOI : 10.5626/JOK.2015.42.7.901
 Title & Authors
Building a Korean-English Parallel Corpus by Measuring Sentence Similarities Using Sequential Matching of Language Resources and Topic Modeling
Cheon, JuRyong; Ko, YoungJoong;
 
 Abstract
In this paper, to build a parallel corpus between Korean and English in Wikipedia. We proposed a method to find similar sentences based on language resources and topic modeling. We first applied language resources(Wiki-dictionary, numbers, and online dictionary in Daum) to match word sequentially. We construct the Wiki-dictionary using titles in Wikipedia. In order to take advantages of the Wikipedia, we used translation probability in the Wiki-dictionary for word matching. In addition, we improved the accuracy of sentence similarity measuring method by using word distribution based on topic modeling. In the experiment, a previous study showed 48.4% of F1-score with only language resources based on linear combination and 51.6% with the topic modeling considering entire word distributions additionally. However, our proposed methods with sequential matching added translation probability to language resources and achieved 9.9% (58.3%) better result than the previous study. When using the proposed sequential matching method of language resources and topic modeling after considering important word distributions, the proposed system achieved 7.5%(59.1%) better than the previous study.
 Keywords
parallel sentence;Wikipedia;comparable corpus;Topic model;translation probability;
 Language
Korean
 Cited by
 References
1.
Teubert Wolfgang, "Comparable or parallel corpora?," International journal of lexicography, Vol. 9, No. 3, pp. 238-264, 1996. crossref(new window)

2.
Sunghyun Kim. Seon Yang and Youngjoong Ko, "Extracting Korean-English Parallel Sentences from Wikipedia," Journal of korean institute of information scientists and engineers (KIISE): software and applications, pp. 580-585, 2014.

3.
Dragos Stefan munteanu and Daniel Marcu, "Improving machine translation performance by exploiting non-parallel corpora," Computational linguistics, Vol. 31, No. 4, pp. 477-504, 1995.

4.
Tao Tao and ChengXiang Zhai, "Mining comparable bilingual text corpora for cross-language information integration," Proc. of the 19th ACM SIGKDD international conference on knowledge discovery in data mining (KDD-2005), pp. 691-696, 2005.

5.
Ramirez Jessica C and Yuji Matsumoto, "A Rule-Based Approach For Aligning Japanese-Spanish Sentences From A Comparable Corpora," arXiv preprint arXiv:1211.4488, 2012.

6.
Utiyama Masao and Hitoshi Isahara, "Reliable measures for aligning Japanese-English news articles and sentences," Proc. of ACL '03, pp. 72-79, 2003.

7.
Adafre Sisay Fissaha and Maarten De Rijke. "Finding similar sentences across multiple languages in wikipedia," Proc. of ACL '06, pp. 62-69, 2006.

8.
David M. Blei, Andrew Y. Ng and Michael I.Jordan, "Latent dirichlet allocation," The journal of machine learning research, 3, pp. 993-1022, 2003.

9.
Zede Zhu, Miao Li, Lei Chen and Zhenxin Yang, "Building Comparable Corpora Based on Bilingual LDA Model," Proc. of ACL '13, pp. 278-282, 2013.

10.
Ture Ferhan and jimmay Lin, "Why not grab a free lunch?: mining large corpora for parallel sentences to improve translation modeling," Proc. of the 2012 conference of the north american chapter of the association for computational linguistics: human language technologies, association for computational linguistics, pp. 626-630, 2012.

11.
Mallet toolkit, [Online]. Available: http://mallet.cs.umass.edu/download.php

12.
GIZA++ statistical translation models toolkit, [Online]. Available: http://code.google.com/p/giza-pp/