DOI QR코드

DOI QR Code

Performance Improvement by Cluster Analysis in Korean-English and Japanese-English Cross-Language Information Retrieval

한국어-영어/일본어-영어 교차언어정보검색에서 클러스터 분석을 통한 성능 향상

  • 이경순 (전북대학교 전자정보공학부)
  • Published : 2004.04.01

Abstract

This paper presents a method to implicitly resolve ambiguities using dynamic incremental clustering in Korean-to-English and Japanese-to-English cross-language information retrieval (CLIR). The main objective of this paper shows that document clusters can effectively resolve the ambiguities tremendously increased in translated queries as well as take into account the context of all the terms in a document. In the framework we propose, a query in Korean/Japanese is first translated into English by looking up bilingual dictionaries, then documents are retrieved for the translated query terms based on the vector space retrieval model or the probabilistic retrieval model. For the top-ranked retrieved documents, query-oriented document clusters are incrementally created and the weight of each retrieved document is re-calculated by using the clusters. In the experiment based on TREC test collection, our method achieved 39.41% and 36.79% improvement for translated queries without ambiguity resolution in Korean-to-English CLIR, and 17.89% and 30.46% improvements in Japanese-to-English CLIR, on the vector space retrieval and on the probabilistic retrieval, respectively. Our method achieved 12.30% improvements for all translation queries, compared with blind feedback in Korean-to-English CLIR. These results indicate that cluster analysis help to resolve ambiguity.

본 논문에서는 교차언어정보검색에서 점진적 클러스터링을 통해서 모호성을 묵시적으로 해소하는 방법을 제안한다. 연구 목적은 질의 번역에서 모호성이 크게 증가된 상태에서 문서 클러스터가 문서 문맥 역할과 모호성 해소 역할을 하는지를 보고자 하는 것이다. 제안하는 방법은 한국어/일본어 질의를 사전을 이용하여 영어로 번역을 하고, 번역된 영어 질의에 대해서 벡터공간검색모델이나 확률검색모델에 의해서 문서를 검색한다 검색된 문서의 순위대로 점진적 클러스터를 동적으로 생성하고, 이 클러스터 정보를 질의에 반영해서 문서의 순위를 다시 결정하는 것이다. TREC 테스트컬렉션을 이용한 실험에서 모호성 해소를 하지 않은 질의에 대해서, 제안한 방법은 한국어-영어 교차언어정보검색에서는 벡터공간검색모델에서 39.41%의 성능향상, 확률검색모델에서 36.79%의 성능향상을 보였다. 일-영 교차언어정보검색에서는 각각 17.59%와 30.46%의 성능향상을 보였다. 적합성 피드백 방법과의 비교에서는 모호성 해소를 하지 않은 경우 확률검색모델에서 12.30%의 성능향상을 보였다. 이를 통해, 클러스터 분석은 질의 모호성 해소에 도움을 주어서 검색성능 향상에 기여하였음을 알 수 있다.

Keywords

References

  1. 천정훈, 한영 교차언어 정보검색 시스템에서 질의어의 모호성 해소와 병렬 코퍼스를 이용한 질의어 보완, 한국과학기술원 전자전산학과 석사학위논문, 2000
  2. Anick, P. G. and Vaithyanathan, S. Exploiting Clustering and Phrases for Context-Based Information Retrieval, In Proc. of 20th ACM SIGIR Conference, 1997 https://doi.org/10.1145/258525.258601
  3. Ballesteros, L. and Croft, W. B. Resolving Ambiguity for Cross-language Retrieval. In proc. of 21rd ACM SIGMR Conference, 1998 https://doi.org/10.1145/290941.290958
  4. Breen, J. EDICT Japanese/English dictionary File. The Electronic Dictionary Research and Development Group, Monash University, 2003
  5. Church, K. W. and Hanks, P. Word Association Norms Mutual Information and Lexicography, Computational Linguistics, 16(1), pp.23-29, 1990
  6. Dumais, S. T., Letsche, T. A., Littman, M. L. and Landauer, T. K. Automatic cross-language retrieval using latent semantic indexing, In Proc. of AAAI Symposium on Cross-Language Text and Speech Retrieval, 1997
  7. Eichmann, D., Ruiz, M. E. and Srinivasan, P. Cross-Language Information Retrieval with the UMLS Metathesaurus, In Proc. of the 21th ACM SIGIR Conference, 1998 https://doi.org/10.1145/290941.290959
  8. Gilarranz, J., Gonzalo, J. and Verdejo, F. An Approach to Conceptual Text Retrieval Using the EuroWordNet Multilingual Semantic Database, In Proc. of AAAI Spring Symposium on Cross-Language Text and Speech Retrieval, 1997
  9. Hearst, M. A. and Pedersen, J. O. Reexamining the Cluster Hypothesis : Scatter/Gather on Retrieval Results, In Proc. of 19th ACM SIGIR Conference, 1996 https://doi.org/10.1145/243199.243216
  10. Hull, D. A. and Grefenstette, G. Querying across languages : a dictionary-based approach to multilingual information retrieval, In Proc. of the 19th ACM SIGIR Conference, 1996 https://doi.org/10.1145/243199.243212
  11. Jang, M. G., Myaeng, S. H. and Park, S. H. Using Mutual Information to Resolve Query Translation Ambiguities and Query Term Weighting, In Proc. of the 37th Annual Meeting of the Association for Computational Linguistics, 1999 https://doi.org/10.3115/1034678.1034718
  12. Kwon, O.-W., Kang, I. S., Lee, J.-H and Lee, G. B. Cross-Language Text Retrieval Based on Document Translation Using Japanese-to-Korean MT system, In Proc. of NLPRS'97, 1997
  13. Lee, K. S., Park, Y. C., Choi, K. S. Re-ranking model based on document clusters, Information Processing and Management, 37(1), pp.1-14, 2001 https://doi.org/10.1016/S0306-4573(00)00017-0
  14. Oard, D. W. and Hackett, P. Document Translation for the Cross-Language Text Retrieval at the University of Maryland, In Proc. of the Sixth Text Retrieval Conference (TREC-6), 1997
  15. Paul, O. and Callan, J. Experiments Using the Lemur Toolkit, InProc. of the Tenth Text REtrieval Conference (TREC-10), 2001
  16. Robertson, S. E. and Walker, S. Okapi/Keenbow at TREC-8, In Proc. of the Eighth Text REtrieval Conference (TREC-8), 1999
  17. Salton, G. Automatic Text Processing : The Transformation, Analysis, and Retrieval of Information by Computer, Addison-Wesley, Reading, Pennsylvania. 1989
  18. Smadja, F., McKeown, K. R. and Hatzivassiloglou, V. Translating collocations for bilingual lexicons: A statistical approach, Computational Linguistics, 22(1), pp.1-38, 1996
  19. van Rijsbergen, C. J. Information Retrieval, Butterworths : London, second edition, 1979
  20. Xu, J. and Croft, W. B. Query Expansion Using Local and Global Document Analysis, In Proc. of the 19th ACM SIGIR Conference, 1996 https://doi.org/10.1145/243199.243202