• Title/Summary/Keyword: cross-language information retrieval

Search Result 17, Processing Time 0.026 seconds

Effective Cross-Lingual Text Retrieval using a Fuzzy Knowledge Base (퍼지 지식베이스를 이용한 효과적인 다언어 문서 검색)

  • Choi, Myeong-Bok
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.8 no.1
    • /
    • pp.53-62
    • /
    • 2008
  • Cross-lingual text retrieval(CLTR) is the information retrieval in which a user tries to search a set of documents written in one language for a query another language. This thesis proposes a CLTR system based on fuzzy multilingual thesaurus to handle a partial matching between terms of two different languages. The proposed CLTR system uses a fuzzy term matrix defined in our thesis to perform the information retrieval effectively. In the defined fuzzy term matrix, all relation degrees between terms are inferred from using the transitive closure algorithm to reflect all implicit links between terms into processing of the information retrieval. With this framework, the CLTR system proposed in our thesis enhances the retrieval effectiveness because it is able to emulate a human expert's decision making well in CLTR.

  • PDF

Korean-Chinese Person Name Translation for Cross Language Information Retrieval

  • Wang, Yu-Chun;Lee, Yi-Hsun;Lin, Chu-Cheng;Tsai, Richard Tzong-Han;Hsu, Wen-Lian
    • Proceedings of the Korean Society for Language and Information Conference
    • /
    • 2007.11a
    • /
    • pp.489-497
    • /
    • 2007
  • Named entity translation plays an important role in many applications, such as information retrieval and machine translation. In this paper, we focus on translating person names, the most common type of name entity in Korean-Chinese cross language information retrieval (KCIR). Unlike other languages, Chinese uses characters (ideographs), which makes person name translation difficult because one syllable may map to several Chinese characters. We propose an effective hybrid person name translation method to improve the performance of KCIR. First, we use Wikipedia as a translation tool based on the inter-language links between the Korean edition and the Chinese or English editions. Second, we adopt the Naver people search engine to find the query name's Chinese or English translation. Third, we extract Korean-English transliteration pairs from Google snippets, and then search for the English-Chinese transliteration in the database of Taiwan's Central News Agency or in Google. The performance of KCIR using our method is over five times better than that of a dictionary-based system. The mean average precision is 0.3490 and the average recall is 0.7534. The method can deal with Chinese, Japanese, Korean, as well as non-CJK person name translation from Korean to Chinese. Hence, it substantially improves the performance of KCIR.

  • PDF

A Method of Chinese and Thai Cross-Lingual Query Expansion Based on Comparable Corpus

  • Tang, Peili;Zhao, Jing;Yu, Zhengtao;Wang, Zhuo;Xian, Yantuan
    • Journal of Information Processing Systems
    • /
    • v.13 no.4
    • /
    • pp.805-817
    • /
    • 2017
  • Cross-lingual query expansion is usually based on the relationship among monolingual words. Bilingual comparable corpus contains relationships among bilingual words. Therefore, this paper proposes a method based on these relationships to conduct query expansion. First, the word vectors which characterize the bilingual words are trained using Chinese and Thai bilingual comparable corpus. Then, the correlation between Chinese query words and Thai words are computed based on these word vectors, followed with selecting the Thai candidate expansion terms via the correlative value. Then, multi-group Thai query expansion sentences are built by the Thai candidate expansion words based on Chinese query sentence. Finally, we can get the optimal sentence using the Chinese and Thai query expansion method, and perform the Thai query expansion. Experiment results show that the cross-lingual query expansion method we proposed can effectively improve the accuracy of Chinese and Thai cross-language information retrieval.

Performance Improvement by Cluster Analysis in Korean-English and Japanese-English Cross-Language Information Retrieval (한국어-영어/일본어-영어 교차언어정보검색에서 클러스터 분석을 통한 성능 향상)

  • Lee, Kyung-Soon
    • The KIPS Transactions:PartB
    • /
    • v.11B no.2
    • /
    • pp.233-240
    • /
    • 2004
  • This paper presents a method to implicitly resolve ambiguities using dynamic incremental clustering in Korean-to-English and Japanese-to-English cross-language information retrieval (CLIR). The main objective of this paper shows that document clusters can effectively resolve the ambiguities tremendously increased in translated queries as well as take into account the context of all the terms in a document. In the framework we propose, a query in Korean/Japanese is first translated into English by looking up bilingual dictionaries, then documents are retrieved for the translated query terms based on the vector space retrieval model or the probabilistic retrieval model. For the top-ranked retrieved documents, query-oriented document clusters are incrementally created and the weight of each retrieved document is re-calculated by using the clusters. In the experiment based on TREC test collection, our method achieved 39.41% and 36.79% improvement for translated queries without ambiguity resolution in Korean-to-English CLIR, and 17.89% and 30.46% improvements in Japanese-to-English CLIR, on the vector space retrieval and on the probabilistic retrieval, respectively. Our method achieved 12.30% improvements for all translation queries, compared with blind feedback in Korean-to-English CLIR. These results indicate that cluster analysis help to resolve ambiguity.

The Contruction of the Comparable Corpus Based on SGML (SGML 기반 비교 가능 코퍼스 구축)

  • 이창열;김용순;김성혁
    • Journal of the Korean Society for information Management
    • /
    • v.15 no.3
    • /
    • pp.7-26
    • /
    • 1998
  • The large scale documents of the data repository are utilized to the diverse applications. In the cross-language information retrieval, if the words of a query contain polymorphic meanings, the system needs multilingual corpus to exactly translate to the target words. We constructed the financial comparable corpus, called KFCM(Korean Financial Corpus corresponding to MLCC Corpus), comparing to the MLCC Polylingual Documents which consisted with the 6 European languages. It is independently constructed under the DTD of MLCC comparable corpus, and can be utilized to the cross-language information retrieval. In this paper, we discussed about the application and construction procedures of KFCM which is public domain data.

  • PDF

Knowledge-poor Term Translation using Common Base Axis with application to Korean-English Cross-Language Information Retrieval (과도한 지식을 요구하지 않는 공통기반축에 의한 용어 번역과 한영 교차정보검색에의 응용)

  • 최용석;최기선
    • Korean Journal of Cognitive Science
    • /
    • v.14 no.1
    • /
    • pp.29-40
    • /
    • 2003
  • Cross-Language Information Retrieval (CLIR) deals with the documents in various languages by one language query. A user who uses one language can retrieve the documents in another language through CLIR system. In CLIR, query translation method is known to be more efficient. For the better performance of query translation, we need more resources like dictionary, ontology, and parallel/comparable corpus but usually not available. This paper proposes a new concept called the Common Base Axis which is adapted to Korean-English Query translation ann a new weighting method in dictionary based query translation. The essential idea is that we can express Korean and English word in one vector space by Common Base Axis and use it in calculating sense distance for query weighting. The experiments show that Common Base Axis gives us good performance without ontology and is especially good for one word query translation.

  • PDF

Korean-Japanese Cross Lingual Information Retrieval Based on Bi-gram Indexing (바이그램 색인에 기반한 한-일 교차언어검색)

  • Lee Gyu-Chan;Kang In-Su;Na Seung-Hoon;Lee Jong-Hyeok
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2005.07b
    • /
    • pp.448-450
    • /
    • 2005
  • 교차언어검색 시스템은 다양한 언어자원을 필요로 한다. 여기서는 한-일 대역어 사전과 일본어 문서의 바이그램 색인만을 이용해서 교차언어검색을 수행하는 방법을 제시한다. 한국어로 된 자연어 질의에서 형태소분석기 등의 도움 없이 간단하게 일본어 대역어 리스트를 생성할 수 있는 방법과, 검색의 성능을 올릴 수 있도록 대역어에 가중치를 부여하는 방법을 제안한다. 그리고 실험을 통해 제시한 방법을 평가하고 분석한다.

  • PDF

Query Context Information-Based Translation Models for Korean-Japanese Cross-Language Informal ion Retrieval (한-일 교차언어검색에서의 질의 문맥 정보를 이용한 대역어 변환 확률 모델)

  • Lee, Gyu-Chan;Kang, In-Su;Na, Seung-Hoon;Lee, Jong-Hyeok
    • Annual Conference on Human and Language Technology
    • /
    • 2005.10a
    • /
    • pp.97-104
    • /
    • 2005
  • 교차언어 검색 과정에서는 질의나 문서의 언어를 일치시키기 위한 변환 과정이 필수적이며, 이런 변환 과정에서 어휘의 중의성으로 인해 하나의 어휘에 대응하는 다수의 대역어가 생성됨으로써 사용자의 정보 욕구를 왜곡시켜 검색의 성능을 저하시킬 수 있다. 본 논문에서는 어휘 중의성 문제를 해결하기 위해서 질의의 문맥 정보를 이용하여 변환 질의의 확률을 구함으로써 중의성을 해소하는 방식을 제시하고, 질의의 길이, 중의도, 중의성을 가진 어휘의 비율 등에 따라서 성능이 어떻게 변하는지 비교함으로써 이 방법의 장점과 단점을 분석한다. 또한 현재의 단점을 보완하기 위한 차후 연구 방향을 제시한다.

  • PDF

An automated Classification System of Standard Industry and Occupation Codes by Using Information Retrieval Techniques (정보검색 기법을 이용한 산업/직업 코드 자동 분류 시스템)

  • Lim, Heui Seok
    • The Journal of Korean Association of Computer Education
    • /
    • v.7 no.4
    • /
    • pp.51-60
    • /
    • 2004
  • This paper proposes an automated coding system of Korean standard industry/occupation for census which reduces a lot of cost and labor for manual coding. The proposed system converts natural language responses on survey questionnaires into corresponding numeric codes using information retrieval techniques and document classification algorithm. The system was experimented with 46,762 industry records and occupation 36,286 records using 10-fold cross -validation evaluation method. As experimental results, the system show 87.08% and 66.08% production rates when classifying industry records into level 2 and level 5 codes respectively. The system shows slightly lower performances on occupation code classification. We expect that the system is enough to be used as a semi-automate coding system which can minimize manual coding task or as a verification tool for manual coding results though it has much room to be improved as an automated coding system.

  • PDF

Query Expansion Using Thesaurus for Korean to Chinese Cross- Language Text Retrieval (한.중 교차언어 검색에서 시소러스를 이용한 질의 확장)

  • Jin, Feng;Kang, In-Su;Lee, Jong-Hyeok
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2003.10a
    • /
    • pp.538-540
    • /
    • 2003
  • 본 논문은 한.중 교차언어 검색을 위한 효과적인 질의 확장에 대해 기술하고 있다. 한.중 교차언어 검색은 한국어 질의로 중국어 문서를 검색하는 것이고 본 논문에서는 대역어 사전을 이용하여 한국어 질의를 중국어 질의로 변환하는 방식을 사용한다. 질의 확장을 위한 방법으로 중국어 시소러스인“동의사사림”을 사용하였다. 그리고 동의어들과 주변 단어간의 상호 정보를 비교함으로서 재현률과 정확률을 높였다. 실험을 통하여 검증한 결과 사전만 사용하여 변환하는 방법에 비하여 검색 성능이 향상되었다.

  • PDF