용어간 종속성을 이용한 문서 순위 매기기에 의한 확률적 정보 검색

• You, Hyun-Jo (Program in Data Science for Humanities, Seoul National University) ;
• Lee, Jung-Jin (Department of Statistics and Actuarial Science, Soongsil University)
• 유현조 (서울대학교 인문데이터과학 연계전공) ;
• 이정진 (숭실대학교 정보통계보험수리학과)
• Accepted : 2019.09.24
• Published : 2019.10.31

Abstract

This paper proposes a probabilistic document ranking model incorporating term dependencies. Document ranking is a fundamental information retrieval task. The task is to sort documents in a collection according to the relevance to the user query (Qin et al., Information Retrieval Journal, 13, 346-374, 2010). A probabilistic model is a model for computing the conditional probability of the relevance of each document given query. Most of the widely used models assume the term independence because it is challenging to compute the joint probabilities of multiple terms. Words in natural language texts are obviously highly correlated. In this paper, we assume a multinomial distribution model to calculate the relevance probability of a document by considering the dependency structure of words, and propose an information retrieval model to rank a document by estimating the probability with the maximum entropy method. The results of the ranking simulation experiment in various multinomial situations show better retrieval results than a model that assumes the independence of words. The results of document ranking experiments using real-world datasets LETOR OHSUMED also show better retrieval results.

References

1. Deming, W. E. and Stephan, F. F. (1940) On a least squares adjustment of a sampled frequency table when the expected marginal totals are known, Annals of Mathematical Statistics, 11, 427-444. https://doi.org/10.1214/aoms/1177731829
2. Fienberg, S. E. (1970) An iterative procedure for estimation in contingency tables, Annals of Mathematical Statistics, 41, 907-917. https://doi.org/10.1214/aoms/1177696968
3. Kantor, P. B. and Lee, J. J. (1998). Testing the maximum entropy principle for information retrieval, Journal of American Society for Information Science, 49, 557-566. https://doi.org/10.1002/(SICI)1097-4571(19980501)49:6<557::AID-ASI7>3.0.CO;2-G
4. Lee, J. J. (2005). Discriminating analysis of binary data with multinomial distribution by using the iterative cross entropy minimization estimation, The Korean Communications in Statistics, 12, 125-137.
5. Lee, J. J. and Kantor, P. B. (1991). A study of probabilistic information retrieval systems in the case of inconsistent expert judgments, Journal of American Society for Information Science, 42, 166-172. https://doi.org/10.1002/(SICI)1097-4571(199104)42:3<166::AID-ASI2>3.0.CO;2-A
6. Lee, J. J. and Park, H. K. (2010). Rule-based classification analysis using entropy distribution, Communications for Statistical Applications and Methods, 17, 527-540. https://doi.org/10.5351/CKSS.2010.17.4.527
7. Manning, Ch. D., Raghavan, P. and Schuutze, H. (2012). An Introduction to Information Retrieval, CUP. Online publication: https://doi.org/10.1017/CBO9780511809071 https://doi.org/10.1017/CBO9780511809071
8. Min, J. (2017). Utilizing External Resources for Enriching Information Retrieval, Ph.D. Dissertation, DCU. Available at http://doras.dcu.ie/21981/
9. Qin, T., Liu, T.-Y., Xu, J., and Li, H. (2010) LETOR: A benchmark collection for research on learning to rank for information retrieval, Information Retrieval Journal, 13, 346-374. https://doi.org/10.1007/s10791-009-9123-y
10. Robertson S. E. (1977). The probability ranking principle in IR, Journal of Documentation, 33, 294-304. https://doi.org/10.1108/eb026647
11. Ruschendorf, L. (1995) Convergence of the iterative proportional fitting procedure, The Annals of Statistics, 23, 1160-1174. https://doi.org/10.1214/aos/1176324703
12. Sanderson, M. and Croft, W. B. (2012). The history of information retrieval research. In Proceedings of the IEEE, 100, 1444-1451.