DOI QR코드

DOI QR Code

웹 사이트 구조를 이용한 토픽 검색 연구

An Experimental Study on Topic Distillation Using Web Site Structure

  • 발행 : 2007.09.29

초록

이 연구에서는 TRBC이 제시한 토픽 검색의 정의에 따라 질의에 적합한 웹 사이트를 검색하는 효과적인 토픽 검색 알고리즘을 제안하고 실험을 통해 그 성능을 평가하였다. 이 연구의 토픽 검색 알고리즘은 먼저 질의에 대한 웹 페이지 검색 결과로부터 적합한 웹 사이트를 선정한 다음, 선정된 사이트의 구조를 이용하여 질의에 대한 적합성 점수를 산출한다. TREC의 .GOV 실험 문헌 집단과 TREC-2004 실험의 질의 및 적합문헌 리스트를 이용한 검색 실험 결과 이 토픽 검색 알고리즘은 상위 10위 안에 최소 2개 이상의 적합 사이트를 검색하여 비교적 높은 수준의 성능을 보였다. 또한 TREC-2004의 적합문헌 리스트 분석을 통해 적합문헌 선정에 토픽 검색의 정의가 엄격하게 적용되지 않은 경우가 있음을 확인하고, 수정된 적합문헌 리스트를 이용하여 토픽 검색 성능을 재평가한 결과 이 연구에서 제안한 토픽 검색 알고리즘의 성능이 월등히 향상되었다.

This study proposes a topic distillation algorithm that ranks the relevant sites selected from retrieved web pages, and evaluates the performance of the algorithm. The algorithm calculates the topic score of a site using its hierarchical structure. The TREC .GOV test collection and a set of TREC-2004 queries for topic distillation task are used for the experiment. The experimental results showed the algorithm returned at least 2 relevant sites in top ten retrieval results. We peformed an in-depth analysis of the relevant sites list provided by TREC-2004 to find out that the definition of topic distillation was not strictly applied in selecting relevant sites. When we re-evaluated the retrieved sites/sub-sites using the revised list of relevant sites, the performance of the proposed algorithm was improved significantly.

키워드

참고문헌

  1. 박기림, 장유진, 김민구, 박승규. 2003. '문서 내의 주제정보를 이용한 개선된 링크 분석 알고리즘.' 한국정보과학회 학술발표논문집.' 한국정보과학회 학술발표논문집 30(2): 7-9
  2. Bahrat, K., and Henzinger, M. R. 1998. 'Improved Algorithms for Topic Distillation in a Hyperlinked Environment.' In Proceedings of the 21st ACM SIOIR Conference on Research and Development in Information Retrieval, 104-111 https://doi.org/10.1145/290941.290972
  3. Bahrat, K., and Mihaila, G. A. 2002. 'When experts agree: Using non-affiliated Experts to rank popular topics.' ACM Transactions on Information Systems, 20(1): 46-58 https://doi.org/10.1145/503104.503107
  4. Chakrabarti, S., Berg, M., and Dom, B. 1999. 'Focused Crawling: A new approach to topic-specific web resource discovery.' Proceedings of Eighth International World Wide Web Conference.
  5. Craswell, N., and Hawking, D. 2003. 'Task Descriptions: Web Track 2003.' In Proceedings of the Twelfth Text Retrieval Conference(TREC-12). (http:trec.nist.gov/pubs/trec12/papers/web03.guidelines.pdf>
  6. Craswell, N., and Hawking, D. 2004. 'Overview of the TREC-2004 Web Track.' In : Proceedings of the Thirteenth Text Retrieval Conference (TREC-13). (http:trec.nist.gov/pubs/trec13/papers/WEB.OVERVIEW.pdf)
  7. Kamps, J., Monz, C., Rijke, M., and Sigurbjornsson, B. 2003. 'Approaches to Robust and Web Retrieval.' In Proceedings of the Twelfth Text Retrieval Conferen ce (TREC-12).
  8. Kleinberg, J. M. 1999. 'Authoritative sources in a hyperlinked environment.' Journal of ACM 46(5): 604-632 https://doi.org/10.1145/324133.324140
  9. Lim, C. S., Lee, K. J., and Kim, G. C. 2005. 'Multiple sets of features for automatic genre classification of web documents.' Information Processing and Management, 41(5): 1263-1276 https://doi.org/10.1016/j.ipm.2004.06.004
  10. MacFariane, A. 2002. 'Pliers at TREC 2002.' In Proceedings of the Eleventh Text Retrieval Conference(TREC-11).
  11. Plachouras, V., Cacheda, F., Ounis, I., and Rijsbergen, C. J. 2003. 'University of Glasgow at the Web Track: Dynamic Application of Hyperlink Analysis using Query Scope.' In Proceedings of the Twelfth Text Retrieval Conference (TREC-12).
  12. Qin, T., Liu, T., Zhang, X., Feng, G., Wang, D., and Ma, W. 2007. 'Topic distillation via sub-site retrieval.' Information Processing & Management 43(2): 445-460 https://doi.org/10.1016/j.ipm.2006.07.004
  13. Robertson, S.E. and Sparck Jones, K. 1976. 'Relevance weighting of search terms.' Journal of the American Society and Information Science, 27(3):129-146 https://doi.org/10.1002/asi.4630270302
  14. Robertson, S.E., Walker, S. Beaulieu, M. 2000. 'Experimentation as a way of life: Okapi at TREC.' Informa tion Processing & Management 36(1): 95-108 https://doi.org/10.1016/S0306-4573(99)00046-1
  15. Robertson, S.E., Walker, S., Jones, S., Hancock-Beaulieu, M.M., and Gatford, M. 1994. 'Okapi at TREC-3.' In Proceedings of the Third Text Retrieval Conference (TREC-3).
  16. Song, R., Wen, J., Shi, S., Xin, G., Liu, T., Qin, T., Zheng, X., Zhang, J., Xue, G., and Ma, W. 2004. 'Microsoft Research Asia at Web Track and Terabyte Track of TREC 2004.' In Proceedings of the Thirteenth Text Retrieval Conference (TREC-13).
  17. Sun, A. and Lim, E. 2003. 'Web Unit Mining - Finding and Classifying Subgraphs of Web Pages.' In Proceedings of the twelfth ACM CIKM : 108-115 https://doi.org/10.1145/956863.956885
  18. Tomlinson, S. 2002. 'Experiments in Named Page Finding and Arabic Retrieval with Hummingbird SearchServer$^TM$ at TREC 2002.' In Proceedings of the Eleventh Text Retrieval Con ference (TREC-11).
  19. Tomlinson, S. 2003. 'Robust, Web and Genomic Retrieval with Hummingbird SearchServer$^{TM}$ at TREC 2003.' In Proceedings of the Twelfth Text Retrieval Conference(TREC-12).
  20. Zaragoza, H., Craswell, N., Taylor, M., Saria, S., and Robertson, S. 2004. 'Microsoft Cambridge at TREC-13: Web and Hard Tracks.' In Proceedings of the Thirteenth Text Retrieval Conference(TREC-13).
  21. Zhang, M., Lin, C., Liu, Y., Zhao, L., and Ma, S. 2003. 'THUIR at TREC 2003: Novelty, Robust and Web.' In Proceedings of the Twelfth Text Retrieval Conferen ce (TREC-12).
  22. Zhang, M., Song, R., Lin, C., Ma, S., Jiang, Z., Jin, Y., Liu, Y., and Zhao, L. 2002. 'THU TREC-2002 Web Track Experiments.' In Proceedings of the Eleventh Text Retrieval Conference (TREC-11).