A Study on Layout Extraction from Internet Documents Through Xpath

Xpath에 의한 인터넷 문서의 레이아웃 추출 방법에 관한 연구

  • 한광록 (호서대학교 컴퓨터공학부) ;
  • 선복근 (호서대학교 컴퓨터공학부)
  • Published : 2005.08.01

Abstract

Currently most Internet documents including news data are made based on predefined templates, but templates are usually formed only for main data and are not helpful for information retrieval against indexes, advertisements, header data etc. Templates in such forms are not appropriate when Internet documents are used as data for information retrieval. In order to process Internet documents in various areas of information retrieval, it is necessary to detect additional information such as advertisements and page indexes. Thus this study proposes a method of detecting the layout of web pages by identifying the characteristics and structure of block tags that affect the layout of web pages and calculating distances between web pages. As a result of experiment, we can successfully extract 640 documents from 1000 samples and obtain 64% recall rate. This method is purposed to reduce the cost of web document automatic processing and improve its efficiency through applying the method to document preprocessing of information retrieval such as data extraction and document summarization.

Keywords

Information Retrieval;Data Extraction;Layout;HTML;XML Technologies