DOI QR코드

DOI QR Code

PDFindexer: Distributed PDF Indexing system using MapReduce

  • Murtazaev, JAziz (Department of Computer Engineering, Ajou University) ;
  • Kihm, Jang-Su (Department of Computer Engineering, Ajou University) ;
  • Oh, Sangyoon (Department of Computer Engineering, Ajou University)
  • 투고 : 2011.11.20
  • 심사 : 2012.02.13
  • 발행 : 2012.02.28

초록

Indexing allows converting raw document collection into easily searchable representation. Web searching by Google or Yahoo provides subsecond response time which is made possible by efficient indexing of web-pages over the entire Web. Indexing process gets challenging when the scale gets bigger. Parallel techniques, such as MapReduce framework can assist in efficient large-scale indexing process. In this paper we propose PDFindexer, system for indexing scientific papers in PDF using MapReduce programming model. Unlike Web search engines, our target domain is scientific papers, which has pre-defined structure, such as title, abstract, sections, references. Our proposed system enables parsing scientific papers in PDF recreating their structure and performing efficient distributed indexing with MapReduce framework in a cluster of nodes. We provide the overview of the system, their components and interactions among them. We discuss some issues related with the design of the system and usage of MapReduce in parsing and indexing of large document collection.

키워드