Advanced SearchSearch Tips
An Update-Efficient, Disk-Based Inverted Index Structure for Keyword Search on Data Streams
facebook(new window)  Pirnt(new window) E-mail(new window) Excel Download
 Title & Authors
An Update-Efficient, Disk-Based Inverted Index Structure for Keyword Search on Data Streams
Park, Eun Ju; Lee, Ki Yong;
  PDF(new window)
As social networking services such as twitter become increasingly popular, data streams are widely prevalent these days. In order to search data accumulated from data streams efficiently, the use of an index structure is essential. In this paper, we propose an update-efficient, disk-based inverted index structure for efficient keyword search on data streams. When new data arrive at the data stream, the index needs to be updated to incorporate the new data. The traditional inverted index is very inefficient to update in terms of disk I/O, because all index data stored in the disk need to be read and written to the disk each time the index is updated. To solve this problem, we divide the whole inverted index into a sequence of inverted indices with exponentially increasing size. When new data arrives, it is first inserted into the smallest index and, later, the small indices are merged with the larger indices, which leads to a small amortize update cost for each new data. Furthermore, when indices stored in the disk are merged with each other, we minimize the disk I/O cost incurred for the merge operation, resulting in an even smaller update cost. Through various experiments, we compare the update efficiency of the proposed index structure with the previous one, and show the performance advantage of the proposed structure in terms of the update cost.
Inverted Index;Data Streams;Index Update;Keyword Search;
 Cited by
B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom., "Processing sliding window multi-joins in continuous queries over data streams," in Proceedings of ACM SIGMODSIGACTSIGART Symposium on Principles of Database Systems (PODS), pp.1-16, June, 2002.

M. Busch, K. Gade, B. Larson, P. Lok, S. Luckenbill, and J. Lin, "Earlybird: Real-time search at twitter," in ICDE, pp. 1360-1369, 2012.

S. Helmer and G. Moerkotte, "A performance study of four index structures for set-valued attributes of low cardinality," The International Journal on Very Large Data Bases(VLDB), Vol.12, No.3, pp.244-261, 2003. crossref(new window)

C. Chen, F. Li, B. C. Ooi, and S. Wu, "TI: An efficient indexing mechanism for real-time search on tweets," in SIGMOD, pp. 649-660, 2011.

Lingkun Wu, Wenqing Lin, Xiaokui Xiao, and Yabo Xu3, "LSII: An Indexing Structure for Exact Real-Time Search on Microblogs," in ICDE, pp.482-493, 2013.

J. Zobel and A. Moat, "Inverted files for text search engines," ACM Computing Survey, Vol.38, No.2, July, 2006.

D. Arroyuelo, S. Gonzalez, M. Oyarzun, and V. Sepulveda, "Document identifier reassignment and run-length-compressed inverted indexes for improved search performance," Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp.173-182, 2013.

R. Baeza-Yates and B. Ribeiro-Neto, "Modern Information Retrieval: The Concepts and Technology behind Search," 2nd Edition, Addison-Wesley Professional, 2011.

H. Yan, S. Ding, and T. Suel. "Inverted index compression and query processing with optimized document ordering," in Proceedings of the 18th international conference on World Wide Web, pp.401-410, 2009.

Carolina Bonacic, Danilo Bustos, and Veronica Gil-Costa, "Multithreaded Processing in Dynamic Inverted Indexes for Web Search Engines," in Proceedings of the 2015 Workshop on Large-Scale and Distributed System for Information Retrieval, pp.15-20, 2015.

M. Stonebraker, "The case for partial indexes," ACMSIGMOD Record, Vol.18, No.4, pp.4-11.

P. Seshadri and A. N. Swami. 1995, "Generalized partial indexes," in ICDE, pp.420-427, December, 1989.

B. T. Loo, J. M. Hellerstein, R. Huebsch, S. Shenker, and I. Stoica, "Enhancing p2p file-sharing with an internet-scale query processor," in VLDB, pp.432-443, 2004.

E. Adar, "User 4xxxxx9: Anonymizing query logs," in Workshop on Query Log Analysis at the 16th World Wide Web Conference, 2007.

J. Lin and G. Mishne, "A study of 'churn' in tweets and real-time search queries," in Proceedings of the Sixth International AAAI Conference on Weblogs and Social Media, 2012.

P. E. O'Neil, E. Cheng, D. Gawlick, and E. J. O'Neil, "The log structured merge-tree (lsm-tree)," Journal Acta Informatica, Vol.33, No.4, pp.351-385, 1996. crossref(new window)