DOI QR코드

DOI QR Code

Performance Enhancement of a DVA-tree by the Independent Vector Approximation

독립적인 벡터 근사에 의한 분산 벡터 근사 트리의 성능 강화

  • Received : 2011.11.25
  • Accepted : 2012.03.13
  • Published : 2012.04.30

Abstract

Most of the distributed high-dimensional indexing structures provide a reasonable search performance especially when the dataset is uniformly distributed. However, in case when the dataset is clustered or skewed, the search performances gradually degrade as compared with the uniformly distributed dataset. We propose a method of improving the k-nearest neighbor search performance for the distributed vector approximation-tree based on the strongly clustered or skewed dataset. The basic idea is to compute volumes of the leaf nodes on the top-tree of a distributed vector approximation-tree and to assign different number of bits to them in order to assure an identification performance of vector approximation. In other words, it can be done by assigning more bits to the high-density clusters. We conducted experiments to compare the search performance with the distributed hybrid spill-tree and distributed vector approximation-tree by using the synthetic and real data sets. The experimental results show that our proposed scheme provides consistent results with significant performance improvements of the distributed vector approximation-tree for strongly clustered or skewed datasets.

지금까지 제안된 분산 고차원 색인의 대부분은 균일한 분포를 가지는 데이터 집합에서 좋은 검색 성능을 나타내나, 편향되거나 클러스터를 이루는 데이터의 집합에서는 그 성능이 크게 감소된다. 본 논문은 강하게 클러스터를 이루거나 편향된 분포를 가지는 데이터 집합에 대한 분산 벡터 근사 트리의 k-최근접 검색 성능을 향상시키는 방법을 제안한다. 기본 아이디어는 전체 데이터를 클러스터링하는 상위 트리의 말단 노드가 담당하는 데이터 공간의 크기를 계산하고, 그 공간 상의 특징 벡터를 근사하는 데 사용되는 비트의 수를 달리하여 벡터 근사의 식별 능력을 보장하는 것이다. 즉, 고밀도 클러스터에는 더 많은 수의 비트를 할당하는 것이다. 우리는 합성 데이터와 실세계 데이터를 가지고 분산 hybrid spill-tree와 기존 분산 벡터 근사 트리와의 성능 비교 실험을 수행하였다. 실험 결과는 확장된 분산 벡터 근사 트리의 검색 성능이 균일하지 않은 분포의 데이터 집합에서 크게 향상되었음을 보인다.

Keywords

References

  1. C. Zhang, A. Krishnamurthy, R. Y. Wang, "SkipIndex: Towards a Scalable Peer-to-Peer Index Service for High Dimensional Data", Technical Report TR-703-04, Princeton University, 2004.
  2. B. Nam, A. Sussman, "DiST: Fully Decentralized Indexing for Querying Distributed Multidimensional Datasets", Technical Report CS-TR-4720 and UMIACS-TR-2005-28, Maryland University, 2005.
  3. H. V. Jagadish, B. C. Ooi, Q. H. Vu, et al., "VBI-Tree: A Peer-to-Peer Framework for Supporting Multi-Dimensional Indexing Schemes", ICDE, 2006.
  4. M. Bawa, T. Condie, P. Ganesan, "LSH Forest: Self-Tuning Indexes for Similarity Search", WWW, 2005.
  5. P. Haghani, S. Michel, P. Cudré-Mauroux, et al., "LSH At Large-Distributed KNN Search in High Dimensions", WebDB, 2008.
  6. N. Koudas, C. Faloutsos, I. Kamel, "Declustering Spatial Databases on a Multi-computer Architecture", EDBT, 1996.
  7. B. Schnitzer, S.T. Leutenegger, "Master-Client R-trees: A New Parallel R-tree Architecture", SSDBM, 1999.
  8. X. Fu, D. Wang, W. Zheng, M. Sheng, "GPR-tree: A Global Parallel Index Structure for Multiattribute Declustering on Cluster of Workstations", APDC, pp.300-306, 1997.
  9. T. Liu, C. Rosenberg, H.A. Rowley, "Clustering Billions of Images with Large Scale Nearest Neighbor Search", IEEE WACV, 2007.
  10. R. Weber, K. Böhm, H.-J. Schek, "Interactive-Time Similarity Search for Large Image Collection Using Parallel VA-Files", ICDE, 2000.
  11. J. Chang, A. Lee, "Parallel High-dimensional Index Structure for Content-based Information Retrieval", CIT, 2008.
  12. H.-H Choi, M.-Y. Lee, Y.-C. Kim, J.-W Chang, K.-C. Lee, "A Distributed High Dimensional Indexing Structure for Content-based Retrieval of Large Scale Data", KIISE:Databases Journal, Vol.37, No.5, pp.228-237, 2010.
  13. T. Liu, A.W. Moore, A. Gray, "An Investigation of Practical Approximate Nearest Neighbor Algorithms", ANIPS, 2004.
  14. R. Weber, H. J. Schek, S. Blott, "A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces", VLDB, pp.194-205, 1998.
  15. P. Ciaccia, M. Patella, P. Zezula, "M-tree: An Efficient Access Method for Similarity Search in Metric Spaces", VLDB, pp.426-435, 1997.
  16. R. Weber, S. Blott, "An Approximation-Based Data Structure for Similarity Search", Technical Report 24, ESPRIT project HERMES (No.9141), 1997.
  17. T. Yamane, Statistics: An Introductory Analysis, second ed., 1976.
  18. P. Ciaccia, M. Patella, P. Zezula, "A Cost Model for Similarity Queries in Metric Spaces", PODS, pp.59-68, 1998.
  19. R. Weber, K. Böhm, "Trading Quality for Time with Nearest-Neighbor Search", EDBT, pp.21-35, 2000.
  20. Real Data source website, http://www.autonlab.org/autonweb/15960.html.
  21. M-tree homepage, http://www-db.deis.unibo.it/research/Mtree.