An Improvement in K-NN Graph Construction using re-grouping with Locality Sensitive Hashing on MapReduce

Title & Authors
An Improvement in K-NN Graph Construction using re-grouping with Locality Sensitive Hashing on MapReduce
Lee, Inhoe; Oh, Hyesung; Kim, Hyoung-Joo;

Abstract
The k nearest neighbor (k-NN) graph construction is an important operation with many web-related applications, including collaborative filtering, similarity search, and many others in data mining and machine learning. Despite its many elegant properties, the brute force k-NN graph construction method has a computational complexity of $\small{O(n^2)}$, which is prohibitive for large scale data sets. Thus, (Key, Value)-based distributed framework, MapReduce, is gaining increasingly widespread use in Locality Sensitive Hashing which is efficient for high-dimension and sparse data. Based on the two-stage strategy, we engage the locality sensitive hashing technique to divide users into small subsets, and then calculate similarity between pairs in the small subsets using a brute force method on MapReduce. Specifically, generating a candidate group stage is important since brute-force calculation is performed in the following step. However, existing methods do not prevent large candidate groups. In this paper, we proposed an efficient algorithm for approximate k-NN graph construction by regrouping candidate groups. Experimental results show that our approach is more effective than existing methods in terms of graph accuracy and scan rate.
Keywords
Big Data;MapReduce;k-NN Graph Construction;Locality Sensitive Hashing(LSH);MinHash;
Language
Korean
Cited by
References
1.
A. Das, M. Datar, A. Garg, and S. Rajaram, "Google news personalization: scalable online collaborative filtering," Proc. 16th Int. Conf., pp. 271-280, 2007.

2.
W. Dong, C. Moses, and K. Li, "Efficient k-nearest neighbor graph construction for generic similarity measures," Proc. 20th Int. Conf. World wide web - WWW'11, pp. 577-586, 2011.

3.
M. R. Brito, E. L. Chavez, A. J. Quiroz, and J. E. Yukich, "Connectivity of the mutual k-nearest-neighbor graph in clustering and outlier detection," Statistics & Probability Letters, Vol. 35. pp. 33-42, 1997.

4.
O. Boiman, E. Shechtman, and M. Irani, "In defense of nearest-neighbor based image classification," 26th IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2008.

5.
Y. Zhang, K. Huang, G. Geng, and C. Liu, "Fast k NN Graph Construction with Locality Sensitive Hashing," Knowl. Discov. Databases, pp. 660-674, 2013.

6.
J. Chen, H. Fang, and Y. Saad, "Fast Approximate kNN Graph Construction for High Dimensional Data via Recursive Lanczos Bisection," J. Mach. Learn. Res., Vol. 10, No. 2009, pp. 1989-2012, 2009.

7.
Y. Park, S. Park, S. Lee, and W. Jung, "Fast collaborative filtering with a k-nearest neighbor graph," BigComp, pp. 92-95, 2014.

8.
J. L. Bentley, "Multidimensional binary search trees used for associative searching," Communications of the ACM, Vol. 18. pp. 509-517, 1975.

9.
A. Guttman, "R-trees: A Dynamic Index Structure for Spatial Searching," Proc. of the 1984 ACM SIGMOD International Conference on Management of Data - SIGMOD'84, pp. 47-57, 1984.

10.
R. Weber, H. J. Schek, and S. Blott, "A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces," Proc. 24th VLDB Conf., Vol. New York C, pp. 194-205, 1998.

11.
P. Indyk and R. Motwani, "Approximate nearest neighbors: towards removing the curse of dimensionality," STOC'98: Proc. of the thirtieth annual ACM symposium on Theory of computing, pp. 604-613, 1998.

12.
E. Kushilevitz, R. Ostrovsky, and Y. Rabani, "Efficient search for approximate nearest neighbor in high dimensional spaces," STOC'98: Proc. of the thirtieth annual ACM symposium on Theory of computing, pp. 614-623, 1998.

13.
L. Li, D. Wang, T. Li, D. Knox, and B. Padmanabhan, "SCENE: a scalable two-stage personalized news recommendation system," SIGIR, pp. 125-134, 2011.

14.
L. Hsieh and G. Wu, "Two-stage sparse graph construction using MinHash on MapReduce," ICASSP, pp. 1013-1016, 2012.

15.

16.
J. Dean and S. Ghemawat, "MapReduce : Simplified Data Processing on Large Clusters," Commun. ACM, Vol. 51, pp. 1-13, 2008.

17.
Y. Kwon and M. Balazinska, "A study of skew in mapreduce applications," Open Cirrus Summit, 2011.

18.
R. Szmit, "Locality Sensitive Hashing for Similarity Search Using MapReduce on Large Scale Data," IIS, 2013, Vol. 7912, No. LNCS, pp. 171-178.

19.
A. Z. Broder, "On the resemblance and containment of documents," Proc. Compression Complex. Seq. 1997 (Cat. No.97TB100171), 1997.

20.
Z. Yang, W. Oop, and Q. Sun, "Hierarchical nonuniform locally sensitive hashing and its application to video identification," ICIP, pp. 743-746, 2004.

21.
"MovieLens," [Online]. Available: http://grouplens.org/datasets/movielens/.

22.
"NYTimes news articles," [Online]. Available: https://archive.ics.uci.edu/ml/datasets/Bag+of+Words.