DOI QR코드

DOI QR Code

K Nearest Neighbor Joins for Big Data Processing based on Spark

Spark 기반 빅데이터 처리를 위한 K-최근접 이웃 연결

  • JIAQI, JI (Department of Computer Engineering, Wonkwang University) ;
  • Chung, Yeongjee (Department of Computer Engineering, Wonkwang University)
  • Received : 2017.05.22
  • Accepted : 2017.06.30
  • Published : 2017.09.30

Abstract

K Nearest Neighbor Join (KNN Join) is a simple yet effective method in machine learning. It is widely used in small dataset of the past time. As the number of data increases, it is infeasible to run this model on an actual application by a single machine due to memory and time restrictions. Nowadays a popular batch process model called MapReduce which can run on a cluster with a large number of computers is widely used for large-scale data processing. Hadoop is a framework to implement MapReduce, but its performance can be further improved by a new framework named Spark. In the present study, we will provide a KNN Join implement based on Spark. With the advantage of its in-memory calculation capability, it will be faster and more effective than Hadoop. In our experiments, we study the influence of different factors on running time and demonstrate robustness and efficiency of our approach.

K-최근접 이웃 연결(KNN 연결) 알고리즘은 기계학습에서 매우 효과적인 방법으로, 작은 데이터군에 대해서 널리 사용되어 왔다. 데이터의 수가 증가함에 따라, 단일 컴퓨터에서는 메모리와 수행시간의 제약으로 실제적인 응용프로그램에서는 실행하기에 적합하지 못하였다. 최근에는 대규모 데이터 처리를 위해서, 많은 수의 컴퓨터로 이루어진 클러스터에서 실행될 수 있는 맵리듀스 (MapReduce)로 알려진 알고리즘이 널리 사용되고 있다. 하둡은 맵리듀스 알고리즘을 구현한 프레임워크이지만 스파크라고 하는 새로운 프레임워크에 의하여 그 성능이 월등히 개선되었다. 본 논문에서는, 스파크에 기반하여 구현된 KNN 연결 알고리즘을 제안하였으며, 이는 인메모리(In-Memory) 연산 기능의 장점으로 하둡보다 빠르고 보다 효율적일 것으로 기대한다. 실험을 통하여, 수행시간에 영향을 주는 요소들에 관하여 조사하였으며, 제안한 접근 방식의 우수성과 효율성을 확인하였다.

Keywords

References

  1. C. Yu et al, "High-dimensional knn joins with incremental updates," Geoinformatica, vol.14, no.1, pp.55-82, Jan. 2010. https://doi.org/10.1007/s10707-009-0076-5
  2. T. Emrich et al, "On reverse-k-nearest-neighbor joins," GeoInformatica, vol.19, no.2, pp.299-330, Apr. 2015. https://doi.org/10.1007/s10707-014-0215-5
  3. J. D. Kim, "A Method for Continuous k Nearest Neighbor Search With Partial Order," Journal of the Korea Institute of Information and Communication Engineering, vol.15, no.1, pp.126-132, Jan. 2011. https://doi.org/10.6109/jkiice.2011.15.1.126
  4. C. Zhang, F. Li, and J. Jestes, "Efficient parallel kNN joins for large data in MapReduce," Proceedings of the 15th International Conference on Extending Database Technology, EDBT 2012-Berlin, ACM, pp. 38-49, 2012.
  5. A. Stupar, S. Michel, and R. Schenkel, "RankReduce processing k-nearest neighbor queries on top of MapReduce," Proceedings of the 8th Workshop on Large-Scale Distributed Systems for Information Retrieval (LSDS-IR'10), pp.13-18, 2010.
  6. C. Ji et al. (2016, August). Inverted Voronoi-Based KNN Query Processing with MapReduce, 2016 IEEE Trust Com-BigDataSE-ISPA [Online]. pp.2263-2268. Available: http://ieeexplore.ieee.org/document/7847232/.
  7. G. Song et al, "K Nearest Neighbour Joins for Big Data on MapReduce: A Theoretical and Experimental Analysis," IEEE Transactions on Knowledge and Data Engineering, vol.28, no.9, pp.2376-2392, Sep. 2016. https://doi.org/10.1109/TKDE.2016.2562627
  8. M. Parsian, Data Algorithms: Recipes for Scaling Up with Hadoop and Spark, 1st ed. Sebastopol, CA: O'Reilly Media, Inc., Jul. 2015.
  9. Z. Sun et al. (2016, July). Migrating GIS big data computing from Hadoop to Spark: an exemplary study Using Twitter. 2016 IEEE 9th International Conference on Cloud Computing(CLOUD 2016), IEEE [Online]. pp.351-358. Available: http://ieeexplore.ieee.org/document/7820291/.
  10. K. S. Park, J. H. Choi, "Design and Implementation of a Search Engine based on Apache Spark," Journal of the Korea Institute of Information and Communication Engineering, vol.21, no.1, pp.17-28, Jan. 2017. https://doi.org/10.6109/jkiice.2017.21.1.17