Using a Greedy Algorithm for the Improvement of a MapReduce, Theta join, M-Bucket-I Heuristic

• Journal title : Journal of KIISE
• Volume 43, Issue 2,  2016, pp.229-236
• Publisher : Korean Institute of Information Scientists and Engineers
• DOI : 10.5626/JOK.2016.43.2.229
Title & Authors
Using a Greedy Algorithm for the Improvement of a MapReduce, Theta join, M-Bucket-I Heuristic
Kim, Wooyeol; Shim, Kyuseok;

Abstract
Theta join is one of the essential and important types of queries in database systems. As the amount of data needs to be processed increases, processing theta joins with a single machine becomes impractical. Therefore, theta join algorithms using distributed computing frameworks have been studied widely. Although one of the state-of-the-art theta-join algorithms uses M-Bucket-I heuristic, it is hard to use since running time of M-Bucket-I heuristic, which computes a mapping from a record to a reducer (i.e., reducer mapping), is O(n) where n is the size of input data. In this paper, we propose MBI-I algorithm which reduces the running time of M-Bucket-I heuristic to $\small{O(r_{max}log\;n)}$ and gives the same result as M-Bucket-I heuristic does. We also conducted several experiments to show algorithm and confirmed that our algorithm can improve the performance of a theta join by 10%.
Keywords
theta join;distributed computing;MapReduce;histogram;
Language
Korean
Cited by
References
1.
J. Dean, and S. Ghemawat, "MapReduce: simplified data processing on large cluster," OSDI, 2004.

2.
A. Okcan, and M. Riedewald, "Processing Thetajoins using MapReduce," SIGMOD, pp. 949-960, 2011.

3.
J. Son, J. Lee, Y. Kim, and K. Shim, "Streaming Theta-Join Algorithm using MapReduce," Proc. of the 40th KIISE Fall Conference, pp. 182-184, 2013. (in Korean)

4.
D. Jiang, B. C. Ooi, L. Shi, and S. Wu, "The Performance of MapReduce: An In-depth Study," Proc. of the VLDB Endowment, Vol. 3, No. 1, pp. 472-483, 2010.

5.
S. Blanas, J. M. Patel, V. Ercegovac, J. Rao, E. J. Shekita and Y. Tian, "A Comparison of Join Algorithms for Log Processing in MapReduce," SIGMOD, pp. 975-986, 2010.

6.
A. Metwally, and C. Faloutsos, "V-SMART-Join, A Scalable MapReduce Framework for All-Pair Similarity Joins of Multisets and Vectors," PVLDB, pp. 704-715, 2012.

7.
D. Deng, G. Li, S. Hao, J. Wang and J. Feng, "MassJoin: A Mapreduce-based Method for Scalable String Similarity Joins," ICDE, pp. 340-351, 2014.

8.
S. Fries, B. Boden, G. Stepien and T. Seidl, "PHiDJ: Parallel Similarity Self-Join for High-Dimensional Vector Data with MapReduce," ICDE, pp. 796-807, 2014.

9.
X. Zhang, L. Chen, M. Wang, "Efficient Multi-way Theta-Join Processing Using MapReduce," PVLDB, pp. 1184-1105, 2012.

10.
C. Hahn and S. Warren, Extended edited synoptic cloud reports from shps and land statins over the globe, 1952-1996, http://cdiac.ornl.gov/ftp/ndp026c/