Advanced SearchSearch Tips
RHadoop platform for K-Means clustering of big data
facebook(new window)  Pirnt(new window) E-mail(new window) Excel Download
 Title & Authors
RHadoop platform for K-Means clustering of big data
Shin, Ji Eun; Oh, Yoon Sik; Lim, Dong Hoon;
  PDF(new window)
RHadoop is a collection of R packages that allow users to manage and analyze data with Hadoop. In this paper, we implement K-Means algorithm based on MapReduce framework with RHadoop to make the clustering method applicable to large scale data. The main idea introduces a combiner as a function of our map output to decrease the amount of data needed to be processed by reducers. We showed that our K-Means algorithm using RHadoop with combiner was faster than regular algorithm without combiner as the size of data set increases. We also implemented Elbow method with MapReduce for finding the optimum number of clusters for K-Means clustering on large dataset. Comparison with our MapReduce implementation of Elbow method and classical kmeans() in R with small data showed similar results.
Big data;Hadoop;K-Means clustering;R;RHadoop;
 Cited by
Anchalia, P. P. (2014). Improved MapReduce k-means clustering algorithm with combiner. 16th International Conference on Computer Modelling and Simulation, 386-391.

ASA Data Expo. (2009). Airline on-time performance, ASA section on: Statistical computing statistical graphics,

Ciliendo, E. and Kunimasa, T. (2007). Linux performance and tuning guidelines, International Technical Support Organization, IBM,

Guha, S. (2010). Computing environment for the statistical analysis of large and complex data. Ph. D. Thesis, Purdue University, West Lafayette.

Harish, D., Anusha, M.S. and Dr. Daya Sagar, K. V. (2015). Big data analysis using Rhadoop. International Journal of Innovative Research in Advanced Engineering, 4, 180-185.

Jung, B. H., Shin, J. E. and Lim, D. H. (2014). Rhipe platform for big data processing and analysis, The Korean Journal of Applied Statistics, 27, 1171-1185. crossref(new window)

Kane, M. J. and Emerson, J. W. (2010a). biganalytics: A library of utilities for big.matrix objects of package bigmemory, R package version 1.0.12,

Kane, M. J. and Emerson, J. W. (2010b). bigmemory: Manage massive matrices with shared memory and memory-mapped files, R package version 4.2.3,

Ko, Y. and Kim, J. (2013). Analysis of big data using Rhipe, Journal of the Korean Data & Information Science, 24, 975-987. crossref(new window)

Kodinariya, T. M. and Makwana, P. R. (2013). Review on determining number of cluster in k-means clustering. International Journal of Advance Research in Computer Science and Management Studies, 1, 90-95.

Oancea, B. and Dragoescu, R. M. (2014). Integration R and Hadoop for big data analysis. Romanian Statistical Review, 2. 83-94.

Park, J. H., Lee, S. Y., Kang D. H. and Won, J. H. (2013). Hadoop and MapReduce, Journal of the Korean Data & Information Science, 24, 1013-1027. crossref(new window)

Prajapati, V. (2013). Big data analytics with R and Hadoop, Packt Publishing Ltd, Birmingham, UK.

Sammer, E. (2012). Hadoop Operations, O'Reilly Media, Inc., Sebastopol, CA.

Shin, J. E., Jung, B. H. and Lim, D. H. (2015). Big data distributed processing system using RHadoop. Journal of the Korean Data & Information Science, 26, 1155-1166. crossref(new window)

White, T. (2012). Hadoop: The definitive guide, O'Reilly Media, Inc., Sebastopol, CA.