RHadoop platform for K-Means clustering of big data

Shin, Ji Eun;Oh, Yoon Sik;Lim, Dong Hoon;

doi:10.7465/jkdi.2016.27.3.609

Journal of the Korean Data and Information Science Society

Volume 27 Issue 3
/
Pages.609-619
/
2016
/
1598-9402(pISSN)

The Korean Data and Information Science Society (한국데이터정보과학회)

DOI QR Code

RHadoop platform for K-Means clustering of big data

빅데이터 K-평균 클러스터링을 위한 RHadoop 플랫폼

Shin, Ji Eun (Department of Information and Statistics, Gyeongsang National University) ;
Oh, Yoon Sik (Division of Biological Sciences, Gyeongsang National University) ;
Lim, Dong Hoon (Department of Information and Statistics, Gyeongsang National University)

신지은 (경상대학교 정보통계학과) ;
오윤식 (경상대학교 생명과학부) ;
임동훈 (경상대학교 정보통계학과)

Received : 2016.03.07
Accepted : 2016.03.23
Published : 2016.05.31

https://doi.org/10.7465/jkdi.2016.27.3.609 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

RHadoop is a collection of R packages that allow users to manage and analyze data with Hadoop. In this paper, we implement K-Means algorithm based on MapReduce framework with RHadoop to make the clustering method applicable to large scale data. The main idea introduces a combiner as a function of our map output to decrease the amount of data needed to be processed by reducers. We showed that our K-Means algorithm using RHadoop with combiner was faster than regular algorithm without combiner as the size of data set increases. We also implemented Elbow method with MapReduce for finding the optimum number of clusters for K-Means clustering on large dataset. Comparison with our MapReduce implementation of Elbow method and classical kmeans() in R with small data showed similar results.

본 논문에서는 대용량 데이터를 처리 및 분석하기 위해 RHadoop 플랫폼에서 실제 데이터와 모의 실험 데이터를 가지고 K-평균 클러스터링을 구현하고, MapReduce의 컴바이너 사용여부에 따른 처리 속도를 비교하고자 한다. 또한, K-평균 클러스터링에서 최적의 군집수 결정방법을 MapReduce 프로그램으로 구현하여 실제 데이터에 적용하고자 한다. 그리고 제안된 RHadoop 플랫폼의 확장 가능성을 보이기 위해 실제 데이터에서 R의 기본 패키지에서 kmeans() 함수와 bigmemory 패키지 상에서 유용한 bigkmeans() 함수와 처리 속도를 비교하고자 한다.

Keywords

References

Anchalia, P. P. (2014). Improved MapReduce k-means clustering algorithm with combiner. 16th International Conference on Computer Modelling and Simulation, 386-391.
ASA Data Expo. (2009). Airline on-time performance, ASA section on: Statistical computing statistical graphics, http://stat-computing.org/dataexpo/2009/the-data.html.
Ciliendo, E. and Kunimasa, T. (2007). Linux performance and tuning guidelines, International Technical Support Organization, IBM, ibm.com/redbooks.
Guha, S. (2010). Computing environment for the statistical analysis of large and complex data. Ph. D. Thesis, Purdue University, West Lafayette.
Harish, D., Anusha, M.S. and Dr. Daya Sagar, K. V. (2015). Big data analysis using Rhadoop. International Journal of Innovative Research in Advanced Engineering, 4, 180-185.
Jung, B. H., Shin, J. E. and Lim, D. H. (2014). Rhipe platform for big data processing and analysis, The Korean Journal of Applied Statistics, 27, 1171-1185. https://doi.org/10.5351/KJAS.2014.27.7.1171
Kane, M. J. and Emerson, J. W. (2010a). biganalytics: A library of utilities for big.matrix objects of package bigmemory, R package version 1.0.12, http://CRAN.R-project.org/package=biganalytics.
Kane, M. J. and Emerson, J. W. (2010b). bigmemory: Manage massive matrices with shared memory and memory-mapped files, R package version 4.2.3, http://CRAN.R-project.org/package=bigmemory.
Ko, Y. and Kim, J. (2013). Analysis of big data using Rhipe, Journal of the Korean Data & Information Science, 24, 975-987. https://doi.org/10.7465/jkdi.2013.24.5.975
Kodinariya, T. M. and Makwana, P. R. (2013). Review on determining number of cluster in k-means clustering. International Journal of Advance Research in Computer Science and Management Studies, 1, 90-95.
Oancea, B. and Dragoescu, R. M. (2014). Integration R and Hadoop for big data analysis. Romanian Statistical Review, 2. 83-94.
Park, J. H., Lee, S. Y., Kang D. H. and Won, J. H. (2013). Hadoop and MapReduce, Journal of the Korean Data & Information Science, 24, 1013-1027. https://doi.org/10.7465/jkdi.2013.24.5.1013
Prajapati, V. (2013). Big data analytics with R and Hadoop, Packt Publishing Ltd, Birmingham, UK.
Sammer, E. (2012). Hadoop Operations, O'Reilly Media, Inc., Sebastopol, CA.
Shin, J. E., Jung, B. H. and Lim, D. H. (2015). Big data distributed processing system using RHadoop. Journal of the Korean Data & Information Science, 26, 1155-1166. https://doi.org/10.7465/jkdi.2015.26.5.1155
White, T. (2012). Hadoop: The definitive guide, O'Reilly Media, Inc., Sebastopol, CA.

Cited by

빅데이터 통합모형 비교분석 vol.28, pp.4, 2016, https://doi.org/10.7465/jkdi.2017.28.4.755
고차원 자료에서 영향점의 영향을 평가하기 위한 그래픽 방법 vol.28, pp.6, 2016, https://doi.org/10.7465/jkdi.2017.28.6.1291
제조 빅데이터 시스템을 위한 효과적인 시각화 기법 vol.28, pp.6, 2016, https://doi.org/10.7465/jkdi.2017.28.6.1301

Journal of the Korean Data and Information Science Society

RHadoop platform for K-Means clustering of big data

빅데이터 K-평균 클러스터링을 위한 RHadoop 플랫폼

Abstract

Keywords

References

Cited by

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)