DOI QR코드

DOI QR Code

Big data distributed processing system using RHadoop

RHadoop을 이용한 빅데이터 분산처리 시스템

  • Shin, Ji Eun (Department of Information Statistics, Gyeongsang National University) ;
  • Jung, Byung Ho (Department of Information Statistics, Gyeongsang National University) ;
  • Lim, Dong Hoon (Department of Information Statistics, Gyeongsang National University)
  • 신지은 (경상대학교 정보통계학과) ;
  • 정병호 (경상대학교 정보통계학과) ;
  • 임동훈 (경상대학교 정보통계학과)
  • Received : 2015.07.02
  • Accepted : 2015.09.22
  • Published : 2015.09.30

Abstract

It is almost impossible to store or analyze big data increasing exponentially with traditional technologies, so Hadoop is a new technology to make that possible. In recent R is using as an engine for big data analysis based on distributed processing with Hadoop technology. With RHadoop that integrates R and Hadoop environment, we implemented parallel multiple regression analysis with various data sizes of actual data and simulated data. Experimental results showed our RHadoop system was faster as the number of data nodes increases. We also compared the performance of our RHadoop with lm function and biglm packages available on bigmemory. The results showed that our RHadoop was faster than other packages owing to paralleling processing with increasing the number of map tasks as the size of data increases.

기하급수적으로 증가하는 대용량 데이터를 저장, 분석하는데 기존 방식으로는 거의 불가능하여 이를 가능케 해 주는 기술이 바로 하둡이다. 최근에 R은 하둡기술을 활용하여 분산처리에 기반한 빅데이터 분석 엔진으로 활용되고 있다. 본 논문에서는 R과 하둡의 통합환경인 RHadoop을 이용하여 실제 데이터와 모의실험 데이터에서 다양한 데이터 크기에 따라 병렬 다중 회귀분석을 구현하고자 한다. 또한, 제안된 RHadoop 플랫폼의 성능을 평가하기 위해 기본 R 패키지의 lm 함수, bigmemory 상에서 유용한 biglm 패키지와 처리 속도를 비교하였다. 실험결과 RHadoop은 데이터 노드가 많을수록 병렬처리로 인해 빠른 처리속도를 보였고 또한 대용량의 데이터에 대해 다른 패키지들보다 빠른 처리속도를 보였다.

Keywords

References

  1. Adler, D., Nenadic, O., Zucchini, W. and Glaser, C. (2007). The ff package: Handling large data sets in R with memory mapped pages of binary flat files, UseR2007, http://www.r-project.org/conferences/useR-2007/program/presentations/adler.pdf.
  2. ASA Data Expo. (2009). Airline on-time performance, ASA section on: Statistical computing statistical graphics, http://stat-computing.org/dataexpo/2009/the-data.html.
  3. Beyer, M. A. and Laney, D. (2012). The importance of big data: A definition, Gartner, Stanford.
  4. Ciliendo, E., Kunimasa, T. and Braswell, B. (2007). Linux Performance and Tuning Guidelines, IBM.
  5. Guha, S. (2010). Computing environment for the statistical analysis of large and complex data. Ph. D. Thesis, Department of Statistics, Purdue University, West Lafayette.
  6. Guha, S., Hafen, R., Rounds, J., Xia, J., Li, J., Xi, B. and Cleveland, W. S. (2012). Large complex data: Divide and recombine (D&R) with RHIPE. Stat, 1, 53-67. https://doi.org/10.1002/sta4.7
  7. Hafen, R., Gibson, T., Dam, K. K. and Critchlow, T. (2014). Power grid data analysis with R and Hadoop, In Data Mining Applications with R, 1-34.
  8. Harish, D., Anusha, M.S. and Dr. Daya Sagar, K. V. (2015). Big data analysis using Rhadoop, International Journal of Innovative Research in Advanced Engineering, 4, 180-185.
  9. Jung, B. H., Shin, J. E. and Lim, D. H. (2014). Rhipe platform for big data processing and analysis. The Korean Journal of Applied Statistics, 27, 1171-1185. https://doi.org/10.5351/KJAS.2014.27.7.1171
  10. Kane, M. J. and Emerson, J. W. (2010a). bigmemory: Manage massive matrices with shared memory and memory-mapped files, R package version 4.2.3, https://cran.r-project.org/package=bigmemory.
  11. Kane, M. J. and Emerson, J. W. (2010b). biganalytics: A library of utilities for big.matrix objects of package bigmemory, R package version 1.0.12.
  12. Ko, Y. and Kim, J. (2013). Analysis of big data using Rhipe. Journal of the Korean Data & Information Science, 24, 975-987. https://doi.org/10.7465/jkdi.2013.24.5.975
  13. Laney, D. (2001). 3D Data Management: Controlling Data Volume, Velocity, and Variety, META Group.
  14. Lin, H., Yang, S. and Midkiff, S. P. (2013). A Parallel R Framework for Processing Large Dataset on Distributed Systems, DataCloud.
  15. Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C. and Byers, A. H. (2011). Big data: The next frontier for innovation, competition, and productivity, McKinsey Global Institute.
  16. Oancea, B. and Dragoescu, R. M. (2014). Integration R and Hadoop for Big data analysis. Romanian Statistical Review, 2, 83-94.
  17. Park, J. H., Lee, S. Y., Kang D. H. and Won, J. H. (2013). Hadoop and MapReduce. Journal of the Korean Data & Information Science, 24, 1013-1027. https://doi.org/10.7465/jkdi.2013.24.5.1013
  18. Prajapati, V. (2013). Big data analytics with R and Hadoop, Packt Publishing Ltd, Birmingham, UK.
  19. Sammer, E. (2012). Hadoop Operations, O'Reilly Media, Inc., Sebastopol, CA.
  20. Tech Spartan. In An Internet Minute-2013 VS 2014, http://www.techspartan.co.uk/features/internetminute-2013-vs-2014-infographic/, 2014.
  21. Todorov, V. and Templ, M. (2012). R in the statistical office: Part 2, Development, policy, statistics and research branch working paper 1/2012, United Nations Industrial Development Organization, Vienna.
  22. Todorov, V. (2010). R in the statistical office: The UNIDO experience, Development, policy, statistics and research branch working paper paper 03/2010, United Nations Industrial Development Organization, Vienna.
  23. White, T. (2012). Hadoop: The Definitive Guide, O'Reilly Media, Inc., Sebastopol, CA.

Cited by

  1. Learning algorithms for big data logistic regression on RHIPE platform vol.27, pp.4, 2016, https://doi.org/10.7465/jkdi.2016.27.4.911
  2. RHadoop platform for K-Means clustering of big data vol.27, pp.3, 2016, https://doi.org/10.7465/jkdi.2016.27.3.609
  3. Performance Comparison of Logistic Regression Algorithms on RHadoop vol.22, pp.4, 2017, https://doi.org/10.9708/jksci.2017.22.04.009
  4. 빅데이터 통합모형 비교분석 vol.28, pp.4, 2015, https://doi.org/10.7465/jkdi.2017.28.4.755
  5. 제조 빅데이터 시스템을 위한 효과적인 시각화 기법 vol.28, pp.6, 2015, https://doi.org/10.7465/jkdi.2017.28.6.1301