DOI QR코드

DOI QR Code

Comparison analysis of big data integration models

빅데이터 통합모형 비교분석

  • Jung, Byung Ho (Gyeongsangnamdo Provincial Government) ;
  • Lim, Dong Hoon (Department of Information and Statistics, Gyeongsang National University)
  • Received : 2017.06.13
  • Accepted : 2017.07.13
  • Published : 2017.07.31

Abstract

As Big Data becomes the core of the fourth industrial revolution, big data-based processing and analysis capabilities are expected to influence the company's future competitiveness. Comparative studies of RHadoop and RHIPE that integrate R and Hadoop environment, have not been discussed by many researchers although RHadoop and RHIPE have been discussed separately. In this paper, we constructed big data platforms such as RHadoop and RHIPE applicable to large scale data and implemented the machine learning algorithms such as multiple regression and logistic regression based on MapReduce framework. We conducted a study on performance and scalability with those implementations for various sample sizes of actual data and simulated data. The experiments demonstrated that our RHadoop and RHIPE can scale well and efficiently process large data sets on commodity hardware. We showed RHIPE is faster than RHadoop in almost all the data generally.

빅데이터가 4차 산업혁명의 핵심으로 자리하면서 빅데이터 기반 처리 및 분석 능력이 기업의 미래 경쟁력을 좌우할 전망이다. 빅데이터 처리 및 분석을 위한 RHadoop과 RHIPE 모형은 R과 Hadoop의 통합모형으로 지금까지 각각의 모형에 대해서는 연구가 많이 진행되어 왔으나 두 모형간 비교 연구는 거의 이루어 지지 않았다. 본 논문에서는 대용량의 실제 데이터와 모의실험 데이터에서 다중 회귀 (multiple regression)와 로지스틱 회귀 (logistic regression) 추정을 위한 머신러닝 (machine learning) 알고리즘을 MapReduce 프로그램 구현을 통해 RHadoop과 RHIPE 간의 비교 분석하고자 한다. 구축된 분산 클러스터 (distributed cluster) 하에서 두 모형간 성능 실험 결과, RHIPE은 RHadoop에 비해 대체로 빠른 처리속도를 보인 반면에 설치, 사용면에서 어려움을 보였다.

Keywords

References

  1. ASA data expo. (2009). http://stat-computing.org/dataexpo/2009/the-data.html
  2. Davenport, T. (2015). B. I. G. forum 2015, Gyeonggi Creative Economy & Innovation Center.
  3. Forte, R. M. (2015). Mastering predictive analytics with R, Packt Publishing Ltd, Birmingham, U.K.
  4. Guha, S. (2010). Computing environment for the statistical analysis of large and complex data, Ph.D Thesis, Department of Statistics, Purdue University, West Lafayette.
  5. Guha, S., Hafen, R., Rounds, J., Xia, J., Li, J., Xi, B., Cleveland, W. S. (2012). Large complex data: divide and recombine (D&R) with RHIPE. Statistics, 191, 53-67.
  6. Hafen, R., Gibson, T., Dam, K. K., Critchlow., T. (2014). Power grid data analysis with R and Hadoop in data mining applications with R, 1-34.
  7. Harish, D., Anusha, M.S., Dr. Daya Sagar, K.V. (2015). Big data analysis using Rhadoop. IJIRAE, 4, 180-185.
  8. Hilbe, J. M. (2009). Logistic regression models, Chapman & Hall/CRC Press.
  9. IDC. (2015). IDC FutureScape: Worldwide big data and analytics 2016 predictions, MA, USA.
  10. Jee, Y. S. (2017). Exercise rehabilitation in the fourth industrial revolution. Journal of Exercise Rehabilitation, 13, 255-256. https://doi.org/10.12965/jer.1735012.506
  11. Jung, B. H., Shin, J. E. and Lim, D. H. (2014). Rhipe platform for big data processing and analysis. The Korean Journal of Applied Statistics, 27, 1171-1185. https://doi.org/10.5351/KJAS.2014.27.7.1171
  12. Jung, B. H. and Lim, D. H. (2016). Learning algorithms for big data logistic regression on RHIPE platform. The Korean Journal of Applied Statistics, 27, 911-923.
  13. Ko, Y. and Kim, J. (2013). Analysis of big data using Rhipe. Journal of the Korean Data & Information Science, 24, 975-987. https://doi.org/10.7465/jkdi.2013.24.5.975
  14. Liang, S. (2003). Quantitative remote sensing of land surfaces, John Wiley & Sons.
  15. Lin, H., Yang, S., Midkiff, S. P. (2013). RABID - A general distributed R processing framework targeting large data-set problems. IEEE International Congress on Big Data, Santa Clara, CA, USA.
  16. Oancea, B. and Dragoescu, R. M. (2014). Integration R and Hadoop for big data analysis. Romanian statistical review, 2, 83-94.
  17. Park, J. H., Lee, S. Y., Kang, D. H., Won, J. H. (2013). Hadoop and Mapreduce. Journal of the Korean Data & Information Science, 24, 1013-1027. https://doi.org/10.7465/jkdi.2013.24.5.1013
  18. Prakash, L. and Bejda, M. (2015). Performance analysis for scaling up R computations using Hadoop, B.S. in Computer Science, The University of Texas at Austin.
  19. Prajapati, V. (2013). Big data analytics with R and Hadoop, Packt Publishing Ltd, Birmingham, UK.
  20. Rashid, M. (2008). Inference on logistic regression, Ph. D. Thesis, Bowling Green State University.
  21. Sammer, E. (2012). Hadoop operations, O'Reilly Media, Inc., Sebastopol, CA.
  22. Shin, J. E., Jung, B. H. and Lim, D. H. (2015). Big data distributed processing system using RHadoop. Journal of the Korean Data & Information Science, 26, 1155-1166. https://doi.org/10.7465/jkdi.2015.26.5.1155
  23. Shin, J. E., Oh, Y. S. and Lim, D. H. (2016). RHadoop platform for K-Means clustering of big data. Journal of the Korean Data & Information Science, 27, 609-619. https://doi.org/10.7465/jkdi.2016.27.3.609
  24. Wang, C., Chen, M. H., Schifano, Wu, J. and Yan, J. (2015). A survey of statistical methods and computing for Big Data, Cornell University Library.
  25. White, T. (2012). Hadoop: The definitive guide, O'Reilly Media, Inc., Sebastopol, CA.
  26. Rotte, A. V., Patwari, G., Hiremath, S. (2015). Big data analytics made easy with rhadoop. International Journal of Research in Engineering and Technology, 4, 9-15.