DOI QR코드

DOI QR Code

RHIPE 플랫폼에서 빅데이터 로지스틱 회귀를 위한 학습 알고리즘

Learning algorithms for big data logistic regression on RHIPE platform

  • 정병호 (경상대학교 정보통계학과) ;
  • 임동훈 (경상대학교 정보통계학과)
  • Jung, Byung Ho (Department of Information and Statistics, Gyeongsang National University) ;
  • Lim, Dong Hoon (Department of Information and Statistics, Gyeongsang National University)
  • 투고 : 2016.05.03
  • 심사 : 2016.07.22
  • 발행 : 2016.07.31

초록

빅데이터 시대에 머신러닝의 중요성은 더욱 부각되고 있고 로지스틱 회귀는 머신러닝에서 분류를 위한 방법으로 의료, 경제학, 마케팅 및 사회과학 전반에 걸쳐 널리 사용되고 있다. 지금까지 R과 Hadoop의 통합환경인 RHIPE 플랫폼은 설치 및 MapReduce 구현의 어려움으로 인해 거의 연구가 이루지 지지 않았다. 본 논문에서는 대용량 데이터에 대해 로지스틱 회귀 추정을 위한 두가지 알고리즘 즉, Gradient Descent 알고리즘과 Newton-Raphson 알고리즘에 대해 MapReduce로 구현하고, 실제 데이터와 모의실험 데이터를 가지고 이들 알고리즘 간의 성능을 비교하고자 한다. 알고리즘 성능 실험에서 Gradient Descent 알고리즘은 학습률에 크게 의존하고 또한 데이터에 따라 수렴하지 않는 문제를 갖고 있다. Newton-Raphson 알고리즘은 학습률이 불필요 할 뿐만 아니라 모든 실험 데이터에 대해 좋은 성능을 보였다.

Machine learning becomes increasingly important in the big data era. Logistic regression is a type of classification in machine leaning, and has been widely used in various fields, including medicine, economics, marketing, and social sciences. Rhipe that integrates R and Hadoop environment, has not been discussed by many researchers owing to the difficulty of its installation and MapReduce implementation. In this paper, we present the MapReduce implementation of Gradient Descent algorithm and Newton-Raphson algorithm for logistic regression using Rhipe. The Newton-Raphson algorithm does not require a learning rate, while Gradient Descent algorithm needs to manually pick a learning rate. We choose the learning rate by performing the mixed procedure of grid search and binary search for processing big data efficiently. In the performance study, our Newton-Raphson algorithm outpeforms Gradient Descent algorithm in all the tested data.

키워드

참고문헌

  1. Arnulf, B. A., Graf, Alexander J. S. and Borer, S. (2003). Classification in a normalized feature space using support vector machines. IEEE, 14, 597-605.
  2. ASA data expo. (2009). http://stat-computing.org/dataexpo/2009/the-data.html.
  3. Ciliendo, E., Kunimasa, T. and Braswell, B. (2007). Linux performance and tuning guidelines, IBM redbooks, IBM, International Technical Support Organization, USA.
  4. Davenport, T. (2015). B.I.G. Forum 2015. Big data initiative Gyeonggi, Gyeonggi Creative Economy & Innovation Center, Gyeonggi Province, Korea.
  5. Forte, R. M. (2015). Mastering predictive analytics with R, Packt Publishing Ltd, Birmingham, UK.
  6. Guha, S. (2010). Computing environment for the statistical analysis of large and complex data, Ph. D. Thesis, Department of Statistics, Purdue University, West Lafayette, Indiana, USA.
  7. Guha, S., Hafen, R., Rounds, J., Xia, J., Li, J., Xi, B. and Cleveland, W. S. (2012). Large complex data: Divide and recombine (D&R) with RHIPE. Stat, 191, 53-67
  8. Hafen, R., Gibson, T., Dam, K. K. and Critchlow. T. (2014). Power grid data analysis with R and Hadoop. in data mining applications with R, 1-34.
  9. Hilbe, J. M. (2009). Logistic regression models, Chapman & Hall/CRC Press, Florida, USA.
  10. Jung, B. H., Shin, J. E. and Lim, D. H. (2014). Rhipe platform for big data processing and analysis, The Korean Journal of Applied Statistics, 27, 1171-1185. https://doi.org/10.5351/KJAS.2014.27.7.1171
  11. Jung, B. H. (2016). A study on machine learning algorithms using distributed processing system of big data, Ph. D. Thesis, Gyeongsang National University, Jinju, Korea.
  12. Ko, Y. and Kim, J. (2013). Analysis of big data using Rhipe. Journal of the Korean Data & Information science Society, 24, 975-987. https://doi.org/10.7465/jkdi.2013.24.5.975
  13. Lin, H., Yang, S. and Midkiff, S. P. (2013). RABID-A general distributed R processing framework targeting large data-set problems, IEEE International Congress on Big Data, Santa Clara, CA, USA.
  14. Prajapati, V. (2013). Big data analytics with R and Hadoop, Packt Publishing Ltd, Birmingham, UK.
  15. Rashid, M. (2008). Inference on logistic regression, Ph. D. Thesis, Bowling green state university, Ohio, USA.
  16. Sammer, E. (2012). Hadoop Operations, O'Reilly Media, Inc., Sebastopol, CA.
  17. Shin, J. E., Jung, B. H. and Lim, D. H. (2015). Big data distributed processing system using RHadoop. Journal of the Korean Data & Information science Society, 26, 1155-1166. https://doi.org/10.7465/jkdi.2015.26.5.1155
  18. Tzafestas, A. G. (1992). Robotic systems: Advanced techniques and applications, Kluwer Academic Publishers, Dordrecht, Netherlands.
  19. Wang, C., Chen, M. H., Schifano, Wu, J. and Yan, J. (2015). A survey of statistical methods and computing for big data, Cornell university library, Available at http://de.arxiv.org/abs/1502.07989v1.
  20. White, T. (2012). Hadoop: The definitive guide, O'Reilly Media, Inc., Sebastopol, CA.
  21. Wu, J. and Coggeshall, S. (2012). Foundations of predictive analytics, Chapman and Hall/CRC Press, Florida, USA.

피인용 문헌

  1. 고차원 자료에서 영향점의 영향을 평가하기 위한 그래픽 방법 vol.28, pp.6, 2016, https://doi.org/10.7465/jkdi.2017.28.6.1291