DOI QR코드

DOI QR Code

Comparison of Scala and R for Machine Learning in Spark

스파크에서 스칼라와 R을 이용한 머신러닝의 비교

  • Woo-Seok Ryu (Dept. of Health Care Management, Catholic University of Pusan)
  • 류우석 (부산가톨릭대학교 병원경영학과)
  • Received : 2022.12.11
  • Accepted : 2023.02.17
  • Published : 2023.02.28

Abstract

Data analysis methodology in the healthcare field is shifting from traditional statistics-oriented research methods to predictive research using machine learning. In this study, we survey various machine learning tools, and compare several programming models, which utilize R and Spark, for applying R, a statistical tool widely used in the health care field, to machine learning. In addition, we compare the performance of linear regression model using scala, which is the basic languages of Spark and R. As a result of the experiment, the learning execution time when using SparkR increased by 10 to 20% compared to Scala. Considering the presented performance degradation, SparkR's distributed processing was confirmed as useful in R as the traditional statistical analysis tool that could be used as it is.

보건의료분야 데이터 분석 방법론이 기존의 통계 중심의 연구방법에서 머신러닝을 이용한 예측 연구로 전환되고 있다. 본 연구에서는 다양한 머신러닝 도구들을 살펴보고, 보건의료분야에서 많이 사용하고 있는 통계 도구인 R을 빅데이터 머신러닝에 적용하기 위해 R과 스파크를 연계한 프로그래밍 모델들을 비교한다. 그리고, R을 스파크 환경에서 수행하는 SparkR을 이용한 선형회귀모델 학습의 성능을 스파크의 기본 언어인 스칼라를 이용한 모델과 비교한다. 실험 결과 SparkR을 이용할 때의 학습 수행 시간이 스칼라와 비교하여 10~20% 정도 증가하였다. 결과로 제시된 성능 저하를 감안한다면 기존의 통계분석 도구인 R을 그대로 활용 가능하다는 측면에서 SparkR의 분산 처리의 유용성을 확인하였다.

Keywords

Acknowledgement

이 논문은 2020년도 부산가톨릭대학교 교내연구비에 의하여 연구되었음

References

  1. K. Goztepe, "De Facto Language of Data Science: The R Project," J. of Management and Information Science, vol. 4, no. 4, Dec. 2016, pp. 104-107. https://doi.org/10.17858/jmisci.288183
  2. W. Ryu, "Distributed Processing of Big Data Analysis based on R using SparkR," J. of the Korea Institute of Electronic Communication Sciences, vol. 17, no. 1, Feb. 2022, pp. 161-166.
  3. X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman, D. Liu, J. Freeman, D. Tsai, M. Amde, S. Owen, and D. Xin, "Mllib: Machine learning in apache spark," The J. of Machine Learning Research, vol. 17, no. 1, 2016, pp. 1235-1241.
  4. K. Ji and Y. Kwon, "Performance Comparison of Python and Scala APIs in Spark Distributed Cluster Computing System," J. of Korea Multimedia Society, vol. 28, no. 2, Feb. 2020, pp. 241-248.
  5. M. Zaharia, R. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave, X. Meng, J. Rosen, S. Venkataraman, M. Franklin, and A. Ghodsi, "Apache spark: a unified engine for big data processing." Communications of the ACM, vol. 59, no. 11, 2016, pp. 56-65. https://doi.org/10.1145/2934664
  6. R. Anil, G. Capan, I. Drost-Fromm, T. Dunning, E. Friedman, T. Grant, S. Quinn, P. Ranjan, S. Schelter, and O. Yilmazel, "Apache Mahout: Machine Learning on Distributed Dataflow Systems," J. Machine Learning Research, vol. 21, no. 127, 2020, pp. 1-6.
  7. J. Jo, "Performance Comparison Analysis of AI Supervised Learning Methods of Tensorflow and Scikit-Learn in the Writing Digit Data," J. of the Korea Institute of Electronic Communication Sciences, vol. 14, no. 4, Aug. 2019, pp. 701-705.
  8. J. Demsar, T. Curk, A. Erjavec, C. Gorup, T. Hocevar, M. Milutinovic, M. Mozina, M. Polajnar, M. Toplak, A. Staric, and M. Stajdohar, "Orange: Data Mining Toolbox in Python," J. of Machine Learning Research, vol. 14, Aug. 2013, pp. 2349-2353.
  9. M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, and M. Kudlur, "{TensorFlow}: A System for {Large-Scale} Machine Learning," In 12th USENIX Symp. on Operating Systems Design and Implementation (OSDI 16), Savannah, GA, USA, Nov. 2016, pp. 265-283.
  10. J. Jo, "Time Series Data Processing Deep Learning system for Prediction of Hospital Outpatient Number," J. of the Korea Institute of Electronic Communication Sciences, vol. 16, no. 2, Apr. 2021, pp. 313-318.
  11. A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, and A. Desmaison, "PyTorch: An Imperative Style, High-Performance Deep Learning Library," In Advances in neural information processing systems, Vancouver, Canada, Dec. 2019, pp. 8024-8035.
  12. B. Chambers and M. Zaharia, Spark: The definitive Guide: Big data processing made simple. Newton: O'Reilly Media, Inc, Feb. 2018.