Performance Factor of Distributed Processing of Machine Learning using Spark

Ryu, Woo-Seok;

doi:10.13067/JKIECS.2021.16.1.19

한국전자통신학회논문지 (The Journal of the Korea institute of electronic communication sciences)

제16권1호
/
Pages.19-24
/
2021
/
1975-8170(pISSN)

한국전자통신학회 (Korea Institute of Electronic Communication Science)

DOI QR Code

스파크를 이용한 머신러닝의 분산 처리 성능 요인

Performance Factor of Distributed Processing of Machine Learning using Spark

류우석 (부산가톨릭대학교 병원경영학과)

Ryu, Woo-Seok (Dept. of Health Care Management, Catholic University of Pusan)

투고 : 2020.11.23
심사 : 2021.02.17
발행 : 2021.02.28

https://doi.org/10.13067/JKIECS.2021.16.1.19 인용 PDF KSCI

PDF 다운로드

⟨ 이전 논문 다음 논문 ⟩

초록

본 논문에서는 아파치 스파크를 이용하여 머신러닝을 분산 처리할 때의 성능 요인을 분석하고 효율적인 분산 처리를 위한 실행 환경을 실험을 통해 제시한다. 먼저, 분산 클러스터 환경에서 머신러닝을 수행할 때 고려해야 하는 성능 요인으로 클러스터의 성능, 데이터의 규모, 스파크 엔진의 속성으로 구분하여 분석한다. 그리고 하둡 클러스터에서 동작하는 스파크 MLlib을 이용하여 회귀분석을 수행할 때 노드의 구성과 스파크 Executor의 설정을 변화하면서 성능을 측정한다. 실험 결과 최적의 Executor 개수는 데이터의 블록의 수에 영향을 받으나 클러스터 규모에 따라 최대값, 최소값은 각각 코어의 수, 워커 노드의 수로 제한됨을 실증하였다.

In this paper, we study performance factor of machine learning in the distributed environment using Apache Spark and presents an efficient distributed processing method through experiments. This work firstly presents performance factor when performing machine learning in a distributed cluster by classifying cluster performance, data size, and configuration of spark engine. In addition, performance study of regression analysis using Spark MLlib running on the Hadoop cluster is performed while changing the configuration of the node and the Spark Executor. As a result of the experiment, it was confirmed that the effective number of executors was affected by the number of data blocks, but depending on the cluster size, the maximum and minimum values were limited by the number of cores and the number of worker nodes, respectively.

키워드

참고문헌

Y. Jeong and K. Choi, "For Gene disease Analysis using Data Mining Implement MKSV system," J. of the Korea Institute of Electronic Communication Sciences, vol. 14, no. 4, Aug. 2019, pp. 781-786. https://doi.org/10.13067/JKIECS.2019.14.4.781
N. Shahid, T. Rappon, and W. Berta, "Applications of artificial neural networks in health care organizational decision-making: A scoping review," PLOS ONE, vol. 14, no. 2, Feb. 2019, pp. 1-22.
Y. Bae and D. Hwang, "The prediction of bidding price using deep learning in the electronic bidding," J. of the Korea Institute of Electronic Communication Sciences, vol. 15 no. 1, Feb. 2020, pp. 147-152. https://doi.org/10.13067/JKIECS.2020.15.1.147
I. Mavridis and H. Karatza, "Performance evaluation of cloud-based log file analysis with Apache Hadoop and Apache Spark," J. of Systems and Software, vol. 125, Mar. 2017, pp. 133-151. https://doi.org/10.1016/j.jss.2016.11.037
M. Zaharia, R. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave, X. Meng, J. Rosen, S. Venkataraman, M. Franklin, and A. Ghodsi, "Apache spark: a unified engine for big data processing," Communications of the ACM, vol. 59, no. 11, Oct. 2016, pp. 56-65. https://doi.org/10.1145/2934664
J. Jo, "Performance Comparison Analysis of AI Supervised Learning Methods of Tensorflow and Scikit-Learn in the Writing Digit Data," J. of the Korea Institute of Electronic Communication Sciences, vol. 14, no. 4, Aug. 2019, pp. 701-706. https://doi.org/10.13067/JKIECS.2019.14.4.701
R. Anil, G. Capan, I. Drost-Fromm, T. Dunning, E. Friedman, T. Grant, S. Quinn, P. Ranjan, S. Schelter, and O. Yilmazel, "Apache Mahout: Machine Learning on Distributed Dataflow Systems," J. of Machine Learning Research, vol. 21, no. 127, Jan. 2020, pp. 1-6.
M. Assefi, E. Behravesh, G. Liu, and A. P. Tafti, "Big data machine learning using apache spark MLlib," In 2017 IEEE Int. Conf. on Big Data (Big Data), Boston, MA, U.S.A., 2017, pp. 3492-3498.
M. Frampton, Mastering Apache Spark. Birmingham: Packt Publishing Ltd, 2015.
J. Jang, J. Park, H. Kim, and S. Yoon, "A Comparative Performance Analysis of Spark-Based Distributed Deep-Learning Frameworks," KIISE Trans. Computing Practices, vol. 23, no. 5, May 2017, pp. 299-303. https://doi.org/10.5626/KTCP.2017.23.5.299
A. Garate-Escamilla, A. Hassani, and E. Andres, "Big data scalability based on Spark Machine Learning Libraries," In Proc. the 3rd International Conference on Big Data Research, Cergy-Pontoise, France, Nov. 2019, pp. 166-171.
R. Myung, H. Yu, and S. Choi, "Performance Optimization Strategies for Fully Utilizing Apache Spark," KIPS Trans. Computer and Communication Systems, vol. 7, no. 1, Jan. 2018, pp. 9-18. https://doi.org/10.3745/KTCCS.2018.7.1.9

한국전자통신학회논문지 (The Journal of the Korea institute of electronic communication sciences)

스파크를 이용한 머신러닝의 분산 처리 성능 요인

Performance Factor of Distributed Processing of Machine Learning using Spark

초록

키워드

참고문헌

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

자세히 찾기

이미지 검색 (β)