Design and Implementation of Distributed In-Memory DBMS-based Parallel K-Means as In-database Analytics Function

Kou, Heymo;Nam, Changmin;Lee, Woohyun;Lee, Yongjae;Kim, HyoungJoo;

doi:10.5626/KTCP.2018.24.3.105

KIISE Transactions on Computing Practices (정보과학회 컴퓨팅의 실제 논문지)

Volume 24 Issue 3
/
Pages.105-112
/
2018
/
2383-6318(pISSN)
/
2383-6326(eISSN)

Korean Institute of Information Scientists and Engineers (한국정보과학회)

DOI QR Code

Design and Implementation of Distributed In-Memory DBMS-based Parallel K-Means as In-database Analytics Function

분산 인 메모리 DBMS 기반 병렬 K-Means의 In-database 분석 함수로의 설계와 구현

구해모 (서울대학교 컴퓨터공학부) ;
남창민 (TmaxData F2팀) ;
이우현 (TmaxData F2팀) ;
이용재 (TmaxData Data1연구소) ;
김형주 (서울대학교 컴퓨터공학부)

Received : 2017.05.25
Accepted : 2017.12.17
Published : 2018.03.15

https://doi.org/10.5626/KTCP.2018.24.3.105 Citation KSCI

⟨ Previous Next ⟩

Abstract

As data size increase, a single database is not enough to serve current volume of tasks. Since data is partitioned and stored into multiple databases, analysis should also support parallelism in order to increase efficiency. However, traditional analysis requires data to be transferred out of database into nodes where analytic service is performed and user is required to know both database and analytic framework. In this paper, we propose an efficient way to perform K-means clustering algorithm inside the distributed column-based database and relational database. We also suggest an efficient way to optimize K-means algorithm within relational database.

데이터의 양이 증가하면서 단일 노드 데이터베이스로는 저장과 처리를 동시에 수행하기에는 부족하다. 따라서, 데이터를 분산시켜 복수 노드로 구성된 분산 데이터베이스에 저장되고 있으며 분석 역시 효율성을 위해 병렬 기능을 제공해야한다. 전통적인 분석 방식은 데이터베이스에서 분석 노드로 데이터를 이동시킨 후 분석을 수행하기 때문에 네트워크의 비용이 발생하며 사용자가 분석을 위해 분석 프레임 워크도 다를 수 있어야한다. 본 연구는 군집화 분석 기법인 K-Means 군집화 알고리즘을 관계형 데이터 베이스와 칼럼 기반 데이터베이스를 이용한 분산 데이터베이스 환경에서 SQL로 구현하는 In-database 분석 함수로의 설계와 구현 그리고 관계형 데이터베이스에서의 성능 최적화 방법을 제안한다.

Keywords

Acknowledgement

Grant : 대규모 트랜잭션 처리와 실시간 복합 분석을 통합한 일체형 데이터 엔지니어링 기술 개발

Supported by : 정보통신기술진흥센터

References

J. Taylor. (2013). In-Database Analytics [Online]. Available: https://www.sas.com/content/dam/SAS/en_us/doc/whitepaper2/in-database-analytics-106725.pdf (downloaded 2017. May 10)
X. Wu, V. Kumar, J. R. Quinlan, J. Ghosh, Q. Yang, H. Motoda, G. J. McLachlan, A. Ng, B. Liu, P. S. Yu, Z. Zhou, M. Steinbach, D. J. Hand, and D. Steinberg, "Top 10 algorithms in data mining," Knowledge and information systems, Vol. 14, No. 1, pp. 1-37, Jan. 2008. https://doi.org/10.1007/s10115-007-0114-2
S. Kantabutra and A. L. Couch, "Parallel K-means clustering algorithm on NOWs," NECTEC Technical Journal, Vol. 1, No. 6, pp. 243-247, Jan. 2000.
W. Zhao, H. Ma, and Q. He, "Parallel k-means clustering based on mapreduce," IEEE International Conference on Cloud Computing, pp. 674-679, 2009.
Apache Spark. (2016. Dec 28). Clustering - RDD-based API [Online]. Available: https://spark.apache.org/docs/2.1.0/mllib-clustering.html#k-means (downloaded 2017. May 10)
D. J. Dewitt, R. H. Katz, F. Olken, L. D. Shapiro, M. R. Stonebraker, and D. A. Wood, "Implementation techniques for main memory database systems," ACM, Vol. 14, No. 2, pp. 1-8, Jun. 1984.
P. A. Boncz, M. L. Kersten, and S. Manegold, "Breaking the memory wall in MonetDB," Communications of the ACM, Vol. 51, No. 12, pp. 77-85, Dec 2008. https://doi.org/10.1145/1409360.1409380
M. Zaharia, M. Chowdhury, T. Das, A. Dave, and J. Ma, "Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing," Proc. of the 9th USENIX conference on Networked Systems Design and Implementation, pp. 2-2, 2012.
C. Ordonez, "Integrating K-means clustering with a relational DBMS using SQL," IEEE transactions on Knowledge and Data engineering, Vol. 18, No. 2, pp. 188-201, Dec. 2006. https://doi.org/10.1109/TKDE.2006.31
ISO/IEC. 9075:2008: Information technology - Database languages - SQL
S. Nandagopalan, C. Dhanalakshmi, B. S. Adiga, and N. Deepak, "A fast K-Means algorithm for the segmentation of echocardiographic images using DBMS-SQL," 2010 The 2nd International Conference on Computer and Automation Engineering (ICCAE), pp. 162-166, 2010.
K. Leetaru and P. A. Schrodt, "Gdelt: Global data on events, location, and tone, 1979-2012," ISA Annual Convention, Vol. 2. No. 4. 2013.

KIISE Transactions on Computing Practices (정보과학회 컴퓨팅의 실제 논문지)

Design and Implementation of Distributed In-Memory DBMS-based Parallel K-Means as In-database Analytics Function

분산 인 메모리 DBMS 기반 병렬 K-Means의 In-database 분석 함수로의 설계와 구현

Abstract

Keywords

Acknowledgement

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)