DOI QR코드

DOI QR Code

Design and Implementation of Distributed In-Memory DBMS-based Parallel K-Means as In-database Analytics Function

분산 인 메모리 DBMS 기반 병렬 K-Means의 In-database 분석 함수로의 설계와 구현

  • Received : 2017.05.25
  • Accepted : 2017.12.17
  • Published : 2018.03.15

Abstract

As data size increase, a single database is not enough to serve current volume of tasks. Since data is partitioned and stored into multiple databases, analysis should also support parallelism in order to increase efficiency. However, traditional analysis requires data to be transferred out of database into nodes where analytic service is performed and user is required to know both database and analytic framework. In this paper, we propose an efficient way to perform K-means clustering algorithm inside the distributed column-based database and relational database. We also suggest an efficient way to optimize K-means algorithm within relational database.

데이터의 양이 증가하면서 단일 노드 데이터베이스로는 저장과 처리를 동시에 수행하기에는 부족하다. 따라서, 데이터를 분산시켜 복수 노드로 구성된 분산 데이터베이스에 저장되고 있으며 분석 역시 효율성을 위해 병렬 기능을 제공해야한다. 전통적인 분석 방식은 데이터베이스에서 분석 노드로 데이터를 이동시킨 후 분석을 수행하기 때문에 네트워크의 비용이 발생하며 사용자가 분석을 위해 분석 프레임 워크도 다를 수 있어야한다. 본 연구는 군집화 분석 기법인 K-Means 군집화 알고리즘을 관계형 데이터 베이스와 칼럼 기반 데이터베이스를 이용한 분산 데이터베이스 환경에서 SQL로 구현하는 In-database 분석 함수로의 설계와 구현 그리고 관계형 데이터베이스에서의 성능 최적화 방법을 제안한다.

Keywords

Acknowledgement

Grant : 대규모 트랜잭션 처리와 실시간 복합 분석을 통합한 일체형 데이터 엔지니어링 기술 개발

Supported by : 정보통신기술진흥센터

References

  1. J. Taylor. (2013). In-Database Analytics [Online]. Available: https://www.sas.com/content/dam/SAS/en_us/doc/whitepaper2/in-database-analytics-106725.pdf (downloaded 2017. May 10)
  2. X. Wu, V. Kumar, J. R. Quinlan, J. Ghosh, Q. Yang, H. Motoda, G. J. McLachlan, A. Ng, B. Liu, P. S. Yu, Z. Zhou, M. Steinbach, D. J. Hand, and D. Steinberg, "Top 10 algorithms in data mining," Knowledge and information systems, Vol. 14, No. 1, pp. 1-37, Jan. 2008. https://doi.org/10.1007/s10115-007-0114-2
  3. S. Kantabutra and A. L. Couch, "Parallel K-means clustering algorithm on NOWs," NECTEC Technical Journal, Vol. 1, No. 6, pp. 243-247, Jan. 2000.
  4. W. Zhao, H. Ma, and Q. He, "Parallel k-means clustering based on mapreduce," IEEE International Conference on Cloud Computing, pp. 674-679, 2009.
  5. Apache Spark. (2016. Dec 28). Clustering - RDD-based API [Online]. Available: https://spark.apache.org/docs/2.1.0/mllib-clustering.html#k-means (downloaded 2017. May 10)
  6. D. J. Dewitt, R. H. Katz, F. Olken, L. D. Shapiro, M. R. Stonebraker, and D. A. Wood, "Implementation techniques for main memory database systems," ACM, Vol. 14, No. 2, pp. 1-8, Jun. 1984.
  7. P. A. Boncz, M. L. Kersten, and S. Manegold, "Breaking the memory wall in MonetDB," Communications of the ACM, Vol. 51, No. 12, pp. 77-85, Dec 2008. https://doi.org/10.1145/1409360.1409380
  8. M. Zaharia, M. Chowdhury, T. Das, A. Dave, and J. Ma, "Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing," Proc. of the 9th USENIX conference on Networked Systems Design and Implementation, pp. 2-2, 2012.
  9. C. Ordonez, "Integrating K-means clustering with a relational DBMS using SQL," IEEE transactions on Knowledge and Data engineering, Vol. 18, No. 2, pp. 188-201, Dec. 2006. https://doi.org/10.1109/TKDE.2006.31
  10. ISO/IEC. 9075:2008: Information technology - Database languages - SQL
  11. S. Nandagopalan, C. Dhanalakshmi, B. S. Adiga, and N. Deepak, "A fast K-Means algorithm for the segmentation of echocardiographic images using DBMS-SQL," 2010 The 2nd International Conference on Computer and Automation Engineering (ICCAE), pp. 162-166, 2010.
  12. K. Leetaru and P. A. Schrodt, "Gdelt: Global data on events, location, and tone, 1979-2012," ISA Annual Convention, Vol. 2. No. 4. 2013.