DOI QR코드

DOI QR Code

Study of Efficient Algorithm for Deduplication of Complex Structure

복잡한 구조의 데이터 중복제거를 위한 효율적인 알고리즘 연구

  • Lee, Hyeopgeon (Dept. of Data Analysis, Seoul Ganseo Campus of Korea Polytechnics) ;
  • Kim, Young-Woon (Dept. of Software Engineering, Seoil University) ;
  • Kim, Ki-Young (Dept. of Software Engineering, Seoil University)
  • Received : 2021.01.26
  • Accepted : 2021.02.07
  • Published : 2021.02.28

Abstract

The amount of data generated has been growing exponentially, and the complexity of data has been increasing owing to the advancement of information technology (IT). Big data analysts and engineers have therefore been actively conducting research to minimize the analysis targets for faster processing and analysis of big data. Hadoop, which is widely used as a big data platform, provides various processing and analysis functions, including minimization of analysis targets through Hive, which is a subproject of Hadoop. However, Hive uses a vast amount of memory for data deduplication because it is implemented without considering the complexity of data. Therefore, an efficient algorithm has been proposed for data deduplication of complex structures. The performance evaluation results demonstrated that the proposed algorithm reduces the memory usage and data deduplication time by approximately 79% and 0.677%, respectively, compared to Hive. In the future, performance evaluation based on a large number of data nodes is required for a realistic verification of the proposed algorithm.

IT기술의 발달로 인해 발생되는 데이터양은 기하급수적으로 급격하게 증가하고 있으며, 데이터 구조의 복잡성은 높아지고 있다. 빅데이터 분석가와 빅데이터 엔지니어들은 이러한 빅데이터들을 보다 빠르게 데이터 처리 및 데이터 분석을 수행을 목표로 분석 대상의 데이터양을 최소화하기 위한 연구가 기업 및 가관 등 활발하게 이뤄지고 있다. 빅데이터 플랫폼으로 많이 활용되는 하둡은 서브프로젝트인 Hive를 통해 분석 대상의 데이터 최소화 등 다양한 데이터 처리 및 데이터 분석 기능을 제공하고 있다. 그러나 Hive는 데이터의 복잡성을 고려하지 않고 구현되어 중복 제거에 방대한 양의 메모리를 사용한다. 이에 복잡한 구조의 데이터 중복제거를 위한 효율적인 알고리즘을 제안한다. 성능평가 결과, 제안하는 알고리즘은 Hive에 비해 메모리 사용량은 최대 79%, 데이터 중복제거 시간은 0.677% 감소한다. 향후, 제안하는 알고리즘의 현실적인 검증을 위해 다수의 데이터 노드 기반 성능 평가가 필요하다.

Keywords

References

  1. H. G. Lee, Y. W. Kim, K, Y. Kim "Study of In-Memory based Hybrid Big Data Processing Scheme for Improve the Big Data Processing Rate", Journal of Korea Institute of Information, Electronics, and Communication Technology, 12(2), pp. 127-134, April, 2019 https://doi.org/10.17661/JKIIECT.2019.12.2.127
  2. In-Hak Joo, "Spatial Big Data Query Processing System Supporting SQL-based Query Language in Hadoop," Journal of Korea Institute of Information, Electronics, and Communication Technology, 10(1), pp.1-8, February, 2017 https://doi.org/10.17661/jkiiect.2017.10.1.1
  3. H. G. Lee, Y. W. Kim, K. Y. Kim "Design of GlusterFS Based Big Data Distributed Processing System in Smart Factory", Journal of Korea Institute of Information, Electronics, and Communication Technology, 11(1), pp.70-75, February, 2018 https://doi.org/10.17661/jkiiect.2018.11.1.70
  4. H. G. Lee, Y. W. Kim, K. Y. Kim, "Implementation of an Efficient Big Data Collection Platform for Smart Manufacturing," Journal of Engineering and Applied Sciences, 12(2Si), pp.6304-6307, 2018
  5. Yue Liu, Shuai Guo, Songlin Hu, Tilmann Rabl, Hans-Arno Jacobsen, Jintao Li, Jiye Wang, "Performance Evaluation and Optimization of Multi-Dimensional Indexes in Hive," IEEE Transactions on Services Computing, 11(5), pp.835-849, July, 2016 https://doi.org/10.1109/tsc.2016.2594778
  6. Xi Peng, Liang Liu, Lei Zhang, "A Hive-Based Retrieval Optimization Scheme for Long-Term Storage of Massive Call Detail Records," IEEE Access, Vol.8, pp.431-444, December, 2019 https://doi.org/10.1109/access.2019.2961692
  7. Mudassar Ahmad, Safina Kanwal, Maryam Cheema, Muhammad Asif Habib, "Performance Analysis of ECG Big Data using Apache Hive and Apache Pig," 2019 8th International Conference on Information and Communication Technologies(ICICT), November, 2019
  8. Jongyeop Kim, Seongsoo Kim, Donghoon Kim, Hong Liu, "Automated Configuration Parameter Classfication Model for Hive Query Plan on the Apache Yarn," 2019 IEEE International Conference on Big Data, Cloud Computing, Data Science & Engineering (BCD), May, 2019
  9. Zhiang Wu, Aibo Song, Jie Cao, Junzhou Luo, Lu Zhang, "Efficiently Translating Complex SQL Query to MapReduce Jobflow on Cloud," IEEE Transactions on Cloud Computing, 8(2), pp.508-517, May, 2017 https://doi.org/10.1109/tcc.2017.2700842
  10. Fan Zhang, Majd F. Sakr, Kai Hwang, Samee U. Khan, "Empirical Discovery of Power-Law Distribution in MapReduce Scalability," IEEE Transactions on Cloud Computing, 7(3), pp.744-755, February, 2017 https://doi.org/10.1109/tcc.2017.2669320