DOI QR코드

DOI QR Code

Compression Conversion and Storing of Large RDF datasets based on MapReduce

맵리듀스 기반 대량 RDF 데이터셋 압축 변환 및 저장 방법

  • Kim, InA (Department of Computer Engineering, Chungnam National University) ;
  • Lee, Kyong-Ha (Korea Institute of Science and Technology Information) ;
  • Lee, Kyu-Chul (Department of Computer Engineering, Chungnam National University)
  • Received : 2022.03.07
  • Accepted : 2022.03.13
  • Published : 2022.04.30

Abstract

With the recent demand for analysis using data, the size of the knowledge graph, which is the data to be analyzed, gradually increased, reaching about 82 billion edges when extracted from the web as a knowledge graph. A lot of knowledge graphs are represented in the form of Resource Description Framework (RDF), which is a standard of W3C for representing metadata for web resources. Because of the characteristics of RDF, existing RDF storages have the limitations of processing time overhead when converting and storing large amounts of RDF data. To resolve these limitations, in this paper, we propose a method of compressing and converting large amounts of RDF data into integer IDs using MapReduce, and vertically partitioning and storing them. Our proposed method demonstrated a high performance improvement of up to 25.2 times compared to RDF-3X and up to 3.7 times compared to H2RDF+.

최근 데이터를 활용한 분석에 대한 수요와 함께 분석 데이터인 지식 그래프의 크기는 점차 증가하여, 웹에서 수집한 데이터를 지식 그래프로 추출하였을 때 약 820억개의 엣지(Edge)를 가지는 수준까지 도달하였다. 많은 지식 그래프들은 웹 자원에 대한 메타데이터를 표현하기 위한 W3C 표준인 RDF(Resource Description Framework) 형식으로 표현되며, RDF 특성으로 인해 기존의 RDF 저장소들은 대량 RDF 데이터를 압축하고 저장할 때 처리 시간의 오버헤드가 발생하는 문제점을 가진다. 본 논문은 이러한 문제점을 개선하기 위해, 맵리듀스를 사용하여 대량 RDF 데이터를 정수 ID로 압축 변환하고, 수직 분할하여 저장하는 방법을 제안한다. 본 논문에서 제안한 방법은 RDF-3X와 비교하였을 때 최대 25.2배, H2RDF+와 비교하였을 때 최대 3.7배까지의 높은 성능 향상을 보였다.

Keywords

Acknowledgement

This work was partly supported by KISTI (K-21-L04-C03-S04) and the National Research Council of Science & Technology (NST) grant by the Korea government (MSIT) (1711101951).

References

  1. Web Data Commons, Microdata, RDFa, JSON-LD, and Microformat Data Set [Internet], Available: http://webdatacommons.org/structureddata/index.html#results-2021-1
  2. W. Ali, M. Saleem, B. Yao, A. Hogan, and A. -C. N. Ngomo, "A survey of RDF stores & SPARQL engines for querying knowledge graphs," The VLDB Journal, pp. 1-26, Nov. 2021.
  3. W3C, Resource description framf theework (rdf) model and syntax specification [Internet], Available: https://www.w3.org/TR/1998/WD-rdf-syntax-19980819/
  4. T. Keumann and G. Weikum, "RDF-3X: a RISC-style engine for RDF," in Proceedings of VLDB Endowment, Auckland, New Zealand, vol. 1, iss, 1, pp. 647-659, Aug. 2008.
  5. K. Lee, L. and Liu, "Scaling queries over big RDF graphs with semantic hash partitioning," in Proceedings of the VLDB Endowment, Trento, Italy, vol. 6, no. 14, pp. 1894-1905, 2013. https://doi.org/10.14778/2556549.2556571
  6. F. Goasdoue, Z. Kaoudi, I. Manolescu, J. -A. Quiane-Ruiz, and S. Zampetakis, "CliqueSquare: Flat plans for massively parallel RDF queries," in 2015 IEEE 31st International Conference on Data Engineering, Seoul, South Korea, pp. 771-782, 2015.
  7. N. Papailiou, D. Tsoumakos, I. Konstantinou, P. Karras, and N. Koziris, "H2rdf+ an efficient data management system for big rdf graphs," in Proceedings of the 2014 ACM SIGMOD international conference on Management of data, Utah, USA, pp. 909-912, Jun. 2014.
  8. J. Dean and S. Ghemawat, "MapReduce: simplified data processing on large clusters," Communications of the ACM, vol. 51, no. 1, pp. 107-113, Jan. 2008. https://doi.org/10.1145/1327452.1327492
  9. I. A. Kim and K. -C. Lee, "Conversion of Large RDF Data using Hash-based ID Mapping Tables," in Proceedings of the Korean Institute of Information and Commucation Sciences Conference, Gunsan South Korea, pp. 236-239, 2021.
  10. W3C, RDF 1.1 N-Triples [Internet], Available: https://www.w3.org/TR/n-triples/
  11. W3C, RDF 1.1 Turtle [Internet], Available: https://www.w3.org/TR/turtle/
  12. University of Waterloo, Waterloo SPARQL Diversity Test Suite (WatDiv) v0.6 [Internet], Available: https://dsg.uwaterloo.ca/watdiv/
  13. SWAT, The Lehigh University Benchmark (LUBM) [Internet], Available: http://swat.cse.lehigh.edu/projects/lubm/
  14. Max-Planck-Institute Saarbrucken, YAGO: A High-Quality Knowledge Base [Internet], Available: https://yago-knowledge.org/
  15. M. Wylot, M. Hauswkrth, P. Cudre-Mauroux, and S. Sakr, "RDF data storage and query processing schemes: A survey," ACM Computing Surveys (CSUR), vol. 51, no. 4, pp. 1-36, 2018.
  16. B. B. Mahria, I. Chaker, and , A. Zahi, "An empirical study on the evaluation of the RDF storage systems," Journal of Big Data, vol. 8, no. 1, pp. 1-20, 2021. https://doi.org/10.1186/s40537-020-00387-6
  17. K. L. Bawankule, Q. K. Dewang, and A. K. Singh, "Historical data based approach to mitigate stragglers from the Reduce phase of MapReduce in a heterogeneous Hadoop cluster," Cluster Computing, pp. 1-19, Feb. 2022.