DOI QR코드

DOI QR Code

An Optimized Iterative Semantic Compression Algorithm And Parallel Processing for Large Scale Data

  • Jin, Ran (School of Electronic and Computer, Zhejiang Wanli University) ;
  • Chen, Gang (College of Computer Science and Technology, Zhejiang University) ;
  • Tung, Anthony K.H. (School of Computing, National University of Singapore) ;
  • Shou, Lidan (College of Computer Science and Technology, Zhejiang University) ;
  • Ooi, Beng Chin (School of Computing, National University of Singapore)
  • Received : 2017.07.25
  • Accepted : 2018.02.10
  • Published : 2018.06.30

Abstract

With the continuous growth of data size and the use of compression technology, data reduction has great research value and practical significance. Aiming at the shortcomings of the existing semantic compression algorithm, this paper is based on the analysis of ItCompress algorithm, and designs a method of bidirectional order selection based on interval partitioning, which named An Optimized Iterative Semantic Compression Algorithm (Optimized ItCompress Algorithm). In order to further improve the speed of the algorithm, we propose a parallel optimization iterative semantic compression algorithm using GPU (POICAG) and an optimized iterative semantic compression algorithm using Spark (DOICAS). A lot of valid experiments are carried out on four kinds of datasets, which fully verified the efficiency of the proposed algorithm.

Acknowledgement

Supported by : National Natural Science Foundation of China, Ministry of Education of China, Ningbo Natural Science Foundation

References

  1. Promhouse G and Bennett M., "Semantic Data Compression," in Proc. of Data Compression Conference, pp. 323-331, April 8-11, 1991.
  2. Schmalz Mark S., "An overview of semantic compression,"in Proc. of SPIE, pp. 1493-1495, August 20, 2010.
  3. Jagadish H V, Ng R T, Ooi B C and Anthony K H Tung, "ItCompress: An Iterative Semantic Compression Algorithm," in Proc. of 20th International Conference on Data Engineering(ICDE'04), pp. 646-657, March 5, 2004.
  4. Jagadish H V, Madar J, Ng R, "Semantic Compression and Pattern Extraction with Fascicles," in Proc. 1999 International Conference Very Large Data Bases(VLDB'99), pp. 186-197, September 7-10, 1999.
  5. Babu S, Garofalakis M, Rastogi R, "SPARTAN: A Model-based Semantic Compression System for Massive Data Tables," in Proc. of ACM SIGMOD'2001 International Conference on Management of Data, pp. 22-49, May 21-24, 2001.
  6. Wei Qingting, Guan Jihong, "A GML Compression Approach Based on On-line Semantic Clustering," in Proc. of the 18th International Conference on Geoinformatics, pp. 1-7, June 18-20, 2010.
  7. Griffin David, Lesage Benjamin, Burns Alan and RI Davis, "Lossy Compression for Worst-Case Execution Time Analysis of PLRU Caches," in Proc. of the 22nd International Conference on Real-time Networks and Systems, pp. 203-212, October 8-10, 2014.
  8. Hsiao-Ping Tsai, De-Nian Yang and Ming-Syan Chen, "Exploring Application-Level Semantics for Data Compression," IEEE Transactions on Knowledge and Data Engineering, vol. 23, no.1, pp. 95-109, February, 2011. https://doi.org/10.1109/TKDE.2010.30
  9. J. Wang and G. Karypis, "On Efficiently Summarizing Categorical Databases," Knowledge and Information Systems, vol. 9, no. 1, pp. 19-37, January, 2006. https://doi.org/10.1007/s10115-005-0216-7
  10. R. Saint-Paul, G. Raschia and N. Mouaddib, "General Purpose Database Summarization," in Proc. of the 31st International Conference on Very Large Databases (VLDB 2005), pp. 733-744, August 30- September 2, 2005.
  11. Pham Quang-Khai, Raschia Guillaume and Mouaddib Noureddine, "Time Sequence Summarization to Scale up Chronology-dependent Applications," in Proc. of the 18th ACM Conference on Information and Knowledge Management, pp. 1137-1146, November 2-6, 2009.
  12. Li Liu, Lifang Wang and Chin-Chen Chang, "A Semantic Compression Scheme for Digital Images Based on Vector Quantization and Data Hiding," Multimedia Tools and Applications, pp. 1-14, 2016.
  13. Lakshmanan Laks V S, Pei Jian and Zhao Yan, "Efficacious Data Cube Exploration by Semantic Summarization and Compression," in Proc. of the 29th International Conference on Very Large Data Bases(VLDB'03), pp. 1125-1128, September 9-12, 2003.
  14. Pham Quang-Khai, Saint-Paul Regis and Benatallah Boualem, "Mine Your Own Business, Mine Others' News!," in Proc. of the 11th International Conference on Extending Database Technology, pp. 725-729, March 25-29, 2008.
  15. Balaji J, Geetha T.V and Parthasarathi Ranjani, "Abstractive Summarization: A Hybrid Approach for the Compression of Semantic Graphs," International Journal on Semantic Web and Information Systems (IJSWIS), vol. 12, no. 2, pp. 76-99, April, 2016. https://doi.org/10.4018/IJSWIS.2016040104
  16. Zhang Wei, "Graph-based Large Scale RDF Data Compression," in Proc. of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 1276-1276, July 6-11, 2014.
  17. Che Wanxiang, Zhao Yanyan and Guo Honglei, "Sentence Compression for Aspect-based Sentiment Analysis," IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), vol. 23, no. 12, pp. 2111-2124, December, 2015. https://doi.org/10.1109/TASLP.2015.2443982
  18. Feldman Dan, Sung Cynthia and Sugaya Andrew, "iDiary: From GPS Signals to A Text-Searchable Diary," ACM Transactions on Sensor Networks (TOSN), vol. 11, no. 4, pp. 1-41, December, 2015.
  19. M. M. Gaber, A. Zaslavsky and S. Krishnaswamy, "Mining Data Streams:A review," ACM Sigmod Record, vol. 34, no. 2, pp. 18-26, June, 2005.
  20. Cheng Long, Malik Avinash and Kotoulas Spyros, "Fast Compression of Large Semantic Web Data Using X10," IEEE Transactions on Parallel and Distributed Systems, vol. 27, no. 9, pp. 2603-2617, September, 2016. https://doi.org/10.1109/TPDS.2015.2496579
  21. Urbani Jacopo, Maassen Jason and Bal Henri, "Massive Semantic Web data compression with MapReduce," in Proc. of the 19th ACM International Symposium on High Performance Distributed Computing, pp. 795-802, June 21-25, 2010.
  22. Urbani J., Maassen N., Drost F. and Seinstra H. Bal, "Scalable RDF Data Compression with MapReduce," Concurrency & Computation Practice & Experience, vol. 25, no. 1, pp. 24-39, January, 2013. https://doi.org/10.1002/cpe.2840
  23. Tan Yujuan, Jiang Hong and Feng Dan, "SAM: A Semantic-Aware Multi-tiered Source De-duplication Framework for Cloud Backup," in Proc. of the 39th International Conference on Parallel Processing, pp. 614-623, September 13-16, 2010.
  24. Ran Jin, Chunhai Kou, Ruijuan Liu and Yefeng Li, "Efficient Parallel Spectral Clustering Algorithm Design for Large Data Sets under Cloud Computing Environment," Journal of Cloud Computing, vol. 2, no. 1, December, 2013.
  25. Cuzzocrea Alfredo and Chakravarthy Sharma, "Event-based Lossy Compression for Effective and Efficient OLAP over Data Streams," Data & Knowledge Engineering, vol. 69, no. 7, pp. 678-708, July, 2010. https://doi.org/10.1016/j.datak.2010.02.006
  26. Drinić Milenko, Kirovski Darko and Vo Hoi, "PPMexe: Program Compression," ACM Transactions on Programming Languages and Systems (TOPLAS), vol. 29, no. 1, pp. 3-es, January, 2007. https://doi.org/10.1145/1180475.1180478
  27. P. Yuan, P. Liu, B. Wu, H. Jin, W. Zhang and L. Liu, "TripleBit: A Fast and Compact system for large scale RDF data," in Proc. of the VLDB Endowment, vol. 6, nol. 7, pp. 517-528, May, 2013. https://doi.org/10.14778/2536349.2536352
  28. R. Baeza-Yates and B. Ribeiro-Neto, "Modern Information Retrieval," ACM press, pp. 463-466, 1999.
  29. V. Raman and G. Swart, "How to wring a table dry: Entropy Compression of Relations and querying of Compressed Relations," in Proc. of the 32nd International Conference on Very large data bases, pp. 858-869, September 12-15, 2006.
  30. M. Stonebraker, D. J. Abadi, A. Batkin, et al., "C-store: A Column-oriented DBMS," in Proc. of the 31st International Conference on Very Large Data Bases, pp. 553-564, August 30-September 2, 2005.
  31. S. Davies and A. Moore, "Bayesian Networks for Lossless Dataset Compression," in Proc. of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 387-391, August 15-18, 1999.
  32. Babu S, Garofalakis M and Rastogi R., "SPARTAN: Using Constrained Models for Guaranteed-error Semantic Compression," SIGKDD Explorations, vol. 4, no. 2, pp. 11-20, June, 2002. https://doi.org/10.1145/568574.568578
  33. G. Schwarz, "Estimating the Dimension of A Model," Annals of Statistics, vol. 6, no. 2, pp. 461-464, March, 1978. https://doi.org/10.1214/aos/1176344136
  34. Gao Yihan and Parameswaran Aditya, "Squish: Near-Optimal Compression for Archival of Relational Datasets," in Proc. of the 22nd ACM SIGKDD International Conference on knowledge discovery and data mining, pp. 1575-1584, August 13-17, 2016.
  35. J. Rissanen, "Generalized Kraft Inequality and Arithmetic Coding," IBM Journal of Research and Development, vol. 20, no. 3, pp. 198-203, May, 1976. https://doi.org/10.1147/rd.203.0198
  36. G. G. Langdon Jr, "An Introduction to Arithmetic Coding," IBM Journal of Research and Development, vol. 28, no. 2, pp. 135-149, March, 1984. https://doi.org/10.1147/rd.282.0135
  37. I. H. Witten, R. M. Neal and J. G. Cleary, "Arithmetic Coding for Data compression," Communications of the ACM, vol. 30, no. 6, pp. 520-540, June, 1987. https://doi.org/10.1145/214762.214771