An Optimized Iterative Semantic Compression Algorithm And Parallel Processing for Large Scale Data

  • Jin, Ran (School of Electronic and Computer, Zhejiang Wanli University) ;
  • Chen, Gang (College of Computer Science and Technology, Zhejiang University) ;
  • Tung, Anthony K.H. (School of Computing, National University of Singapore) ;
  • Shou, Lidan (College of Computer Science and Technology, Zhejiang University) ;
  • Ooi, Beng Chin (School of Computing, National University of Singapore)
  • Received : 2017.07.25
  • Accepted : 2018.02.10
  • Published : 2018.06.30


With the continuous growth of data size and the use of compression technology, data reduction has great research value and practical significance. Aiming at the shortcomings of the existing semantic compression algorithm, this paper is based on the analysis of ItCompress algorithm, and designs a method of bidirectional order selection based on interval partitioning, which named An Optimized Iterative Semantic Compression Algorithm (Optimized ItCompress Algorithm). In order to further improve the speed of the algorithm, we propose a parallel optimization iterative semantic compression algorithm using GPU (POICAG) and an optimized iterative semantic compression algorithm using Spark (DOICAS). A lot of valid experiments are carried out on four kinds of datasets, which fully verified the efficiency of the proposed algorithm.


Supported by : National Natural Science Foundation of China, Ministry of Education of China, Ningbo Natural Science Foundation


  1. Promhouse G and Bennett M., "Semantic Data Compression," in Proc. of Data Compression Conference, pp. 323-331, April 8-11, 1991.
  2. Schmalz Mark S., "An overview of semantic compression,"in Proc. of SPIE, pp. 1493-1495, August 20, 2010.
  3. Jagadish H V, Ng R T, Ooi B C and Anthony K H Tung, "ItCompress: An Iterative Semantic Compression Algorithm," in Proc. of 20th International Conference on Data Engineering(ICDE'04), pp. 646-657, March 5, 2004.
  4. Jagadish H V, Madar J, Ng R, "Semantic Compression and Pattern Extraction with Fascicles," in Proc. 1999 International Conference Very Large Data Bases(VLDB'99), pp. 186-197, September 7-10, 1999.
  5. Babu S, Garofalakis M, Rastogi R, "SPARTAN: A Model-based Semantic Compression System for Massive Data Tables," in Proc. of ACM SIGMOD'2001 International Conference on Management of Data, pp. 22-49, May 21-24, 2001.
  6. Wei Qingting, Guan Jihong, "A GML Compression Approach Based on On-line Semantic Clustering," in Proc. of the 18th International Conference on Geoinformatics, pp. 1-7, June 18-20, 2010.
  7. Griffin David, Lesage Benjamin, Burns Alan and RI Davis, "Lossy Compression for Worst-Case Execution Time Analysis of PLRU Caches," in Proc. of the 22nd International Conference on Real-time Networks and Systems, pp. 203-212, October 8-10, 2014.
  8. Hsiao-Ping Tsai, De-Nian Yang and Ming-Syan Chen, "Exploring Application-Level Semantics for Data Compression," IEEE Transactions on Knowledge and Data Engineering, vol. 23, no.1, pp. 95-109, February, 2011.
  9. J. Wang and G. Karypis, "On Efficiently Summarizing Categorical Databases," Knowledge and Information Systems, vol. 9, no. 1, pp. 19-37, January, 2006.
  10. R. Saint-Paul, G. Raschia and N. Mouaddib, "General Purpose Database Summarization," in Proc. of the 31st International Conference on Very Large Databases (VLDB 2005), pp. 733-744, August 30- September 2, 2005.
  11. Pham Quang-Khai, Raschia Guillaume and Mouaddib Noureddine, "Time Sequence Summarization to Scale up Chronology-dependent Applications," in Proc. of the 18th ACM Conference on Information and Knowledge Management, pp. 1137-1146, November 2-6, 2009.
  12. Li Liu, Lifang Wang and Chin-Chen Chang, "A Semantic Compression Scheme for Digital Images Based on Vector Quantization and Data Hiding," Multimedia Tools and Applications, pp. 1-14, 2016.
  13. Lakshmanan Laks V S, Pei Jian and Zhao Yan, "Efficacious Data Cube Exploration by Semantic Summarization and Compression," in Proc. of the 29th International Conference on Very Large Data Bases(VLDB'03), pp. 1125-1128, September 9-12, 2003.
  14. Pham Quang-Khai, Saint-Paul Regis and Benatallah Boualem, "Mine Your Own Business, Mine Others' News!," in Proc. of the 11th International Conference on Extending Database Technology, pp. 725-729, March 25-29, 2008.
  15. Balaji J, Geetha T.V and Parthasarathi Ranjani, "Abstractive Summarization: A Hybrid Approach for the Compression of Semantic Graphs," International Journal on Semantic Web and Information Systems (IJSWIS), vol. 12, no. 2, pp. 76-99, April, 2016.
  16. Zhang Wei, "Graph-based Large Scale RDF Data Compression," in Proc. of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 1276-1276, July 6-11, 2014.
  17. Che Wanxiang, Zhao Yanyan and Guo Honglei, "Sentence Compression for Aspect-based Sentiment Analysis," IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), vol. 23, no. 12, pp. 2111-2124, December, 2015.
  18. Feldman Dan, Sung Cynthia and Sugaya Andrew, "iDiary: From GPS Signals to A Text-Searchable Diary," ACM Transactions on Sensor Networks (TOSN), vol. 11, no. 4, pp. 1-41, December, 2015.
  19. M. M. Gaber, A. Zaslavsky and S. Krishnaswamy, "Mining Data Streams:A review," ACM Sigmod Record, vol. 34, no. 2, pp. 18-26, June, 2005.
  20. Cheng Long, Malik Avinash and Kotoulas Spyros, "Fast Compression of Large Semantic Web Data Using X10," IEEE Transactions on Parallel and Distributed Systems, vol. 27, no. 9, pp. 2603-2617, September, 2016.
  21. Urbani Jacopo, Maassen Jason and Bal Henri, "Massive Semantic Web data compression with MapReduce," in Proc. of the 19th ACM International Symposium on High Performance Distributed Computing, pp. 795-802, June 21-25, 2010.
  22. Urbani J., Maassen N., Drost F. and Seinstra H. Bal, "Scalable RDF Data Compression with MapReduce," Concurrency & Computation Practice & Experience, vol. 25, no. 1, pp. 24-39, January, 2013.
  23. Tan Yujuan, Jiang Hong and Feng Dan, "SAM: A Semantic-Aware Multi-tiered Source De-duplication Framework for Cloud Backup," in Proc. of the 39th International Conference on Parallel Processing, pp. 614-623, September 13-16, 2010.
  24. Ran Jin, Chunhai Kou, Ruijuan Liu and Yefeng Li, "Efficient Parallel Spectral Clustering Algorithm Design for Large Data Sets under Cloud Computing Environment," Journal of Cloud Computing, vol. 2, no. 1, December, 2013.
  25. Cuzzocrea Alfredo and Chakravarthy Sharma, "Event-based Lossy Compression for Effective and Efficient OLAP over Data Streams," Data & Knowledge Engineering, vol. 69, no. 7, pp. 678-708, July, 2010.
  26. Drinić Milenko, Kirovski Darko and Vo Hoi, "PPMexe: Program Compression," ACM Transactions on Programming Languages and Systems (TOPLAS), vol. 29, no. 1, pp. 3-es, January, 2007.
  27. P. Yuan, P. Liu, B. Wu, H. Jin, W. Zhang and L. Liu, "TripleBit: A Fast and Compact system for large scale RDF data," in Proc. of the VLDB Endowment, vol. 6, nol. 7, pp. 517-528, May, 2013.
  28. R. Baeza-Yates and B. Ribeiro-Neto, "Modern Information Retrieval," ACM press, pp. 463-466, 1999.
  29. V. Raman and G. Swart, "How to wring a table dry: Entropy Compression of Relations and querying of Compressed Relations," in Proc. of the 32nd International Conference on Very large data bases, pp. 858-869, September 12-15, 2006.
  30. M. Stonebraker, D. J. Abadi, A. Batkin, et al., "C-store: A Column-oriented DBMS," in Proc. of the 31st International Conference on Very Large Data Bases, pp. 553-564, August 30-September 2, 2005.
  31. S. Davies and A. Moore, "Bayesian Networks for Lossless Dataset Compression," in Proc. of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 387-391, August 15-18, 1999.
  32. Babu S, Garofalakis M and Rastogi R., "SPARTAN: Using Constrained Models for Guaranteed-error Semantic Compression," SIGKDD Explorations, vol. 4, no. 2, pp. 11-20, June, 2002.
  33. G. Schwarz, "Estimating the Dimension of A Model," Annals of Statistics, vol. 6, no. 2, pp. 461-464, March, 1978.
  34. Gao Yihan and Parameswaran Aditya, "Squish: Near-Optimal Compression for Archival of Relational Datasets," in Proc. of the 22nd ACM SIGKDD International Conference on knowledge discovery and data mining, pp. 1575-1584, August 13-17, 2016.
  35. J. Rissanen, "Generalized Kraft Inequality and Arithmetic Coding," IBM Journal of Research and Development, vol. 20, no. 3, pp. 198-203, May, 1976.
  36. G. G. Langdon Jr, "An Introduction to Arithmetic Coding," IBM Journal of Research and Development, vol. 28, no. 2, pp. 135-149, March, 1984.
  37. I. H. Witten, R. M. Neal and J. G. Cleary, "Arithmetic Coding for Data compression," Communications of the ACM, vol. 30, no. 6, pp. 520-540, June, 1987.