DOI QR코드

DOI QR Code

An Adaptive Grid-based Clustering Algorithm over Multi-dimensional Data Streams

적응적 격자기반 다차원 데이터 스트림 클러스터링 방법

  • 박남훈 (연세대학교 대학원 컴퓨터과학과) ;
  • 이원석 (연세대학교 컴퓨터과학과)
  • Published : 2007.12.31

Abstract

A data stream is a massive unbounded sequence of data elements continuously generated at a rapid rate. Due to this reason, memory usage for data stream analysis should be confined finitely although new data elements are continuously generated in a data stream. To satisfy this requirement, data stream processing sacrifices the correctness of its analysis result by allowing some errors. The old distribution statistics are diminished by a predefined decay rate as time goes by, so that the effect of the obsolete information on the current result of clustering can be eliminated without maintaining any data element physically. This paper proposes a grid based clustering algorithm for a data stream. Given a set of initial grid cells, the dense range of a grid cell is recursively partitioned into a smaller cell based on the distribution statistics of data elements by a top down manner until the smallest cell, called a unit cell, is identified. Since only the distribution statistics of data elements are maintained by dynamically partitioned grid cells, the clusters of a data stream can be effectively found without maintaining the data elements physically. Furthermore, the memory usage of the proposed algorithm is adjusted adaptively to the size of confined memory space by flexibly resizing the size of a unit cell. As a result, the confined memory space can be fully utilized to generate the result of clustering as accurately as possible. The proposed algorithm is analyzed by a series of experiments to identify its various characteristics

데이터 스트림이란, 빠른 속도로 지속적으로 생성되는 무한한 크기의 방대한 양의 데이터 집합으로 정의된다. 무한한 데이터 스트림에 비해 주어진 메모리 공간은 유한하게 한정되어 있어, 이러한 제약조건을 충족시키는 범위 내에서 일정 한도내의 정확도 오차를 허용하기도 한다. 또한, 변화하는 데이터 스트림 내의 최신 클러스터를 찾기 위해서는 데이터 객체의 저장없이 오래된 데이터 스트림 내의 정보들을 비중을 감소시킬 수 있어야 한다. 본 연구에서는 데이터 스트림 분석을 위한 데이터 스트림 격자 기반 클러스터링 기법을 제시한다. 주어진 초기 격자셀에 대해, 데이터 객체의 빈도가 높은 범위를 반복적으로 보다 작은 크기의 격자셀로 분할하여 최소 크기의 격자셀, 단위 격자셀을 생성한다. 격자 셀에서는 데이터 객체들의 분포에 대한 통계값만을 저장하여, 기존의 클러스터링 기법에 비해 데이터 객체에 대한 탐색없이 효율적으로 클러스터를 찾을 수 있다. 또한, 가용 메모리 공간에 따라 단위 격자셀의 크기를 조절하여 클러스터의 정확도를 최대화할 수 있어, 주어진 메모리 공간에 맞게 적응적으로 성능을 조절할 수 있다.

Keywords

References

  1. G. S. Manku and R. Motwani. Approximate frequency counts over data streams. In Proc. Of the 28th Int'l Conference on Very Large Databases, Hong Kong, China, Aug. 2002
  2. M. Garofalakis, J. Gehrke and R. Rastogi. Querying and mining data streams: you only get one look. In the tutorial notes of the 28th Int'l Conference on Very Large Databases, Hong Kong, China, Aug. 2002
  3. J. H. Chang & W. S. Lee. Finding Frequent Itemsets over Online Data Streams. Information and Software Technology, 48(7), July 2006 https://doi.org/10.1016/j.infsof.2005.06.004
  4. J. H. Chang & W. S. Lee. Finding Recently Frequent Itemsets Adaptively over Online Transactional Data Streams. Information Systems, 31(8), December 2006 https://doi.org/10.1016/j.is.2005.04.001
  5. Hua-Fu Li, Suh-Yin Lee, Man-Kwan Shan: Online Mining Changes of Items over Continuous Append-only and Dynamic Data Streams. J. UCS 11(8), page 1411-1425, 2005
  6. Mohamed Medhat Gaber, Arkady B. Zaslavsky, Shonali Krishnaswamy: Mining data streams: a review. SIGMOD Record 34(2), page 18-26, 2005 https://doi.org/10.1145/1083784.1083789
  7. L. Kaufman and P.J. Rousseeuw. Finding Groups in Data. An Introduction to Cluster Analysis. Wiley, New York, 1990
  8. T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: an efficient data clustering method for very large databases. In Proc. SIGMOD, pages 103-114, 1996
  9. S. Guha, R.Rastogi, and K. Shim. CURE: An efficient clustering algorithm for large databases. In Proc. SIGMOD, pages 73-84, 1998 https://doi.org/10.1145/276304.276312
  10. M. Ester, H. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases, 1996
  11. M. Ester, H. Kriegel, J. Sander, M. Wimmer, and X. Xu. Incremental clustering for mining in a data warehousing environment, In Proc. VLDB 24th, New York, 1998
  12. Liadan O'Callaghan, Nina Mishra, Adam Meyerson, Sudipto Guha, and Rajeev Motwani. STREAM-data algorithms for high-quality clustering. In Proc. of IEEE International Conference on Data Engineering, March 2002
  13. Nam Hun Park and Won Suk Lee. A statistical $\mu$-partitioning method for clustering data streams. In Proc. of Eighteenth International Symposium on Computer and Information Sciences, November 2003
  14. Nam Hun Park and Won Suk Lee. Statistical $\sigma$-partition Clustering over Data Streams. In Proc. of 7th European Conference on Principles and Practice of Knowledge Discovery in Databases, September 2003
  15. R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis. Wiley, 1972
  16. W. Wang, J. Yang, and R. Muntz. Sting: A statistical information grid approach to spatial data mining, 1997
  17. Charu C. Aggarwal, Jiawei Han, Jianyong Wang, Philip S. Yu. A Framework for Clustering Evolving Data Streams. In Proc. VLDB 29th, Berlin, 2003
  18. Cheng, C., Fu, A., and Zhang, Y. Entropy based subspace clustering for mining numerical data. KDD-99, 84-93, San Diego, August 1999 https://doi.org/10.1145/312129.312199
  19. C.-H. Lee, C.R. Lin, and M.-S. Chen, Sliding-window filtering: An efficient algorithm for incremental mining, Proceedings of the 10th International Conference on Information and Knowledge Management, Atlanta, GE, November 2001, pp.263-270 https://doi.org/10.1145/502585.502630
  20. A. Hinneburg and D. A. Keim, 'Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality in High- Dimensional Clustering', In Proc. Int' Conf. on Very Large Data Bases(VLDB), Edinburgh, Scotland, pp.506-517, Sept. 1999