Extended High Dimensional Clustering using Iterative Two Dimensional Projection Filtering

반복적 2차원 프로젝션 필터링을 이용한 확장 고차원 클러스터링

  • Published : 2001.10.01

Abstract

The large amounts of high dimensional data contains a significant amount of noises by it own sparsity, which adds difficulties in high dimensional clustering. The CLIP is developed as a clustering algorithm to support characteristics of the high dimensional data. The CLIP is based on the incremental one dimensional projection on each axis and find product sets of the dimensional clusters. These product sets contain not only all high dimensional clusters but also they may contain noises. In this paper, we propose extended CLIP algorithm which refines the product sets that contain cluster. We remove high dimensional noises by applying two dimensional projections iteratively on the already found product sets by CLIP. To evaluate the performance of extended algorithm, we demonstrate its effectiveness through a series of experiments on synthetic data sets.

대용량의 고차원 데이터 집합은 고차원 데이터 고유 희소성에 의하여 상당한 양의 잡음을 포함하므로 효과적인 고차원 클러스터링에 어려움을 더한다. CLIP은 이와 같은 고차원 데이터의 특성을 지원하는 클러스터링 알고리즘으로 개발되었다. CLIP은 1차원 성형변환 프로젝션을 점진적으로 적용하여, 각 프로젝션 공간에서 얻어진 1차원 클러스터들의 곱집합을 찾는다. 이 집합은 클러스터를 포함할 뿐 아니라 잡음도 포함할 수 있다. 본 논문에서는 클러스터를 포함하는 곱집합을 정제하는 확장된 CLIP 알고리즘을 제안한다. 이미 CLIP에서 찾은 곱집합에 반복적인 2차원 프로젝션을 적용하여 클러스터의 고차원적 잡음을 제거한다. 확장된 알고리즘의 성능을 평가하기 위해 합성 데이터를 이용한 일련의 실험을 통하여 효과성을 증명한다.

Keywords

References

  1. Charu C. Aggrawal, Cecilia Procopiuc, Joel L. Wolf, Philip S. Yu, and Jong Soo Prk, 'Fast Algorithms for Projected Clustering,' Proc. of ACM SIGMOD Int. Conf. on Management of Data, pp.61-72, 1999 https://doi.org/10.1145/304182.304188
  2. Charu C. Aggrawal, Philip S. Yu, 'Finding Generalized Projected Clusters in High Dimensional Spaces,' Proc. of ACM SIGMOD Int. Conf. on Management of Data, pp.70-81, 2000 https://doi.org/10.1145/342009.335383
  3. Rakesh Agrawal, Johannes Gehrke, Dimitrios Gunopulos, and Prabhakar Raghavan, 'Automatic subspace Clustering on High Dimensional Data Mining Applications,' Proc. of ACM SIGMOD Int. Conf. on Management of Data, pp.94-105, 1998 https://doi.org/10.1145/276304.276314
  4. Hinneburg A., Keim D. A, 'An Efficient Approach to Clustering in Large Multimedia Databases with Noise,' Proc. of 4th Int. Conf. on Knowledge Discovery and Data Mining, 1998
  5. S. Berchtold, D. A. Keim, C. Bohm, H.-P. Kriegel, 'A Cost Model For Nearest Neighbor Search in High-Dimensional Data Space,' Proc. of the 16th Symposium on Principles of Database Systems (PODS), pp.78-86, 1997 https://doi.org/10.1145/263661.263671
  6. S. Berchtold, D. A. Keim, 'High-dimensional Index Structures, Database Support for Next Decade's Applications,' Proc. of ACM SIGMOD Int. Conf. on Management of Data, 1998 https://doi.org/10.1145/276305.276353
  7. Kaushik Chakrabarti, Sharad Mehrotra, 'Local Dimensionality Reduction : A New Approach to Indexing High Dimensional Spaces,' Proc. of 26th Int. Conf. on VLDH, pp. 89-100. 2000
  8. Martin Ester, Hans-Peter Kriegel, Jorg Sander, and Xiao-wei Xu, 'A density-based algorithm for discovering clusters in large spatial database with noise,' Proc. of Int. Conf. on Knowledge Discovery and Data Mining, 1996
  9. Christos Faloutsos, 'Fast Searching by Content in Multimedia Database,' Data Engineering Bulletin, 18(4), 1995
  10. Fayyad, U. M., et al., Advances in Knowlwdge Discovery and Data Mining, AAAI Press/The MIT Press, pp.307-328, 1996
  11. Hinneburg A., 'Mining for High Dimensional Cluster using Projection and Visualizations,' Proc. of the EDBT 2000 phD Workshop, 2000
  12. Hinneburg A., Keim D. A, 'Optimal Grid-Clustering : Towards breadking the Curse of Dimensionality in High-Dimensional Clustering,' Proc. of 25th Int. Conf. on VLDB, pp. 506-517, 1999
  13. Wei Wang, Jiong Yang, and Richard Muntz, 'STING : A Statistical Information Grid Approach to Spatial Data Mining,' Proc. of 23rd Int. Conf. on VLDB, pp.186-195, 1997
  14. Tian Zhang, Raghu Ramakrishnan, and Miron Livny, 'BIRCH : An Efficient Data Clustering Method for Very Large Databases,' Proc. of ACM SIGMOD Int. Conf. on Management of Data, pp.103-114, 1996 https://doi.org/10.1145/233269.233324
  15. 이혜명, 박영배, '고차원 데이터에서 점진적 프로젝션을 이용한 클러스터링', 한국정보과학회 가을학술발표논문집(I), 2000
  16. 이혜명, 박영배, '점진적 프로젝션을 이용한 고차원 클러스터링', 한국정보과학회논문지, 제28권 제4호, 2001