DOI QR코드

DOI QR Code

An Effective Algorithm for Subdimensional Clustering of High Dimensional Data

고차원 데이터를 부분차원 클러스터링하는 효과적인 알고리즘

  • 박종수 (성신여자대학교 컴퓨터정보학부) ;
  • 김도형 (성신여자대학교 컴퓨터정보학부)
  • Published : 2003.06.01

Abstract

The problem of finding clusters in high dimensional data is well known in the field of data mining for its importance, because cluster analysis has been widely used in numerous applications, including pattern recognition, data analysis, and market analysis. Recently, a new framework, projected clustering, to solve the problem was suggested, which first select subdimensions of each candidate cluster and then each input point is assigned to the nearest cluster according to a distance function based on the chosen subdimensions of the clusters. We propose a new algorithm for subdimensional clustering of high dimensional data, each of the three major steps of which partitions the input points into several candidate clutters with proper numbers of points, filters the clusters that can not be useful in the next steps, and then merges the remaining clusters into the predefined number of clusters using a closeness function, respectively. The result of extensive experiments shows that the proposed algorithm exhibits better performance than the other existent clustering algorithms.

고차원 데이터에서 클러스터를 찾아내는 문제는 그 중요성으로 인해 데이터 마이닝 분야에서 잘 알려져 있다. 클러스터 분석은 패턴 인식, 데이터 분석, 시장 분석 등의 여러 응용 분야에 광범위하게 사용되어지고 있다. 최근에 이 문제를 풀 수 있는 투영된 클러스터링이라는 새로운 방법론이 제기되었다. 이것은 먼저 각 후보 클러스터의 부분차원들을 선택하고 이를 근거로 한 거리 함수에 따라 가장 가까운 클러스터에 점이 배정된다. 우리는 고차원 데이터를 부분차원 클러스터링하는 새로운 알고리즘을 제안한다. 알고리즘의 주요한 세 부분은, $\circled1$적절한 개수의 점들을 갖는 여러 개의 후보 클러스터로 입력 점들을 분할하고, $\circled2$다음 단계에서 유용하지 않은 클러스터들을 제외하고, 그리고 $\circled3$선택된 클러스터들은 밀접도 함수를 사용하여 미리 정해진 개수의 클러스터들로 병합한다. 다른 클러스터링 알고리즘과 비교하여 제안된 알고리즘의 좋은 성능을 보여주기 위하여 많은 실험을 수행하였다.

Keywords

References

  1. C. C. Aggarwal, C. Procopiuc, J. L. Wolf, P. S. Yu and J. S. Park, 'Fast Algorithms for Projected Clustering,' In Proceedings of the ACM SIGMOD International Conference on Management of Data, PP.61-72, 1999 https://doi.org/10.1145/304182.304188
  2. C. C. Aggarwal and P. S. Yu, 'Finding generalized projected clusters in high dimensional spaces,' In Proceedings of the ACM SIGMOD International Conference on Management of Data, pp.70-81, 2000 https://doi.org/10.1145/342009.335383
  3. C. C. Aggarwal and P. S. Yu, 'Finding generalized projected clusters in high dimensional spaces,' IEEE TKDE, Vol.14, No.2, pp.210-225, 2002
  4. R. Agrawal, J. Gehrke, D. Gunopulos, P. Raghavan, 'Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications,' In Proceedings of the ACM SIGMOD International Conference on Management of Data, pp.94-105, 1998 https://doi.org/10.1145/276304.276314
  5. M. Ankerst, M. M. Breunig, H.-P. Kriegel and J. Sander, 'OPTICS : Ordering Points to Identify the Clustering Structure,' In Proceedings of the ACM SIGMOD International Conference on Management of Data, pp.49-60, 1999 https://doi.org/10.1145/304182.304187
  6. M. Ester, H. P. Kriegel, J. Sander and X. Xu, 'A density based algorithm for discovering clusters in large databases,' In Proceedings of 1996 International Conference on Knowledge Discovery and Data Mining(KDD'96), pp.226-231, 1996
  7. S. Guha, R. Rastogi and K. Shim, 'CURE: An Efficient Clustering Algorithm for Large Databases,' In Proceedings of the ACM SIGMOD International Conference on Management of Data, pp.73-84, 1998 https://doi.org/10.1145/276304.276312
  8. J. Han and M. Kamber, Data Mining : Concepts and Techniques, Morgan Kaufmann Publishers, San Francisco, CA, 2001
  9. A. Hinneburg and D. Keim, 'Optimal Grid-Clustering : Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering,' In Proceedings of the 25th VLDB Conference, pp.506-517, 1999
  10. A. K. Jain, M. N. Murty and P. J. Flynn, 'Data Clustering : A Review,' ACM Computing Surveys, Vol.31, No.3, pp.264-323, 1999 https://doi.org/10.1145/331499.331504
  11. G. Karypis, E. H. Han and V. Kumar, 'CHAMELEON : A Hierarchical Clustering Algorithm Using Dynamic Modeling,' COMPUTER, 32, pp.68-75, 1999 https://doi.org/10.1109/2.781637
  12. R. Kohavi and D. Sommerfield, 'Feature Subset Selection Using the Wrapper Method : Overfitting and Dynamic Search Space Topology,' In Proceedings of the First International Conference on Knowledge Discovery and Data Mining, 1995
  13. H. Liu and H. Motoda, Feature Extraction, Construction and Selection : A Data Mining Perspective, Kluwer Academic Publishers, Boston, 1998
  14. R. Ng and J. Han, 'Efficient and Effective Clustering Methods for Spatial Data Mining,' In Proceedings of the 20th VLDB Conference, pp.144-155, 1994
  15. R. Ng and J. Han, 'Efficient and Effective Clustering Methods for Spatial Data Mining,' IEEE TKDE Vol.14, No.5, pp.1003-1016, 2002
  16. C. M. Procopiuc, M. Jones, P. K. Agarwal and T. M. Murali, 'A Monte Carlo Algorithm for Fast Projective Clustering,' In Proceedings of the ACM SIGMOD International Conference on Management of Data, pp.418-427, 2002 https://doi.org/10.1145/564691.564739
  17. T. Zhang, R. Ramakrishnan and M. Linvy, 'BIRCH : An Efficient Data Clustering Method for Large Databases,' In Proceedings of the ACM SIGMOD International Conference on Management of Data, pp.103-114, 1996 https://doi.org/10.1145/233269.233324