범주형 속성 기반 군집화를 위한 새로운 유사 측도

A New Similarity Measure for Categorical Attribute-Based Clustering

  • 김민 (한국과학기술연구원 인지로봇센터) ;
  • 전주혁 (한국과학기술원 전산학과) ;
  • 우경구 (삼성전자 종합기술원 SW 선행연구소) ;
  • 김명호 (한국과학기술원 전산학과)
  • 투고 : 2009.05.04
  • 심사 : 2010.01.20
  • 발행 : 2010.04.15

초록

데이터의 군집을 찾아내는 문제는 패턴 인식, 이미지 처리, 시장 조사 등 많은 응용 분야에서 널리 사용되고 있다. 군집의 질을 결정하는 핵심 요소로는 유사 측도, 차원의 개수 등이 있다. 유사 측도는 데이터의 특성을 반영하여 다르게 정의되어야 하는데, 대부분 기존의 연구들은 데이터를 특징 지어주는 속성이 수치형으로 주어진 경우에 국한되어 있었다. 속성이 범주형으로 주어진 경우도 실생활에 많이 존재하지만, 범주형 변수에 대한 속성값의 유사성은 값의 순서가 고유하게 정해지지 않아서 정의하기 어렵다. 이에 더하여, 고차원 데이터에 대해서는 데이터 점들이 희박하게 위치하여 가까운 점과 먼 점간의 차이가 거의 없고, 군집화 결과가 좋지 않을 수 있다. 이 문제를 해결하기 위해 부분 차원 군집화 방법이 제안되어 왔다. 부분 차원 군집화 방법은 각 군집을 발견하기에 적합한 부분 차원을 선택하면서 군집화를 수행하는 방법이다. 본 논문에서는 범주형 속성으로 특징지어진 고차원 데이터를 부분 차원 군집화하기 위한 새로운 유사 측도를 제안한다. 유사 측도는 각 군집은 다른 군집과 구별되는 특정 정보를 잘 표현할 수 있어야 한다는 기본적인 가정 하에 속성들 사이의 상관성을 반영하여 정의되었다. 이들 모두를 반영한 유사측도는 기존에 존재하지 않았다는 점에서 본 연구는 의미가 있다. 실제 데이터 집합을 군집화하는 실험을 통해 제안하는 방법이 다른 군집화 방법보다 저차원 데이터와 고차원 데이터 모두에 대해 좀 더 정확한 군집 결과를 얻을 수 있음을 보였다.

The problem of finding clusters is widely used in numerous applications, such as pattern recognition, image analysis, market analysis. The important factors that decide cluster quality are the similarity measure and the number of attributes. Similarity measures should be defined with respect to the data types. Existing similarity measures are well applicable to numerical attribute values. However, those measures do not work well when the data is described by categorical attributes, that is, when no inherent similarity measure between values. In high dimensional spaces, conventional clustering algorithms tend to break down because of sparsity of data points. To overcome this difficulty, a subspace clustering approach has been proposed. It is based on the observation that different clusters may exist in different subspaces. In this paper, we propose a new similarity measure for clustering of high dimensional categorical data. The measure is defined based on the fact that a good clustering is one where each cluster should have certain information that can distinguish it with other clusters. We also try to capture on the attribute dependencies. This study is meaningful because there has been no method to use both of them. Experimental results on real datasets show clusters obtained by our proposed similarity measure are good enough with respect to clustering accuracy.

키워드

참고문헌

  1. H. Jiawei and K. Micheline, Data Mining: Concepts and Techniques, 2rd ed., pp.383-444, Morgan Kaufmann, 2006.
  2. A. Ahmad and L. Dey, A k-mean clustering algorithm for mixed numeric and categorical data, Data & Knowledge Engineering, vol.63, Issue 2, pp.503-527, 2007. https://doi.org/10.1016/j.datak.2007.03.016
  3. C. Stanfill and D. Waltz, Toward memory-based reasoning, Communications of the ACM, vol.29, no.12, pp.1213-1228, 1986. https://doi.org/10.1145/7902.7906
  4. P. H. A. Sneath and R. R. Sokal, Numerical Taxonomy: The Principles and Practice of Numerical Classication, W. H. Freeman and Company, 1973.
  5. C. Ding, X. He, H. Zha, and H. D. Simon, Adaptive dimension reduction for clustering high dimensional data. Proceedings of Second IEEE International Conference on Data Mining, pp. 147-154, 2002.
  6. L. Yu and H. Liu, Feature selection for highdimensional data: a fast correlation-based filter solution. Proceedings of the twentieth International Conference on Machine Learning, pp.856-863, 2003.
  7. S. Raychaudhuri, P. D. Sutphin, J. T. Chang, and R. B. Altman, Basic microarray analysis: grouping and feature reduction. Trends in Biotechnology, vol.19, no.5, pp.189-193, 2001. https://doi.org/10.1016/S0167-7799(01)01599-2
  8. J. MacQueen, Some methods for classification and analysis of multivariate observation. Proceedings of the fifth Berkeley Symp. on Math. Statist. and Prob., vol.1, pp.281-297, 1966.
  9. Z. Huang, Extensions to the k-means algorithm for clustering large data sets with categorical data. Data Mining and Knowledge Discovery, vol.2, no.3, pp.283-304, 1998. https://doi.org/10.1023/A:1009769707641
  10. L. Kaufman and P. Rousseeuw, Clustering by means of medoids. In Dodge, Y. (Ed.) Statistical Data Analysis based on the L1 Norm. pp.405-416, 1987.
  11. Z. He, X. Xu and S. Deng, Squeezer: an efficient algorithm for clustering categorical data. Journal of Computer Science and Technology, vol.17, no.5, pp.611-624, 2002. https://doi.org/10.1007/BF02948829
  12. S. Guha, R. Rastogi and K. Shim, ROCK: a robust clustering algorithm for categorical attributes. Proceedings of the 15th International Conference on Data Engineering, pp.512-521, 1999.
  13. D. H. Fisher, Knowledge acquisition via incremental conceptual clustering. Machine Learning, vol.2, no.2, pp.139-172, 1987.
  14. M. Gluck and J. Corter, Information, Uncertainty, and the Utility of Categories. Proceedings of Seventh Annual Conference of Cognitive Science Society, pp.283-287, 1985.
  15. Z. Huang and M.K. Ng, A fuzzy k-modes algorithm for clustering categorical data. IEEE Transactions on Fuzzy Systems, vol.7, no.4, pp.446-452, 1999. https://doi.org/10.1109/91.784206
  16. K.B. McKusick and K. Thompson, COBWEB/3: A portable implementation, Report FIA-90-6-18-2, NASA, Ames Research Center, 1990.
  17. Y. Reich and S.J. Fenves, The formation and use of abstract concepts in design. Concept Formation: Knowledge and Experience in Unsupervised Learning, Morgan Kaufmann, 1991.
  18. G. Biswas, J. Weinberg, and C. Li, ITERATE: A conceptual clustering scheme for knowledge discovery in databases. Artificial Intelligence in the Petroleum Industry, B. Braunschweig and R. Day eds., pp.111-139, 1995.
  19. P. Andritsos, P. Tsaparas, R.J. Miller and K.C. Sevcik, LIMBO: Scalable clustering of categorical data. Proceedings of the 9th International Conference on Extending DataBase Technology (EDBT), 2004.
  20. D. Barbara, Y. Li and J. Couto, COOLCAT: an entropy-based algorithm for categorical clustering. Proceedings of ACM Conf. on Information and Knowledge Mgt. (CIKM), pp.582-589, 2002.
  21. T. Cover, J. Thomas, Elements of information theory, Wiley InterScience, 1991.
  22. D. Hochbaum and D. Shmoys, A best possible heuristic for the k-center problem. Mathematics of Operations Research, vol.10, no.2, pp.180-184, 1985. https://doi.org/10.1287/moor.10.2.180
  23. C. J. Merz and P. Merphy, UCI Repository of Machine Learning Databases, 1996. Available from: .
  24. Z. Huang, A fast clustering algorithm to cluster very large categorical data sets in data mining. Proceedings of SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, pp.1-8, 1997.
  25. F. Cao, J. Liang and L. Bai, A new initialization method for categorical data clustering, Expert Systems With Applications: An International Journal archive, vol.36, Issue 7, pp.10223-102228, 2009. https://doi.org/10.1016/j.eswa.2009.01.060
  26. M. Al-Razgan, C. Domeniconi and D. Barbara, Random Subspace Ensembles for Clustering Categorical Data. Studies in Computational Intelligence, Springer, 2008.
  27. B. Broda and M. Piasecki, Experiments in Clustering Documents for Automatic Acquisition of Lexical Semantic Networks for Polish, Proceedings of the 16th International Conference Intelligent Information Systems, 2008, pp.203-202, 2008.
  28. A. M. Fahim, G. Saake, A. M. Salem, F. A. Torkey, and M. A. Ramadan, k-Means for Spherical Clusters with Large Variance in Sizes, Proceedings of World Academy of Science, Engineering and Technology, vol.35, pp.177-182, 2008.
  29. K. Qin, M. Xu, Y. Du, and S. Yue, Cloud Model and Hierarchical Clustering Based Spatial Data Mining Method and Application, Proceedings of the International Archives of the Photogrammetry, Remote Sensing and Spatial information Sciences, vol.37, pp.241-246, 2008.