DOI QR코드

DOI QR Code

Cluster Analysis with Balancing Weight on Mixed-type Data

  • Chae, Seong-San (Department of Applied Statistics, Daejeon University) ;
  • Kim, Jong-Min (Division of Science and Mathematics, University of Minnesota) ;
  • Yang, Wan-Youn (Department of Applied Statistics, Kyungwon University)
  • Published : 2006.12.31

Abstract

A set of clustering algorithms with proper weight on the formulation of distance which extend to mixed numeric and multiple binary values is presented. A simple matching and Jaccard coefficients are used to measure similarity between objects for multiple binary attributes. Similarities are converted to dissimilarities between i th and j th objects. The performance of clustering algorithms with balancing weight on different similarity measures is demonstrated. Our experiments show that clustering algorithms with application of proper weight give competitive recovery level when a set of data with mixed numeric and multiple binary attributes is clustered.

References

  1. Affi, A.A. and Clark, V. (1990). Computer-Aided Multivariate Analysis. Van Nostrand Reinhold Company, New York
  2. Asparoukhov, O.K. and Krzanowski, W.J. (2001). A comparison of discriminant procedures for binary variables. Computational Statistics & Data Analysis, Vol. 38, 139-160 https://doi.org/10.1016/S0167-9473(01)00032-9
  3. Chae, S.S., DuBien J.L. and Warde, W.D. (2006). A method of predicting the number of clusters using Rand's statistic. Computational Statistics & Data Analysis, Vol. 50, 3531-3546 https://doi.org/10.1016/j.csda.2005.08.006
  4. Chae, S.S. and Kim, J.I. (2005). Cluster analysis using principal coordinates for binary data. The Korean Communications in Statistics, Vol. 12, 683-696 https://doi.org/10.5351/CKSS.2005.12.3.683
  5. DuBien, J.L. and Warde, W.D. (1987). A comparison of agglomerative cluster -ing methods with respect to noise. Communications in Statistics, Theory and Method, Vol. 16, 1433-1460 https://doi.org/10.1080/03610928708829447
  6. Everitt, B. (1993). Cluster Analysis. 3rd edition, John Wiley & Sons
  7. Gowda, K.C. and Diday, E. (1991). Symbolic clustering using a new dis simi -larity measures. Pattern Recognition, Vol. 24, 567-578 https://doi.org/10.1016/0031-3203(91)90022-W
  8. Gower, J.C. (1966). Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika, Vol. 53, 325-338 https://doi.org/10.1093/biomet/53.3-4.325
  9. Gower, J.C. (1967). A comparison of some methods of cluster analysis. Biometrics, Vol. 23, 623-637 https://doi.org/10.2307/2528417
  10. Gower, J.C. (1971). A general coefficient of similarity and some of its properties. Biometrics, Vol. 27, 857-871 https://doi.org/10.2307/2528823
  11. Gower, J.C. and Legendre, P. (1986), Metric and Euclidean properties of dis -similarity coefficients. Journal of Classification, Vol. 3, 5-48 https://doi.org/10.1007/BF01896809
  12. Huang, Z. (1998). Extensions to the k-means algorithms for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, Vol. 2, 283-304 https://doi.org/10.1023/A:1009769707641
  13. Jain, A.K. and Dubes, R.C, (1988). Algorithms for Clustering Data. Prentice Hall
  14. Lee, J.J. (2005). Discriminant analysis of binary data with multinomial distri -bution by using the iterative cross entropy minimization estimation. The Korean Communications in Statistics, Vol. 12, 125-137 https://doi.org/10.5351/CKSS.2005.12.1.125
  15. Ordonez, C. (2003). Clustering binary data streams with K-means. In 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery
  16. Rand, W.M. (1971). Objective criteria for the evaluation of clustering methods. Joumal of the American Statistical Association, Vol. 66, 846-850 https://doi.org/10.2307/2284239