Cluster Analysis with Balancing Weight on Mixed-type Data

  • Chae, Seong-San (Department of Applied Statistics, Daejeon University) ;
  • Kim, Jong-Min (Division of Science and Mathematics, University of Minnesota) ;
  • Yang, Wan-Youn (Department of Applied Statistics, Kyungwon University)
  • Published : 2006.12.31


A set of clustering algorithms with proper weight on the formulation of distance which extend to mixed numeric and multiple binary values is presented. A simple matching and Jaccard coefficients are used to measure similarity between objects for multiple binary attributes. Similarities are converted to dissimilarities between i th and j th objects. The performance of clustering algorithms with balancing weight on different similarity measures is demonstrated. Our experiments show that clustering algorithms with application of proper weight give competitive recovery level when a set of data with mixed numeric and multiple binary attributes is clustered.


  1. Affi, A.A. and Clark, V. (1990). Computer-Aided Multivariate Analysis. Van Nostrand Reinhold Company, New York
  2. Asparoukhov, O.K. and Krzanowski, W.J. (2001). A comparison of discriminant procedures for binary variables. Computational Statistics & Data Analysis, Vol. 38, 139-160
  3. Chae, S.S., DuBien J.L. and Warde, W.D. (2006). A method of predicting the number of clusters using Rand's statistic. Computational Statistics & Data Analysis, Vol. 50, 3531-3546
  4. Chae, S.S. and Kim, J.I. (2005). Cluster analysis using principal coordinates for binary data. The Korean Communications in Statistics, Vol. 12, 683-696
  5. DuBien, J.L. and Warde, W.D. (1987). A comparison of agglomerative cluster -ing methods with respect to noise. Communications in Statistics, Theory and Method, Vol. 16, 1433-1460
  6. Everitt, B. (1993). Cluster Analysis. 3rd edition, John Wiley & Sons
  7. Gowda, K.C. and Diday, E. (1991). Symbolic clustering using a new dis simi -larity measures. Pattern Recognition, Vol. 24, 567-578
  8. Gower, J.C. (1966). Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika, Vol. 53, 325-338
  9. Gower, J.C. (1967). A comparison of some methods of cluster analysis. Biometrics, Vol. 23, 623-637
  10. Gower, J.C. (1971). A general coefficient of similarity and some of its properties. Biometrics, Vol. 27, 857-871
  11. Gower, J.C. and Legendre, P. (1986), Metric and Euclidean properties of dis -similarity coefficients. Journal of Classification, Vol. 3, 5-48
  12. Huang, Z. (1998). Extensions to the k-means algorithms for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, Vol. 2, 283-304
  13. Jain, A.K. and Dubes, R.C, (1988). Algorithms for Clustering Data. Prentice Hall
  14. Lee, J.J. (2005). Discriminant analysis of binary data with multinomial distri -bution by using the iterative cross entropy minimization estimation. The Korean Communications in Statistics, Vol. 12, 125-137
  15. Ordonez, C. (2003). Clustering binary data streams with K-means. In 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery
  16. Rand, W.M. (1971). Objective criteria for the evaluation of clustering methods. Joumal of the American Statistical Association, Vol. 66, 846-850