DOI QR코드

DOI QR Code

A Big Data Analysis by Between-Cluster Information using k-Modes Clustering Algorithm

k-Modes 분할 알고리즘에 의한 군집의 상관정보 기반 빅데이터 분석

  • Park, In-Kyoo (Dept. of Computer.Game Engineering, College of Engineering)
  • 박인규 (중부대학교 컴퓨터.게임공학과)
  • Received : 2015.09.21
  • Accepted : 2015.11.20
  • Published : 2015.11.28

Abstract

This paper describes subspace clustering of categorical data for convergence and integration. Because categorical data are not designed for dealing only with numerical data, The conventional evaluation measures are more likely to have the limitations due to the absence of ordering and high dimensional data and scarcity of frequency. Hence, conditional entropy measure is proposed to evaluate close approximation of cohesion among attributes within each cluster. We propose a new objective function that is used to reflect the optimistic clustering so that the within-cluster dispersion is minimized and the between-cluster separation is enhanced. We performed experiments on five real-world datasets, comparing the performance of our algorithms with four algorithms, using three evaluation metrics: accuracy, f-measure and adjusted Rand index. According to the experiments, the proposed algorithm outperforms the algorithms that were considered int the evaluation, regarding the considered metrics.

본 논문은 융복합을 위한 범주형 데이터의 부공간에 의한 군집화에 대해서 다룬다. 범주형 데이터는 수치형 데이터에만 국한되지 않기 때문에 기존의 범주형 데이터들의 평가척도들은 순서화(ordering)의 부재와 데이터의 고차원성과 희소성으로 인하여 한계를 가지기 마련이다. 따라서 각각의 군집에 존재하는 범주형 속성들의 상호 유사도을 보다 근접하게 측정할 수 있는 조건부 엔트로피 척도를 제안한다. 또한 군집의 최적화를 위하여 군집내의 발산을 최소화하고, 군집간의 독립성을 향상시킬 수 있는 새로운 목적함수를 제안한다. 제안된 알고리즘의 성능을 4개의 알고리즘과 비교검증하기 위하여 5가지의 데이터에 대하여 실험을 수행하였다. 비교검증을 위한 평가척도는 정확도, f-척도와 적응된 Rand 색인이다. 실험을 통하여 제안된 방법이 평가척도에 의한 결과에서 기존의 방법들보다 좋은 성능을 보였다.

Keywords

References

  1. Sang-Hyun Lee, "A Study on Determining Factors for Manufacturers to Distributors Warehouse in Supply Chain", Journal of the Korea Convergence Society, Vol. 4, No. 2, pp. 15-20, 2013. https://doi.org/10.15207/JKCS.2013.4.2.015
  2. E. Y. Chan, W. K. Ching, M. K. Ng and J. Z. Huang, "An optimization algorithm for clustering using weighted dissimilarity measures", Pattern Recognition, Vol. 37, No. 5, pp. 943-952, 2004. https://doi.org/10.1016/j.patcog.2003.11.003
  3. L. Bai, J. Liang, C. Dang, and F. Cao, "A novel attribute weighting algorithm for clustering high-dimensional categorical data", Pattern Recognition, Vol. 44, No. 12, pp. 2843-2861, 2011. https://doi.org/10.1016/j.patcog.2011.04.024
  4. F. Cao, J. Liang, D. Li and X. Zhao, "A weighting k-modes algorithm for subspace clustering of categorical data", Neurocomputing, Vol. 108, pp. 23-30, 2013. https://doi.org/10.1016/j.neucom.2012.11.009
  5. L. Jing, M.K. Ng, and J. Z. Hunag, "An entropy weighting k-means algorithm for subspace clustering of high-dimensional sparce data", Knowledge and Data Engineering, IEEE Transactions on, Vol. 19, No. 8, pp. 1026-1041, 2007. https://doi.org/10.1109/TKDE.2007.1048
  6. D. Barbara, Y. Li, and J. Couto, Coolcat: "an entropy-based algorithm for categorical clustering", in Proceedings of the 11th international conference on Information and knowledge management, ACM, pp. 582-589, 2002.
  7. Z. Huang, "Extensions to the k-means algorithm for clustering large data sets with categorical values", Data mining and Knowledge Discovery, Vol.2, No. 3, pp. 283-304, 1998. https://doi.org/10.1023/A:1009769707641
  8. F. Cao, J. Liang, D. Li, L. Bai and C. Dang, "A dissimilarity measure for the k-Modes clustering algorithm, Knowledge-Based Systems", Vol. 26, pp. 120-127, 2012. https://doi.org/10.1016/j.knosys.2011.07.011
  9. In-Kyu Park. "The generation of control rules for data mining", The Journal of Digital Policy & Management, Vol. 11, No.1, pp.343-349, 2013.
  10. J. L. Carbonera and M. Abel, "Categorical data clustering: a correlation-based approach for unsupervised attribute weighting", in Proceedings of ICTAI, 2014.
  11. J. L. Carbonera and M. Abel, "An entropy-based subspace clustering algorithm for categorical data", 2014 IEEE 26th International Conference on Tools with Artificial Intelligence, pVol. 48, No. 26, pp. 272-277, 2014.
  12. G. Gan and J. Wu, "Subspace clustering for high dimensional categorical data", ACM SIGDD Explorations Newsletter, Vol. 6, No. 2, pp.87-94, 2004. https://doi.org/10.1145/1046456.1046468
  13. M. J. Zaki, M. Peters I. Assent, and T. Seidl, "Clicks: An effective algorithm for mining subspace clusters in categorical datasets", Data & Knowledge Engineering, Vol. 60, No. 1, pp. 51-70, 2007. https://doi.org/10.1016/j.datak.2006.01.005
  14. E. Cesario, G. Manco and R. Ortale, "Top-down parameter-free clustering fo high-dimensional categorical data", IEEE Trans. on Knowledge and Data Engineering, Vol. 19, No. 12, pp. 1607-1624, 2007. https://doi.org/10.1109/TKDE.2007.190649
  15. H.-P. Kriegel, P. Kroger and A. Aimek, "Subspace clustering", Wisley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, Vol. 2, No. 4, pp. 351-364, 2012. https://doi.org/10.1002/widm.1057