DOI QR코드

DOI QR Code

Rule-Based Classification Analysis Using Entropy Distribution

엔트로피 분포를 이용한 규칙기반 분류분석 연구

  • 이정진 (숭실대학교 정보통계보험수리학과) ;
  • 박해기 (숭실대학교 정보통계보험수리학과)
  • Received : 20100300
  • Accepted : 20100600
  • Published : 2010.07.31

Abstract

Rule-based classification analysis is widely used for massive datamining because it is easy to understand and its algorithm is uncomplicated. In this classification analysis, majority vote of rules or weighted combination of rules using their supports are frequently used in order to combine rules. We propose a method to combine rules by using the multinomial distribution in this paper. Iterative proportional fitting algorithm is used to estimate the multinomial distribution which maximizes entropy constrained on rules' support. Simulation experiments show that this method can compete with other well known classification models in the case of two similar populations.

규칙기반 분류분석(rule-based classification analysis)은 직관적인 이해가 쉽고 알고리즘이 복잡하지 않아 최근 대용량 데이터마이닝에 많이 이용되는 기법이다. 하지만 현재의 규칙기반 분석은 여러 개의 규칙들을 찾은후 이 규칙들을 단순히 다수결이나 또는 중요도의 가중 합으로서 새로운 데이터를 분류한다. 본 연구에서는 다항분포를 이용한 이항데이터의 분류분석 기법을 규칙 조합방법에 응용하고자한다. 다향분포의 추정을 위해서는 변형된 반복 비율 적합(iterative proportional fitting; IPF) 알고리즘을 이용하여 최대 엔트로피 분포(entropy distribution)를 찾는다. 시뮬레이션 실험 결과 이 방법은 두 집단의 데이터가 서로 유사한 경우 어느 정도 의미 있는 분류 결과를 보여주였다.

Keywords

References

  1. 이정진 (2005). Discriminant analysis of binary data with multinomial distribution by using the iterative cross entropy minimization, <한국통계학회논문집>, 12, 125-137. https://doi.org/10.5351/CKSS.2005.12.1.125
  2. 이정진, 김수관 (2002). Classification analysis in information retrieval by using Gauss patterns, <한국통계학회논문집>, 9, 1-11. https://doi.org/10.5351/CKSS.2002.9.1.001
  3. 이정진, 황준 (2003). Discriminant analysis of binary data by using the maximum entropy distribution, <한국통계학회논문집>, 10, 909-917. https://doi.org/10.5351/CKSS.2003.10.3.909
  4. Asparoukhov, O. K. and Krzanowski, W. J. (2001). A comparison of discriminant procedures for binary variables, Computational Statistics and Data Analysis, 38, 139-160. https://doi.org/10.1016/S0167-9473(01)00032-9
  5. Cramer, E. (2000). Probability measures with given marginals and conditionals: I-projections and conditional iterative proportional fitting, Statistics & Decisions, 18, 311-329.
  6. Duda, R. O., Hart, P. E. and Stork, D. G. (2001). Pattern Classification, Wiley, New York.
  7. Han, J. and Kamber, M. (2000). Data Mining Concepts and Technique, Elsevier.
  8. Ireland, C. T. and Kullback, S. (1968). Contingency tables with given marginals, Biometrika, 55, 179-188. https://doi.org/10.1093/biomet/55.1.179
  9. Kantor, P. B. and Lee, J. J. (1998). Testing the maximum entropy principle for information retrieval, Journal of American Society for Information Science, 49, 557-566. https://doi.org/10.1002/(SICI)1097-4571(19980501)49:6<557::AID-ASI7>3.0.CO;2-G
  10. Lachenbruch (1981). Discriminant Analysis, Prentice Hall.
  11. Liu, B., Hsu, W. and Ma, Y. (1998). Integrating classification and association rule mining, Proceeding 1998 International Conference Knowledge Discovery and Data Mining, 80-86, New York, August 1998.
  12. Ruschendorf, L. (1995) Convergence of the iterative proportional fitting procedure, The Annals of Statistics, 23, 1160-1174. https://doi.org/10.1214/aos/1176324703