DOI QR코드

DOI QR Code

Analysis of Large Tables

대규모 분할표 분석

  • Choi, Hyun-Jip (Department of Applied Information Statistics, Kyonggi University)
  • 최현집 (경기대학교 경제학부 응용정보통계전공)
  • Published : 2005.07.01

Abstract

For the analysis of large tables formed by many categorical variables, we suggest a method to group the variables into several disjoint groups in which the variables are completely associated within the groups. We use a simple function of Kullback-Leibler divergence as a similarity measure to find the groups. Since the groups are complete hierarchical sets, we can identify the association structure of the large tables by the marginal log-linear models. Examples are introduced to illustrate the suggested method.

많은 수의 범주형 변수에 의한 대규모 분할표 분석을 위하여 차원축소(collapsibility) 성질을 이용한 분석 방법을 제안하였다. kullback-Leibler의 발산 측도(divergence measure)를 이용한 서로 완전한 연관을 갖는 변수그룹을 결정하는 방법을 제안하였으며, 제안된 방법에 의한 변수그룹은 주변 로그선형모형(marginal log-linear models)에 의하여 변수들간의 연관성을 식별할 수 있다. 제안된 방법의 적용 예로 데이터 마이닝에서 흔히 접할 수 있는 대규모 분할표 자료인 소비자들의 구매행위 분석을 위한 장바구니 자료의 분석 결과를 제시하였다.

Keywords

References

  1. Agresti, A., Lipsitz, S., and Lang, J. B. (1992). Comparing marginal distributions of large, sparse contingency tables, Computational Statistics & Data Analysis, 14, 55-73 https://doi.org/10.1016/0167-9473(92)90081-P
  2. Bergsma, W. P. and Rudas, T. (2002). Marginal models for categorical data, Annals of Statistics, 30, 140-159 https://doi.org/10.1214/aos/1015362188
  3. Christensen, R. (1997). Log-Linear Models and Logistic Regression 2nd, Springer-Verlag
  4. DuMouchel, W. (1999). Bayesian data mining in large frequency tables, with an application to the FDA spontaneous reporting system, The American Statistician, 53, 177-190 https://doi.org/10.2307/2686093
  5. Edwards, D. (2000). Introduction to Graphical Modelling, Springer-Verlag
  6. Erosheva, E. A., Fienberg, S. E., and Junker, B. W. (2002). Alternative statistical models and representations for large sparse multi-dimensional contingency tables, Annales de la Faculte de Sciences de Toulouse, 11, 485-505 https://doi.org/10.5802/afst.1035
  7. Fienberg, S. E. (2000). Contingency tables and log-linear models: Basic results and new developments, Journal of the American Statistical Association, 95, 643-647 https://doi.org/10.2307/2669409
  8. Giudici, P. and Passerone, G. (2002). Data mining of association structures to model consumer behaviour, Computational Statistics & Data Analysis, 38, 533-541 https://doi.org/10.1016/S0167-9473(01)00077-9
  9. Kojadinovic, I. (2004). Agglomerative hierarchical clustering of continuous variables based on mutual information, Computational Statistics & Data Analysis, 46, 269-294 https://doi.org/10.1016/S0167-9473(03)00153-1
  10. Kullback, S., Leibler, R. A. (1951). On information and sufficiency, Annals of Mathmatical Statistics, 22, 79-86 https://doi.org/10.1214/aoms/1177729694
  11. Law, G. R., Cox, D. R., Machonochie, N. E. S., E. Roman, J. S., and Carpenter, L. M. (2001). Large Tables, Biostatistics, 2, 163-171 https://doi.org/10.1093/biostatistics/2.2.163
  12. Whittaker, J. (1990). Graphical Models in Applied Multivariate Statistics, John Wiley & Sons