DOI QR코드

DOI QR Code

Bounds of PIM-based similarity measures with partially marginal proportion

부분적 주변 비율에 의한 확률적 흥미도 측도 기반 유사성 측도의 상한 및 하한의 설정

  • Received : 2015.06.11
  • Accepted : 2015.07.01
  • Published : 2015.07.31

Abstract

By Wikipedia, data mining is the computational process of discovering patterns in huge data sets involving methods at the intersection of association rule, decision tree, clustering, artificial intelligence, machine learning. Clustering or cluster analysis is the task of grouping a set of objects in such a way that objects in the same group are more similar to each other than to those in other groups. The similarity measures being used in the clustering may be classified into various types depending on the characteristics of data. In this paper, we computed bounds for similarity measures based on the probabilistic interestingness measure with partially marginal probability such as Peirce I, Peirce II, Cole I, Cole II, Loevinger, Park I, and Park II measure. We confirmed the absolute value of Loevinger measure wasthe upper limit of the absolute value of any other existing measures. Ordering of other measures is determined by the size of concurrence proportion, non-simultaneous occurrence proportion, and mismatch proportion.

데이터 마이닝은 다양한 형태의 방대한 데이터 집합으로부터 보이지 않는 지식이나 새로운 법칙을 발견한 후, 이를 바탕으로 의사결정 등을 위한 정보로 활용하고자 하는 것이다. 데이터 마이닝 기법중의 하나인 군집 분석은 거리 또는 유사성 측도를 이용하여 집단을 분류하고, 구분된 각 집단의 특성을 파악하기 위한 기법이다. 본 논문에서는 주변 확률이 일부 포함된 확률적 흥미도 측도 기반의 유사성 측도들인 Peirce I, Peirce II, Cole I, Cole II, 그리고 이들을 응용한 Park I 및 Park II에 대한 대소 관계를 수식의 증명뿐만 아니라 예제 데이터에 의해서도 규명하였다. 그 결과, Cole I과 Cole II의 측도를 동시에 고려한 Loevinger 측도가 기존의 측도들 중에서는 상한이 되나 Park I 및 Park II를 함께 고려했을 경우에는 동시발생비율, 동시 비발생비율, 그리고 두 가지 형태의 불일치비율의 크기에 따라 변한다는 사실을 확인하였다.

Keywords

References

  1. Choi, S. S., Cha, S. H. and Tappert, C. (2010). A survey of binary similarity and distance measures. Journal on Systemics, Cybernetics and Informatics, 8, 43-48.
  2. Cole, L. C. (1949). The measurement of interspecific association. Ecology, 30, 411-424. https://doi.org/10.2307/1932444
  3. Imberman S., Domanski B. and Thompson H. (2002), Using dependency/association rules to find indications for computerized tomography in a head trauma dataset. Artificial Intelligence in Medicine, 26, 55-68. https://doi.org/10.1016/S0933-3657(02)00052-0
  4. Lee, K. A. and Kim, J. H. (2011). Comparison of clustering with yeast microarray gene expression data. Journal of the Korean Data & Information Science Society, 22, 741-753.
  5. Lim, J. S. and Lim, D. H. (2012). Comparison of clustering methods of microarray gene expression data. Journal of the Korean Data & Information Science Society, 23, 39-51. https://doi.org/10.7465/jkdi.2012.23.1.039
  6. Loevinger, J. A. (1947). A systematic approach to the construction and evaluation of test ability. Psychological Monograph, 61, 1-49. https://doi.org/10.1037/h0093593
  7. Loevinger, J. A. (1948). The technique of homogeneous tests compared with some aspects of scale analysis and factor analysis. Psychological Bulletin, 45, 507-529. https://doi.org/10.1037/h0055827
  8. Mokken, R. J. (1971). A Theory and Procedure of Scale Analysis, The Hague, Netherlands.
  9. Orchard R. A. (1975). On the determination of relationships between computer system state variables, Bell Laboratories Technical Memorandum, Bell Laboratories, New Jersey.
  10. Park, H. C. (2012). Exploration of PIM based similarity measures with PMP as association rule thresholds. Journal of the Korean Data Analysis Society, 14, 2965-2974.
  11. Park, H. C. (2014a). Comparison of cosine family similarity measures in the aspect of association rule. Journal of the Korean Data Analysis Society, 16, 729-737.
  12. Park, H. C. (2014b). Comparison of confidence measures useful for classification model building. Journal of the Korean Data & Information Science Society, 25, 1-7. https://doi.org/10.7465/jkdi.2014.25.1.1
  13. Park, H. C. (2015). A study on the ordering of PIM family similarity measures without marginal probability. Journal of the Korean Data & Information Science Society, 26, 367-376. https://doi.org/10.7465/jkdi.2015.26.2.367
  14. Park, H. J. and Kim, J. T. (2013). Classification of universities in Daegu.Gyungpook by support vector cluster analysis. Journal of the Korean Data & Information Science Society, 24, 783-791. https://doi.org/10.7465/jkdi.2013.24.4.783
  15. Peirce, C. S. (1884). The numerical measure of the success of predictions. Science, 4, 453-454.
  16. Ryu, J. Y. and Park, H. C. (2013). A study on Jaccard dissimilarity measures for negative association rule generation. Journal of the Korean Data Analysis Society, 15, 3111-3121.
  17. Sijtsma, K. and Molenaar, I. W. (2002). Introduction to Nonparametric Item Response Theory, Thousand Oaks, Sage.
  18. Warrens, M. J. (2008). Similarity coefficients for binary data : Properties of coefficients, coefficient Matrices, multi-way metrics and multivariate coefficients, Doctoral dissertation, Leiden university, Netherlands.
  19. Yeo, I. K. (2011). Clustering analysis of Korea's meteorological data. Journal of the Korean Data & Information Science Society, 22, 941-949.