A study on the ordering of PIM family similarity measures without marginal probability

Park, Hee Chang;

doi:10.7465/jkdi.2015.26.2.367

Journal of the Korean Data and Information Science Society

제26권2호
/
Pages.367-376
/
2015
/
1598-9402(pISSN)

한국데이터정보과학회 (The Korean Data and Information Science Society)

DOI QR Code

주변 확률을 고려하지 않는 확률적 흥미도 측도 계열 유사성 측도의 서열화

A study on the ordering of PIM family similarity measures without marginal probability

박희창 (창원대학교 통계학과)

Park, Hee Chang (Department of Statistics, Changwon National University)

투고 : 2015.02.10
심사 : 2015.03.18
발행 : 2015.03.31

https://doi.org/10.7465/jkdi.2015.26.2.367 인용 PDF KSCI

PDF 다운로드

⟨ 이전 논문 다음 논문 ⟩

초록

데이터마이닝 기법 중의 하나인 군집분석은 다양한 특성을 지닌 관찰대상에 대해 유사성을 바탕으로 동질적인 군집으로 묶은 후, 동일 군집에 속해 있는 공통된 특성을 조사하는데 이용되는 기법이다. 본 논문에서는 주변 확률을 고려하지 않는 확률적 흥미도 측도 기반 유사성 측도인 Yule I과 II, Michael, Digby, Baulieu, 그리고 Dispersion 측도에 대해 상한 및 하한을 설정함으로써 이들의 대소관계를 규명하였다. 그 결과, 세 가지 유형의 대소 관계가 성립한다는 사실을 수식의 증명뿐만 아니라 실제 데이터 및 모의실험 데이터에 의해서도 확인할 수 있었다. 이들 측도들은 각 경계에 있는 측도와는 더욱 더 유사한 값을 가지므로 각 측도의 상한 및 하한은 여러 가지 측도들을 분류하는 도구가 되며, 실제 값의 관점에서 각 측도들의 관계를 알게 되면 주어진 알고리즘의 안정화에 도움이 될 수 있을 것이다.

Today, big data has become a hot keyword in that big data may be defined as collection of data sets so huge and complex that it becomes difficult to process by traditional methods. Clustering method is to identify the information in a big database by assigning a set of objects into the clusters so that the objects in the same cluster are more similar to each other clusters. The similarity measures being used in the cluster analysis may be classified into various types depending on the nature of the data. In this paper, we computed upper and lower limits for probability interestingness measure based similarity measures without marginal probability such as Yule I and II, Michael, Digby, Baulieu, and Dispersion measure. And we compared these measures by real data and simulated experiment. By Warrens (2008), Coefficients with the same quantities in the numerator and denominator, that are bounded, and are close to each other in the ordering, are likely to be more similar. Thus, results on bounds provide means of classifying various measures. Also, knowing which coefficients are similar provides insight into the stability of a given algorithm.

키워드

참고문헌

Baulieu, F. B. (1989). A classification of presence/absence based dissimilarity coefficients. Journal of Classification, 6, 233-246. https://doi.org/10.1007/BF01908601
Choi, S. S., Cha, S. H. and Tappert, C. (2010). A survey of binary similarity and distance measures. Journal on Systemics, Cybernetics and Informatics, 8, 43-48.
Gordon, A. D. (1999). Classification, Chapman & Hall, London-New York.
Kim, M., Jeon, J., Woo, K. and Kim, M. (2010). A new similarity measure for categorical attribute-based clustering. Journal of Korean Institute of Information Scientists and Engineers : Databases, 37, 71-81.
Lee, J. H. (2013). Big data, data mining and temporary reproduction. The Journal of Intellectual Property, 8, 93-125. https://doi.org/10.1093/jiplp/jps218
Lee, K. A. and Kim, J. H. (2011). Comparison of clustering with yeast microarray gene expression data. Journal of the Korean Data & Information Science Society, 22, 741-753.
Lim, J. S. and Lim, D. H. (2012). Comparison of clustering methods of microarray gene expression data. Journal of the Korean Data & Information Science Society, 23, 39-51. https://doi.org/10.7465/jkdi.2012.23.1.039
Michael, E. L. (1920). Marine ecology and the coefficient of association. Journal of Animal Ecology, 8, 54-59. https://doi.org/10.2307/2255213
Park, H. C. (2012). Exploration of PIM based similarity measures as association rule thresholds. Journal of the Korean Data & Information Science Society, 23, 1127-1135. https://doi.org/10.7465/jkdi.2012.23.6.1127
Park, H. C. (2014). Comparison of cosine family similarity measures in the aspect of association rule. Journal of the Korean Data Analysis Society, 16, 729-737.
Park, H. J. and Kim, J. T. (2013). Classification of universities in Daegu.Gyungpook by support vector cluster analysis. Journal of the Korean Data & Information Science Society, 24, 783-791. https://doi.org/10.7465/jkdi.2013.24.4.783
Ryu, J. Y. and Park, H. C. (2013). A study on Jaccard dissimilarity measures for negative association rule generation. Journal of the Korean Data Analysis Society, 15, 3111-3121.
Stanfill, C. and Waltz, D. (1986). Toward memory-based reasoning. Communications of the ACM, 29, 1213-1228. https://doi.org/10.1145/7902.7906
Warrens, M. J. (2008). Bounds of resemblance measures for binary (presence/absence) variables. Journal of Classification, 25, 195-208. https://doi.org/10.1007/s00357-008-9024-6
Yeo, I. K. (2011). Clustering analysis of Korea's meteorological data. Journal of the Korean Data & Information Science Society, 22, 941-949.
Yule, G. U. (1900). On the association of attributes in statistics. Philosophical Transactions of the Royal Society, 75, 257-319.
Yule, G. U. (1912). On the methods of measuring the association between two attributes. Journal of the Royal Statistical Society, 75, 579-652. https://doi.org/10.2307/2340126

피인용 문헌

Bounds of PIM-based similarity measures with partially marginal proportion vol.26, pp.4, 2015, https://doi.org/10.7465/jkdi.2015.26.4.857
Generally non-linear regression model containing standardized lift for association number estimation vol.27, pp.3, 2016, https://doi.org/10.7465/jkdi.2016.27.3.629
Signed Hellinger measure for directional association vol.27, pp.2, 2016, https://doi.org/10.7465/jkdi.2016.27.2.353

Journal of the Korean Data and Information Science Society

주변 확률을 고려하지 않는 확률적 흥미도 측도 계열 유사성 측도의 서열화

A study on the ordering of PIM family similarity measures without marginal probability

초록

키워드

참고문헌

피인용 문헌

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

자세히 찾기

이미지 검색 (β)