DOI QR코드

DOI QR Code

Exploration of Hierarchical Techniques for Clustering Korean Author Names

한글 저자명 군집화를 위한 계층적 기법 비교

  • 강인수 (경성대학교 컴퓨터정보학부)
  • Published : 2009.06.30

Abstract

Author resolution is to disambiguate same-name author occurrences into real individuals. For this, pair-wise author similarities are computed for author name entities, and then clustering is performed. So far, many studies have employed hierarchical clustering techniques for author disambiguation. However, various hierarchical clustering methods have not been sufficiently investigated. This study covers an empirical evaluation and analysis of hierarchical clustering applied to Korean author resolution, using multiple distance functions such as Dice coefficient, Cosine similarity, Euclidean distance, Jaccard coefficient, Pearson correlation coefficient.

저자식별은 학술문헌에 출현한 동명저자명들을 실세계의 서로 다른 사람들로 대응시키는 것이다. 이를 위해 임의의 동명저자명쌍의 유사도를 계산하고 이를 바탕으로 동명저자명 개체들을 군집화하는 단계를 거친다. 저자명의 군집화 기법으로 주로 계층적 군집법이 사용되었으나 다양한 계층적 군집법에 대한 비교 평가는 미흡했다. 이 연구는 다이스계수, 코사인유사도, 유클리디안 거리, 자카드계수, 피어슨 상관계수 등의 다양한 개체거리/유사도수식과 계층적 군집법들의 상관관계와 계층적 군집기법들의 한글 저자식별 성능에 대한 비교/분석을 다룬다.

Keywords

References

  1. 강인수, 이승우, 정한민, 김평, 구희관, 이미경, 성원경, 박동인. 2008. 저자식별을 위한 자질 비교. 한국콘텐츠학회논문지, 8(2): 41-47.
  2. 강인수. 2008a. 저자식별을 위한 전자메일의 추출 및 활용. 한국콘텐츠학회논문지, 8(6): 261-268.
  3. 강인수. 2008b. 한글 저자명 중의성 해소를 위한 기계학습기법의 적용. 한국정보관리학회지, 25(3): 27-39. https://doi.org/10.3743/KOSIM.2008.25.3.027
  4. Alani, H., Dasmahapatra, S., O'Hara, K., & Shadbolt, N. 2003. "Identifying communities of practice through ontology network analysis." IEEE Intelligent Systems, 18(2): 18-25. https://doi.org/10.1109/MIS.2003.1193653
  5. Bilenko, M., Mooney, R., Cohen, W., Ravikumar, P. and Fienberg, S. 2003. "Adaptive name matching in information integration." IEEE Intelligent Systems, 18(5): 16-23. https://doi.org/10.1109/MIS.2003.1234765
  6. Elmagarmid, A. K., Ipeirotis, P. G., and Verykios, V. S. 2007. "Duplicate record detection: A survey." IEEE Transactions on Knowledge and Data Engineering, 19(1): 1-16. https://doi.org/10.1109/TKDE.2007.250581
  7. Han, H., Giles, C. L., and Zha, H. 2003. "A model-based k-means algorithm for name disambiguation." Proceedings of semantic web technologies for searching and retrieving scientific data. October 20, Florida, USA.
  8. Han, H., Giles, C. L., Zha, H., Li, C., and Tsioutsiouliklis, K. 2004. "Two supervised learning approaches for name disambiguation in author citations." Proceedings of the ACM/IEEE joint conference on digital libraries(JCDL), 2004: 296-305.
  9. Huang, J., Ertekin, S., and Giles, C.L. 2006. "Efficient name disambiguation for large scale databases." Proceedings of PKDD-2006, 2006: 536-544.
  10. Kang, I.S., Na, S.H., Lee, S.W., Jung, H.M., Kim, P., Sung, W.K., and Lee, J.H. 2009. "On co-authorship for author disambiguation." Information Processing and Management, 45(1): 84-97. https://doi.org/10.1016/j.ipm.2008.06.006
  11. Manning, C. D., Raghavan, P. and Schutze, H. 2008. Introduction to information retrieval. Cambridge: Cambridge University Press.
  12. Sneath P. A. and Sokal R. R. 1973. Numerical taxonomy: the principles and practice of numerical classification. San Francisco: W. H. Freeman and Company.
  13. Song, Y., Huang, J., Councill, I., Li, J., and Giles, C. L. 2007. "Efficient topic-based unsupervised name disambiguation." Proceedings of the ACM/IEEE joint conference on digital libraries(JCDL), 2007: 342-351.
  14. Tan, Y. F., Kan, M. Y., and Lee, D. W. 2006. "Search engine driven author disambiguation." Proceedings of the ACM/IEEE joint conference on digital libraries (JCDL), 2006: 314-315.
  15. Ward, J. H. 1963. "Hierarchical grouping to optimize an objective function." Journal of the American Statistical Association, 58(301): 236-244. https://doi.org/10.2307/2282967
  16. Xu, R., and Wunsch, D. 2005. "Survey of clustering algorithms." IEEE Transactions on Neural Network, 16(3): 645-678. https://doi.org/10.1109/TNN.2005.845141