DOI QR코드

DOI QR Code

Structure-based Clustering for XML Document Retrieval

XML 문서 검색을 위한 구조 기반 클러스터링

  • 황정희 (충북대학교 대학원 전자계산학과) ;
  • 류근호 (충북대학교 전기전자 컴퓨터공학부)
  • Published : 2004.12.01

Abstract

As the importance or XML is increasing to manage information and exchange data efficiently in the web, there are on going works about structural integration and retrieval. The XML. document with the defined structure can retrieve the structure through the DTD or XML schema, but the existing method can't apply to XML. documents which haven't the structure information. Therefore. in this paper we propose a new clus-tering technique at a basic research which make it possible to retrieve structure fast about the XML documents that haven't the structure information. We first estract the feature of frequent structure from each XML document. And we cluster based on the similar structure by con-sidering the frequent structure as representative structure of the XML document, which makes it possible to retrieve the XML document raster than dealing with the whole documents that have different structure. And also we perform the structure retrieval about XML documents based on the clusters which is the group of similar structure. Moreover, we show efficiency of proposed method to describe how to apply the structure retrieval as well as to display the example of application result.

웹에서 효율적인 정보 관리와 데이터 교환을 위해 XML의 중요성이 증가함에 따라 XML의 구조 통합과 구조 검색에 대한 연구가 진행되고 있다. 구조가 정의되어 있는 XML 문서의 구조 검색은 스키마 또는 DTD를 통해 가능하다 그러나 DTD나 스키마가 정의되어 있지 않은 XML 문서에 대한 검색은 기존의 검색 방법을 적용할 수 없다. 그러므로 이 논문에서는 구조 정보가 주어지지 않은 많은 양의 XML 문서를 대상으로 구조를 빠르게 검색하기 위한 기반 연구로써 새로운 클러스터링 기법을 제안한다. 먼저 각 문서로부터 빈발한 구조의 특성을 추출한다. 그리고 추출된 빈발 구조를 문서의 대표 구조로 하여 유사 구조기반의 클러스터링을 수행한다. 이것은 서로 다른 구조의 전체 문서를 대상으로 검색하는 것보다 신속하게 구조 검색을 할 수 있도록 한다. 또한 유사한 구조들로 그룹화되어 있는 클러스터들을 기반으로 XML 문서에 대한 구조 검색을 수행한다. 아울러 구조 검색의 적용 방법을 기술하고, 그에 대한 결과의 예를 보여 제안 기법의 효율성을 증명한다.

Keywords

References

  1. W3C, Extensible Markup Language(XML) 1.1, http://www.w3.org/TR/xml11, W3C Working Draft. April, 2002
  2. S. W. Kim, et al., 'Indexing and Retrieval of XML-encoded Structured Documents in Dynamic Environment,' Lecture Notes in Computer Science(LNCS) Vol.24, No.80, 2002
  3. M. Garafalalos, A. G. R. Rastogi, S. Seshadri, K. Shim, 'XTRACT : A System for Extracting Document Type Descriptors from XML Documents,' Proceedings of the ACM SIGMOD, 2000 https://doi.org/10.1145/342009.335409
  4. Z. Zhang, R. Li, S. Cao, Y. Zhu, 'Similarity Metric for XML Documents,' Workshop on Knowledge and Experience Management(FGWM) 2003
  5. J. Madhavan, P. A. Bernstein, E. Rahm, 'Generic Schema Matching with Cupid,' Proceedings of VLDB., 2001
  6. J. T. Wang, D. Shasha, G. J. S. Chang, 'Structural Matching and Discovery in Document Databases,' Proceedings of the ACM SIGMOD on Management of Data, 1997 https://doi.org/10.1145/253260.253406
  7. R. Nayak, H. Witt, A. Tonev, 'Data Mining and XML Documents,' International Conference on Internet Computing, 2002
  8. M. L. Lee, L. H. Yang, W. Hsu, X. Yang, 'XClust : Clustering XML Schemas for Effective Integration,' Proceedings of the ACM International Conference on Information and Knowledge Management, 2002 https://doi.org/10.1145/584792.584841
  9. T. Asai, K. Abe, S. Kawasoe, H. Arimura, H. Sakamoto, 'Efficient Substructure Discovery from Large Semi-structured Data,' Proceedings of the SIAM International Conference on Data Mining, 2002
  10. M. Zaki, 'Efficiently Mining Frequent Tree in a Forest,' Proceedings of the ACM SIGKDD International Conference, 2002 https://doi.org/10.1145/775047.775058
  11. E. Kotasakis, 'Structural Information Retrieval in XML Documents,' ACM Symposium on Applied Computing (SAC), 2002 https://doi.org/10.1145/508791.508919
  12. J. Widom, 'Data Management for XML : Research Directions,' IEEE Computer Society Technical Committee on Data Engineering, 1999
  13. A. G. Buchner, M. Baumgarten, M. D. Mulvenna, R. Bohm, S. S. Anand, 'Data Mining and XML : Current and Future Issues,' Proceedings of WISE, 2000 https://doi.org/10.1109/WISE.2000.882869
  14. J. W. Lee, K. Lee, W. Kim, 'Preparation for Semantics-Based XML Mining,' Proceedings of IEEE International Conference on Data Mining (ICDM), 2001 https://doi.org/10.1109/ICDM.2001.989538
  15. F. D. Francesca, G. Gordano, G. Manco, R. Ortale, A. Tagarelli, 'A General Framework for XML Document Clustering,' Technical report, n(8), ICAR-CNR, 2003
  16. J. Yoon, V. Raghavan. V. Chakilam, 'BitCube : Clustering and Statistical Analysis for XML Documents,' Proceedings of the International Conference on Scientific and Statistical Database Management, 2001 https://doi.org/10.1109/SSDM.2001.938548
  17. A. Termier, M. C. Rouster, M. Sebag, 'TreeFinder : A First Step towards XML Data Mining,' Proceedings of IEEE International Conference on Data Mining(ICDM), 2002 https://doi.org/10.1109/ICDM.2002.1183987
  18. Y. Yang, X. Guan, J. You, 'CLOPE : A fast and effective clustering algorithm for transaction data' Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2002 https://doi.org/10.1145/775047.775149
  19. K. Wang, C. Xu, 'Clustering Transactions Using Large Items,' Proceedings of ACM CIKM-99, 1999 https://doi.org/10.1145/319950.320054
  20. K. Winkler, M. Spiliopoulou, 'Employing Text Mining for Semantic Tagging in DIAsDEM,' KI, Vol.16, No.2, 2002
  21. J. Pei, J. Han, B. M. Asi, H. Pinto, 'PrefixSpan : Mining Sequential Pattern Efficiently by Prefix-Projected Pattern Growth,' Proceedings of International Conference on Data Engineering(ICDE), 2001
  22. J. H. Hwang, K. H. Ryu, 'Incremental Clustering of XML Documents Based on Similar Structure,' to be published in KISS
  23. KIAGARA query engine, http://www.cs.wisc.edu/niagara/data.html