DOI QR코드

DOI QR Code

Efficient Structural Information Extraction for XML Data

XML데이터를 위한 효율적인 구조 정보 추출 기법

  • 민준기 (한국기술교육대학교 인터넷미디어공학부)
  • Published : 2007.06.30

Abstract

There has been an increasing interest in n since it is spotlighted as the standard for data representation and exchange in the Web. The structural information for XML documents serves several important purposes. In spite of its importance, the schema is not mandatory for XML documents. Thus, much research to extract structural information for XML document has been conducted. In this paper, we present a technique for efficient extraction of concise and accurate DTD for XML documents. By restriction of DTD content model using the mixed content model of DTD and XML Schema as well as applying some heuristic rules proposed in this paper, we achieve the efficiency and conciseness. The result of an experiment with real life DTDs shows that our approach is superior to existing approaches.

XML 데이터가 웹 상의 데이터 표현 및 교환의 표준으로 각광 받음으로써, XML에 대한 관심이 증대되고 있다. XML 문서의 구조 정보는 몇 가지 중요한 역할을 수행한다. 이러한 중요성에도 불구하고 XML 문서의 구조정보는 필수 요소가 아니다. 따라서, 이러한 구조 정보를 추출하기 위한 다양한 연구들이 진행되어 왔다. 본 논문에서, 우리는 XML 문서를 위한 간결하고 정확한 DTD를 추출하는 기법을 제안한다. 특히 XML 문서의 구조 정보를 위한 DTD의 내용 모델을 DTD와 XML Schema의 혼합 내용(mixed contents)의 타당성 제약 조건을 이용하여 제한하고 본 논문에서 제안하는 몇 가지 경험적 규칙들을 적용함으로써, 우리는 간결성과 효율적을 이룩하였다. 실제 DTD를 이용한 실험을 통하여 본 논문에서 제안하는 기법이 기존의 접근 방법들에 비하여 뛰어남을 보였다.

Keywords

References

  1. D. Angluin, 'Equivenance queries and approximate fingerprints,' In Proceedings of the workshop on computational Learning Theory, 1989
  2. L. Berman and A. Diaz, Data Descriptors by Example (DDbE), IBM alphaworks, http://www.alphaworks. ibm.com/tech/DDbE, 2001
  3. T. Bray, C. Frankston, and A. Malhatro, 'Document Content Description for XML,' W3C submission, http://www.w3.org/TR/NOTE-dcd, 1998
  4. Tim Bray, Jean Paoli, C. M. Sperberg-McQueen, Eve Maler, Francois Yergeau eds., Extensible Markup Language (XML) 1.0 (Fourth Edition), W3C Recommendation, http://www.w3.org/TR/REC-xml, 2006
  5. A. Brazma, 'Efficient Identification of Regular Expressions from Representative Examples,' In Proceedings of ACM COLT, 1993 https://doi.org/10.1145/168304.168340
  6. A. Bruggemann Klein, D. Wood, 'One-unambiguous regular grammar,' Inf. Comput., 142(2), pp.182-206, 1998 https://doi.org/10.1006/inco.1997.2695
  7. M. Bryan, 'An Introduction to the Standard Generalized Markup Language (SGML),' http://www.personal.u-net.com/~sgml/sgml.html
  8. D. C. Fallside, P. Walmsley, XML Schema Part 0, W3C Recommendation, http://www.w3.org/TR/xmlschema-0, 2004
  9. M. Garofalakis, A. Gionis, R. Rastogi, S. Seshadri, and K. Shim, 'XTRACT: A System for Extracting Document Type Descriptors from XML Documents,' In Proceedings of ACM SIGMOD, 2000 https://doi.org/10.1145/342009.335409
  10. R. Goldman, J. Widom, 'DataGuides: Enable Query Formulation and Optimization in Semistructured Databases,' In Proceedings of VLDB Conf., 1997
  11. J. Hegewald, F. Naumann, M. Weis, ' XStruct: Efficient Schema Extraction from Multiple and Large XML Document,' In Proceedings of International Conference of DataEngin¬eering Workshop (ICDEW), 2006 https://doi.org/10.1109/ICDEW.2006.166
  12. Juliana Freire, Jayant R. Haritsa, Maya Ramanath, Prasan Roy, Jerome Simeon, 'StatiX: making XML count,' In Proceedings of ACM SIGMOD, 2002
  13. C. H. Moh, E. P. Lim, and W. K. Ng, 'DTD Miner: A Tool for Mining DTD from XML Documents,' In Proceedings of International Workshop on Advance Issues of E Commerce and Web Based Information Systems(WECWIS), 2000 https://doi.org/10.1109/WECWIS.2000.853869
  14. S. Nestorov, J. Ullman, J.Wiener, and S. Chawathe, 'Representative Objects: Concise Prepresentation of Semistructured, Hierarchical Data,' In Proceedings of IEEE ICDE, pp.79-90, 1997
  15. J. Rissanen, 'Modeling by shortest data description,' Automatica, Vol. 14, 1978
  16. Robin Cover. The XML Cover Pages. http://www.oasisopen.org/cover/xml.html, 2001
  17. J. Shanmugasundaram, K. Tufte, C. Zhang, H. Gang, D. J. DeWitt, and J. F. Naughton, 'Relational databases for querying XML documents: Limitations and opportunities,' In Proceedings of VLDB Conf., 1999
  18. C. S. Wallace, D. M. Boulton, 'An Information Measure for Classification,' Computer Journal, Vol. 11, 1968 https://doi.org/10.1093/comjnl/11.2.185
  19. Q.Y. Wang, J. X. Yu, and K. -F. Wong, 'Approximate graph schema extraction for semi structured data,' In Proceedings of the International Conference on Extending Data Technology (EDBT), 2000 https://doi.org/10.1007/3-540-46439-5_21
  20. R. K. Wong, J. Sankey, 'On Structural Inference for XML Data,' Technical Report UNSW-CSE-TR-0313, The University of New South Wales