DOI QR코드

DOI QR Code

An Unsupervised Clustering Technique of XML Documents based on Function Transform and FFT

함수 변환과 FFT에 기반한 조정자가 없는 XML 문서 클러스터링 기법

  • 이호석 (호서대학교 공과대학 뉴미디어학과)
  • Published : 2007.04.30

Abstract

This paper discusses a new unsupervised XML document clustering technique based on the function transform and FFT(Fast Fourier Transform). An XML document is transformed into a discrete function based on the hierarchical nesting structure of the elements. The discrete function is, then, transformed into vectors using FFT. The vectors of two documents are compared using a weighted Euclidean distance metric. If the comparison is lower than the pre specified threshold, the two documents are considered similar in the structure and are grouped into the same cluster. XML clustering can be useful for the storage and searching of XML documents. The experiments were conducted with 800 synthetic documents and also with 520 real documents. The experiments showed that the function transform and FFT are effective for the incremental and unsupervised clustering of XML documents similar in structure.

본 논문은 함수 변환(Function Transform)과 FFT(Fast Fourier Transform)를 사용하는 새로운 XML 문서 클리스터링 기법에 대하여 논한다. 본 문서 클러스터링 기법은 조정자 없이 점진적으로 수행된다. XML 문서는 엘리먼트의 계층적인 구조에 기반하여 이산 함수로 변환된다. 이산 함수는 FFT를 사용하여 벡터로 변환된다. 문서를 나타내는 벡터는 가중치 유클리디안 거리 메트릭을 사용하여 비교된다. 비교 결과가 미리 정의된 값보다 작을 때에는 비교되는 두 개의 문서는 구조적으로 비슷한 것으로 간주되어 동일한 그룹으로 분류된다. XML 문서 클리스터링은 XML 문서의 저장과 검색에 유용하게 사용될 수 있다. 800개의 합서 문서와 520개의 실제 문서를 사용하여 실험하였다. 실험 결과는 함수변환과 FFT는 XML 문서를 엘리먼트의 구조를 기반으로 하여 점진적으로 조정자 없이 효과적으로 분류하는 것을 보여주었다.

Keywords

References

  1. A.K. Jain, M.N. Murty, P.M. Flynn, 'Data Clustering: A Review,' ACM Computing Surveys, Vol.31, No.3, pp.264-323, September 1999 https://doi.org/10.1145/331499.331504
  2. David Hand, Heikki Mannila, Padhraic Smyth, Principles of Data Mining, The MIT Press, 2001
  3. Mehmed Kantardzic, Data Mining Concepts, Models, Methods, and Algorithms, IEEE Press, 2003
  4. Pang Ning Tan, Michael Steinbach, Vipin Kumar, Introduction to Data Mining, Addison Wesley, 2006
  5. Sergio Flesca, Giuseppe Manco, Elio Mascimi, Luigi Pontieri, Andrea Pugliese, 'Fast Detection of XML Structural Similarity,' IEEE Trans. on Knowledge and Data Engineering, Vol.17, No.2, pp.160-175, February 2005 https://doi.org/10.1109/TKDE.2005.27
  6. Jianghui Liu, Jason T. L. Wang, Wynne Hsu, Katherine G.. Herbert, 'XML Clustering by Principal Component Analysis,' Proc. of the 16th IEEE Int'l Conf. on Tools with Artificial Intelligence(ICTAI 2004), 2004 https://doi.org/10.1109/ICTAI.2004.122
  7. Wang Lian, David Wai lok Cheung, Nikos Mamoulis, Siu Ming Yiu, 'An Efficient and Scalable Algorithm for Clustering XML Documents by Structure,' IEEE Trans. on Knowledge and Data Engineering, Vol.19, No.1, pp.82-96, January 2004 https://doi.org/10.1109/TKDE.2004.1264824
  8. Kyong Ho Lee, Yoon Chul Choy, Sung Bae Cho, 'An Efficient Algorithm to Compute Differences between Structured Documents,' IEEE Trans. on Knowledge and Data Engineering, Vol.16, No.8, pp.965-979, August 2004 https://doi.org/10.1109/TKDE.2004.19
  9. Andrew Nierman, H. V. Jagadish, 'Evaluating Structural Similaritv in XML Documents,' Proc. of the 5th Int'l Workshop on Web and Databases. 2002
  10. Dongkyu Kim, Sang goo Lee, Jonghoon Chun, Juhnyoung Lee, 'A Semantic Classification Model for e Catalog,' Proc. of the IEEE Int'l Conf. on E Commerce Technology, 2004 https://doi.org/10.1109/ICECT.2004.1319721
  11. Mu Chun Su, Chien Hsing Chou, 'A Modified Version of the K Means Algorithm with a Distance based on Cluster Symmetry,' IEEE Trans. on PAMI, Vol.23, No.6, pp.674-680, June 2001 https://doi.org/10.1109/34.927466
  12. Jong Soo Kim, Myoung Ho Kim, 'On Effective Data Clustering in Bitemporal Databases,' Proc. of the 4th Int'l Workshop on Temporal Representation and Reasoning, pp.54-61, Florida, USA, May 1997 https://doi.org/10.1109/TIME.1997.600782
  13. Sudipto Guha, Hajeev Rastogi, Kyuscok Shim, 'ROCK: A Hobust Clustering Algorithm for Categorical Attributes,' Proc. of 15th Int'I Conf. on Data Engineering,' pp.512-521, 1999 https://doi.org/10.1109/ICDE.1999.754967
  14. C.C. Aggarwal, J. Ban, J. Wang, Philip Yu, 'CluStream: A Framework for Clustering Evolving Data Streams,' Proc. of Int'l Conf. on Very Large DataBases, pp.81-92, September 2003
  15. Charu C. Aggarwal, Jiawei Han, Jianyong Wang, Philip S. Yu, 'A FrLunework for On Demand Classification of Evolving Data Streams,' IEEE Trans. on Knowledge and Data Engineering, Vol.18, No.5, pp.577-589, May 2006 https://doi.org/10.1109/TKDE.2006.69
  16. David Gondek, Thomas Hofmann, 'Non Redundant Data Clustering,' Proc. of the 4th IEEE Int'l Conf. on Data Mining, 2004 https://doi.org/10.1109/ICDM.2004.10104
  17. M. L. Zaki, C. Aggarwal. 'Xrules: An Effective Structural Classifier for XML Data,' Machine Learning Journal. Vol 62, No.1-2, pp.137-170, February 2006 https://doi.org/10.1007/s10994-006-5832-2
  18. Yuan Wang, David J. DeWitt, Jin Yi Cai, 'X Diff: An Effective Change Detection Algorithm for XML Documents,' Proc. of the 19th Int'l Conf. on Data Engineering, pp.519-530, Bangalore India, March 2003
  19. James W. Cooper, Anni R Coden, Eric W. Brown, 'A Novel Method for Detecting Similar Documents,' Proc. of the 35th Annual Hawaii Int'l Conference on System Sciences, 2002
  20. Pavel Berkhin, 'Survey of Clustering Data Mining Techniques,' Technical report, Accrue Software, 2002
  21. Antoine Doucet, Helena Ahonen Myka, 'Naive clustering of a large XML document collection,' Proc. of the 1st Annuad Workshop of the Initiative for the Evaluation of XML Retrieval(IXEX'02), pp.81-88, Germany, December 2002
  22. Dwi H. Widyantoro. Thomas R. loerger, John Yen, 'An Incremental Approach to Building a Cluster Hierarchy, Proc. of the 2002 IEEE Int'l Conf. on Data ,Mining, pp.705-708, 2002 https://doi.org/10.1109/ICDM.2002.1184034
  23. Pyo Jae Kim, Jin Young Choi, 'Incremental Conceptual Clustering Using a Modified Category Utility' Int'l Technical Conference on Circuits/Systems, Computers and Communications, Vol.1, No.1, pp.23-24, July 2005
  24. Matthaios Theodorakis, Andreas Vlachos, Theodore Z. Kalamboukis, 'Using Hierarchical Clustering to Enhance Classification Accuracy,' Proc. of the 3rd Hellenic Conf. in Artificial Intelligence, Samos, May 2004
  25. Qiong Liu, Stephcn Levinson, Ying Wu, Thomas Huang, 'Interactive and Incremental Learning via a 'Mixture of Supervised and Unsupervised Learning Strategies,' Proc. of the 5th Joint Conf. on Information Science, Vol,1, pp.555-558, Atlantic City, USA 2002
  26. PRWeb Press Release Service, http://www.prweb.com
  27. Denilson Barbosa, 'ToXgene Template Specification Language,' Dept. of Computer Science, University of Toronto, version 2.1, March 2003
  28. Alan V. Oppenheim, Ronald W. Schafer, John R. Buck, Discrete Time Signal Processing (2nd ed.), Prentice Hall. 1999

Cited by

  1. Clustering Technique Using a Node and Level of XML tree vol.17, pp.3, 2013, https://doi.org/10.6109/jkiice.2013.17.3.649