An Unsupervised Clustering Technique of XML Documents based on Function Transform and FFT

Lee, Ho-Suk;

doi:10.3745/KIPSTD.2007.14-D.2.169

The KIPS Transactions:PartD (정보처리학회논문지D)

Volume 14D Issue 2
/
Pages.169-180
/
2007
/
1598-2866(pISSN)

Korea Information Processing Society (한국정보처리학회)

DOI QR Code

An Unsupervised Clustering Technique of XML Documents based on Function Transform and FFT

함수 변환과 FFT에 기반한 조정자가 없는 XML 문서 클러스터링 기법

Lee, Ho-Suk

이호석 (호서대학교 공과대학 뉴미디어학과)

Published : 2007.04.30

https://doi.org/10.3745/KIPSTD.2007.14-D.2.169 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

This paper discusses a new unsupervised XML document clustering technique based on the function transform and FFT(Fast Fourier Transform). An XML document is transformed into a discrete function based on the hierarchical nesting structure of the elements. The discrete function is, then, transformed into vectors using FFT. The vectors of two documents are compared using a weighted Euclidean distance metric. If the comparison is lower than the pre specified threshold, the two documents are considered similar in the structure and are grouped into the same cluster. XML clustering can be useful for the storage and searching of XML documents. The experiments were conducted with 800 synthetic documents and also with 520 real documents. The experiments showed that the function transform and FFT are effective for the incremental and unsupervised clustering of XML documents similar in structure.

본 논문은 함수 변환(Function Transform)과 FFT(Fast Fourier Transform)를 사용하는 새로운 XML 문서 클리스터링 기법에 대하여 논한다. 본 문서 클러스터링 기법은 조정자 없이 점진적으로 수행된다. XML 문서는 엘리먼트의 계층적인 구조에 기반하여 이산 함수로 변환된다. 이산 함수는 FFT를 사용하여 벡터로 변환된다. 문서를 나타내는 벡터는 가중치 유클리디안 거리 메트릭을 사용하여 비교된다. 비교 결과가 미리 정의된 값보다 작을 때에는 비교되는 두 개의 문서는 구조적으로 비슷한 것으로 간주되어 동일한 그룹으로 분류된다. XML 문서 클리스터링은 XML 문서의 저장과 검색에 유용하게 사용될 수 있다. 800개의 합서 문서와 520개의 실제 문서를 사용하여 실험하였다. 실험 결과는 함수변환과 FFT는 XML 문서를 엘리먼트의 구조를 기반으로 하여 점진적으로 조정자 없이 효과적으로 분류하는 것을 보여주었다.

Keywords

References

A.K. Jain, M.N. Murty, P.M. Flynn, 'Data Clustering: A Review,' ACM Computing Surveys, Vol.31, No.3, pp.264-323, September 1999 https://doi.org/10.1145/331499.331504
David Hand, Heikki Mannila, Padhraic Smyth, Principles of Data Mining, The MIT Press, 2001
Mehmed Kantardzic, Data Mining Concepts, Models, Methods, and Algorithms, IEEE Press, 2003
Pang Ning Tan, Michael Steinbach, Vipin Kumar, Introduction to Data Mining, Addison Wesley, 2006
Sergio Flesca, Giuseppe Manco, Elio Mascimi, Luigi Pontieri, Andrea Pugliese, 'Fast Detection of XML Structural Similarity,' IEEE Trans. on Knowledge and Data Engineering, Vol.17, No.2, pp.160-175, February 2005 https://doi.org/10.1109/TKDE.2005.27
Jianghui Liu, Jason T. L. Wang, Wynne Hsu, Katherine G.. Herbert, 'XML Clustering by Principal Component Analysis,' Proc. of the 16th IEEE Int'l Conf. on Tools with Artificial Intelligence(ICTAI 2004), 2004 https://doi.org/10.1109/ICTAI.2004.122
Wang Lian, David Wai lok Cheung, Nikos Mamoulis, Siu Ming Yiu, 'An Efficient and Scalable Algorithm for Clustering XML Documents by Structure,' IEEE Trans. on Knowledge and Data Engineering, Vol.19, No.1, pp.82-96, January 2004 https://doi.org/10.1109/TKDE.2004.1264824
Kyong Ho Lee, Yoon Chul Choy, Sung Bae Cho, 'An Efficient Algorithm to Compute Differences between Structured Documents,' IEEE Trans. on Knowledge and Data Engineering, Vol.16, No.8, pp.965-979, August 2004 https://doi.org/10.1109/TKDE.2004.19
Andrew Nierman, H. V. Jagadish, 'Evaluating Structural Similaritv in XML Documents,' Proc. of the 5th Int'l Workshop on Web and Databases. 2002
Dongkyu Kim, Sang goo Lee, Jonghoon Chun, Juhnyoung Lee, 'A Semantic Classification Model for e Catalog,' Proc. of the IEEE Int'l Conf. on E Commerce Technology, 2004 https://doi.org/10.1109/ICECT.2004.1319721
Mu Chun Su, Chien Hsing Chou, 'A Modified Version of the K Means Algorithm with a Distance based on Cluster Symmetry,' IEEE Trans. on PAMI, Vol.23, No.6, pp.674-680, June 2001 https://doi.org/10.1109/34.927466
Jong Soo Kim, Myoung Ho Kim, 'On Effective Data Clustering in Bitemporal Databases,' Proc. of the 4th Int'l Workshop on Temporal Representation and Reasoning, pp.54-61, Florida, USA, May 1997 https://doi.org/10.1109/TIME.1997.600782
Sudipto Guha, Hajeev Rastogi, Kyuscok Shim, 'ROCK: A Hobust Clustering Algorithm for Categorical Attributes,' Proc. of 15th Int'I Conf. on Data Engineering,' pp.512-521, 1999 https://doi.org/10.1109/ICDE.1999.754967
C.C. Aggarwal, J. Ban, J. Wang, Philip Yu, 'CluStream: A Framework for Clustering Evolving Data Streams,' Proc. of Int'l Conf. on Very Large DataBases, pp.81-92, September 2003
Charu C. Aggarwal, Jiawei Han, Jianyong Wang, Philip S. Yu, 'A FrLunework for On Demand Classification of Evolving Data Streams,' IEEE Trans. on Knowledge and Data Engineering, Vol.18, No.5, pp.577-589, May 2006 https://doi.org/10.1109/TKDE.2006.69
David Gondek, Thomas Hofmann, 'Non Redundant Data Clustering,' Proc. of the 4th IEEE Int'l Conf. on Data Mining, 2004 https://doi.org/10.1109/ICDM.2004.10104
M. L. Zaki, C. Aggarwal. 'Xrules: An Effective Structural Classifier for XML Data,' Machine Learning Journal. Vol 62, No.1-2, pp.137-170, February 2006 https://doi.org/10.1007/s10994-006-5832-2
Yuan Wang, David J. DeWitt, Jin Yi Cai, 'X Diff: An Effective Change Detection Algorithm for XML Documents,' Proc. of the 19th Int'l Conf. on Data Engineering, pp.519-530, Bangalore India, March 2003
James W. Cooper, Anni R Coden, Eric W. Brown, 'A Novel Method for Detecting Similar Documents,' Proc. of the 35th Annual Hawaii Int'l Conference on System Sciences, 2002
Pavel Berkhin, 'Survey of Clustering Data Mining Techniques,' Technical report, Accrue Software, 2002
Antoine Doucet, Helena Ahonen Myka, 'Naive clustering of a large XML document collection,' Proc. of the 1st Annuad Workshop of the Initiative for the Evaluation of XML Retrieval(IXEX'02), pp.81-88, Germany, December 2002
Dwi H. Widyantoro. Thomas R. loerger, John Yen, 'An Incremental Approach to Building a Cluster Hierarchy, Proc. of the 2002 IEEE Int'l Conf. on Data ,Mining, pp.705-708, 2002 https://doi.org/10.1109/ICDM.2002.1184034
Pyo Jae Kim, Jin Young Choi, 'Incremental Conceptual Clustering Using a Modified Category Utility' Int'l Technical Conference on Circuits/Systems, Computers and Communications, Vol.1, No.1, pp.23-24, July 2005
Matthaios Theodorakis, Andreas Vlachos, Theodore Z. Kalamboukis, 'Using Hierarchical Clustering to Enhance Classification Accuracy,' Proc. of the 3rd Hellenic Conf. in Artificial Intelligence, Samos, May 2004
Qiong Liu, Stephcn Levinson, Ying Wu, Thomas Huang, 'Interactive and Incremental Learning via a 'Mixture of Supervised and Unsupervised Learning Strategies,' Proc. of the 5th Joint Conf. on Information Science, Vol,1, pp.555-558, Atlantic City, USA 2002
PRWeb Press Release Service, http://www.prweb.com
Denilson Barbosa, 'ToXgene Template Specification Language,' Dept. of Computer Science, University of Toronto, version 2.1, March 2003
Alan V. Oppenheim, Ronald W. Schafer, John R. Buck, Discrete Time Signal Processing (2nd ed.), Prentice Hall. 1999

Cited by

Clustering Technique Using a Node and Level of XML tree vol.17, pp.3, 2013, https://doi.org/10.6109/jkiice.2013.17.3.649

The KIPS Transactions:PartD (정보처리학회논문지D)

An Unsupervised Clustering Technique of XML Documents based on Function Transform and FFT

함수 변환과 FFT에 기반한 조정자가 없는 XML 문서 클러스터링 기법

Abstract

Keywords

References

Cited by

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)