- Volume 9 Issue 3
This paper proposes a new document clustering system using fuzzy logic-based genetic algorithm (GA) and semantic vector expansion technology. It has been known in many GA papers that the success depends on two factors, the diversity of the population and the capability to convergence. We use the fuzzy logic-based operators to adaptively adjust the influence between these two factors. In traditional document clustering, the most popular and straightforward approach to represent the document is vector space model (VSM). However, this approach not only leads to a high dimensional feature space, but also ignores the semantic relationships between some important words, which would affect the accuracy of clustering. In this paper we use latent semantic analysis (LSA)to expand the documents to corresponding semantic vectors conceptually, rather than the individual terms. Meanwhile, the sizes of the vectors can be reduced drastically. We test our clustering algorithm on 20 news groups and Reuter collection data sets. The results show that our method outperforms the conventional GA in various document representation environments.
Clustering;Genetic Algorithm;Latent Semantic Analysis(LSA);Semantic Vector Expansion
- S. Selim and M. Ismail, "K-means-type algorithm: generalized convergence theorem and characterization of local optimality," IEEE Trans. Pattern Anal. Mach Intell. 6, pp.81-87, 1994. https://doi.org/10.1109/TPAMI.1984.4767478
- M. Ankerst, M. Breuing, and H. P. Kriegel, "OPTICS: Ordeing points to identify the clustering structure," In Proceedings of SIGMOD"99, pp.49-60, 1999.
- R. Sibson, "SLINK: An optimally efficient algorithm for the single-link cluster method," The Computer Journal, Vol.16, No.1, pp.30-34, 1973. https://doi.org/10.1093/comjnl/16.1.30
- W. Koontz, P. Narendra, and K. Fucunaga, "A graph theoretic approach to nonparametric cluster analysis," IEEE Trans. Comput, C-25, pp.936-944, 1975. https://doi.org/10.1109/TC.1976.1674719
- S. Bandyopadhyay and S. K. Pal, "Multi-objective GAs, quantitative indices and pattern classification," IEEE Trans. Systems, Man and Cybernetics-B, Vol.34, No.5, pp.2088-2099, 2004. https://doi.org/10.1109/TSMCB.2004.834438
- M. W. Berry, S. T. Dumais, and G. W. Brien, "Using linear algebra for intelligent information retrieval," SIAM Rev, Vol.37, No.4, pp.573-595, 1995. https://doi.org/10.1137/1037127
- J. T. Sun, Z. Chen, and H. J. Zeng, "Supervised latent semantic indexing for document categorization," In Proceedings of ICDM'04, pp.535-538, 2004. https://doi.org/10.1109/ICDM.2004.10004
- M. G. Vozalis and K. G. Margaritis, "Using SVD and demographic data for the enhancement of generalized collaborative filtering," Information Sciences, 177, pp.3017-3037, 2007. https://doi.org/10.1016/j.ins.2007.02.036