DOI QR코드

DOI QR Code

SVD-LDA: A Combined Model for Text Classification

  • Hai, Nguyen Cao Truong (School of Electronics and Computer Engineering, Chonnam National University) ;
  • Kim, Kyung-Im (School of Electronics and Computer Engineering, Chonnam National University) ;
  • Park, Hyuk-Ro (School of Electronics and Computer Engineering, Chonnam National University)
  • Published : 2009.03.31

Abstract

Text data has always accounted for a major portion of the world's information. As the volume of information increases exponentially, the portion of text data also increases significantly. Text classification is therefore still an important area of research. LDA is an updated, probabilistic model which has been used in many applications in many other fields. As regards text data, LDA also has many applications, which has been applied various enhancements. However, it seems that no applications take care of the input for LDA. In this paper, we suggest a way to map the input space to a reduced space, which may avoid the unreliability, ambiguity and redundancy of individual terms as descriptors. The purpose of this paper is to show that LDA can be perfectly performed in a "clean and clear" space. Experiments are conducted on 20 News Groups data sets. The results show that the proposed method can boost the classification results when the appropriate choice of rank of the reduced space is determined.

Keywords

References

  1. Zhiwei Zhang, Xuan-Hieu Phan, Susumu Horiguchi, 'An Efficient Feature Selection using Hidden Topics in Text Categorization,' 22nd International Conference on Advanced Information Networking and Application, 2008
  2. A. Berger, A. D. Pietra, and J. D. Pietra, 'A maximum entropy approach to natural language processing,' Computational Linguistics, Vol.22, no.1, 1996, pp.39-71
  3. S. Deerwester, G. W. Furnas, and T. K. Landauer, 'Indexing by latent semantic analysis,' Journal of the American Society for Info, Science, Vol.41, No.6, 1990, pp.391-407 https://doi.org/10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9
  4. D. M. Blei, A. Ng, and M. I. Jordan, 'Latent Dirichlet Allocation,' JMLR, Vol.3, 2003, pp.993-1022 https://doi.org/10.1162/jmlr.2003.3.4-5.993
  5. Ramesh Nallapati and William Cohen, 'Link-PLSALDA: A new unsupervised model for topics and the influence of blogs,' AAAI, 2008
  6. G. Heinrich, 'Parameter estimation for text analysis,' Technical report-University of Leipzig, Germany, 2005
  7. T. Hofmann, 'Probabilistic latent semantic indexing,' Proceedings of SIGIR'99, 1999
  8. Tuomo Kakkonen, Niko Myller, and Erkki Sutinen, 'Applying Latent Dirichlet Allocation to Automatic Essay Grading,' FinTAL 2006, LNAI 4139, pp.110–120, 2006 https://doi.org/10.1007/11816508_13
  9. F. Sebastiani, 'Machine learning in automated text categorization,' ACM Computing Surveys, Vol.34, no.1, 2002, pp.1-47 https://doi.org/10.1145/505282.505283
  10. Y. Yang and J. O. Pedersen, 'A Comparative Study on Feature Selection in Text Categorization,' Proceedings of the Fourteenth International Conference on Machine Learning, 1997, pp.412-420
  11. C. Andrieu, N. D. Freitas, A. Doucet, and M. I. Jordan, 'An introduction to MCMC for machine learning,' Machine Learning, Vol.50, 2003, pp. 5–43 https://doi.org/10.1023/A:1020281327116
  12. T. Hofmann, J. Puzicha, and M. I. Jordan, 'Unsupervised learning from dyadic data,' Advances in Neural Information Processing Systems, Volume 11. MIT Press, 1999
  13. B.C. Russell, A.A. Efros, J. Sivic, W.T. Freeman, and A. Zisserman, 'Using Multiple Segmentations to Discover Objects and their Extent in Image Collections,' Proceedings of CVPR, June, 2006
  14. T. Hofmann, 'Latent semantic models for collaborative filtering,' ACM TOIS, Vol.22, no.1, 2004, pp.89-115 https://doi.org/10.1145/963770.963774
  15. T. Minka and J. Lafferty, 'Expectation-propagation for the generative aspect model,' Proc. UAI, 2002
  16. F. Sebastiani, 'Machine learning in automated text categorization,' ACM Computing Surveys, Vol.34, no.1, 2002, pp.1-47 https://doi.org/10.1145/505282.505283
  17. http://www.puffinwarellc.com/p3b.htm
  18. http://en.wikipedia.org/wiki/Information_retrieval

Cited by

  1. Study on suitability and importance of multilayer extreme learning machine for classification of text data vol.21, pp.15, 2017, https://doi.org/10.1007/s00500-016-2189-8