- Volume 11 Issue 1
This paper discusses a new weighting method for text analyzing from the view point of supervised learning. The term frequency and inverse term frequency measure (tf-idf measure) is famous weighting method for information retrieval, and this method can be used for text analyzing either. However, it is an experimental weighting method for information retrieval whose effectiveness is not clarified from the theoretical viewpoints. Therefore, other effective weighting measure may be obtained for document classification problems. In this study, we propose the optimal weighting method for document classification problems from the view point of supervised learning. The proposed measure is more suitable for the text classification problem as used training data than the tf-idf measure. The effectiveness of our proposal is clarified by simulation experiments for the text classification problems of newspaper article and the customer review which is posted on the web site.
Text Classification;Weighting Method;Vector Space Model;Cosine Similarity
- Aizawa, A. (2000), The Feature Quantity: An Information Theoretic Perspective of Tfidf-like Measures, Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, 104-111.
- Aizawa, A. (2003), An Information-theoric perspective tf-idf Measure, Information Processing and Management, 39, 45-65. https://doi.org/10.1016/S0306-4573(02)00021-3
- Bishop, C. M. (2006), Pattern Recognition and Machine Learning, Springer-Verlag.
- Goto, M., Ishida, T., and Hirasawa, S. (2007), Statistical Evaluation of Measure and Distance on Document Classification Problems in Text Mining, IEEE International Conference on Computer and Information Technology, 674-679.
- Goto, M., Ishida, T., Suzuki, M., and Hirasawa, S. (2008), Asymptotic Evaluation of Distance Measure on High Dimensional Vector Space in Text Mining, International Symposium on Information Theory and its Applications.
- Hearst, M. A. (1999), Untangling text data mining, ACL '99 Proceedings, 3-10.
- Hofmann, T. (1999), Probabilistic Latent Semantic Indexing, Proceeding of the 22nd International Conference on Research and Development in Information Retrieval, 50-57.
- Manning, C. D., Raghavan, P., and Schuetze, H. (2008), Introduction to Information Retrieval, Cambridge University Press.
- McCallum, A. and Nigam, K. (1998), A Comparison of Event Models for Naive Bayes Text Classification, Proceeding of AAAI-98 Workshop on Learning for Text Categorization, 41-48.
- Mikawa, K., Ishida, T., and Goto, M. (2012), A Proposal of Extended Cosine Measure for Distance Metric Learning in Text Classification, Proceeding of 2011 IEEE International Conference on the Systems, Man, Cybernetics (SMC), 1741-1746.
- Nagata, M. (1994), A Stochastic Japanese morphological analyzer using a forward-DP backward-A* best search algorithm, Proceeding of the 15th International Conference on Computational Linguistics, 201-207.
- Salton, G. and Buckley, C. (1988), Term-Weighting Approaches in Automatic Text Retrieval, Information Processing and Management, 24(5), 513-523. https://doi.org/10.1016/0306-4573(88)90021-0