Improving Multinomial Naive Bayes Text Classifier

다항시행접근 단순 베이지안 문서분류기의 개선

  • Published : 2003.04.01

Abstract

Though naive Bayes text classifiers are widely used because of its simplicity, the techniques for improving performances of these classifiers have been rarely studied. In this paper, we propose and evaluate some general and effective techniques for improving performance of the naive Bayes text classifier. We suggest document model based parameter estimation and document length normalization to alleviate the Problems in the traditional multinomial approach for text classification. In addition, Mutual-Information-weighted naive Bayes text classifier is proposed to increase the effect of highly informative words. Our techniques are evaluated on the Reuters21578 and 20 Newsgroups collections, and significant improvements are obtained over the existing multinomial naive Bayes approach.

References

  1. Yang, Y., Expert network : Effective and efficient learning from human decisions in text categorization and retrieval, In Proceedings of SIGIR-94, 18th ACM International Conference on Research and Development in Information Retrieval, pp. 13-22, 1994
  2. Joachims, T., Text categorization with support vector machines: learning with many relevant features, In Proceedings of ECML-98, 10th European Conference on Machine Learning, pp. 137-142, 1998
  3. McCallum, A. K., and Nigam, K., A comparison of event models for naive bayes text classification, In Proceedings of AAAI-98 Workshop on Learning for Text Categorization, pp. 137-142, 1998
  4. Lewis, D. D., and Ringuette, M., A comparison of two learning algorithms for text categorization, In Third Annual Symposium on Document Analysis and Information Retrieval, pp. 81-93, 1994
  5. Yiming Yang, Xin Liu, 'A Re-examination of Text Categorization Methods', Proc. of the 22nd ACM SIGIR'99, 1999 https://doi.org/10.1145/312624.312647
  6. Domingos, P. and Pazzani, M. J., On the optimality of the simple bayesian classifier under zero-one loss, Machine Learning, Vol. 29, No 2/3, pp. 103-130, 1997 https://doi.org/10.1023/A:1007413511361
  7. Sparck Jones, K.., Walker, S. and Robertson, S.E., A probabilistic model of information retrieval: development and comparative experiments. Information Processing and Management Vol. 36, Part 1 pp. 779-808; Part 2 pp. 809-840, 2000 https://doi.org/10.1016/S0306-4573(00)00015-7
  8. Lewis, D. D., Naive (Bayes) at forty: The independence assumption in information retrieval, In Proceedings of ECML-98, 10th European Conference on Machine Learning, pp. 4-15, 1998
  9. Singhal, A., Buckley, C. and Mitra, M., Pivoted Document Length Normalization, In Proceedings of SIGIR-96, 19th ACM International Conference on Research and Development in Information Retrieval, pp. 21-29, 1996 https://doi.org/10.1145/243199.243206
  10. Robertson, S.E. and Walker, S., Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval, In Proceedings of SIGIR-94, 17th ACM International Conference on Research and Development in Information Retrieval, pp. 232-241, 1994
  11. Yang, Y. and Pedersen, J.P. A Comparative Study on Feature Selection in Text Categorization, Proceedings of the Fourteenth International Conference on Machine Learning (ICML'97), pp. 412-420, 1997