DOI QR코드

DOI QR Code

텍스트 분류를 위한 자질 순위화 기법에 관한 연구

An Experimental Study on Feature Ranking Schemes for Text Classification

  • 투고 : 2023.02.01
  • 심사 : 2023.03.17
  • 발행 : 2023.03.30

초록

본 연구는 텍스트 분류를 위한 효율적인 자질선정 방법으로 자질 순위화 기법의 성능을 구체적으로 검토하였다. 지금까지 자질 순위화 기법은 주로 문헌빈도에 기초한 경우가 대부분이며, 상대적으로 용어빈도를 사용한 경우는 많지 않았다. 따라서 텍스트 분류를 위한 자질선정 방법으로 용어빈도와 문헌빈도를 개별적으로 적용한 단일 순위화 기법들의 성능을 살펴본 다음, 양자를 함께 사용하는 조합 순위화 기법의 성능을 검토하였다. 구체적으로 두 개의 실험 문헌집단(Reuters-21578, 20NG)과 5개 분류기(SVM, NB, ROC, TRA, RNN)를 사용하는 환경에서 분류 실험을 진행하였고, 결과의 신뢰성 확보를 위해 5-fold cross validation과 t-test를 적용하였다. 결과적으로, 단일 순위화 기법으로는 문헌빈도 기반의 단일 순위화 기법(chi)이 전반적으로 좋은 성능을 보였다. 또한, 최고 성능의 단일 순위화 기법과 조합 순위화 기법 간에는 유의한 성능 차이가 없는 것으로 나타났다. 따라서 충분한 학습문헌을 확보할 수 있는 환경에서는 텍스트 분류의 자질선정 방법으로 문헌빈도 기반의 단일 순위화 기법(chi)을 사용하는 것이 보다 효율적이라 할 수 있다.

This study specifically reviewed the performance of the ranking schemes as an efficient feature selection method for text classification. Until now, feature ranking schemes are mostly based on document frequency, and relatively few cases have used the term frequency. Therefore, the performance of single ranking metrics using term frequency and document frequency individually was examined as a feature selection method for text classification, and then the performance of combination ranking schemes using both was reviewed. Specifically, a classification experiment was conducted in an environment using two data sets (Reuters-21578, 20NG) and five classifiers (SVM, NB, ROC, TRA, RNN), and to secure the reliability of the results, 5-Fold cross-validation and t-test were applied. As a result, as a single ranking scheme, the document frequency-based single ranking metric (chi) showed good performance overall. In addition, it was found that there was no significant difference between the highest-performance single ranking and the combination ranking schemes. Therefore, in an environment where sufficient learning documents can be secured in text classification, it is more efficient to use a single ranking metric (chi) based on document frequency as a feature selection method.

키워드

참고문헌

  1. Han, Ji Yeong & Heo, Go Eun (2021). Analyzing students' non-face-to-face course evaluation by topic modeling and developing deep learning-based classification model. Journal of the Korean Society for Library and Information Science, 55(4), 267-291. http://dx.doi.org/10.4275/KSLIS.2021.55.4.267
  2. Kim, In Hu & Kim, Seong hee (2022). Automatic classification of academic articles using BERT model based on deep learning. Journal of the Korean Society for Information Management, 39(3), 293-310. http://dx.doi.org/10.3743/KOSIM.2022.39.3.293
  3. Kim, Pan Jun (2008). A study on the performance improvement of rocchio classifier with term weighting methods. Journal of the Korean Society for Information Management, 25(1), 211-233. http://dx.doi.org/10.3743/KOSIM.2008.25.1.211
  4. Kim, Pan Jun (2016). An analytical study on performance factors of automatic classification based on machine learning. Journal of the Korean Society for information Management, 33(2), 33-59. http://dx.doi.org/10.3743/KOSIM.2016.33.2.033
  5. Kim, Pan Jun (2018). An analytical study on automatic classification of domestic journal articles based on machine learning. Journal of the Korean Society for Information Management, 35(2), 37-62. https://doi.org/10.3743/KOSIM.2018.35.2.037
  6. Kim, Pan Jun (2022). An experimental study on the automatic classification of korean journal articles through feature selection. Journal of the Korean Society for Information Management, 39(1), 69-90. http://dx.doi.org/10.3743/KOSIM.2022.39.1.069
  7. Lee, Jae-Yun (2005). An empirical study on improving the performance of text categorization considering the relationships between feature selection criteria and weighting methods. Journal of the Korean Society for Library and Information Science, 39(2), 123-146. http://dx.doi.org/10.4275/kslis.2005.39.2.123
  8. Yuk, JeeHee & Song, Min (2018). A study of research on methods of automated biomedical document classification using topic modeling and deep learning. Journal of the Korean Society for Information Management, 35(2), 63-88. http://dx.doi.org/10.3743/KOSIM.2018.35.2.063
  9. Abiodun, E. O., Alabdulatif, A., Abiodun, O. I., Alawida, M., Alabdulatif, A., & Alkhawaldeh, R. S. (2021). A systematic review of emerging feature selection optimization methods for optimal text classification: the present state and prospective opportunities. Neural Computing & Applications, 33(4), 1-28. https://doi.org/10.1007/s00521-021-06406-8
  10. Aggarwal, C. C. & Zhai, C. (2012). A Survey of Text Classification Algorithms. In: Aggarwal, C., Zhai, C. (eds) Mining Text Data. https://doi.org/10.1007/978-1-4614-3223-4_6
  11. Agnihotri, D., Verma, K., & Tripathi, P. (2017). Variable global feature selection scheme for automatic classification of text documents. Expert Systems with Applications, 81, 268-281. https://doi.org/10.1016/j.eswa.2017.03.057
  12. Avila-Arguelles, R., Calvo, H., Gelbukh, A., & Godoy-Calderon, S. (2010). Assigning Library of Congress Classification codes to books based only on their titles. Informatica, 34(1), 77-84.
  13. Azam, N. & Yao, J. (2012). Comparison of term frequency and document frequency based feature selection metrics in text categorization. Expert Systems with Applications, 39(5), 4760-4768. https://doi.org/10.1016/j.eswa.2011.09.160
  14. Baccianella, S., Esuli, A., & Sebastiani, F. (2013). Using micro-documents for feature selection: The case of ordinal text classification. Expert Systems with Applications, 40(11), 4687-4696. https://doi.org/10.1016/j.eswa.2013.02.010
  15. Bolon-Canedo, V. & Alonso-Betanzos, A. (2019). Ensembles for feature selection: A review and future trends. Information Fusion, 52, 1-12. https://doi.org/10.1016/j.inffus.2018.11.008
  16. Cai, J., Luo, J., Wang, S., & Yang, S. (2018). Feature selection in machine learning: A new perspective. Neurocomputing, 300, 70-79. https://doi.org/10.1016/j.neucom.2017.11.077
  17. Cai, Z. & Zhu, W. (2018). Multi-label feature selection via feature manifold learning and sparsity regularization. International journal of machine learning and cybernetics, 9(8), 1321-1334. https://doi.org/10.1007/s13042-017-0647-y
  18. Chang, F., Guo, J., Xu, W., & Yao, K. (2015). A Feature Selection Method to Handle Imbalanced Data in Text Classification. Journal of Digital Information Management, 13, 169-175.
  19. Chen, J., Huang, H., Tian, S., & Qu, Y. (2009). Feature selection for text classification with Naive Bayes. Expert Systems with Applications, 36(3), 5432-5435. https://doi.org/10.1016/j.eswa.2008.06.054
  20. Cunha, W., Mangaravite, V., Gomes, C., Canuto, S., Resende, E., Nascimento, C., Viegas, F., Franca, C., Martins, W. S., Almeida, J. M., Rosa, T., Rocha, L., & Goncalves, M. A. (2021). On the cost-effectiveness of neural and non-neural approaches and representations for text classification: A comprehensive comparative study. Information Processing & Management, 58(3), 102481. https://doi.org/10.1016/j.ipm.2020.102481
  21. Dash, M. & Liu, H. (1997). Feature selection for classification. Intelligent data analysis, 1, 131-156. https://doi.org/10.1016/S1088-467X(97)00008-5
  22. Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools and Applications, 78, 3797-3816. https://doi.org/10.1007/s11042-018-6083-5
  23. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. https://arxiv.org/abs/1810.04805
  24. Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 3, 1289-1305.
  25. Gunal, S. (2012). Hybrid feature selection for text classification. Turkish Journal of Electrical Engineering and Computer Science, 20(Sup.2), 1296-1311. https://doi.org/10.3906/elk-1101-1064
  26. Guyon, I. & Elisseeff, A. (2003). An introduction to variable and feature selection. The Journal of Machine Learning Research, 3, 1157-1182.
  27. Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. (2002). Gene selection for cancer classification using support vector machines. Machine Learning, 46(1), 389-422. https://doi.org/10.1023/A:1012487302797
  28. Han, E. H. & Karypis, G. (2000). Centroid-based document classification: Analysis and experimental results. In European conference on principles of data mining and knowledge discovery, 421-431. https://doi.org/10.1007/3-540-45372-5_46
  29. Harish, B. & Revanasiddappa, M. (2017). A comprehensive survey on various feature selection methods to categorize text documents. International Journal of Computer Applications, 164, 1-7. http://doi.org/10.5120/ijca2017913711
  30. Iqbal, M., Abid, M. M., Khalid, M. N., & Manzoor, A. (2020). Review of feature selection methods for text classification. International Journal of Advanced Computer Research, 10(49), 138-152. http://dx.doi.org/10.19101/IJACR.2020.1048037
  31. Javed, K., Babri, H. A., & Saeed, M. (2010). Feature selection based on class-dependent densities for high-dimensional binary data. IEEE Transactions on Knowledge and Data Engineering, 24(3), 465-477. http://dx.doi.org/10.1109/TKDE.2010.263
  32. Joachims, T. (1996). A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. Carnegie-Mellon University Dept of Computer Science. Available: https://apps.dtic.mil/sti/citations/ADA307731
  33. Joachims, T. (2002). Learning to classify text using support vector machines: Methods, theory and algorithms. Massachusetts: Kluwer Academic Publishers.
  34. Kohavi, R. & John, G. H. (1997). Wrappers for feature subset selection. Artificial intelligence, 97(1-2), 273-324. https://doi.org/10.1016/S0004-3702(97)00043-X
  35. Kumar, V. & Minz, S. (2014). Feature selection: a literature review. Smart Computing Review, 4(3), 211-229. htts://doi.org/10.6029/smartcr.2014.03.007
  36. Lan, M., Tan, C. L., Su, J., & Lu, Y. (2008). Supervised and traditional term weighting methods for automatic text categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(4), 721-735. https://doi.org/10.1109/TPAMI.2008.110
  37. Lazar, C., Taminau, J., Meganck, S., Steenhoff, D., Coletta, A., Molter, C., De Schaetzen, V., Duque, R., Bersini, H., & Nowe, A. (2012). A survey on filter techniques for feature selection in gene expression microarray analysis. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 9(4), 1106-1119. https://doi.org/10.1109/TCBB.2012.33
  38. Li, Y., Li, T., & Liu, H. (2017). Recent advances in feature selection and its applications. Knowledge and Information Systems, 53(3), 551-577. https://doi.org/10.1007/s10115-017-1059-8
  39. Liu, H. & Yu, L. (2005). Toward integrating feature selection algorithms for classification and clustering. IEEE Transactions on Knowledge and Data Engineering, 17(4), 491-502. https://doi.org/10.1109/TKDE.2005.66
  40. Mesleh, A. M. (2011). Feature sub-set selection metrics for arabic text classification. Pattern Recognition Letters, 32(14), 1922-1929. https://doi.org/10.1016/j.patrec.2011.07.010
  41. Parlak, B. & Uysal, A. K. (2021). A novel filter feature selection method for text classification: Extensive Feature Selector. Journal of Information Science, 49(1), 59-78. https://doi.org/10.1177/0165551521991037
  42. Pinheiro, R. H., Cavalcanti, G. D., & Ren, T. I. (2015). Data-driven global-ranking local feature selection methods for text categorization. Expert Systems with Applications, 42(4), 1941-1949. https://doi.org/10.1016/j.eswa.2014.10.011
  43. Pintas, J. T., Fernandes, L. A. F., & Garcia, A. C. B. (2021). Feature selection methods for text classification: a systematic literature review. Artificial Intelligence Review, 54, 6149-6200. https://doi.org/10.1007/s10462-021-09970-6
  44. Rehman, A., Javed, K., & Babri, H. A. (2017). Feature selection based on a normalized difference measure for text classification. Information Processing & Management, 53(2), 473-489. https://doi.org/10.1016/j.ipm.2016.12.004.
  45. Rehman, A., Javed, K., Babri, H. A., & Asim, N. (2018). Selection of the most relevant terms based on a max-min ratio metric for text classification. Expert Systems with Applications, 114, 78-96. https://doi.org/10.1016/j.eswa.2018.07.028
  46. Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1-47. https://doi.org/10.1145/505282.505283
  47. Shang, W., Huang, H., Zhu, H., Lin, Y., Qu, Y., & Wang, Z. (2007). A novel feature selection algorithm for text categorization. Expert Systems with Applications, 33(1), 1-5. https://doi.org/10.1016/j.eswa.2006.04.001
  48. Su, J., Shirab, J. S., & Matwin, S. (2011). Large scale text classification using semi-supervised multinomial naive bayes. In Proceedings of the 28th International Conference on International Conference on Machine Learning (ICML'11), 97-104. Available: http://www.icml-2011.org/papers/93_icmlpaper.pdf
  49. Talavera, L. (2005). An evaluation of filter and wrapper methods for feature selection in categorical clustering. In: Famili, A. F., Kok, J.N ., Pena, J. M., Siebes, A., Feelders, A. (eds) Advances in intelligent data analysis VI. IDA 2005. Lecture Notes in Computer Science, 3646. https://doi.org/10.1007/11552253_40
  50. Uysal, A. K. (2016). An improved global feature selection scheme for text classification. Expert Systems with Applications, 43(1), 82-92. https://doi.org/10.1016/j.eswa.2015.08.050
  51. Van Hulse, J., Khoshgoftaar, T. M., & Napolitano, A. (2011). A comparative evaluation of feature ranking methods for high dimensional bioinformatics data. In 2011 IEEE International Conference on Information Reuse & Integration, 2011, 315-320. https://doi.org/10.1109/IRI.2011.6009566
  52. Venkatesh, B. & Anuradha, J. (2019). A review of feature selection and its methods. Cybernetics and Information Technologies, 19(1), 3-26. https://doi.org/10.2478//cait-2019-0001
  53. Wang, D, Zhang, H., Liu, R., & Lv, W. (2012). Feature selection based on term frequency and T-test for text categorization. IProceedings of the 21st ACM International Conference on Information and Knowledge Management, 1482-1486. https://doi.org/10.1145/2396761.2398457
  54. Wang, D., Zhang, H., Liu, R., Liu, X., & Wang, J. (2016). Unsupervised feature selection through gram-Schmidt orthogonalization-A word co-occurrence perspective. Neurocomputing, 173(P3), 845-854. https://doi.org/10.1016/j.neucom.2015.08.038
  55. Wang, D., Zhang, H., Liu, R., Lv, W., & Wang, D. (2014). t-test feature selection approach based on term frequency for text categorization. Pattern Recognition Letters, 45, 1-10. https://doi.org/10.1016/j.patrec.2014.02.013
  56. Wang, H. & Hong, M. (2019). Supervised Hebb rule based feature selection for text classification. Information Processing & Management, 56(1), 167-191. https://doi.org/10.1016/j.ipm.2018.09.004
  57. Wu, G. & Xu, J. (2015). Optimized approach of feature selection based on information gain. In 2015 International Conference on Computer Science and Mechanical Automation, 157-161. https://doi.org/10.1109/CSMA.2015.38
  58. Wu, Y. & Zhang, A. (2004). Feature selection for classifying high-dimensional numerical data. Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004, CVPR 2004, 2, 251-258. http://doi.org/10.1109/CVPR.2004.1315171
  59. Yang, Y. & Pedersen. J. O. (1997). A comparative study on feature selection in text categorization. Proceedings of the Fourteenth International Conference on Machine Learning, 412-420.
  60. Yao, H., Liu, C., Zhang, P., & Wang, L. (2017). A feature selection method based on synonym merging in text classification system. EURASIP Journal on Wireless Communications and Networking, 2017(1), 1-8. https://doi.org/10.1186/s13638-017-0950-z