DOI QR코드

DOI QR Code

A Study on the Relationship between Class Similarity and the Performance of Hierarchical Classification Method in a Text Document Classification Problem

텍스트 문서 분류에서 범주간 유사도와 계층적 분류 방법의 성과 관계 연구

  • Jang, Soojung (Graduate School(Big Data Analytics), Ewha Womans University) ;
  • Min, Daiki (School of Business, Ewha Womans University)
  • Received : 2020.07.22
  • Accepted : 2020.08.14
  • Published : 2020.08.31

Abstract

The literature has reported that hierarchical classification methods generally outperform the flat classification methods for a multi-class document classification problem. Unlike the literature that has constructed a class hierarchy, this paper evaluates the performance of hierarchical and flat classification methods under a situation where the class hierarchy is predefined. We conducted numerical evaluations for two data sets; research papers on climate change adaptation technologies in water sector and 20NewsGroup open data set. The evaluation results show that the hierarchical classification method outperforms the flat classification methods under a certain condition, which differs from the literature. The performance of hierarchical classification method over flat classification method depends on class similarities at levels in the class structure. More importantly, the hierarchical classification method works better when the upper level similarity is less that the lower level similarity.

비정형 텍스트 문서를 다중 범주로 분류하는 문제에 있어서, 계층적 분류 방법이 비계층적 분류 방법에 비하여 분류 성능이 우수한 것으로 알려져 있다. 기존 문헌과 다르게 본 연구에서는 사전에 범주들의 계층 구조가 정의된 상황에서 계층적 분류 방법과 비계층적 분류 방법의 성능을 비교하였다. 수자원 분야 기후변화 적응기술과 관련한 논문 분류 데이터와 20NewsGroup 오픈 데이터를 대상으로 계층적/비계층적 분류 방법의 성능을 비교하였다. 본 연구결과 기존 문헌과 다르게 계층적 분류 방법이 비계층적 분류 방법에 비하여 언제나 성능이 우수한 것은 아님을 확인하였다. 계층 구조의 상위/하위 수준에서의 상대적 유사도에 따라서 계층적/비계층적 분류 방법의 성능에 차이가 있음을 확인하였다. 즉, 상위 수준의 유사도가 하위 수준보다 상대적으로 낮은 경우 상위 수준에서의 오분류 감소로 계층적 분류 방법의 성능이 개선됨을 확인하였다.

Keywords

References

  1. Agnihotri, D., Verma, K., and Tripathi, P., "Variable global feature selection scheme for automatic classification of text documents," Expert Systems with Applications, Vol. 81, pp. 268-281, 2017. https://doi.org/10.1016/j.eswa.2017.03.057
  2. Bertule, M., Appelquist, L. R., Spensley, J., Traerup, S. L. M., and Naswa, P., "Climate change adaptation technologies for water: A practitioner's guide to adaptation technologies for increased water sector resilience," CTCN publications, Copenhagen, Denmark, 2018.
  3. Beyan, C. and Fisher, R., "Classifying imbalanced data sets using similarity based hierarchical decomposition," Pattern Recognition, Vol. 48, pp. 1653-1672, 2015. https://doi.org/10.1016/j.patcog.2014.10.032
  4. Byun, J. H., "Current Status and Perspectives of Fintech Innovation," Journal of New Industry and Business, Vol. 26, No. 2, pp. 35-48, 2018
  5. Chen, Y., Craword, M. M., and Ghosh, J., "Integrating support vector machines in a hierarchical output space decomposition framework," IEEE International Geoscience and Remote Sensing Symposium, Vol. 2, pp. 949-952, 2004.
  6. Cristianini, N. and Shawe-Taylor, J., "An introduction to support vector machines and other kernel-based leartning methods", Cambridge University Press, MA, 2000.
  7. Du, Y., Liu, J., Ke, W., and Gong, X., "Hierarchy construction and text classification based on the relaxation strategy and least information model," Expert Systems with Applications, Vol. 100, pp. 157-164, 2018. https://doi.org/10.1016/j.eswa.2018.02.003
  8. Duan, K. B. and Keerthi, S. S., "Which is the best multiclass SVM method? An empirical study," International Workshop on Multiple Classifier Systems, Vol. 3531, pp. 278-285, 2005.
  9. Gargiulo, F., Silvestri, S., Ciampi, M., and De Pietro, G., "Deep neural network for hierarchical extreme multi-label text classification," Applied Soft Computing, Vol. 79, pp. 125-138, 2019. https://doi.org/10.1016/j.asoc.2019.03.041
  10. Kang, S., Cho, S., and Kang, P., "Constructing a multi-class classier using one-against-one approach with different binary classifiers," Neurocomputing, Vol. 149, pp. 677-682, 2015. https://doi.org/10.1016/j.neucom.2014.08.006
  11. Kim, P. J. and Lee, J. Y., "An experimental study on the performance improvement of automatic classification for the articles of korean journals based on controlled keywords in international database," Journal of the Korean Society for Library and Information Science, Vol. 48, No. 3, pp. 491-510, 2014 https://doi.org/10.4275/KSLIS.2014.48.3.491
  12. Kim, P. J., "An analytical study on automatic classification of domestic journal articles based on machine learning," Journal of the Korean Society for information Management, Vol. 35, No. 2, pp. 37-62, 2018. https://doi.org/10.3743/KOSIM.2018.35.2.037
  13. Kim, Y. S. and Lee, B. Y., "Multi-class support vector machines model based clustering for hierarchical document categorization in big data environment," The Journal of the Korea Contents Association, Vol. 17, pp. 600-608, 2017.
  14. Kowsari, K., Jafari Meimandi, K., Heidarysafa, M., Mendu, S., Barnes, L., and Brown, D,, "Text classification algorithms: A survey," Information, Vol. 10, No. 4, 2019.
  15. Lee, J. H., Yi, J. S., and Son, J. W., "Unstructured construction data analytics using R programming: Focused on overseas construction adjudication cases", Journal of the Architectural Institute of Korea Structure & Construction, Vol. 32, No. 5, pp. 37-44, 2016. https://doi.org/10.5659/JAIK_SC.2016.32.5.37
  16. Lee, J. S. and Kwon, J. G., "A hybrid SVM classifier for imbalanced data sets," Journal of Intelligence and Information Systems, Vol. 19, pp. 125-140, 2013.
  17. Lee, S. K. and Kim, K., "Academic Conference Categorization According to Subjects Using Topical Information Extraction from Conference Websites," The Journal of Society for e-Business Studies, Vol. 22, No. 2, pp. 61-77, 2017. https://doi.org/10.7838/jsebs.2017.22.2.061
  18. Lee, S. J. and Kim, H. J., "Keyword extraction from news corpus using modified TF-IDF," The Journal of Society for e-Business Studies, Vol. 14, No. 4, pp, 59-73, 2009.
  19. Lorena, A. C., De Carvalho, A. C., and Gama, J. M. P., "A review on the combination of binary classifiers in multiclass problems," Artificial Intelligence Review, Vol. 30, No. 19, 2008.
  20. Madzarov, G., Gjorgjevikj, D., and Chorbev, I., "A multi-class SVM classifier utilizing binary decision tree," Informatica, Vol. 33, 2009.
  21. Min, J. H. and Lee, Y. C., "Bankruptcy prediction using support vector machine with optimal choice of kernel function parameters," Expert Systems with Applications, Vol. 28, pp. 603-614, 2005. https://doi.org/10.1016/j.eswa.2004.12.008
  22. Naik, A. and Rangwala, H., "Improving large-scale hierarchical classification by rewiring: A data-driven filter based approach," Journal of Intelligent Information Systems, Vol. 52, pp. 141-164, 2019 https://doi.org/10.1007/s10844-018-0509-4
  23. Park, J. H. and Kim, J. S., "A text classification system for hierarchical categories," The Korean Institute of Information Scientists and Engineers, Vol. 27, No. 2, pp. 128-130, 2000.
  24. Silla, C. N. and Freitas, A. A., "A survey of hierarchical classification across different application domains," Data Mining and Knowledge Discovery, Vol. 22, pp. 31-72, 2011 https://doi.org/10.1007/s10618-010-0175-9
  25. Silva-Palacios, D., Ferri, C., and Ramirez-Quintana, M. J., "Probabilistic class hierarchies for multiclass classification," Journal of Computational Science, Vol. 26, pp. 254-263, 2018 https://doi.org/10.1016/j.jocs.2018.01.006
  26. Sun, A., Lim, E. P., Ng, W. K., and Srivastava, J., "Blocking reduction strategies in hierarchical text classification," IEEE Transactions on Knowledge and Data Engineering, Vol. 16, pp. 1305-1308, 2004 https://doi.org/10.1109/TKDE.2004.50
  27. Tegegnie, A. K., Tarekegn, A. N., and Alemu, T. A., "A comparative study of flat and hierarchical classification for amharic news text using SVM," Information Engineering and Electronic Business, Vol. 3, pp. 36-42, 2017.
  28. UNEP, "Technologies for climate change mitigation," UNEP, 2011.
  29. Vapnik, V., "Estimation of Dependences Based on Empirical Data." Nauka, Moscow, 1979.
  30. Vapnik, V., "The nature of statistical learning theory", Chapter 5. Springer-Verlag, New York, 1995.
  31. Williams, T. P. and Gong, J., "Predicting construction cost overruns using text mining, numericaldata and ensemble classifiers," Automation in Construction, Vol. 43, pp. 23-29, 2014 https://doi.org/10.1016/j.autcon.2014.02.014
  32. Yoon, Y. W. Lee, C. K., and Lee, G. B., "Hierarchical text categorization using support vector machine," Annual Conference on Human and Language Technology, pp. 7-13, 2013.
  33. Zhang, L., Shah, S. K., and Kakadiaris, I. A., "Hierarchical multi-label classification using fully associative ensemble learning," Pattern Recognition, Vol. 70, pp. 89-103, 2017. https://doi.org/10.1016/j.patcog.2017.05.007
  34. Zhao, Z., Wang, X., and Wang, T., "A novel measurement data classification algorithm based on SVM for tracking closely spaced targets," IEEE Transactions on Instrumentation and Measurement, Vol. 68, No. 4, pp. 1089-1100, 2019. https://doi.org/10.1109/TIM.2018.2861107
  35. Zheng, J., Guo, Y., Feng, C., and Chen., H., "A hierarchical neural network based document representation approach for text classification," Mathematical Problems in Engineering, Vol. 2018, 2018.