DOI QR코드

DOI QR Code

Heterogeneous Ensemble of Classifiers from Under-Sampled and Over-Sampled Data for Imbalanced Data

Kang, Dae-Ki;Han, Min-gyu

  • Received : 2019.01.16
  • Accepted : 2019.01.30
  • Published : 2019.03.31

Abstract

Data imbalance problem is common and causes serious problem in machine learning process. Sampling is one of the effective methods for solving data imbalance problem. Over-sampling increases the number of instances, so when over-sampling is applied in imbalanced data, it is applied to minority instances. Under-sampling reduces instances, which usually is performed on majority data. We apply under-sampling and over-sampling to imbalanced data and generate sampled data sets. From the generated data sets from sampling and original data set, we construct a heterogeneous ensemble of classifiers. We apply five different algorithms to the heterogeneous ensemble. Experimental results on an intrusion detection dataset as an imbalanced datasets show that our approach shows effective results.

Keywords

Over-sampling;Under-sampling;Heterogeneous ensemble;Imbalanced data

References

  1. P. Kang, and S. Cho. "EUS SVMs: Ensemble of Under-Sampled SVMs for Data Imbalance Problems." Lecture Notes in Computer Science, Vol. 4232, 2006. DOI: https://doi.org/10.1007/11893028_93.
  2. M.-J. Kim, D.-K. Kang, and H. B. Kim. "Geometric Mean Based Boosting Algorithm with Over-Sampling to Resolve Data Imbalance Problem for Bankruptcy Prediction." Expert Systems with Applications, Vol. 42, No. 3, pp. 1074-1082, 2015. DOI: https://doi.org/10.1016/j.eswa.2014.08.025. https://doi.org/10.1016/j.eswa.2014.08.025
  3. C. Seiffert, T.M. Khoshgoftaar, J.V. Hulse, and A. Napolitano, "RUSBoost: Improving Classification Performance When Training Data Is Skewed," in Proc. 19th International Conference on Pattern Recognition, pp. 1-4, 2008.
  4. N. V. Chawla, W. B. Kevin, O. H. Lawrence, and W. P. Kegelmeyer. "SMOTE: Synthetic Minority Over-Sampling Technique." J. Artif. Int. Res., Vol. 16, No. 1, pp. 321-357, June 2002.
  5. L. Breiman, "Bagging Predictors." Machine Learning, Vol. 24, No. 2, pp. 123-140, August 1996. DOI: https://doi.org/10.1023/A:1018054314350.
  6. Y. Freund, and R. E. Schapire. "A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting." Journal of Computer and System Sciences, Vol. 55, No. 1, pp. 119-39, 1997. DOI: https://doi.org/10.1006/jcss.1997.1504. https://doi.org/10.1006/jcss.1997.1504
  7. L. Breiman, "Arcing Classifier (with Discussion and a Rejoinder by the Author)." The Annals of Statistics, Vol. 26, No. 3, pp. 801-849, 1998. DOI: https://doi.org/10.1214/aos/1024691079. https://doi.org/10.1214/aos/1024691079
  8. A. I., Naimi and L. B. Balzer. "Stacked Generalization: An Introduction to Super Learning." European Journal of Epidemiology, Vol. 33, No. 5, pp. 459-464, May 2018. DOI: https://doi.org/10.1007/s10654-018-0390-z. https://doi.org/10.1007/s10654-018-0390-z
  9. M. P. Sesmero, A. I. Ledezma, and A. Sanchis. "Generating Ensembles of Heterogeneous Classifiers Using Stacked Generalization." Wiley Int. Rev. Data Min. and Knowl. Disc., Vol. 5, No. 1, pp. 21-34, January 2015. DOI: https://doi.org/10.1002/widm.1143. https://doi.org/10.1002/widm.1143
  10. D.-K. Kang, D. Fuller, and V. G. Honavar. "Learning Classifiers for Misuse Detection Using a Bag of System Calls Representation." In Proceedings of Intelligence and Security Informatics, IEEE International Conference on Intelligence and Security Informatics, ISI 2005, pp. 511-516, Atlanta, GA, USA, May 19-20, 2005. DOI: https://doi.org/10.1007/11427995_51.
  11. B.X. Wang, N. Japkowicz, "Boosting Support Vector Machines For Imbalanced Data Sets," in Proceedings of the 17th international conference on foundations of intelligent systems, pp. 38-47, Springer-Verlag, Berlin, Heidelberg, 2008.
  12. C. Elkan, "The Foundations Of Cost-Sensitive Learning," Proceedings of the 17th international joint conference on artificial intelligence, vol. 2, pp. 973-978, Morgan Kaufmann Publishers Inc, San Francisco, CA, USA, 2001.
  13. S. Forrest, S. A. Hofmeyr, A. Somayaji, and T. A. Longstafi, "A sense of self for unix processes," in Proceedings of the 1996 IEEE Symposium on Security and Privacy, p. 120, IEEE Computer Society, 1996.
  14. Ho, J., and Kang, D.-K., "Improvement of the Convergence Rate of Deep Learning by Using Scaling Method," International Journal of Advanced Smart Convergence (IJASC), 6(4):67-72, December 2017.
  15. Pratama, K., and Kang, D.-K., "The Effect of Hyperparameter Choice on ReLU and SELU Activation Function," International Journal of Advanced Smart Convergence (IJASC), 6(4):73-79, December 2017.

Acknowledgement

Supported by : National Research Foundation of Korea(NRF)