DOI QR코드

DOI QR Code

Classification of large-scale data and data batch stream with forward stagewise algorithm

전진적 단계 알고리즘을 이용한 대용량 데이터와 순차적 배치 데이터의 분류

  • Yoon, Young Joo (Department of Business Information Statistics, Daejeon University)
  • 윤영주 (대전대학교 비즈니스정보통계학과)
  • Received : 2014.08.28
  • Accepted : 2014.10.02
  • Published : 2014.11.30

Abstract

In this paper, we propose forward stagewise algorithm when data are very large or coming in batches sequentially over time. In this situation, ordinary boosting algorithm for large scale data and data batch stream may be greedy and have worse performance with class noise situations. To overcome those and apply to large scale data or data batch stream, we modify the forward stagewise algorithm. This algorithm has better results for both large scale data and data batch stream with or without concept drift on simulated data and real data sets than boosting algorithms.

본 논문에서는 대용량이거나 시간에 따라 순차적으로 들어오는 데이터의 분류를 위한 전진적 단계 알고리즘을 제안한다. Adaboost 알고리즘은 노이즈가 있는 데이터에 대하여 성능이 떨어지는 것으로 알려져 있다. 이를 해결하기 위한 한 가지 방법으로 전진적 단계 선형 회귀 방법을 사용한다. 대용량 데이터나 순차적 배치 데이터의 경우에도 이러한 상황을 극복하기 위해 전진적 단계 알고리즘 방법을 적용한 방법을 제안한다. 모의실험과 실제 자료 분석을 통해 제안된 알고리즘이 좋은 성능을 보임을 알 수 있었다.

Keywords

References

  1. Bache, K. and Lichman, M. (2013). UCI machine learning repository [http://archive.ics.uci.edu/ml]. University of California, School of Information and Computer Science, Irvine, CA.
  2. Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123-140.
  3. Breiman, L. (1998). Arcing classifiers (with discussion). Annals of Statistics, 26, 801-849. https://doi.org/10.1214/aos/1024691079
  4. Breiman, L., Friedman, J., Olshen, R. and Stone, C. (1984). Classification and regression trees, Chapman and Hall, New York, NY.
  5. Dietterich, T. G. (2000). An experimental comparison of three methods for constructing ensembles decision trees: bagging, boosting and randomization. Machine Learning, 40, 139-157. https://doi.org/10.1023/A:1007607513941
  6. Freund, Y. and Schapire, R. E. (1997). A decision-theoretic generalization of online learning and application to boosting. Journal of Computer and System Science, 55, 119-139. https://doi.org/10.1006/jcss.1997.1504
  7. Hastie, T., Tibshirani, R. and Friedman, J. (2001). The elements of statistical learning, Springer-Verlag, New York, NY.
  8. Kim, S. H., Cho, D. H. and Seok, K. H. (2012). Study on the ensemble methods with kernel ridge regression. Journal of the Korean Data & Information Science Society, 23, 375-383. https://doi.org/10.7465/jkdi.2012.23.2.375
  9. Kohavi, R. (1996). Scaling up the accuracy of naive-bayes classifiers: A decision-tree hybrid. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, 202-207.
  10. Kuncheva, L. I. (2004). Classification ensemble for changing environments. Proceedings of 5th International Workshop on Multiple Classifier systems, 1-15.
  11. Quinlan, J. R. (1993). C4.5 : programs for machine learning, Morgan Kaufmann, San Maeto, CA.
  12. Street, W. N. and Kim, Y. S. (2001). A streaming ensemble algorithm (SEA) for large scale classification. Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 377-382.
  13. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society B, 58, 267-288.
  14. Wang, H., Fan, W., Yu, P. S. and Han, J. (2003). Mining concept drifting data streams using ensemble classifiers. Proceedings of then 9th ACM SIGKDD International Conference on Knowledge discovery and Data Mining, 226-235.
  15. Yoon, Y. J. (2010). Boosting algorithms for large-scale data and data batch stream (in Korean). The Korean Journal of Applied Statistics, 23, 197-206. https://doi.org/10.5351/KJAS.2010.23.1.197