DOI QR코드

DOI QR Code

Comparison of tree-based ensemble models for regression

  • Park, Sangho (Department of Statistics, Sungkyunkwan University) ;
  • Kim, Chanmin (Department of Statistics, Sungkyunkwan University)
  • 투고 : 2022.02.18
  • 심사 : 2022.06.29
  • 발행 : 2022.09.30

초록

When multiple classifications and regression trees are combined, tree-based ensemble models, such as random forest (RF) and Bayesian additive regression trees (BART), are produced. We compare the model structures and performances of various ensemble models for regression settings in this study. RF learns bootstrapped samples and selects a splitting variable from predictors gathered at each node. The BART model is specified as the sum of trees and is calculated using the Bayesian backfitting algorithm. Throughout the extensive simulation studies, the strengths and drawbacks of the two methods in the presence of missing data, high-dimensional data, or highly correlated data are investigated. In the presence of missing data, BART performs well in general, whereas RF provides adequate coverage. The BART outperforms in high dimensional, highly correlated data. However, in all of the scenarios considered, the RF has a shorter computation time. The performance of the two methods is also compared using two real data sets that represent the aforementioned situations, and the same conclusion is reached.

키워드

과제정보

This work is supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (No. NRF-2020R1F1A1A01048168)

참고문헌

  1. Breiman L (2001). Random forests, Machine Learning, 45, 5-32. https://doi.org/10.1023/A:1010933404324
  2. Breiman L, Friedman JH, Olshen R, and Stong CJ (1984). Classification and Regression Trees, Routledge, New York.
  3. Buhlmann P and Van De Geer S (2011). Statistics for High-Dimensional Data: Methods, Theory and Applications, Springer Science & Business Media, New York.
  4. Chipman HA, George EI, and McCulloch RE (1998). Bayesian CART model search, Journal of the American Statistical Association, 93, 935-948. https://doi.org/10.1080/01621459.1998.10473750
  5. Chipman HA, George EI, and McCulloch RE (2010). BART: Bayesian additive regression trees, The Annals of Applied Statistics, 4, 266-298. https://doi.org/10.1214/09-AOAS285
  6. Fox EW, Hill RA, Leibowitz SG, Olsen AR, Thornbrugh DJ, and Weber MH (2017). Assessing the accuracy and stability of variable selection methods for random forest modeling in ecology, Environmental Monitoring and Asessment, 189, 1-20. https://doi.org/10.1007/s10661-016-5706-4
  7. Friedman JH (1991). Multivariate adaptive regression splines, The Annals of Statistics, 19, 1-141. https://doi.org/10.1214/aos/1176347963
  8. Gunduz N and Fokoue E (2015). Robust classification of high dimension low sample size data, arXiv:1501.00592.
  9. Hernandez B, Raftery AE, Pennington SR, and Parnell AC (2018). Bayesian additive regression trees using Bayesian model averaging, Statistics and Computing, 28, 869-890. https://doi.org/10.1007/s11222-017-9767-1
  10. Janitza S, Celik E, and Boulesteix AL (2018). A computationally fast variable importance test for random forests for high-dimensional data, Advances in Data Analysis and Classification, 12, 885-915. https://doi.org/10.1007/s11634-016-0276-4
  11. Kapelner A and Bleich J (2015). Prediction with missing data via Bayesian additive regression trees, Canadian Journal of Statistics, 43, 224-239. https://doi.org/10.1002/cjs.11248
  12. Kapelner A and Bleich J (2016). bartMachine: Machine learning with Bayesian additive regression trees, Journal of Statistical Software, 70, 1-40.
  13. Kern C, Klausch T, and Kreuter F (2019). Tree-based machine learning methods for survey research, Survey Research Method, 13, 73-93.
  14. Kuhn M and Johnson K (2013). Applied predictive modeling, Springer, New York.
  15. Liaw A and Wiener M (2002). Classification and regression by randomForest, R News, 2, 18-22.
  16. Linero AR (2018). Bayesian regression trees for high-dimensional prediction and variable selection, Journal of the American Statistical Association, 113, 626-636. https://doi.org/10.1080/01621459.2016.1264957
  17. Rubin DB (1976). Inference and missing data, Biometrika, 63, 581-592. https://doi.org/10.1093/biomet/63.3.581
  18. Sparapani R, Spanbauer C, and McCulloch R (2021). Nonparametric machine learning and efficient computation with Bayesian additive regression trees: the BART R package, Journal of Statistical Software, 97, 1-66.
  19. Stekhoven DJ and B¨uhlmann P (2012). MissForest-non-parametric missing value imputation for mixed-type data, Bioinformatics, 28, 112-118. https://doi.org/10.1093/bioinformatics/btr597
  20. Strobl C, Boulesteix AL, Zeileis A, and Hothorn T (2007). Bias in random forest variable importance measures: Illustrations, sources and a solution, BMC Bioinformatics, 8, 1-21. https://doi.org/10.1186/1471-2105-8-1
  21. Tang F and Ishwaran H (2017). Random forest missing data algorithms, Statistical Analysis and Data Mining, 10, 363-377. https://doi.org/10.1002/sam.11348
  22. Waldmann P (2016). Genome-wide prediction using Bayesian additive regression trees, Genetics Selection Evolution, 48, 1-12. https://doi.org/10.1186/s12711-016-0219-8
  23. Wright MN, Wager S, and Probst P (2020). Ranger: A fast implementation of random forests, R package version 0.12,1.
  24. Zhang H, Zimmerman J, Nettleton D, and Nordman DJ (2019). Random forest prediction intervals, The American Statistician, 74, 392-406. https://doi.org/10.1080/00031305.2019.1585288