Publisher : Korean Data and Information Science Society
DOI : 10.7465/jkdi.2016.27.4.855
Title & Authors
A simple diagnostic statistic for determining the size of random forest Park, Cheolyong;
In this study, a simple diagnostic statistic for determining the size of random forest is proposed. This method is based on MV (margin of victory), a scaled difference in the votes at the infinite forest between the first and second most popular categories of the current random forest. We can note that if MV is negative then there is discrepancy between the current and infinite forests. More precisely, our method is based on the proportion of cases that -MV is greater than a fixed small positive number (say, 0.03). We derive an appropriate diagnostic statistic for our method and then calculate the distribution of the statistic. A simulation study is performed to compare our method with a recently proposed diagnostic statistic.
Diagnostic statistic;margin of victory;random forest;size determination;
Banfield, R. E., Hall, L. O., Bowyer, K. W. and Kegelmeyer, W. P. (2007). A comparison of decision tree creation techniques. IEEE Transactions on Pattern Recognition and Machine Learning, 29, 173-180.
Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123-140.
Breiman, L. (2001). Random forest. Machine Learning, 45, 5-32.
Choi, S. H. and Kim, H. (2016). Tree size determination for classification ensemble. Journal of the Korean Data & Information Science Society, 27, 255-264.
Dudoit, S., Fridlyand, J. and Speed, T. P. (2002). Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Society, 97, 77-87.
Hamza, M. and Larocque, D. (2005). An empirical comparison of ensemble methods based on classification trees. Journal of Statistical Computation and Simulation, 75, 629-643.
Hernandez-Lobato, D., Martinez-Munoz, G. and Suarez, A. (2011). Inference on prediction of ensembles of infinite size. Pattern Recognition, 44, 1426-1434.
Hernandez-Lobato, D., Martinez-Munoz, G. and Suarez, A. (2013). How large should ensembles of classifiers be? Pattern Recognition, 46, 1323-1336.
Park, C. (2010). Simple hypotheses testing for the number of trees in a random forest. Journal of the Korean Data & Information Science Society, 21, 371-377.
Shapire, R., Freund, Y., Bartlett, P. and Lee, W. (1998). Boosting the margin: A new explanation for the effectiveness of voting methods. Annals of Statistics, 26, 1651-1686.