DOI QR코드

DOI QR Code

Evaluating Variable Selection Techniques for Multivariate Linear Regression

다중선형회귀모형에서의 변수선택기법 평가

  • Ryu, Nahyeon (School of Industrial Management Engineering, Korea University) ;
  • Kim, Hyungseok (School of Industrial Management Engineering, Korea University) ;
  • Kang, Pilsung (School of Industrial Management Engineering, Korea University)
  • 류나현 (고려대학교 산업경영공학부) ;
  • 김형석 (고려대학교 산업경영공학부) ;
  • 강필성 (고려대학교 산업경영공학부)
  • Received : 2016.06.10
  • Accepted : 2016.10.04
  • Published : 2016.10.15

Abstract

The purpose of variable selection techniques is to select a subset of relevant variables for a particular learning algorithm in order to improve the accuracy of prediction model and improve the efficiency of the model. We conduct an empirical analysis to evaluate and compare seven well-known variable selection techniques for multiple linear regression model, which is one of the most commonly used regression model in practice. The variable selection techniques we apply are forward selection, backward elimination, stepwise selection, genetic algorithm (GA), ridge regression, lasso (Least Absolute Shrinkage and Selection Operator) and elastic net. Based on the experiment with 49 regression data sets, it is found that GA resulted in the lowest error rates while lasso most significantly reduces the number of variables. In terms of computational efficiency, forward/backward elimination and lasso requires less time than the other techniques.

Acknowledgement

Supported by : 한국연구재단

References

  1. Bellman, R. E. (2015), Adaptive Control Processes : A Guided Tour, Princeton university press.
  2. Blum, A. L. and Langley, P. (1997), Selection of relevant features and examples in machine learning, Artificial Intelligence, 97(1), 245-271. https://doi.org/10.1016/S0004-3702(97)00063-5
  3. Chatterjee, S. and Hadi, A. S. (2015), Regression Analysis by Example, John Wiley and Sons.
  4. Fernández-Delgado, M., Cernadas, E., Barro, S., and Amorim, D. (2014), Do we need hundreds of classifiers to solve real world classification problems, J. Mach. Learn. Res, 15(1), 3133-3181.
  5. Guyon, I. and Elisseeff, A. (2003), An introduction to variable and feature selection, The Journal of Machine Learning Research, 3, 1157-1182.
  6. Hoerl, A. E. and Kennard, R. W. (1970), Ridge regression : Biased estimation for non orthogonal problems, Technometrics, 12(1), 55-67. https://doi.org/10.1080/00401706.1970.10488634
  7. James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013), An Introduction to Statistical Learning, New York : springer, 112.
  8. Kang, P., Lee, H., Cho, S., Kim, D., Park, J., and Park, C.-K. (2009), A virtual metrology system for semiconductor manufacturing, Expert Systems with Applications, 36(11), 12554-12561. https://doi.org/10.1016/j.eswa.2009.05.053
  9. Kang, P., Kim, D., Lee, H., Doh, S., and Cho, S. (2011), Virtual metrology for run-to-run control in semiconductor manufacturing, Expert Systems with Applications, 38(3), 2508-2522. https://doi.org/10.1016/j.eswa.2010.08.040
  10. Kim, D., Kang, P., Lee, S.-K., Kang, S., Doh, S., and Cho, S. (2015), Improvement of virtual metrology performance by removing metrology noises in a training dataset, Pattern Analysis and Applications, 18(1), 173-189. https://doi.org/10.1007/s10044-013-0363-5
  11. Kohavi, R. and John, G. H. (1997), Wrappers for feature subset selection, Artificial intelligence, 97(1), 273-324. https://doi.org/10.1016/S0004-3702(97)00043-X
  12. Lastovicka, J. L. and Sirianni, N. J. (2011), Truly, madly, deeply : Consumers in the throes of material possession love, Journal of Consumer Research, 38(2), 323-342. https://doi.org/10.1086/658338
  13. Lee, H., Kim, S. G., Park, H.-W., and Kang, P. (2014), Pre-launch new product demand forecasting using the Bass model : A statistical and machine learning-based approach, Technological Forecasting and Social Change, 86, 49-64. https://doi.org/10.1016/j.techfore.2013.08.020
  14. Madhuri, V. H. and Rani, T. S. (2015), Ranking and dimensionality reduction using biclustering, In Proceedings of the Fifth International Conference on Fuzzy and Neuro Computing (FANCCO), 209-226.
  15. Mallick, H. and Yi, N. (2013), Bayesian methods for high dimensional linear models, Journal of Biometrics and Biostatistics, 1(5).
  16. Ross, S. M. (2004), Introduction to Probability and Statistic for Engineers and Scientists, Academic Press.
  17. Shumway, R. H. and Stoffer, D. S. (2010), Time series analysis and its applications : with R examples, Springer Science and Business Media.
  18. Smialowski, P., Frishman, D., and Kramer, S. (2010), Pitfalls of supervised feature selection, Bioinformatics, 26(3), 440-443. https://doi.org/10.1093/bioinformatics/btp621
  19. Tibshirani, R. (1996), Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society, Series B (Methodological), 267-288.
  20. Yang, J. and Honavar, V. (1998), Feature subset selection using a genetic algorithm, IEEE Intelligent Systems and Their Applications, 13(2), 44-49. https://doi.org/10.1109/5254.671091