DOI QR코드

DOI QR Code

Fast robust variable selection using VIF regression in large datasets

대형 데이터에서 VIF회귀를 이용한 신속 강건 변수선택법

  • Seo, Han Son (Department of Applied Statistics, Konkuk University)
  • 서한손 (건국대학교 응용통계학과)
  • Received : 2018.05.02
  • Accepted : 2018.06.11
  • Published : 2018.08.31

Abstract

Variable selection algorithms for linear regression models of large data are considered. Many algorithms are proposed focusing on the speed and the robustness of algorithms. Among them variance inflation factor (VIF) regression is fast and accurate due to the use of a streamwise regression approach. But a VIF regression is susceptible to outliers because it estimates a model by a least-square method. A robust criterion using a weighted estimator has been proposed for the robustness of algorithm; in addition, a robust VIF regression has also been proposed for the same purpose. In this article a fast and robust variable selection method is suggested via a VIF regression with detecting and removing potential outliers. A simulation study and an analysis of a dataset are conducted to compare the suggested method with other methods.

연구에서는 선형회귀모형을 가정한 대형 데이터에서의 변수선택 알고리즘을 다룬다. 방법의 속도와 강건성에 주안점을 둔 여러 알고리즘들이 제안되었다. 그 중에서 streamwise 회귀 접근법을 사용한 VIF회귀는 신속하고 정확하게 수행된다. 그러나 VIF회귀는 최소제곱방법에 의해 모형이 추정되므로 이상치에 민감하다. 변수선택방법의 강건성을 높이기 위해 가중 추정치를 사용한 강건측도가 제안되었으며 강건 VIF회귀도 제안되었다. 본 연구에서는 잠재적 이상치를 탐지하여 제거한 후 VIF회귀를 수행하는, 빠르고 강건한 변수선택 방법을 제안한다. 제안된 방법은 모의실험과 데이터 분석 통해 다른 방법들과 비교된다.

Keywords

References

  1. Dupuis, D. J. and Victoria-Feser, M. P. (2011). Fast robust model selection in large Datasets, Journal of the American Statistical Association, 106, 203-212. https://doi.org/10.1198/jasa.2011.tm09650
  2. Dupuis, D. J. and Victoria-Feser, M. P. (2013). Robust VIF regression with application to variable selection in large data sets, Annals of Applied Statistics, 7, 319-341. https://doi.org/10.1214/12-AOAS584
  3. Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space, Journal of the Royal Statistical Society. Series B, 70, 849-911. https://doi.org/10.1111/j.1467-9868.2008.00674.x
  4. Foster, D. P. and Stine, R. A. (2008). investing: a procedure for sequential control of expected false discoveries, Journal of the Royal Statistical Society. Series B, 70, 429-444. https://doi.org/10.1111/j.1467-9868.2007.00643.x
  5. Hadi, A. S. and Simonoff, J. S. (1993). Procedures for the identification of multiple outliers in linear models, Journal of the American Statistical Association, 88, 1264-1272. https://doi.org/10.1080/01621459.1993.10476407
  6. Harrison, D. and Rubinfeld, D. L. (1978). Hedonic prices and the demand for clean air, Journal of Environmental Economics and Management, 5, 81-102. https://doi.org/10.1016/0095-0696(78)90006-2
  7. Lin, D., Foster, D. P., and Ungar, L. H. (2011). VIF regression: a fast regression algorithm for large data, Journal of the American Statistical Association, 106, 232-247. https://doi.org/10.1198/jasa.2011.tm10113
  8. Stock, J. H. and Watson, M. W. (2007). Introduction to Econometrics, 2nd ed. Boston: Addison Wesley.
  9. Zhou, J., Foster, D. P., and Ungar, L. H. (2006). Streamwise feature selection, Journal of Machine Learning Research, 7, 1861-1885.