DOI QR코드

DOI QR Code

Simultaneous outlier detection and variable selection via difference-based regression model and stochastic search variable selection

  • Park, Jong Suk (Department of Statistics, Kyungpook National University) ;
  • Park, Chun Gun (Department of Mathematics, Kyonggi University) ;
  • Lee, Kyeong Eun (Department of Statistics, Kyungpook National University)
  • Received : 2018.08.14
  • Accepted : 2019.02.08
  • Published : 2019.03.31

Abstract

In this article, we suggest the following approaches to simultaneous variable selection and outlier detection. First, we determine possible candidates for outliers using properties of an intercept estimator in a difference-based regression model, and the information of outliers is reflected in the multiple regression model adding mean shift parameters. Second, we select the best model from the model including the outlier candidates as predictors using stochastic search variable selection. Finally, we evaluate our method using simulations and real data analysis to yield promising results. In addition, we need to develop our method to make robust estimates. We will also to the nonparametric regression model for simultaneous outlier detection and variable selection.

Acknowledgement

Supported by : National Research Foundation of Korea (NRF)

References

  1. Atkinson AC (1986). [Influential observations, high leverage points, and outliers in linear regression]: comment: aspects of diagnostic regression analysis, Statistical Science, 1, 397-402. https://doi.org/10.1214/ss/1177013624
  2. Barbieri MM and Berger JO (2004). Optimal predictive model selection, The Annals of Statistics, 32, 870-897. https://doi.org/10.1214/009053604000000238
  3. Bayarri MJ, Berger JO, Forte A, and Donato GG (2012). Criteria for Bayesian model choice with application to variable selection, The Annals of Statistics, 40, 1550-1577. https://doi.org/10.1214/12-AOS1013
  4. Belsley DA, Kuh E, and Welsch RE (1980). Regression Diagnostics, Wiley, New York.
  5. Choi IH, Park CG, and Lee KE (2018). Outlier detection and variable selection via difference based regression model and penalized regression, Journal of the Korean Data & Information Science Society, 29, 815-825. https://doi.org/10.7465/jkdi.2018.29.3.815
  6. Donato GG and Forte A (2017). BayesVarSel : Bayes factors, model choice and variable selection in linear models, R package version 1.8.0 Available on line access from https://cran.rproject.org/web/packages/BayesVarSel/BayesVarSel.pdf
  7. George EI and McCulloch RE (1993). Variable selection via Gibbs sampling, Journal of the American Statistical Association, 88, 881-889. https://doi.org/10.1080/01621459.1993.10476353
  8. George EI and McCulloch RE (1997). Approaches for Bayesian variable selection, Statistica Sinica, 7, 339-373.
  9. Gelman A and Rubin DB (1992). Inference from iterative simulation using multiple sequences, Statistical Science, 7, 457-511. https://doi.org/10.1214/ss/1177011136
  10. Hoeting J, Raftery AE, and Madigan D (1996). A method for simultaneous variable selection and outlier identification in linear regression, Computational Statistics and Data Analysis, 22, 251-270. https://doi.org/10.1016/0167-9473(95)00053-4
  11. Kahng MW, Kim YI, Ahn CH, and Lee YG (2016). Regression Analysis (2nd ed), Yulgok, Seoul.
  12. Kim S, Park SH, and Krzanowski WJ (2008). Simultaneous variable selection and outlier identification in linear regression using the mean-shift outlier model, Journal of Applied Statistics, 35, 283-291. https://doi.org/10.1080/02664760701833040
  13. Menjoge RS and Welsch RE (2010). A diagnostic method for simultaneous feature selection and outlier identification in linear regression, Computational Statistics and Data Analysis, 54, 3181-3193. https://doi.org/10.1016/j.csda.2010.02.014
  14. Park CG (2018). A study on robust regression estimators in heteroscedastic error models, Journal of the Korean Data & Information Science Society, 29, 339-350. https://doi.org/10.7465/jkdi.2018.29.2.339
  15. Park CG and Kim I (2018a). Outlier detection using difference-based variance estimators in multiple regression, Communications in Statistics - Theory and Methods, 47, 5986-6001. https://doi.org/10.1080/03610926.2017.1404101
  16. Park CG and Kim I (2018b). Outlier detection using difference based regression Model, Communications in Statistics - Theory Methods, under review.
  17. Park CG, Kim I, and Lee Y (2012). Error variance estimation in nonparametric regression under Lipschitz condition and small sample size, Journal of Statistical Planning and Inference, 142, 2369-2385. https://doi.org/10.1016/j.jspi.2012.02.050
  18. Rousseeuw PJ (1984). Least median of squares regression, Journal of the American Statistical Association, 79, 871-888. https://doi.org/10.1080/01621459.1984.10477105
  19. Weisberg S (2004). Applied Linear Regression (3rd ed.), Wiley,