DOI QR코드

DOI QR Code

Identification of Regression Outliers Based on Clustering of LMS-residual Plots

  • Kim, Bu-Yong ;
  • Oh, Mi-Hyun
  • Published : 2004.12.01

Abstract

An algorithm is proposed to identify multiple outliers in linear regression. It is based on the clustering of residuals from the least median of squares estimation. A cut-height criterion for the hierarchical cluster tree is suggested, which yields the optimal clustering of the regression outliers. Comparisons of the effectiveness of the procedures are performed on the basis of the classic data and artificial data sets, and it is shown that the proposed algorithm is superior to the one that is based on the least squares estimation. In particular, the algorithm deals very well with the masking and swamping effects while the other does not.

Keywords

regression outlier;robust residual;clustering;masking;swamping

References

  1. Basset, Jr. G. W.(1991). Equivariant, monotonic, 50% breakdown estimators, The American Statistician, Vol. 45, 135-137 https://doi.org/10.2307/2684377
  2. Belsely, D. A, Kuh, E. and Welsh, R E.(1980). Regression Diagnostics: lrifluential Data and Source of Collinearity. Wiley, New York
  3. Cook, R D. and Weisberg, S.(1980). Characterizations of an empirical influence function for detecting influential cases in regression, Technometrics, Vol. 22, 495-508 https://doi.org/10.2307/1268187
  4. Everitt, B. S.(1993). Cluster Analysis, Halsted Press, New York
  5. Hadi, A S. and Simonoff, J. S.(1993). Procedures for the identification of multiple outliers in linear models, journal of the American Statistical Association, Vol. 88, 1264-1272 https://doi.org/10.2307/2291266
  6. Hartigan, J. A(1975). Clustering Algorithms, Wiley, New York
  7. Hawkins, D. M., Bradu, D. and Kass, G. V.(1984). Location of several outliers in multiple regression data using elemental sets, Technometrics, Vol. 26, 197-208 https://doi.org/10.2307/1267545
  8. Kianifard, F. and Swallow, W. H.(1990). A Monte Carlo comparison of five procedures for identifying outliers in linear regression, Commun. Statist.-Theory Meth, Vol. 19, 1913-1938 https://doi.org/10.1080/03610929008830300
  9. Kim, B. Y.(1996).$ L_{\infty}$-estimation based algorithm for the least median of squares estimator, The Korean Communications in Statistics, Vol. 3, 299-307
  10. Kim, B. Y. and Kim, H. Y(2002). A hybrid algorithm for identifying multiple outliers in linear regression, The Korean Communication in Statistics, Vol. 9, 291-304 https://doi.org/10.5351/CKSS.2002.9.1.291
  11. Marasinghe, M. G.(1985). A multistage procedure for detecting several outliers in linear regression, Technometrics, Vol. 27, 395-399 https://doi.org/10.2307/1270206
  12. Mojena, R(1977). Hierarchical grouping methods and stopping rules: an evaluation, Computer journal, Vol. 20, 359-363 https://doi.org/10.1093/comjnl/20.4.359
  13. Rousseeuw, P. J.(1984). Least median of squares regression, journal of the American Statistical Association, Vol. 79, 871-880 https://doi.org/10.2307/2288718
  14. Rousseeuw, P. J. and Leroy, A M.(1987). Robust Regression and Outlier Detection, Wiley-Interscience, New York
  15. Rousseeuw, P. J. and Zomeren, B. C.(1990). Unmasking multivariate outliers and leverage points, journal of the American Statistical Association, Vol. 85, 633-639 https://doi.org/10.2307/2289995
  16. Sebert, D. M., Montgomery, D. C. and RoIlier, D. A(1998). A clustering algorithm for identifying multiple outliers in linear regression, Computational Statistics & Data Analysis, Vol. 27, 461-484 https://doi.org/10.1016/S0167-9473(98)00021-8

Cited by

  1. A Criterion for the Selection of Principal Components in the Robust Principal Component Regression vol.18, pp.6, 2011, https://doi.org/10.5351/CKSS.2011.18.6.761