Identification of Regression Outliers Based on Clustering of LMS-residual Plots

Kim, Bu-Yong;Oh, Mi-Hyun

  • Published : 2004.12.01


An algorithm is proposed to identify multiple outliers in linear regression. It is based on the clustering of residuals from the least median of squares estimation. A cut-height criterion for the hierarchical cluster tree is suggested, which yields the optimal clustering of the regression outliers. Comparisons of the effectiveness of the procedures are performed on the basis of the classic data and artificial data sets, and it is shown that the proposed algorithm is superior to the one that is based on the least squares estimation. In particular, the algorithm deals very well with the masking and swamping effects while the other does not.


regression outlier;robust residual;clustering;masking;swamping


  1. Basset, Jr. G. W.(1991). Equivariant, monotonic, 50% breakdown estimators, The American Statistician, Vol. 45, 135-137
  2. Belsely, D. A, Kuh, E. and Welsh, R E.(1980). Regression Diagnostics: lrifluential Data and Source of Collinearity. Wiley, New York
  3. Cook, R D. and Weisberg, S.(1980). Characterizations of an empirical influence function for detecting influential cases in regression, Technometrics, Vol. 22, 495-508
  4. Everitt, B. S.(1993). Cluster Analysis, Halsted Press, New York
  5. Hadi, A S. and Simonoff, J. S.(1993). Procedures for the identification of multiple outliers in linear models, journal of the American Statistical Association, Vol. 88, 1264-1272
  6. Hartigan, J. A(1975). Clustering Algorithms, Wiley, New York
  7. Hawkins, D. M., Bradu, D. and Kass, G. V.(1984). Location of several outliers in multiple regression data using elemental sets, Technometrics, Vol. 26, 197-208
  8. Kianifard, F. and Swallow, W. H.(1990). A Monte Carlo comparison of five procedures for identifying outliers in linear regression, Commun. Statist.-Theory Meth, Vol. 19, 1913-1938
  9. Kim, B. Y.(1996).$ L_{\infty}$-estimation based algorithm for the least median of squares estimator, The Korean Communications in Statistics, Vol. 3, 299-307
  10. Kim, B. Y. and Kim, H. Y(2002). A hybrid algorithm for identifying multiple outliers in linear regression, The Korean Communication in Statistics, Vol. 9, 291-304
  11. Marasinghe, M. G.(1985). A multistage procedure for detecting several outliers in linear regression, Technometrics, Vol. 27, 395-399
  12. Mojena, R(1977). Hierarchical grouping methods and stopping rules: an evaluation, Computer journal, Vol. 20, 359-363
  13. Rousseeuw, P. J.(1984). Least median of squares regression, journal of the American Statistical Association, Vol. 79, 871-880
  14. Rousseeuw, P. J. and Leroy, A M.(1987). Robust Regression and Outlier Detection, Wiley-Interscience, New York
  15. Rousseeuw, P. J. and Zomeren, B. C.(1990). Unmasking multivariate outliers and leverage points, journal of the American Statistical Association, Vol. 85, 633-639
  16. Sebert, D. M., Montgomery, D. C. and RoIlier, D. A(1998). A clustering algorithm for identifying multiple outliers in linear regression, Computational Statistics & Data Analysis, Vol. 27, 461-484

Cited by

  1. A Criterion for the Selection of Principal Components in the Robust Principal Component Regression vol.18, pp.6, 2011,