Probabilistic penalized principal component analysis

• Park, Chongsun (Department of Statistics, Sungkyunkwan University) ;
• Wang, Morgan C. (Department of Statistics, University of Central Florida) ;
• Mo, Eun Bi (Department of Statistics, Sungkyunkwan University)
• Accepted : 2017.02.25
• Published : 2017.03.31

Abstract

A variable selection method based on probabilistic principal component analysis (PCA) using penalized likelihood method is proposed. The proposed method is a two-step variable reduction method. The first step is based on the probabilistic principal component idea to identify principle components. The penalty function is used to identify important variables in each component. We then build a model on the original data space instead of building on the rotated data space through latent variables (principal components) because the proposed method achieves the goal of dimension reduction through identifying important observed variables. Consequently, the proposed method is of more practical use. The proposed estimators perform as the oracle procedure and are root-n consistent with a proper choice of regularization parameters. The proposed method can be successfully applied to high-dimensional PCA problems with a relatively large portion of irrelevant variables included in the data set. It is straightforward to extend our likelihood method in handling problems with missing observations using EM algorithms. Further, it could be effectively applied in cases where some data vectors exhibit one or more missing values at random.

References

1. Anderson TW and Rubin H (1956). Statistical inference in factor analysis. In Proceedings of the 3rd Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, 111-150.
2. Antoniadis A (1997). Wavelets in statistics: a review, Journal of the Italian Statistical Society, 6, 97-144. https://doi.org/10.1007/BF03178905
3. Breiman L (1995). Better subset regression using the nonnegative garrote, Technometrics, 37, 373-384. https://doi.org/10.1080/00401706.1995.10484371
4. Cadima J and Jolliffe IT (1995). Loadings and correlations in the interpretation of principal compo-nents, Journal of Applied Statistics, 22, 203-214 https://doi.org/10.1080/757584614
5. Fan J (1997). Comments on 'wavelets in statistics: a review' by A. Antoniadis, Journal of the Italian Statistical Society, 6, 131-138. https://doi.org/10.1007/BF03178906
6. Fan J and Li R (2001). Variable selection via nonconcave penalized likelihood and its oracle properties, Journal of the American Statistical Association, 96, 1348-1360. https://doi.org/10.1198/016214501753382273
7. Fan J and Peng H (2004). Nonconcave penalized likelihood with a diverging number of parameters, The Annals of Statistics, 32, 928-961. https://doi.org/10.1214/009053604000000256
8. Fu WJ (1998). Penalized regressions: the bridge versus the LASSO, Journal of Computational and Graphical Statistics, 7, 397-416.
9. Green PJ (1990). On use of the EM for penalized likelihood estimation, Journal of the Royal Statistical Society Series B (Methodological), 52, 443-452. https://doi.org/10.1111/j.2517-6161.1990.tb01798.x
10. Hausman RE (1982). Constrained multivariate analysis. In SH Zanckis and JS Rustagi (Eds), Optimisation in Statistics: With a View Towards Applications in Management Science and Operations Research (pp. 137-151), North-Holland, Amsterdam.
11. Jeffers JNR (1967). Two case studies in the application of principal component analysis, Applied Statistics, 16, 225-236. https://doi.org/10.2307/2985919
12. Jolliffe IT (1972). Discarding variables in a principal component analysis. I: artificial data, Applied Statistics, 21, 160-173. https://doi.org/10.2307/2346488
13. Jolliffe IT (1973). Discarding variables in a principal component analysis. II: real data, Applied Statistics, 22, 21-31. https://doi.org/10.2307/2346300
14. Jolliffe IT (1989). Rotation of ill-defined principal components, Applied Statistics, 38, 139-147. https://doi.org/10.2307/2347688
15. Jolliffe IT (1995). Rotation of principal components: choice of normalization constraints, Journal of Applied Statistics, 22, 29-35. https://doi.org/10.1080/757584395
16. Jolliffe IT (2002). Principal Component Analysis, Springer-Verlag, New York.
17. Jolliffe IT, Trendafilov NT, and Uddin M (2003). A modified principal component technique based on the LASSO, Journal of Computational and Graphical Statistics, 12, 531-547. https://doi.org/10.1198/1061860032148
18. Lawley DN (1953). A modified method of estimation in factor analysis and some large sample results. In Uppsala Symposium on Psychological Factor Analysis, Number 3 in Nordisk Psykologi's Monograph Series (pp. 35-42), Almqvist and Wiksell, Uppsala.
19. Tibshirani R (1996). Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society Series B (Methodological), 58, 267-288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x
20. Tipping ME and Bishop CM (1999a). Mixtures of probabilistic principal component analyzers, Neural computation, 11, 443-482. https://doi.org/10.1162/089976699300016728
21. Tipping ME and Bishop CM (1999b). Probabilistic principal component analysis, Journal of the Royal Statistical Society Series B (Statistical Methodology), 61, 611-622. https://doi.org/10.1111/1467-9868.00196
22. Vines SK (2000). Simple principal components, Journal of the Royal Statistical Society Series C (Applied Statistics), 49, 441-451. https://doi.org/10.1111/1467-9876.00204
23. Witten DM, Tibshirani R, and Hastie T (2009). A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis, Biostatistics, 10, 515-534. https://doi.org/10.1093/biostatistics/kxp008
24. Xie B, Pan W, and Shen X (2010). Penalized mixtures of factor analyzers with application to clustering high-dimensional microarray data, Bioinformatics, 26, 501-508. https://doi.org/10.1093/bioinformatics/btp707
25. Zou H, Hastie T, and Tibshirani R (2006). Sparse principal component analysis. Journal of Computational and Graphical Statistics, 15, 265-286. https://doi.org/10.1198/106186006X113430