Multivariate Procedure for Variable Selection and Classification of High Dimensional Heterogeneous Data

- Journal title : Communications for Statistical Applications and Methods
- Volume 22, Issue 6, 2015, pp.575-587
- Publisher : The Korean Statistical Society
- DOI : 10.5351/CSAM.2015.22.6.575

Title & Authors

Multivariate Procedure for Variable Selection and Classification of High Dimensional Heterogeneous Data

Mehmood, Tahir; Rasheed, Zahid;

Mehmood, Tahir; Rasheed, Zahid;

Abstract

The development in data collection techniques results in high dimensional data sets, where discrimination is an important and commonly encountered problem that are crucial to resolve when high dimensional data is heterogeneous (non-common variance covariance structure for classes). An example of this is to classify microbial habitat preferences based on codon/bi-codon usage. Habitat preference is important to study for evolutionary genetic relationships and may help industry produce specific enzymes. Most classification procedures assume homogeneity (common variance covariance structure for all classes), which is not guaranteed in most high dimensional data sets. We have introduced regularized elimination in partial least square coupled with QDA (rePLS-QDA) for the parsimonious variable selection and classification of high dimensional heterogeneous data sets based on recently introduced regularized elimination for variable selection in partial least square (rePLS) and heterogeneous classification procedure quadratic discriminant analysis (QDA). A comparison of proposed and existing methods is conducted over the simulated data set; in addition, the proposed procedure is implemented to classify microbial habitat preferences by their codon/bi-codon usage. Five bacterial habitats (Aquatic, Host Associated, Multiple, Specialized and Terrestrial) are modeled. The classification accuracy of each habitat is satisfactory and ranges from 89.1% to 100% on test data. Interesting codon/bi-codons usage, their mutual interactions influential for respective habitat preference are identified. The proposed method also produced results that concurred with known biological characteristics that will help researchers better understand divergence of species.

Keywords

partial least squares;classification;variable selection;parsimonious model;high dimensional data sets;identification;multi collinearity;microbial;

Language

English

References

1.

Alsberg, B. K., Kell, D. B. and Goodacre, R. (1998). Variable selection in discriminant partial least-squares analysis, Analytical Chemistry, 70, 4126-4133.

2.

Bachvarov, B., Kirilov, K. and Ivanov, I. (2008). Codon usage in prokaryotes, Biotechnology & Biotechnological Equipment, 22, 669-682.

3.

Barker, M. and Rayens, W. (2003). Partial least squares for discrimination, Journal of Chemometrics, 17, 166-173.

4.

Botzman, M. and Margalit, H. (2011). Variation in global codon usage bias among prokaryotic organisms is associated with their lifestyles, Genome Biol, 12, R109.

5.

Boulesteix, A.-L. (2004). PLS dimension reduction for classification with microarray data, Statistical Applications in Genetics and Molecular Biology, 3, 1-30.

6.

Chen, R., Yan, H., Zhao, K. N., Martinac, B. and Liu, G. B. (2007). Comprehensive analysis of prokaryotic mechanosensation genes: Their characteristics in codon usage, DNA Sequence, 18, 269-278.

7.

Chun, H. and Keles, S. (2010). Sparse partial least squares regression for simultaneous dimension reduction and variable selection, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72, 3-25.

8.

Costello, E. K., Lauber, C. L., Hamady, M., Fierer, N., Gordon, J. I. and Knight, R. (2009). Bacterial community variation in human body habitats across space and time, Science, 326, 1694-1697.

9.

Eriksson, L., Johansson, E., Kettaneh-Wold, N. and Wold, S. (2001). Multi-and Megavariate Data Analysis, Umetrics Academy, Umea.

10.

Gosselin, R., Rodrigue, D. and Duchesne, C. (2010). A bootstrap-VIP approach for selecting wave-length intervals in spectral imaging applications, Chemometrics and Intelligent Laboratory Systems, 100. 12-21.

11.

Handelsman, J. (2004). Metagenomics: application of genomics to uncultured microorganisms, Microbiology and Molecular Biology Reviews, 68, 669-685.

12.

Hanes, A., Raymer, M. L., Doom, T. E. and Krane, D. E. (2009). A comparision of codon usage trends in prokaryotes, In Proceedings of Ohio Collaborative Conference on Bioinformatics (OCCBIO'09), Cleveland, OH, 83-86.

13.

Hastie, T., Tibshirani, R. and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer, New York.

14.

Hattenschwiler, S., Fromin, N. and Barantal, S. (2011). Functional diversity of terrestrial microbial decomposers and their substrates, Comptes Rendus Biologies, 334, 393-402.

15.

Hubner, S., Rashkovetsky, E., Kim, Y. B., Oh, J. H., Michalak, K., Weiner, D., Korol, A. B. Nevo, E. and Michalak, P. (2013). Genome differentiation of Drosophila melanogaster from a microclimate contrast in Evolution Canyon, Israel, In Proceedings of the National Academy of Sciences, 110, 21059-21064.

16.

Hyatt, D., Chen, G. L., Locascio, P. F., Land, M. L., Larimer, F.W. and Hauser, L. J. (2010). Prodigal: prokaryotic gene recognition and translation initiation site identification, BMC Bioinformatics, 11, 119.

17.

Jensen, D. B., Vesth, T. C., Hallin, P. F., Pedersen, A. G. and Ussery, D. W. (2012). Bayesian prediction of bacterial growth temperature range based on genome sequences, BMC Genomics, 13(Suppl 7), S3.

19.

Le Cao, K. A., Rossouw, D., Robert-Granie, C. and Besse, P. (2008). A sparse PLS for variable selection when integrating omics data, Statistical Applications in Genetics and Molecular Biology, 7, 1-32.

20.

Lejeusne, C. and Chevaldonne, P. (2006). Brooding crustaceans in a highly fragmented habitat: the genetic structure of Mediterranean marine cave-dwelling mysid populations, Molecular Ecology, 15, 4123-4140.

21.

Liland, K. H., Hoy, M., Martens, H. and Saebo, S. (2013). Distribution based truncation for variable selection in subspace methods for multivariate regression, Chemometrics and Intelligent Laboratory Systems, 122, 103-111.

22.

Lindgren, F., Geladi, P., Rannar, S. and Wold, S. (1994). Interactive variable selection (IVS) for PLS. Part 1: Theory and algorithms, Journal of Chemometrics, 8, 349-363.

23.

Martens, H. and Naes, T. (1989). Multivariate Calibration, Wiley & Sons, New York.

24.

Mehmood, T., Bohlin, J., Kristoffersen, A. B., Saebo, S., Warringer, J. and Snipen, L. (2012b). Exploration of multivariate analysis in microbial coding sequence modeling, BMC Bioinformatics, 13, 97.

25.

Mehmood, T., Bohlin, J. and Snipen, L. (2014). A partial least squares based procedure for upstream sequence classification in prokaryotes., IEEE/ACM Transactions on Computational Biology and Bioinformatics, 12, 560-567.

26.

Mehmood, T., Liland, K. H., Snipen, L. and Saebo, S. (2012a). A review of variable selection methods in partial least squares regression, Chemometrics and Intelligent Laboratory Systems, 118, 62-69.

27.

Mehmood, T., Martens, H., Saebo, S., Warringer, J. and Snipen, L. (2011a). A partial least squares based algorithm for parsimonious variable selection, Algorithms for Molecular Biology, 6, 27.

28.

Mehmood, T., Martens, H. and Saebo, S., Warringer, J. and Snipen, L. (2011b). Mining for genotype-phenotype relations in Saccharomyces using partial least squares, BMC Bioinformatics, 12, 318.

29.

Mehmood, T. and Snipen, L. (2013). Clustered variable selection by regularized elimination in PLS. In H. Abdi, et al. (Eds.), New Perspectives in Partial Least Squares and Related Methods (pp. 95-105), Springer, New York.

30.

Mehmood, T., Warringer, J., Snipen, L. and Saebo, S. (2012c). Improving stability and understand-ability of genotype-phenotype mapping in Saccharomyces using regularized variable selection in L-PLS regression, BMC Bioinformatics, 13, 327.

31.

Nguyen, D. V. and Rocke, D. M. (2002a). Tumor classification by partial least squares using microarray gene expression data, Bioinformatics, 18, 39-50.

32.

Nguyen, D. V. and Rocke, D. M. (2002b). Multi-class cancer classification via partial least squares with gene expression profiles, Bioinformatics, 18, 1216-1226.

33.

Nguyen, M. N., Ma, J., Fogel, G. B. and Rajapakse, J. C. (2009). Di-codon usage for gene classification. In V. Kadirkamanathan, et al. (Eds.), Pattern Recognition in Bioinformatics (pp. 211-221), Springer Berlin, Heidelberg.

34.

Norgaard, L., Saudland, A., Wagner, J., Nielsen, J. P., Munck, L. and Engelsen, S. B. (2000). Interval partial least-squares regression (iPLS): a comparative chemometric study with an example from near-infrared spectroscopy, Applied Spectroscopy, 54, 413-419.

35.

Saebo, S., Almoy, T., Aaroe, J. and Aastveit, A. H. (2008). ST-PLS: a multi-dimensional nearest shrunken centroid type classifier via PLS, Journal of Chemometrics, 22, 54-62.

36.

Singh, B. K., Nazaries, L., Munro, S., Anderson, I. C. and Campbell, C. D. (2006). Use of multiplex terminal restriction fragment length polymorphism for rapid and simultaneous analysis of different components of the soil microbial community, Applied and Environmental Microbiology, 72, 7278-7285.

37.

Tibshirani, R., Hastie, T., Narasimhan, B. and Chu, G. (2003). Class prediction by nearest shrunken centroids, with applications to DNA microarrays, Statistical Science, 18, 104-117.

38.

Tringe, S. G., Von Mering, C., Kobayashi, A., Salamov, A. A., Chen, K., Chang, H. W., Podar, M., Short, J. M., Mathur, E. J., Detter, J. C., Bork, P., Hugenholtz, P. and Rubin, E. M. (2005). Comparative metagenomics of microbial communities, Science, 308, 554-557.