Advanced SearchSearch Tips
Multivariate Procedure for Variable Selection and Classification of High Dimensional Heterogeneous Data
facebook(new window)  Pirnt(new window) E-mail(new window) Excel Download
 Title & Authors
Multivariate Procedure for Variable Selection and Classification of High Dimensional Heterogeneous Data
Mehmood, Tahir; Rasheed, Zahid;
  PDF(new window)
The development in data collection techniques results in high dimensional data sets, where discrimination is an important and commonly encountered problem that are crucial to resolve when high dimensional data is heterogeneous (non-common variance covariance structure for classes). An example of this is to classify microbial habitat preferences based on codon/bi-codon usage. Habitat preference is important to study for evolutionary genetic relationships and may help industry produce specific enzymes. Most classification procedures assume homogeneity (common variance covariance structure for all classes), which is not guaranteed in most high dimensional data sets. We have introduced regularized elimination in partial least square coupled with QDA (rePLS-QDA) for the parsimonious variable selection and classification of high dimensional heterogeneous data sets based on recently introduced regularized elimination for variable selection in partial least square (rePLS) and heterogeneous classification procedure quadratic discriminant analysis (QDA). A comparison of proposed and existing methods is conducted over the simulated data set; in addition, the proposed procedure is implemented to classify microbial habitat preferences by their codon/bi-codon usage. Five bacterial habitats (Aquatic, Host Associated, Multiple, Specialized and Terrestrial) are modeled. The classification accuracy of each habitat is satisfactory and ranges from 89.1% to 100% on test data. Interesting codon/bi-codons usage, their mutual interactions influential for respective habitat preference are identified. The proposed method also produced results that concurred with known biological characteristics that will help researchers better understand divergence of species.
partial least squares;classification;variable selection;parsimonious model;high dimensional data sets;identification;multi collinearity;microbial;
 Cited by
Alsberg, B. K., Kell, D. B. and Goodacre, R. (1998). Variable selection in discriminant partial least-squares analysis, Analytical Chemistry, 70, 4126-4133. crossref(new window)

Bachvarov, B., Kirilov, K. and Ivanov, I. (2008). Codon usage in prokaryotes, Biotechnology & Biotechnological Equipment, 22, 669-682. crossref(new window)

Barker, M. and Rayens, W. (2003). Partial least squares for discrimination, Journal of Chemometrics, 17, 166-173. crossref(new window)

Botzman, M. and Margalit, H. (2011). Variation in global codon usage bias among prokaryotic organisms is associated with their lifestyles, Genome Biol, 12, R109. crossref(new window)

Boulesteix, A.-L. (2004). PLS dimension reduction for classification with microarray data, Statistical Applications in Genetics and Molecular Biology, 3, 1-30.

Chen, R., Yan, H., Zhao, K. N., Martinac, B. and Liu, G. B. (2007). Comprehensive analysis of prokaryotic mechanosensation genes: Their characteristics in codon usage, DNA Sequence, 18, 269-278. crossref(new window)

Chun, H. and Keles, S. (2010). Sparse partial least squares regression for simultaneous dimension reduction and variable selection, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72, 3-25. crossref(new window)

Costello, E. K., Lauber, C. L., Hamady, M., Fierer, N., Gordon, J. I. and Knight, R. (2009). Bacterial community variation in human body habitats across space and time, Science, 326, 1694-1697. crossref(new window)

Eriksson, L., Johansson, E., Kettaneh-Wold, N. and Wold, S. (2001). Multi-and Megavariate Data Analysis, Umetrics Academy, Umea.

Gosselin, R., Rodrigue, D. and Duchesne, C. (2010). A bootstrap-VIP approach for selecting wave-length intervals in spectral imaging applications, Chemometrics and Intelligent Laboratory Systems, 100. 12-21. crossref(new window)

Handelsman, J. (2004). Metagenomics: application of genomics to uncultured microorganisms, Microbiology and Molecular Biology Reviews, 68, 669-685. crossref(new window)

Hanes, A., Raymer, M. L., Doom, T. E. and Krane, D. E. (2009). A comparision of codon usage trends in prokaryotes, In Proceedings of Ohio Collaborative Conference on Bioinformatics (OCCBIO'09), Cleveland, OH, 83-86.

Hastie, T., Tibshirani, R. and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer, New York.

Hattenschwiler, S., Fromin, N. and Barantal, S. (2011). Functional diversity of terrestrial microbial decomposers and their substrates, Comptes Rendus Biologies, 334, 393-402. crossref(new window)

Hubner, S., Rashkovetsky, E., Kim, Y. B., Oh, J. H., Michalak, K., Weiner, D., Korol, A. B. Nevo, E. and Michalak, P. (2013). Genome differentiation of Drosophila melanogaster from a microclimate contrast in Evolution Canyon, Israel, In Proceedings of the National Academy of Sciences, 110, 21059-21064. crossref(new window)

Hyatt, D., Chen, G. L., Locascio, P. F., Land, M. L., Larimer, F.W. and Hauser, L. J. (2010). Prodigal: prokaryotic gene recognition and translation initiation site identification, BMC Bioinformatics, 11, 119. crossref(new window)

Jensen, D. B., Vesth, T. C., Hallin, P. F., Pedersen, A. G. and Ussery, D. W. (2012). Bayesian prediction of bacterial growth temperature range based on genome sequences, BMC Genomics, 13(Suppl 7), S3.

Lachenbruch, P. A. and Goldstein, M. (1979). Discriminant analysis, Biometrics, 35, 69-85. crossref(new window)

Le Cao, K. A., Rossouw, D., Robert-Granie, C. and Besse, P. (2008). A sparse PLS for variable selection when integrating omics data, Statistical Applications in Genetics and Molecular Biology, 7, 1-32.

Lejeusne, C. and Chevaldonne, P. (2006). Brooding crustaceans in a highly fragmented habitat: the genetic structure of Mediterranean marine cave-dwelling mysid populations, Molecular Ecology, 15, 4123-4140. crossref(new window)

Liland, K. H., Hoy, M., Martens, H. and Saebo, S. (2013). Distribution based truncation for variable selection in subspace methods for multivariate regression, Chemometrics and Intelligent Laboratory Systems, 122, 103-111. crossref(new window)

Lindgren, F., Geladi, P., Rannar, S. and Wold, S. (1994). Interactive variable selection (IVS) for PLS. Part 1: Theory and algorithms, Journal of Chemometrics, 8, 349-363. crossref(new window)

Martens, H. and Naes, T. (1989). Multivariate Calibration, Wiley & Sons, New York.

Mehmood, T., Bohlin, J., Kristoffersen, A. B., Saebo, S., Warringer, J. and Snipen, L. (2012b). Exploration of multivariate analysis in microbial coding sequence modeling, BMC Bioinformatics, 13, 97. crossref(new window)

Mehmood, T., Bohlin, J. and Snipen, L. (2014). A partial least squares based procedure for upstream sequence classification in prokaryotes., IEEE/ACM Transactions on Computational Biology and Bioinformatics, 12, 560-567.

Mehmood, T., Liland, K. H., Snipen, L. and Saebo, S. (2012a). A review of variable selection methods in partial least squares regression, Chemometrics and Intelligent Laboratory Systems, 118, 62-69. crossref(new window)

Mehmood, T., Martens, H., Saebo, S., Warringer, J. and Snipen, L. (2011a). A partial least squares based algorithm for parsimonious variable selection, Algorithms for Molecular Biology, 6, 27. crossref(new window)

Mehmood, T., Martens, H. and Saebo, S., Warringer, J. and Snipen, L. (2011b). Mining for genotype-phenotype relations in Saccharomyces using partial least squares, BMC Bioinformatics, 12, 318. crossref(new window)

Mehmood, T. and Snipen, L. (2013). Clustered variable selection by regularized elimination in PLS. In H. Abdi, et al. (Eds.), New Perspectives in Partial Least Squares and Related Methods (pp. 95-105), Springer, New York.

Mehmood, T., Warringer, J., Snipen, L. and Saebo, S. (2012c). Improving stability and understand-ability of genotype-phenotype mapping in Saccharomyces using regularized variable selection in L-PLS regression, BMC Bioinformatics, 13, 327. crossref(new window)

Nguyen, D. V. and Rocke, D. M. (2002a). Tumor classification by partial least squares using microarray gene expression data, Bioinformatics, 18, 39-50. crossref(new window)

Nguyen, D. V. and Rocke, D. M. (2002b). Multi-class cancer classification via partial least squares with gene expression profiles, Bioinformatics, 18, 1216-1226. crossref(new window)

Nguyen, M. N., Ma, J., Fogel, G. B. and Rajapakse, J. C. (2009). Di-codon usage for gene classification. In V. Kadirkamanathan, et al. (Eds.), Pattern Recognition in Bioinformatics (pp. 211-221), Springer Berlin, Heidelberg.

Norgaard, L., Saudland, A., Wagner, J., Nielsen, J. P., Munck, L. and Engelsen, S. B. (2000). Interval partial least-squares regression (iPLS): a comparative chemometric study with an example from near-infrared spectroscopy, Applied Spectroscopy, 54, 413-419. crossref(new window)

Saebo, S., Almoy, T., Aaroe, J. and Aastveit, A. H. (2008). ST-PLS: a multi-dimensional nearest shrunken centroid type classifier via PLS, Journal of Chemometrics, 22, 54-62. crossref(new window)

Singh, B. K., Nazaries, L., Munro, S., Anderson, I. C. and Campbell, C. D. (2006). Use of multiplex terminal restriction fragment length polymorphism for rapid and simultaneous analysis of different components of the soil microbial community, Applied and Environmental Microbiology, 72, 7278-7285. crossref(new window)

Tibshirani, R., Hastie, T., Narasimhan, B. and Chu, G. (2003). Class prediction by nearest shrunken centroids, with applications to DNA microarrays, Statistical Science, 18, 104-117. crossref(new window)

Tringe, S. G., Von Mering, C., Kobayashi, A., Salamov, A. A., Chen, K., Chang, H. W., Podar, M., Short, J. M., Mathur, E. J., Detter, J. C., Bork, P., Hugenholtz, P. and Rubin, E. M. (2005). Comparative metagenomics of microbial communities, Science, 308, 554-557. crossref(new window)

Watson, J. E., Whittaker, R. J. and Dawson, T. P. (2004). Avifaunal responses to habitat fragmentation in the threatened littoral forests of south-eastern Madagascar, Journal of Biogeography, 31, 1791-1807. crossref(new window)

Wold, S., Ruhe, A., Wold, H. and Dunn, III, W. J. (1984). The collinearity problem in linear regression. The partial least squares (PLS) approach to generalized inverses, SIAM Journal on Scientific and Statistical Computing, 5, 735-743. crossref(new window)