DOI QR코드

DOI QR Code

Bayesian bi-level variable selection for genome-wide survival study

  • Eunjee Lee (Department of Information and Statistics, Chungnam National University) ;
  • Joseph G. Ibrahim (Department of Biostatistics, University of North Carolina) ;
  • Hongtu Zhu (Department of Biostatistics, University of North Carolina)
  • Received : 2023.06.13
  • Accepted : 2023.06.27
  • Published : 2023.09.30

Abstract

Mild cognitive impairment (MCI) is a clinical syndrome characterized by the onset and evolution of cognitive impairments, often considered a transitional stage to Alzheimer's disease (AD). The genetic traits of MCI patients who experience a rapid progression to AD can enhance early diagnosis capabilities and facilitate drug discovery for AD. While a genome-wide association study (GWAS) is a standard tool for identifying single nucleotide polymorphisms (SNPs) related to a disease, it fails to detect SNPs with small effect sizes due to stringent control for multiple testing. Additionally, the method does not consider the group structures of SNPs, such as genes or linkage disequilibrium blocks, which can provide valuable insights into the genetic architecture. To address the limitations, we propose a Bayesian bi-level variable selection method that detects SNPs associated with time of conversion from MCI to AD. Our approach integrates group inclusion indicators into an accelerated failure time model to identify important SNP groups. Additionally, we employ data augmentation techniques to impute censored time values using a predictive posterior. We adapt Dirichlet-Laplace shrinkage priors to incorporate the group structure for SNP-level variable selection. In the simulation study, our method outperformed other competing methods regarding variable selection. The analysis of Alzheimer's Disease Neuroimaging Initiative (ADNI) data revealed several genes directly or indirectly related to AD, whereas a classical GWAS did not identify any significant SNPs.

Keywords

Acknowledgement

This material was based on work partially supported by the National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT) (No. NRF-2022M3J6A1084843, No. NRF-2021R1C1C1013936). This work was also partially supported by the Institute of Information & communications Technology Planning & Evaluation (IITP) grant (No. 2020-0- 01441, No. RS-2022-00155857, Artificial Intelligence Convergence Research Center (Chungnam National University)). Part of this study has been published as a PhD thesis by the first author under the supervision of the co-authors (Lee E. Advanced Bayesian models for high-dimensional biomedical data. Ph.D. Dissertation. Chapel Hill: The University of North Carolina, 2016).

References

  1. Risch N, Merikangas K. The future of genetic studies of complex human diseases. Science 1996;273:1516-1517. https://doi.org/10.1126/science.273.5281.1516
  2. International HapMap Consortium. A haplotype map of the human genome. Nature 2005;437:1299-1320. https://doi.org/10.1038/nature04226
  3. Bush WS, Moore JH. Chapter 11: Genome-wide association studies. PLoS Comput Biol 2012;8:e1002822.
  4. Liu J, Wang K, Ma S, Huang J. Regularized regression method for genome-wide association studies. BMC Proc 2011;5 Suppl 9:S67.
  5. St-Pierre J, Oualkacha K, Bhatnagar SR. Efficient penalized generalized linear mixed models for variable selection and genetic risk prediction in high-dimensional data. Bioinformatics 2023;39:btad063.
  6. Waldmann P, Meszaros G, Gredler B, Fuerst C, Solkner J. Evaluation of the lasso and the elastic net in genome-wide association studies. Front Genet 2013;4:270.
  7. Wu TT, Chen YF, Hastie T, Sobel E, Lange K. Genome-wide association analysis by lasso penalized logistic regression. Bioinformatics 2009;25:714-721. https://doi.org/10.1093/bioinformatics/btp041
  8. Guan Y, Stephens M. Bayesian variable selection regression for genome-wide association studies and other large-scale problems. Ann Appl Stat 2011;5:1780-1815. https://doi.org/10.1214/11-AOAS455
  9. Williams J, Ferreira MA, Ji T. BICOSS: Bayesian iterative conditional stochastic search for GWAS. BMC Bioinformatics 2022;23:475.
  10. He Q, Lin DY. A variable selection method for genome-wide association studies. Bioinformatics 2011;27:1-8. https://doi.org/10.1093/bioinformatics/btq600
  11. Li J, Zhong W, Li R, Wu R. A fast algorithm for detecting genegene interactions in genome-wide association studies. Ann Appl Stat 2014;8:2292-2318.
  12. Wen C, Pan W, Huang M, Wang X. Sure independence screening adjusted for confounding covariates with ultrahigh dimensional data. Stat Sin 2018;28:293-317.
  13. Kaplan A, Lock EF, Fiecas M; Alzheimer's Disease Neuroimaging Initiative. Bayesian GWAS with structured and non-local priors. Bioinformatics 2020;36:17-25. https://doi.org/10.1093/bioinformatics/btz518
  14. Liu J, Huang J, Ma S, Wang K. Incorporating group correlations in genome-wide association studies using smoothed group Lasso. Biostatistics 2013;14:205-219. https://doi.org/10.1093/biostatistics/kxs034
  15. Lock EF, Dunson DB. Bayesian genome- and epigenome-wide association studies with gene level dependence. Biometrics 2017;73:1018-1028. https://doi.org/10.1111/biom.12649
  16. Zhang X, Xue F, Liu H, Zhu D, Peng B, Wiemels JL, et al. Integrative Bayesian variable selection with gene-based informative priors for genome-wide association studies. BMC Genet 2014;15:130.
  17. Bi W, Fritsche LG, Mukherjee B, Kim S, Lee S. A fast and accurate method for genome-wide time-to-event data analysis and its application to UK biobank. Am J Hum Genet 2020;107:222-233. https://doi.org/10.1016/j.ajhg.2020.06.003
  18. Lee KH. Bayesian variable selection in parametric and semipara-metric high dimensional survival analysis. Ph.D. Dissertation. Columbia: University of Missouri, 2011.
  19. Lin X, Cai T, Wu MC, Zhou Q, Liu G, Christiani DC, et al. Kernel machine SNP-set analysis for censored survival outcomes in genome-wide association studies. Genet Epidemiol 2011;35:620-631. https://doi.org/10.1002/gepi.20610
  20. Sha N, Tadesse MG, Vannucci M. Bayesian variable selection for the analysis of microarray data with censored outcomes. Bioinformatics 2006;22:2262-2268. https://doi.org/10.1093/bioinformatics/btl362
  21. Tanner MA, Wong WH. The calculation of posterior distributions by data augmentation. J Am Stat Assoc 1987;82:528-540. https://doi.org/10.1080/01621459.1987.10478458
  22. Bhattacharya A, Pati D, Pillai NS, Dunson DB. Dirichlet-Laplace priors for optimal shrinkage. J Am Stat Assoc 2015;110:1479-1490. https://doi.org/10.1080/01621459.2014.960967
  23. Kalbfleisch JD, Prentice RL. The Statistical Analysis of Failure Time Data. New York: John Wiley & Sons, 2011.
  24. Lawless JF. Statistical Models and Methods for Lifetime Data. New York: John Wiley & Sons, 2011.
  25. Meeker WQ, Escobar LA, Pascual FG. Statistical Methods for Reliability Data. 2nd ed. New York: John Wiley & Sons, 2022.
  26. Nelson WB. Accelerated Testing: Statistical Models, Test Plans, and Data Analysis. New York: John Wiley & Sons, 2009.
  27. Bedrick EJ, Christensen R, Johnson WO. Bayesian accelerated failure time analysis with application to veterinary epidemiology. Stat Med 2000;19:221-237. https://doi.org/10.1002/(SICI)1097-0258(20000130)19:2<221::AID-SIM328>3.0.CO;2-C
  28. Christensen R, Johnson W. Modelling accelerated failure time with a Dirichlet process. Biometrika 1988;75:693-704. https://doi.org/10.1093/biomet/75.4.693
  29. Kuo L, Mallick B. Bayesian semiparametric inference for the accelerated failure-time model. Can J Stat 1997;25:457-472. https://doi.org/10.2307/3315341
  30. Gupta M, Ibrahim JG. An information matrix prior for Bayesian analysis in generalized linear models with high dimensional data. Stat Sin 2009;19:1641-1663.
  31. Zellner A. On assessing prior distributions and Bayesian regression analysis with g-prior distributions. In: Bayesian Inference and Decision Techniques: Essays in Honor of Bruno de Finetti (Goel PK, Zellner A, eds.). Amsterdam: Elsevier Science Publishers, 1986. pp. 233-243.
  32. Newton MA, Noueiry A, Sarkar D, Ahlquist P. Detecting differential gene expression with a semiparametric hierarchical mixture method. Biostatistics 2004;5:155-176. https://doi.org/10.1093/biostatistics/5.2.155
  33. Storey JD. The positive false discovery rate: a Bayesian interpretation and the q-value. Ann Stat 2003;31:2013-2035. https://doi.org/10.1214/aos/1074290335
  34. Morris JS, Brown PJ, Herrick RC, Baggerly KA, Coombes KR. Bayesian analysis of mass spectrometry proteomic data using wavelet-based functional mixed models. Biometrics 2008;64:479-489. https://doi.org/10.1111/j.1541-0420.2007.00895.x
  35. International HapMap 3 Consortium; Altshuler DM, Gibbs RA, Peltonen L, Altshuler DM, Gibbs RA, et al. Integrating common and rare genetic variation in diverse human populations. Nature 2010;467:52-58. https://doi.org/10.1038/nature09298
  36. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet 2007;81:559-575. https://doi.org/10.1086/519795
  37. Price AL, Zaitlen NA, Reich D, Patterson N. New approaches to population stratification in genome-wide association studies. Nat Rev Genet 2010;11:459-463. https://doi.org/10.1038/nrg2813
  38. Bezerra GA, Dobrovetsky E, Seitova A, Fedosyuk S, Dhe-Paganon S, Gruber K. Structure of human dipeptidyl peptidase 10 (DPPY): a modulator of neuronal Kv4 channels. Sci Rep 2015;5:8769.
  39. Chen T, Shen XF, Chegini F, Gai WP, Abbott CA. Molecular characterisation of a novel dipeptidyl peptidase like protein: its pathological link to Alzheimers disease. Clin Chem Lab Med 2008;46:A13.
  40. Chen T, Gai WP, Abbott CA. Dipeptidyl peptidase 10 (DPP10(789)): a voltage gated potassium channel associated protein is abnormally expressed in Alzheimer's and other neurodegenerative diseases. Biomed Res Int 2014;2014:209398.
  41. De Jager PL, Shulman JM, Chibnik LB, Keenan BT, Raj T, Wilson RS, et al. A genome-wide scan for common variants affecting the rate of age-related cognitive decline. Neurobiol Aging 2012;33:1017.
  42. Guerreiro RJ, Gustafson DR, Hardy J. The genetic architecture of Alzheimer's disease: beyond APP, PSENs and APOE. Neurobiol Aging 2012;33:437-456. https://doi.org/10.1016/j.neurobiolaging.2010.03.025
  43. Oguri M, Kato K, Yokoi K, Yoshida T, Watanabe S, Metoki N, et al. Assessment of a polymorphism of SDK1 with hypertension in Japanese individuals. Am J Hypertens 2010;23:70-77. https://doi.org/10.1038/ajh.2009.190
  44. Skoog I, Gustafson D. Update on hypertension and Alzheimer's disease. Neurol Res 2006;28:605-611. https://doi.org/10.1179/016164106X130506
  45. Huentelman MJ, Papassotiropoulos A, Craig DW, Hoerndli FJ, Pearson JV, Huynh KD, et al. Calmodulin-binding transcription activator 1 (CAMTA1) alleles predispose human episodic memory performance. Hum Mol Genet 2007;16:1469-1477. https://doi.org/10.1093/hmg/ddm097
  46. Hooli BV, Kovacs-Vajna ZM, Mullin K, Blumenthal MA, Mattheisen M, Zhang C, et al. Rare autosomal copy number variations in early-onset familial Alzheimer's disease. Mol Psychiatry 2014;19:676-681. https://doi.org/10.1038/mp.2013.77
  47. Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. J R Stat Soc Series B Stat Methodol 2006;68: 49-67. https://doi.org/10.1111/j.1467-9868.2005.00532.x
  48. Zhang CH. Nearly unbiased variable selection under minimax concave penalty. Ann Stat 2010;38:894-942. https://doi.org/10.1214/09-AOS729
  49. Huang J, Ma S, Xie H, Zhang CH. A group bridge approach for variable selection. Biometrika 2009;96:339-355. https://doi.org/10.1093/biomet/asp020
  50. Breheny P. The group exponential lasso for bi-level variable selection. Biometrics 2015;71:731-740. https://doi.org/10.1111/biom.12300
  51. Breheny P, Huang J. Penalized methods for bi-level variable selection. Stat Interface 2009;2:369-380. https://doi.org/10.4310/SII.2009.v2.n3.a10