DOI QR코드

DOI QR Code

Iterative integrated imputation for missing data and pathway models with applications to breast cancer subtypes

  • Linder, Henry (Department of Statistics, University of Connecticut) ;
  • Zhang, Yuping (Department of Statistics, University of Connecticut)
  • 투고 : 2019.04.30
  • 심사 : 2019.06.16
  • 발행 : 2019.07.31

초록

Tumor development is driven by complex combinations of biological elements. Recent advances suggest that molecularly distinct subtypes of breast cancers may respond differently to pathway-targeted therapies. Thus, it is important to dissect pathway disturbances by integrating multiple molecular profiles, such as genetic, genomic and epigenomic data. However, missing data are often present in the -omic profiles of interest. Motivated by genomic data integration and imputation, we present a new statistical framework for pathway significance analysis. Specifically, we develop a new strategy for imputation of missing data in large-scale genomic studies, which adapts low-rank, structured matrix completion. Our iterative strategy enables us to impute missing data in complex configurations across multiple data platforms. In turn, we perform large-scale pathway analysis integrating gene expression, copy number, and methylation data. The advantages of the proposed statistical framework are demonstrated through simulations and real applications to breast cancer subtypes. We demonstrate superior power to identify pathway disturbances, compared with other imputation strategies. We also identify differential pathway activity across different breast tumor subtypes.

참고문헌

  1. Almende BV, Thieurmel B, and Robert T (2018). visNetwork: Network Visualization using vis.js Library, R package version 2.0.4, https://CRAN.R-project.org/package=visNetwork
  2. Benjamini Y and Hochberg Y (1995). Controlling the false discovery rate a practical and powerful approach to multiple testing, Journal of the Royal Statistical Society Series B (Methodological), 57, 289-300. https://doi.org/10.1111/j.2517-6161.1995.tb02031.x
  3. Bochkis IM, Schug J, Diana ZY, Kurinna S, Stratton SA, Barton MC, and Kaestner KH (2012). Genome-wide location analysis reveals distinct transcriptional circuitry by paralogous regulators Foxa1 and Foxa2, PLoS Genetics, 8, e1002770. https://doi.org/10.1371/journal.pgen.1002770
  4. Brown KA, Pietenpol JA, and Moses HL (2007). A tale of two proteins: Differential roles and regulation of Smad2 and Smad3 in TGF-beta signaling, Journal of Cellular Biochemistry, 101, 9-33. https://doi.org/10.1002/jcb.21255
  5. Cai T, Cai TT, and Zhang A (2016). Structured matrix completion with applications to genomic data integration, Journal of the American Statistical Association, 111, 621-633. https://doi.org/10.1080/01621459.2015.1021005
  6. Candes EJ and Tao T (2010). The power of convex relaxation: Near-optimal matrix completion, IEEE Transactions on Information Theory, 56, 2053-2080. https://doi.org/10.1109/TIT.2010.2044061
  7. Chang W, Cheng J, Allaire JJ, Xie Y, and McPherson J (2018). shiny: Web Application Framework for R, R package version 1.2.0, https://CRAN.R-project.org/package=shiny
  8. Cheang MCU, Chia SK, Voduc D, et al. (2009). Ki67 index, HER2 status, and prognosis of patients with luminal B breast cancer, JNCI: Journal of the National Cancer Institute, 101, 736-750. https://doi.org/10.1093/jnci/djp082
  9. Chudasama P, Mughal SS, Sanders MA, et al. (2018). Integrative genomic and transcriptomic analysis of leiomyosarcoma, Nature Communications, 9, 144. https://doi.org/10.1038/s41467-017-02602-0
  10. Csardi G and Nepusz T (2006). The igraph software package for complex network research, Inter-Journal, Complex Systems, 1695.
  11. Dai X, Li T, Bai Z, Yang Y, Liu X, Zhan J, and Shi B (2015). Breast cancer intrinsic subtype classification, clinical use and future trends, American Journal of Cancer Research, 5, 2929.
  12. Danielsen SA, Eide PW, Nesbakken A, Guren T, Leithe E, and Lothe RA (2015). Portrait of the PI3K/AKT pathway in colorectal cancer, Biochimica et Biophysica Acta (BBA)-Reviews on Cancer, 1855, 104-121. https://doi.org/10.1016/j.bbcan.2014.09.008
  13. Driver KE, Song H, Lesueur F, et al. (2008). Association of single-nucleotide polymorphisms in the cell cycle genes with breast cancer in the British population, Carcinogenesis, 29, 333-341. https://doi.org/10.1093/carcin/bgm284
  14. Franzin A, Sambo F, and di Camillo B (2017). bnstruct: an R package for Bayesian Network structure learning in the presence of missing data, Bioinformatics, 33, 1250-1252.
  15. Fryett JJ, Inshaw J, Morris AP, and Cordell HJ (2018). Comparison of methods for transcriptome imputation through application to two common complex diseases, European Journal of Human Genetics, 26, 1658-1667. https://doi.org/10.1038/s41431-018-0176-5
  16. Gamazon ER, Wheeler HE, Shah KP, et al. (2015). A gene-based association method for mapping traits using reference transcriptome data, Nature Genetics, 47, 1091. https://doi.org/10.1038/ng.3367
  17. Grossman RL, Heath AP, Ferretti V, Varmus HE, Lowy DR, Kibbe WA, and Staudt LM (2016). Toward a shared vision for cancer genomic data, New England Journal of Medicine, 375, 1109-1112. https://doi.org/10.1056/NEJMp1607591
  18. Gusev A, Ko A, Shi H, et al. (2016). Integrative approaches for large-scale transcriptome-wide association studies, Nature Genetics, 48, 245. https://doi.org/10.1038/ng.3506
  19. Howie BN, Donnelly P, and Marchini J (2009). PLoS Genetics, A flexible and accurate genotype imputation method for the next generation of genome-wide association studies, 5, e1000529. https://doi.org/10.1371/journal.pgen.1000529
  20. Hsu YHH, Churchhouse C, Pers TH, et al. (2019). PAIRUP-MS: Pathway analysis and imputation to relate unknowns in profiles from mass spectrometry-based metabolite data, PLoS Computational Biology, 15, e1006734. https://doi.org/10.1371/journal.pcbi.1006734
  21. Johnson J, Thijssen B, McDermott U, Garnett M,Wessels LFA, and Bernards R (2016). Targeting the RB-E2F pathway in breast cancer, Oncogene, 35, 4829. https://doi.org/10.1038/onc.2016.32
  22. Kaenel P, Mosimann M, and Andres AC (2012). The multifaceted roles of Eph/ephrin signaling in breast cancer, Cell Adhesion & Migration, 6, 138-147. https://doi.org/10.4161/cam.20154
  23. Koksal AS, Beck K, Cronin DR, et al. (2018). Synthesizing signaling pathways from temporal phosphoproteomic data, Cell Reports, 24, 3607-3618. https://doi.org/10.1016/j.celrep.2018.08.085
  24. Krause RW, Huisman M, and Snijders TA (2018). Multiple imputation for longitudinal network data, Italian Journal of Applied Statistics, 30, 33-58.
  25. Kramer N, Schafer J, and Boulesteix AL (2009). Regularized estimation of large-scale gene association networks using graphical Gaussian models, BMC Bioinformatics, 10, 384. https://doi.org/10.1186/1471-2105-10-384
  26. Liu F (2011). Inhibition of Smad3 activity by cyclin D-CDK4 and cyclin E-CDK2 in breast cancer cells, Cell Cycle, 10, 190-191.
  27. Ma J, Lyu H, Huang J, and Liu B (2014). Targeting of erbB3 receptor to overcome resistance in cancer treatment, Molecular Cancer, 13, 105. https://doi.org/10.1186/1476-4598-13-105
  28. Mazumder R, Hastie T, and Tibshirani R (2010). Spectral regularization algorithms for learning large incomplete matrices, Journal of Machine Learning Research, 11, 2287-2322.
  29. Nevins JR (2001). The Rb/E2F pathway and cancer, Human Molecular Genetics, 10, 699-703. https://doi.org/10.1093/hmg/10.7.699
  30. Pasquale EB (2010). Eph receptors and ephrins in cancer: bidirectional signalling and beyond, Nature Reviews Cancer, 10, 165. https://doi.org/10.1038/nrc2806
  31. Sales G, Calura E, and Romualdi C (2018). graphite: GRAPH Interaction from pathway Topological Environment, R package version 1.26.1.
  32. Schaefer CF, Anthony K, Krupa S, Buchoff J, Day M, Hannay T, and Buetow KH (2008). PID: the pathway interaction database, Nucleic Acids Research, 37, D674-D679.
  33. Schulz H, Ruppert AK, Herms S, et al. (2017). Genome-wide mapping of genetic determinants influencing DNA methylation and gene expression in human hippocampus, Nature Communications, 8, 1511. https://doi.org/10.1038/s41467-017-01818-4
  34. Shen R, Olshen AB, and Ladanyi M (2009). Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis, Bioinformatics, 25, 2906-2912. https://doi.org/10.1093/bioinformatics/btp543
  35. Shojaie A and Michailidis G (2009). Analysis of gene sets based on the underlying regulatory network, Journal of Computational Biology, 16, 407-426. https://doi.org/10.1089/cmb.2008.0081
  36. Shojaie A and Michailidis G (2010). Network enrichment analysis in complex experiments, Statistical Applications in Genetics and Molecular Biology, 9, 22.
  37. Sommer S and Fuqua SA (2001). Estrogen receptor and breast cancer, Seminars in Cancer Biology, 11, 339-352. https://doi.org/10.1006/scbi.2001.0389
  38. Subramanian A, Tamayo P, Mootha VK, et al. (2005). Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. In Proceedings of the National Academy of Sciences, 102, 15545-15550. https://doi.org/10.1073/pnas.0506580102
  39. Tang YN, Ding WQ, Guo XJ, Yuan XW, Wang DM, and Song JG (2015). Epigenetic regulation of Smad2 and Smad3 by profilin-2 promotes lung cancer growth and metastasis, Nature Communications, 6, 8230. https://doi.org/10.1038/ncomms9230
  40. Tarasewicz E, Rivas L, Hamdan R, et al. (2014). Inhibition of CDK-mediated phosphorylation of Smad3 results in decreased oncogenesis in triple negative breast cancer cells, Cell Cycle, 13, 3191-3201. https://doi.org/10.4161/15384101.2014.950126
  41. Thomas AL, Lind H, Hong A, et al. (2017). Inhibition of CDK-mediated Smad3 phosphorylation reduces the Pin1-Smad3 interaction and aggressiveness of triple negative breast cancer cells, Cell Cycle, 16, 1453-1464. https://doi.org/10.1080/15384101.2017.1338988
  42. Tomczak K, Czerwinska P, and Wiznerowicz M (2015). The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge, Contemporary Oncology, 19, A68.
  43. Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, and Altman RB (2001). Missing value estimation methods for DNA microarrays, Bioinformatics, 17, 520-525. https://doi.org/10.1093/bioinformatics/17.6.520
  44. Tsuchiya T, Fujii M, Matsuda N, Kunida K, Uda S, Kubota H, Konishi K, and Kuroda S (2017). System identification of signaling dependent gene expression with different time-scale data, PLoS Computational Biology, 13, e1005913. https://doi.org/10.1371/journal.pcbi.1005913
  45. Tyanova S, Temu T, Sinitcyn P, Carlson A, Hein MY, Geiger T, Mann M, and Cox J (2016). The Perseus computational platform for comprehensive analysis of (prote) omics data, Nature Methods, 13, 731. https://doi.org/10.1038/nmeth.3901
  46. Vaske CJ, Benz SC, Sanborn JZ, Earl D, Szeto C, Zhu J, Haussler D, and Stuart JM (2010). Inference of patient-specific pathway activities from multi-dimensional cancer genomics data using PARADIGM, Bioinformatics, 26, i237-i245. https://doi.org/10.1093/bioinformatics/btq182
  47. Wei L, Jin Z, Yang S, Xu Y, Zhu Y, and Ji Y (2017). TCGA-assembler 2: software pipeline for retrieval and processing of TCGA/CPTAC data, Bioinformatics, 34, 1615-1617.
  48. Wu D, Lim E, Vaillant F, Asselin-Labat ML, Visvader JE, and Smyth GK (2010). ROAST: rotation gene set tests for complex microarray experiments, Bioinformatics, 26, 2176-2182. https://doi.org/10.1093/bioinformatics/btq401
  49. Zelivianski S, Cooley A, Kall R, and Jeruss JS (2010). Cyclin-dependent kinase 4-mediated phosphorylation inhibits Smad3 activity in cyclin D-overexpressing breast cancer Cells, Molecular Cancer Research, 8, 1375-1387. https://doi.org/10.1158/1541-7786.MCR-09-0537
  50. Zhang Y, Linder MH, Shojaie A, Ouyang Z, Shen R, Baggerly KA, Baladandayuthapani V, and Zhao H (2017a). Dissecting pathway disturbances using network topology and multi-platform genomics data, Statistics in Biosciences, 10, 1-21.
  51. Zhang Y, Ouyang Z, and Zhao H (2017b). A statistical framework for data integration through graphical models with application to cancer genomics, The Annals of Applied Statistics, 11, 161-184. https://doi.org/10.1214/16-AOAS998
  52. Zhao Y, Hoang TH, Joshi P, Hong SH, Giardina C, and Shin DG (2017). A route-based pathway analysis framework integrating mutation information and gene expression data, Methods, 124, 3-12. https://doi.org/10.1016/j.ymeth.2017.06.016
  53. Zhou X, Carbonetto P, and Stephens M (2013). Polygenic modeling with Bayesian sparse linear mixed models, PLoS Genetics, 9, e1003264. https://doi.org/10.1371/journal.pgen.1003264
  54. Zhu Y, Qiu P, and Ji Y (2014). TCGA-assembler: open-source software for retrieving and processing TCGA data, Nature Methods, 11, 599. https://doi.org/10.1038/nmeth.2956