Iterative integrated imputation for missing data and pathway models with applications to breast cancer subtypes

  • Linder, Henry (Department of Statistics, University of Connecticut) ;
  • Zhang, Yuping (Department of Statistics, University of Connecticut)
  • Received : 2019.04.30
  • Accepted : 2019.06.16
  • Published : 2019.07.31


Tumor development is driven by complex combinations of biological elements. Recent advances suggest that molecularly distinct subtypes of breast cancers may respond differently to pathway-targeted therapies. Thus, it is important to dissect pathway disturbances by integrating multiple molecular profiles, such as genetic, genomic and epigenomic data. However, missing data are often present in the -omic profiles of interest. Motivated by genomic data integration and imputation, we present a new statistical framework for pathway significance analysis. Specifically, we develop a new strategy for imputation of missing data in large-scale genomic studies, which adapts low-rank, structured matrix completion. Our iterative strategy enables us to impute missing data in complex configurations across multiple data platforms. In turn, we perform large-scale pathway analysis integrating gene expression, copy number, and methylation data. The advantages of the proposed statistical framework are demonstrated through simulations and real applications to breast cancer subtypes. We demonstrate superior power to identify pathway disturbances, compared with other imputation strategies. We also identify differential pathway activity across different breast tumor subtypes.


  1. Almende BV, Thieurmel B, and Robert T (2018). visNetwork: Network Visualization using vis.js Library, R package version 2.0.4,
  2. Benjamini Y and Hochberg Y (1995). Controlling the false discovery rate a practical and powerful approach to multiple testing, Journal of the Royal Statistical Society Series B (Methodological), 57, 289-300.
  3. Bochkis IM, Schug J, Diana ZY, Kurinna S, Stratton SA, Barton MC, and Kaestner KH (2012). Genome-wide location analysis reveals distinct transcriptional circuitry by paralogous regulators Foxa1 and Foxa2, PLoS Genetics, 8, e1002770.
  4. Brown KA, Pietenpol JA, and Moses HL (2007). A tale of two proteins: Differential roles and regulation of Smad2 and Smad3 in TGF-beta signaling, Journal of Cellular Biochemistry, 101, 9-33.
  5. Cai T, Cai TT, and Zhang A (2016). Structured matrix completion with applications to genomic data integration, Journal of the American Statistical Association, 111, 621-633.
  6. Candes EJ and Tao T (2010). The power of convex relaxation: Near-optimal matrix completion, IEEE Transactions on Information Theory, 56, 2053-2080.
  7. Chang W, Cheng J, Allaire JJ, Xie Y, and McPherson J (2018). shiny: Web Application Framework for R, R package version 1.2.0,
  8. Cheang MCU, Chia SK, Voduc D, et al. (2009). Ki67 index, HER2 status, and prognosis of patients with luminal B breast cancer, JNCI: Journal of the National Cancer Institute, 101, 736-750.
  9. Chudasama P, Mughal SS, Sanders MA, et al. (2018). Integrative genomic and transcriptomic analysis of leiomyosarcoma, Nature Communications, 9, 144.
  10. Csardi G and Nepusz T (2006). The igraph software package for complex network research, Inter-Journal, Complex Systems, 1695.
  11. Dai X, Li T, Bai Z, Yang Y, Liu X, Zhan J, and Shi B (2015). Breast cancer intrinsic subtype classification, clinical use and future trends, American Journal of Cancer Research, 5, 2929.
  12. Danielsen SA, Eide PW, Nesbakken A, Guren T, Leithe E, and Lothe RA (2015). Portrait of the PI3K/AKT pathway in colorectal cancer, Biochimica et Biophysica Acta (BBA)-Reviews on Cancer, 1855, 104-121.
  13. Driver KE, Song H, Lesueur F, et al. (2008). Association of single-nucleotide polymorphisms in the cell cycle genes with breast cancer in the British population, Carcinogenesis, 29, 333-341.
  14. Franzin A, Sambo F, and di Camillo B (2017). bnstruct: an R package for Bayesian Network structure learning in the presence of missing data, Bioinformatics, 33, 1250-1252.
  15. Fryett JJ, Inshaw J, Morris AP, and Cordell HJ (2018). Comparison of methods for transcriptome imputation through application to two common complex diseases, European Journal of Human Genetics, 26, 1658-1667.
  16. Gamazon ER, Wheeler HE, Shah KP, et al. (2015). A gene-based association method for mapping traits using reference transcriptome data, Nature Genetics, 47, 1091.
  17. Grossman RL, Heath AP, Ferretti V, Varmus HE, Lowy DR, Kibbe WA, and Staudt LM (2016). Toward a shared vision for cancer genomic data, New England Journal of Medicine, 375, 1109-1112.
  18. Gusev A, Ko A, Shi H, et al. (2016). Integrative approaches for large-scale transcriptome-wide association studies, Nature Genetics, 48, 245.
  19. Howie BN, Donnelly P, and Marchini J (2009). PLoS Genetics, A flexible and accurate genotype imputation method for the next generation of genome-wide association studies, 5, e1000529.
  20. Hsu YHH, Churchhouse C, Pers TH, et al. (2019). PAIRUP-MS: Pathway analysis and imputation to relate unknowns in profiles from mass spectrometry-based metabolite data, PLoS Computational Biology, 15, e1006734.
  21. Johnson J, Thijssen B, McDermott U, Garnett M,Wessels LFA, and Bernards R (2016). Targeting the RB-E2F pathway in breast cancer, Oncogene, 35, 4829.
  22. Kaenel P, Mosimann M, and Andres AC (2012). The multifaceted roles of Eph/ephrin signaling in breast cancer, Cell Adhesion & Migration, 6, 138-147.
  23. Koksal AS, Beck K, Cronin DR, et al. (2018). Synthesizing signaling pathways from temporal phosphoproteomic data, Cell Reports, 24, 3607-3618.
  24. Krause RW, Huisman M, and Snijders TA (2018). Multiple imputation for longitudinal network data, Italian Journal of Applied Statistics, 30, 33-58.
  25. Kramer N, Schafer J, and Boulesteix AL (2009). Regularized estimation of large-scale gene association networks using graphical Gaussian models, BMC Bioinformatics, 10, 384.
  26. Liu F (2011). Inhibition of Smad3 activity by cyclin D-CDK4 and cyclin E-CDK2 in breast cancer cells, Cell Cycle, 10, 190-191.
  27. Ma J, Lyu H, Huang J, and Liu B (2014). Targeting of erbB3 receptor to overcome resistance in cancer treatment, Molecular Cancer, 13, 105.
  28. Mazumder R, Hastie T, and Tibshirani R (2010). Spectral regularization algorithms for learning large incomplete matrices, Journal of Machine Learning Research, 11, 2287-2322.
  29. Nevins JR (2001). The Rb/E2F pathway and cancer, Human Molecular Genetics, 10, 699-703.
  30. Pasquale EB (2010). Eph receptors and ephrins in cancer: bidirectional signalling and beyond, Nature Reviews Cancer, 10, 165.
  31. Sales G, Calura E, and Romualdi C (2018). graphite: GRAPH Interaction from pathway Topological Environment, R package version 1.26.1.
  32. Schaefer CF, Anthony K, Krupa S, Buchoff J, Day M, Hannay T, and Buetow KH (2008). PID: the pathway interaction database, Nucleic Acids Research, 37, D674-D679.
  33. Schulz H, Ruppert AK, Herms S, et al. (2017). Genome-wide mapping of genetic determinants influencing DNA methylation and gene expression in human hippocampus, Nature Communications, 8, 1511.
  34. Shen R, Olshen AB, and Ladanyi M (2009). Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis, Bioinformatics, 25, 2906-2912.
  35. Shojaie A and Michailidis G (2009). Analysis of gene sets based on the underlying regulatory network, Journal of Computational Biology, 16, 407-426.
  36. Shojaie A and Michailidis G (2010). Network enrichment analysis in complex experiments, Statistical Applications in Genetics and Molecular Biology, 9, 22.
  37. Sommer S and Fuqua SA (2001). Estrogen receptor and breast cancer, Seminars in Cancer Biology, 11, 339-352.
  38. Subramanian A, Tamayo P, Mootha VK, et al. (2005). Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. In Proceedings of the National Academy of Sciences, 102, 15545-15550.
  39. Tang YN, Ding WQ, Guo XJ, Yuan XW, Wang DM, and Song JG (2015). Epigenetic regulation of Smad2 and Smad3 by profilin-2 promotes lung cancer growth and metastasis, Nature Communications, 6, 8230.
  40. Tarasewicz E, Rivas L, Hamdan R, et al. (2014). Inhibition of CDK-mediated phosphorylation of Smad3 results in decreased oncogenesis in triple negative breast cancer cells, Cell Cycle, 13, 3191-3201.
  41. Thomas AL, Lind H, Hong A, et al. (2017). Inhibition of CDK-mediated Smad3 phosphorylation reduces the Pin1-Smad3 interaction and aggressiveness of triple negative breast cancer cells, Cell Cycle, 16, 1453-1464.
  42. Tomczak K, Czerwinska P, and Wiznerowicz M (2015). The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge, Contemporary Oncology, 19, A68.
  43. Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, and Altman RB (2001). Missing value estimation methods for DNA microarrays, Bioinformatics, 17, 520-525.
  44. Tsuchiya T, Fujii M, Matsuda N, Kunida K, Uda S, Kubota H, Konishi K, and Kuroda S (2017). System identification of signaling dependent gene expression with different time-scale data, PLoS Computational Biology, 13, e1005913.
  45. Tyanova S, Temu T, Sinitcyn P, Carlson A, Hein MY, Geiger T, Mann M, and Cox J (2016). The Perseus computational platform for comprehensive analysis of (prote) omics data, Nature Methods, 13, 731.
  46. Vaske CJ, Benz SC, Sanborn JZ, Earl D, Szeto C, Zhu J, Haussler D, and Stuart JM (2010). Inference of patient-specific pathway activities from multi-dimensional cancer genomics data using PARADIGM, Bioinformatics, 26, i237-i245.
  47. Wei L, Jin Z, Yang S, Xu Y, Zhu Y, and Ji Y (2017). TCGA-assembler 2: software pipeline for retrieval and processing of TCGA/CPTAC data, Bioinformatics, 34, 1615-1617.
  48. Wu D, Lim E, Vaillant F, Asselin-Labat ML, Visvader JE, and Smyth GK (2010). ROAST: rotation gene set tests for complex microarray experiments, Bioinformatics, 26, 2176-2182.
  49. Zelivianski S, Cooley A, Kall R, and Jeruss JS (2010). Cyclin-dependent kinase 4-mediated phosphorylation inhibits Smad3 activity in cyclin D-overexpressing breast cancer Cells, Molecular Cancer Research, 8, 1375-1387.
  50. Zhang Y, Linder MH, Shojaie A, Ouyang Z, Shen R, Baggerly KA, Baladandayuthapani V, and Zhao H (2017a). Dissecting pathway disturbances using network topology and multi-platform genomics data, Statistics in Biosciences, 10, 1-21.
  51. Zhang Y, Ouyang Z, and Zhao H (2017b). A statistical framework for data integration through graphical models with application to cancer genomics, The Annals of Applied Statistics, 11, 161-184.
  52. Zhao Y, Hoang TH, Joshi P, Hong SH, Giardina C, and Shin DG (2017). A route-based pathway analysis framework integrating mutation information and gene expression data, Methods, 124, 3-12.
  53. Zhou X, Carbonetto P, and Stephens M (2013). Polygenic modeling with Bayesian sparse linear mixed models, PLoS Genetics, 9, e1003264.
  54. Zhu Y, Qiu P, and Ji Y (2014). TCGA-assembler: open-source software for retrieving and processing TCGA data, Nature Methods, 11, 599.