DOI QR코드

DOI QR Code

Biological Feature Selection and Disease Gene Identification using New Stepwise Random Forests

  • Hwang, Wook-Yeon (College of Global Business, Dong-A University)
  • Received : 2016.08.13
  • Accepted : 2017.01.09
  • Published : 2017.03.30

Abstract

Identifying disease genes from human genome is a critical task in biomedical research. Important biological features to distinguish the disease genes from the non-disease genes have been mainly selected based on traditional feature selection approaches. However, the traditional feature selection approaches unnecessarily consider many unimportant biological features. As a result, although some of the existing classification techniques have been applied to disease gene identification, the prediction performance was not satisfactory. A small set of the most important biological features can enhance the accuracy of disease gene identification, as well as provide potentially useful knowledge for biologists or clinicians, who can further investigate the selected biological features as well as the potential disease genes. In this paper, we propose a new stepwise random forests (SRF) approach for biological feature selection and disease gene identification. The SRF approach consists of two stages. In the first stage, only important biological features are iteratively selected in a forward selection manner based on one-dimensional random forest regression, where the updated residual vector is considered as the current response vector. We can then determine a small set of important biological features. In the second stage, random forests classification with regard to the selected biological features is applied to identify disease genes. Our extensive experiments show that the proposed SRF approach outperforms the existing feature selection and classification techniques in terms of biological feature selection and disease gene identification.

Keywords

Bioinformatics;Classification;Feature Evaluation and Selection;Modeling and Prediction

Acknowledgement

Supported by : Dong-A University

References

  1. Adie, E., Adams, R., Evans, K., Porteous, D., and Pickard, B. (2005), Speeding disease gene discovery by sequence based candidate prioritization, BMC bioinformatics, 6-55. https://doi.org/10.1186/1471-2105-6-55
  2. Aerts, S., Lambrechts, D., Maity, S., Van Loo, P., Coessens, B., De Smet, F., Tranchevent, L., De Moor, B., Marynen, P., and Hassan, B. (2006), Gene prioritization through genomic data fusion, Nat Biotechnol, 24, 537-544. https://doi.org/10.1038/nbt1203
  3. Blum, A. and Langley, P. (1997), Selection of relevant features and examples in machine learning, Artificial Intelligence, 97(1-2), 245-271. https://doi.org/10.1016/S0004-3702(97)00063-5
  4. Bollmann, P. and Cherniavsky, V. S. (1981), Restricted Evaluation in Information Retrieval, ACM SIGIR.
  5. Botstein, D. and Risch, N. (2013), Discovering genotypes underlying human phenotypes: Past successes for mendelian disease, future approaches for complex disease, Nature Genetics, 33, 228-237.
  6. Botta, V., Louppe, G., Geurts, P., and Wehenkel, L. (2014), Exploiting SNP Correlations within Random Forest for Genome-Wide Association Studies, PLOS ONE, http://dx.doi.org/10.1371/journal.pone.0093379. https://doi.org/10.1371/journal.pone.0093379
  7. Breiman, L., Friedman, J., Olshen, R., and Stone, C. (1991), Classification and Regression Trees, CRC Press, New York.
  8. Breiman, L. (2001), Random forests, Machine Learning, 45, 5-32. https://doi.org/10.1023/A:1010933404324
  9. Brown, K. and Jurisica, I. (2005), Online predicted human interaction database, Bioinformatics, 21(9), 2076-2082. https://doi.org/10.1093/bioinformatics/bti273
  10. Brunner, H., Van, D. M (2004), From syndrome families to functional genomics, Nature Reviews Genetics, 5, 545-551.
  11. Cao, H., Mok, A., Miskie, B., and Hegele, R. A. (2001), Single-nucleotide polymorphisms of the proprotein convertase subtilisin/kexin type 5 (PCSK5) gene, J Hum Genet, 46, 730-732. https://doi.org/10.1007/s100380170008
  12. Chen, Y., Chen, C. F., Chiang, H. C., Pena, M., Polci, R., Wei, R. L., Edwards, R. A., Hansel, D. E., Chen, P. L., and Riley, D. J. (2011), Mutation of NIMArelated kinase 1 (NEK1) leads to chromosome instability, Molecular Cancer, doi:10.1186/1476-4598-10-5. https://doi.org/10.1186/1476-4598-10-5
  13. Chiong, M., Wang, Z., Pedrozo, Z., Caom D., Troncoso, R., and Ibacache, M. (2011), Cardiomyocyte death: mechanisms and translational implications, Cell Death & Disease, 2, e244. https://doi.org/10.1038/cddis.2011.130
  14. Day, B. W., Stringer, B. W., Al-Ejeh, F., Ting, M. J., Wilson, J., Ensbey, K. S., Jamieson, P. R., Bruce, Z. C., Lim, Y. C., Offenhauser, C., Charmsaz, S., Cooper, L. T., Ellacott, J. K., Harding, A., Leveque, L., Inglis, P., Allan, S., Walker, D. G., Lackmann, M., Osborne, G., Khanna, K. K., Reynolds, B. A., Lickliter, J. D., and Boyd, A. W. (2013), EphA3 maintains tumorigenicity and is a therapeutic target in glioblastoma multiforme, Cancer Cell, 23(2), 238-248. https://doi.org/10.1016/j.ccr.2013.01.007
  15. Deshmukh, M., Li, Y., Yokota, T., Gama, V., Yoshida, T., Gomez, J. A., Ishikawa, K., Sasaguri, H., Cohen, H. Y., Sinclair, D. A., Mizusawa, H., and Matsuyama, S. (2007), Bax-inhibiting peptide protects cells from polyglutamine toxicity caused by Ku70 acetylation, Cell Death and Differentiation, 14, 2058-2067. https://doi.org/10.1038/sj.cdd.4402219
  16. Dufour, C. R., Wilson, B. J., Huss, J. M., Kelly, D. P., Alaynick, W. A., Downes, M., Evans, R. M., Blanchette, M., and Giguere, V. (2007), Genome-wide orchestration of cardiac functions by the orphan nuclear receptors ERR alpha and Gamma, Cell Metab, 5(5), 345-356. https://doi.org/10.1016/j.cmet.2007.03.007
  17. Erol, A. (2010), Systemic DNA damage response and metabolic syndrome as a premalignant state, Current Molecular Medicine, 10(3), 321-334. https://doi.org/10.2174/156652410791065282
  18. Fan, J. and Lv, J. (2008), Sure independence screening for ultrahigh dimensional feature space, JRSS. B., 70, 849-911. https://doi.org/10.1111/j.1467-9868.2008.00674.x
  19. Finn, R., Bateman, A., Clements, J., Coggill, P., Eberhardt, R. Y., Eddy, S. R., Heger, A., Hetherington, K., Holm, L., Mistry, J., Sonnhammer, E. L. L., Tate, J., and Punta, M. (2010), The pfam protein families database, Nucl Acids Res, 38, 211-222. https://doi.org/10.1093/nar/gkp985
  20. Flicek, P., Amode, M. R., Barrell, D., Beal, K., Billis, K., Brent, S., Carvalho-Silva, D., Clapham, P., Coates, G., Fitzgerald, S., Gil, L., Giron, C. G., Gordon, L., Hourlier, T., Hunt, S., Johnson, N., Juettemann, T., Kahari, A. K., Keenan, S., Kulesha, E., Martin, F. J., Maurel, T., McLaren, W. M., Murphy, D. N., Nag, R., Overduin, B., Pignatelli, M., Pritchard, B., Pritchard, E., Riat, H. S., Ruffier, M., Sheppard, D., Taylor, K., Thormann, A., Trevanion, S. J., Vullo, A., Wilder, S. P., Wilson, M., Zadissa, A., Aken, B. L., Birney, E., Cunningham, F., Harrow, J., Herrero, J., Hubbard, T. J. P., Kinsella, R., Muffato, M., Parker, A., Spudich, G., Yates, A., Zerbino, D. R., and Searle, S. M. J. (2011), Ensembl, Nucl Acids Res, 39(1), 800-806. https://doi.org/10.1093/nar/gkq1064
  21. Gene Ontology Consortium (2004), The gene ontology database and informatics resource, Nucleic Acid Res., 32(1), 258-261. https://doi.org/10.1093/nar/gkh036
  22. Giallourakis, C., Henson, C., Reich, M., Xie, X., and-Mootha, V. (2005), Disease gene discovery though integrative genomics, Annu Rev Genomics Hum Genet, 6, 381-406. https://doi.org/10.1146/annurev.genom.6.080604.162234
  23. Goh, K., Cusick, M., Valle, D., Childs, B., Vidal, M., and Barabasi, A. (2007), The human disease network, Proc Natl Acad Sci USA, 104, 8685-8690. https://doi.org/10.1073/pnas.0701361104
  24. Greenwood, P., and Nikulin, M. (1996), A Guide to Chisquared Testing, John Wiley & Sons.
  25. Guertin, D. A. and Sabatini, D. M. (2007), Defining the role of mTOR in cancer, Cancer Cell, 12(1), 9-22. https://doi.org/10.1016/j.ccr.2007.05.008
  26. Guyon, I. and Elisseeff, A. (2003), An introduction to variable and feature selection, Journal of Machine Learning Research, 3, 1157-1182.
  27. Hall, M. (1999), Correlation-based Feature Selection for Machine Learning, Ph.D. thesis.
  28. Hastie, T., Tibsharani, R., and Friedman, J. H. (2001), The elements of statistical learning, Springer, New York.
  29. Hultsch, C., Bergmannm R., Pawelke, B., Pietzsch, J., Wuest, F., Johannsen, B., and Henle, T. (2005), Biodistribution and catabolism of 18F-labelled isopeptide N(epsilon)-(gamma-glutamyl)-L-lysine, Amino Acids, 29(4), 405-413. https://doi.org/10.1007/s00726-005-0204-y
  30. Ideker, T. and Sharan, R. (2008), Protein networks in disease, Genome Research, 18, 644-652. https://doi.org/10.1101/gr.071852.107
  31. Jeong, J. W., Lee, K. Y., Han, S. J., Aronow, B. J., Lydon, J. P., O'Malley, B. W., and DeMayo, F. J. (2007), The P160 steroid receptor coactivator 2, SRC-2, regulates murine endometrial function and regulates progesterone-independent and -dependent gene expression, Endocrinology, 148, 4238-4250. https://doi.org/10.1210/en.2007-0122
  32. Jiang, R., Tang, W., Wu, X., and Fu, W. (2009), A random forest approach to the detection of epistatic interactions in case-control studies, BMC Bioinformatics, DOI:10.1186/1471-2105-10-S1-S65. https://doi.org/10.1186/1471-2105-10-S1-S65
  33. Johansson, T., Lejonklou, M. H., Ekeblad, S., Stalberg, P., and Skogseid, B. (2008), Lack of nuclear expression of hairy and enhancer of split-1 (HES1) in pancreatic endocrine tumors, Horm Metab Res, 40(5), 354-359. https://doi.org/10.1055/s-2008-1076695
  34. Johnson, V. J., Kim, S., and Sharma, R. P. (2005), Aluminum-maltolate induces apoptosis and necrosis in neuro-2a cells: Potential role for p53 signaling. Toxicol Sci., 83(2), 329-339.
  35. Karanika, S., Karantanos, T., Li, L., Corn, P. G., and-Thompson, T. C. (2014), DNA damage response and prostate cancer: Defects, regulation and therapeutic implications. Oncogene, doi:10.1038/onc.2014.238. https://doi.org/10.1038/onc.2014.238
  36. Katz, A. M. (2000), Cytoskeletal abnormalities in the failing heart out on a LIM?, Circulation, 101(23), 2672-2673. https://doi.org/10.1161/01.CIR.101.23.2672
  37. Kenji, K. and Rendell, L. (1992), The feature selection problem: traditional methods and a new algorithm, Proceeding AAAI'92 Proceedings of the Tenth National Conference on Artificial Intelligence, 129-134.
  38. Kohavi, R. and John, G. (1997), Wrappers for feature selection, Artificial Intelligence, 97(1-2), 273-324. https://doi.org/10.1016/S0004-3702(97)00043-X
  39. Kohler, S., Bauer, S., Horn, D., and Robinson, P. (2008), Walking the interactome for prioritization of candidate disease genes, The American Journal of Human Genetics, 82, 949-958. https://doi.org/10.1016/j.ajhg.2008.02.013
  40. Kondo, M., Osada, H., Uchida, K., Yanagisawa, K., Masuda, A., Takagim K., Takahashim T., and Takahashi. T. (1998), Molecular cloning of human TAK1 and its mutational analysis in human lung cancer, Int. J. Cancer, 75(4), 559-563. https://doi.org/10.1002/(SICI)1097-0215(19980209)75:4<559::AID-IJC11>3.0.CO;2-4
  41. Liu, Y. and Wu, Y. (2007), Variable selection via a combination of the L0 and L1 penalties, J. Comp. Graph. Statist., 16, 782-798. https://doi.org/10.1198/106186007X255676
  42. Mahaffey, K. R., Clickner, R. P., and Bodurow, C. C. (2004), Blood organic mercury and dietary mercury intake: National health and nutrition examination survey, Environ Health Perspect, 112(5), 562-570.
  43. Martimbeau, S. and Tilly, J. L. (1997), Physiological cell death in endocrine-dependent tissues: An ovarian perspective, Clinical Endocrinology, 46(3), 241-254. https://doi.org/10.1046/j.1365-2265.1997.00157.x
  44. McKusick, V. (2007), Mendelian inheritance in man and its online version, OMIM. Am J Hum Genet, 80, 588-604. https://doi.org/10.1086/514346
  45. Mitchell, M. (1997), Machine learning, WCB.
  46. Neil, J. R. and Schiemann, W. P. (2008), Altered TAB1: I KappaB kinase interaction promotes transforming growth factor beta-mediated nuclear factor-kappaB activation during breast cancer progression, Cancer Res, 68(5), 1462-1470. https://doi.org/10.1158/0008-5472.CAN-07-3094
  47. Olmez-Hanci, T., Imren, C., Arslan-Alaton, I., Kabdasli, I., and Tunay, O. (2009), H2O2/UV-C oxidation of potential endocrine disrupting compounds: A case study with dimethyl phthalate, Photochem Photobiol Sci., 8(5), 620-627. https://doi.org/10.1039/b817420b
  48. Oti, M. and Brunner, H. (2007), The modular nature of genetic diseases, Clin Genet, 71, 1-11.
  49. Prasad, T., Goel, R., Kandasamy, K., Keerthikumar, S., Kumar, S., Mathivanan, S., Telikicherla, D., and Pandey, A. (2009), Human protein reference database, Nucleic Acids Res, 37, 767-772. https://doi.org/10.1093/nar/gkn892
  50. Qiu, Y., Zhang, S., Zhang, X., and Chen, L. (2010), Detecting disease associated modules and prioritizing active genes based on high throughput data, BMC Bioinformatics, 11-26. https://doi.org/10.1186/1471-2105-11-26
  51. Radivojac, P., Peng, K., Clark, W., Peters, B., Mohan, A., Boyle, S., and Mooney, S. (2008), An integrated approach to inferring gene-disease associations in humans, Proteins, 72(3), 1030-1037. https://doi.org/10.1002/prot.21989
  52. Renk, G. and Crouch, R. K. (1989), Analogue pigment studies of chromophore-protein interactions in metarhodopsins, Biochemistry, 28(2), 907-912. https://doi.org/10.1021/bi00428a075
  53. Ring, H. G. (1967), Pancreatic carcinoma with metastasis to the optic nerve, Arch Ophthalmol, 77(6), 798-800. https://doi.org/10.1001/archopht.1967.00980020800017
  54. Smalter, A., Lei, S., and Chen, X. (2007), Human disease-gene classification with integrative sequencebased and topological features of protein-protein interaction networks, BIBM.
  55. Suk, S., Kim, Y., and Lee, S. (2001), Formation of nuclear isopeptide in the process of neuronal cell death following interstitial hyperthermia in normal rat brain, Journal of Korean Neurol Association, 19(6), 633-640.
  56. Szende, B., Szokan, G., Tyiha, E., Pal, K., Gaborjanyi, R., Almas, M., and Khlafulla, A. R. (2002), Antitumor effect of lysine-isopeptides, Cancer Cell International, doi:10.1186/1475-2867-2-4. https://doi.org/10.1186/1475-2867-2-4
  57. Tew, K., Li, X., and Tan, S. (2007), Functional centrality: Detecting lethality of proteins in protein interaction networks, Proceedings of 18th International Conference on Genome Informatics.
  58. Tong, L., Png, E., Lan, W., and Petznick, A. (2011), Recent advances: Transglutaminase in ocular health and pathological processes, J Clinic Experiment Ophthalmol, doi:10.4172/2155-9570.S2-002. https://doi.org/10.4172/2155-9570.S2-002
  59. Tsujie, M., Nakamori, S., Okami, J., Takahashi, Y., Hayashi, N., Nagano, H., Dono, K., Umeshita, K., Sakon, M., and Monden, M. (2003), Growth inhibition of pancreatic cancer cells through activation of peroxisome proliferator-activated receptor Gamma/Retinoid X Receptor Alpha pathway, Int J Oncol, 23(2), 325-331.
  60. Usmani-Brown, S., Lebastchi, J., Steck, A. K., Beam, C., Herold, K. C., and Ledizet, M. (2014), Analysis of ${\beta}$-cell death in type 1 diabetes by droplet digital PCR, Endocrinology, 155(9), 3694-3698. https://doi.org/10.1210/en.2014-1150
  61. Uttara, B., Singh, A. V., Zamboni, P., and Mahajan, R. T. (2009), Oxidative stress and neurodegenerative diseases: A review of upstream and downstream antioxidant therapeutic options, Curr Neuropharmacol, 7(1), 65-74. https://doi.org/10.2174/157015909787602823
  62. Wang, G., Fu, G., and Corcoran, C. (2015), A forestbased feature screening approach for large-scale genome data with complex structures, BMC Genetics, DOI: 10.1186/s12863-015-0294-9. https://doi.org/10.1186/s12863-015-0294-9
  63. Wang, Z. D., Payattakool, R., Philip, S., and Chen, C. (2007), A new method to measure the semantic similarity of GO terms, Bioinformatics, 23(10), 1274-1281. https://doi.org/10.1093/bioinformatics/btm087
  64. Xin, G., Qiu, Y., Loh, H. H., and Law, P. Y. (2009), GRIN1 regulates ${\mu}$-opioid receptor activities by tethering the receptor and G protein in the lipid raft, Journal of Biological Chemistry, 284(52), 36521-36534. https://doi.org/10.1074/jbc.M109.024109
  65. Xu, J. and Li, Y. (2006), Discovering disease-genes by topological features in human protein-protein interaction network, Bioinformatics, 22(22), 2800-2805. https://doi.org/10.1093/bioinformatics/btl467
  66. Yang, H., Liu, C., Jamsen, J., Wu, Z., Wang, Y., Chen, J., Zheng, L., and Shen, B. (2012), The DNase domaincontaining protein TATDN1 plays an important role in chromosomal segregation and cell cycle progression during zebrafish eye development, Cell Cycle, 11(24), 4626-4632. https://doi.org/10.4161/cc.22886
  67. Yang, P., Li, X., Chua, H., Kwoh, C., and Ng, S. (2014), Ensemble positive unlabeled learning for disease gene identification, PloS one, 9(5).
  68. Yang, P., Li, X., Mei, J., Kwoh, C., and Ng, S. (2012), Positive-unlabeled learning for disease gene identification, Bioinformatics, 28(20), 2640-2647. https://doi.org/10.1093/bioinformatics/bts504
  69. Yang, P., Li, X., Wu, M., Kwoh, C., and Ng, S. (2011), Inferring gene-phenotype associations via global protein complex network propagation, PloS one, 6(7), e21502. https://doi.org/10.1371/journal.pone.0021502
  70. Zhang, H., Ahn, J., Lin, X., and Park, C. (2006), Gene selection using support vector machines with nonconvex penalty, Bioinformatics, 22, 88-95. https://doi.org/10.1093/bioinformatics/bti736
  71. Zhong, T., Tan, Y., Zhou, A., Yu, Q., and Zhou, J. (2005), RING finger ubiquitin-protein isopeptide ligase Nrdp1/FLRF regulates parkin stability and activity, Journal of Biological Chemistry, 280(10), 9425-9430. https://doi.org/10.1074/jbc.M408955200
  72. Zhu, J., Rosset, S., Hastie, T., and Tibshirani, R. (2004), 1-norm support vector machines, The Annual Conference on Neural Information Processing Systems.
  73. Zou, H. (2007), An improved 1-norm support vector machine for simultaneous classification and variable selection, J. Machine Learn. Res., Proceedings Track, 2, 675-681.