Biological Feature Selection and Disease Gene Identification using New Stepwise Random Forests

  • Hwang, Wook-Yeon (College of Global Business, Dong-A University)
  • Received : 2016.08.13
  • Accepted : 2017.01.09
  • Published : 2017.03.30


Identifying disease genes from human genome is a critical task in biomedical research. Important biological features to distinguish the disease genes from the non-disease genes have been mainly selected based on traditional feature selection approaches. However, the traditional feature selection approaches unnecessarily consider many unimportant biological features. As a result, although some of the existing classification techniques have been applied to disease gene identification, the prediction performance was not satisfactory. A small set of the most important biological features can enhance the accuracy of disease gene identification, as well as provide potentially useful knowledge for biologists or clinicians, who can further investigate the selected biological features as well as the potential disease genes. In this paper, we propose a new stepwise random forests (SRF) approach for biological feature selection and disease gene identification. The SRF approach consists of two stages. In the first stage, only important biological features are iteratively selected in a forward selection manner based on one-dimensional random forest regression, where the updated residual vector is considered as the current response vector. We can then determine a small set of important biological features. In the second stage, random forests classification with regard to the selected biological features is applied to identify disease genes. Our extensive experiments show that the proposed SRF approach outperforms the existing feature selection and classification techniques in terms of biological feature selection and disease gene identification.


Bioinformatics;Classification;Feature Evaluation and Selection;Modeling and Prediction


Supported by : Dong-A University


  1. Adie, E., Adams, R., Evans, K., Porteous, D., and Pickard, B. (2005), Speeding disease gene discovery by sequence based candidate prioritization, BMC bioinformatics, 6-55.
  2. Aerts, S., Lambrechts, D., Maity, S., Van Loo, P., Coessens, B., De Smet, F., Tranchevent, L., De Moor, B., Marynen, P., and Hassan, B. (2006), Gene prioritization through genomic data fusion, Nat Biotechnol, 24, 537-544.
  3. Blum, A. and Langley, P. (1997), Selection of relevant features and examples in machine learning, Artificial Intelligence, 97(1-2), 245-271.
  4. Bollmann, P. and Cherniavsky, V. S. (1981), Restricted Evaluation in Information Retrieval, ACM SIGIR.
  5. Botstein, D. and Risch, N. (2013), Discovering genotypes underlying human phenotypes: Past successes for mendelian disease, future approaches for complex disease, Nature Genetics, 33, 228-237.
  6. Botta, V., Louppe, G., Geurts, P., and Wehenkel, L. (2014), Exploiting SNP Correlations within Random Forest for Genome-Wide Association Studies, PLOS ONE,
  7. Breiman, L., Friedman, J., Olshen, R., and Stone, C. (1991), Classification and Regression Trees, CRC Press, New York.
  8. Breiman, L. (2001), Random forests, Machine Learning, 45, 5-32.
  9. Brown, K. and Jurisica, I. (2005), Online predicted human interaction database, Bioinformatics, 21(9), 2076-2082.
  10. Brunner, H., Van, D. M (2004), From syndrome families to functional genomics, Nature Reviews Genetics, 5, 545-551.
  11. Cao, H., Mok, A., Miskie, B., and Hegele, R. A. (2001), Single-nucleotide polymorphisms of the proprotein convertase subtilisin/kexin type 5 (PCSK5) gene, J Hum Genet, 46, 730-732.
  12. Chen, Y., Chen, C. F., Chiang, H. C., Pena, M., Polci, R., Wei, R. L., Edwards, R. A., Hansel, D. E., Chen, P. L., and Riley, D. J. (2011), Mutation of NIMArelated kinase 1 (NEK1) leads to chromosome instability, Molecular Cancer, doi:10.1186/1476-4598-10-5.
  13. Chiong, M., Wang, Z., Pedrozo, Z., Caom D., Troncoso, R., and Ibacache, M. (2011), Cardiomyocyte death: mechanisms and translational implications, Cell Death & Disease, 2, e244.
  14. Day, B. W., Stringer, B. W., Al-Ejeh, F., Ting, M. J., Wilson, J., Ensbey, K. S., Jamieson, P. R., Bruce, Z. C., Lim, Y. C., Offenhauser, C., Charmsaz, S., Cooper, L. T., Ellacott, J. K., Harding, A., Leveque, L., Inglis, P., Allan, S., Walker, D. G., Lackmann, M., Osborne, G., Khanna, K. K., Reynolds, B. A., Lickliter, J. D., and Boyd, A. W. (2013), EphA3 maintains tumorigenicity and is a therapeutic target in glioblastoma multiforme, Cancer Cell, 23(2), 238-248.
  15. Deshmukh, M., Li, Y., Yokota, T., Gama, V., Yoshida, T., Gomez, J. A., Ishikawa, K., Sasaguri, H., Cohen, H. Y., Sinclair, D. A., Mizusawa, H., and Matsuyama, S. (2007), Bax-inhibiting peptide protects cells from polyglutamine toxicity caused by Ku70 acetylation, Cell Death and Differentiation, 14, 2058-2067.
  16. Dufour, C. R., Wilson, B. J., Huss, J. M., Kelly, D. P., Alaynick, W. A., Downes, M., Evans, R. M., Blanchette, M., and Giguere, V. (2007), Genome-wide orchestration of cardiac functions by the orphan nuclear receptors ERR alpha and Gamma, Cell Metab, 5(5), 345-356.
  17. Erol, A. (2010), Systemic DNA damage response and metabolic syndrome as a premalignant state, Current Molecular Medicine, 10(3), 321-334.
  18. Fan, J. and Lv, J. (2008), Sure independence screening for ultrahigh dimensional feature space, JRSS. B., 70, 849-911.
  19. Finn, R., Bateman, A., Clements, J., Coggill, P., Eberhardt, R. Y., Eddy, S. R., Heger, A., Hetherington, K., Holm, L., Mistry, J., Sonnhammer, E. L. L., Tate, J., and Punta, M. (2010), The pfam protein families database, Nucl Acids Res, 38, 211-222.
  20. Flicek, P., Amode, M. R., Barrell, D., Beal, K., Billis, K., Brent, S., Carvalho-Silva, D., Clapham, P., Coates, G., Fitzgerald, S., Gil, L., Giron, C. G., Gordon, L., Hourlier, T., Hunt, S., Johnson, N., Juettemann, T., Kahari, A. K., Keenan, S., Kulesha, E., Martin, F. J., Maurel, T., McLaren, W. M., Murphy, D. N., Nag, R., Overduin, B., Pignatelli, M., Pritchard, B., Pritchard, E., Riat, H. S., Ruffier, M., Sheppard, D., Taylor, K., Thormann, A., Trevanion, S. J., Vullo, A., Wilder, S. P., Wilson, M., Zadissa, A., Aken, B. L., Birney, E., Cunningham, F., Harrow, J., Herrero, J., Hubbard, T. J. P., Kinsella, R., Muffato, M., Parker, A., Spudich, G., Yates, A., Zerbino, D. R., and Searle, S. M. J. (2011), Ensembl, Nucl Acids Res, 39(1), 800-806.
  21. Gene Ontology Consortium (2004), The gene ontology database and informatics resource, Nucleic Acid Res., 32(1), 258-261.
  22. Giallourakis, C., Henson, C., Reich, M., Xie, X., and-Mootha, V. (2005), Disease gene discovery though integrative genomics, Annu Rev Genomics Hum Genet, 6, 381-406.
  23. Goh, K., Cusick, M., Valle, D., Childs, B., Vidal, M., and Barabasi, A. (2007), The human disease network, Proc Natl Acad Sci USA, 104, 8685-8690.
  24. Greenwood, P., and Nikulin, M. (1996), A Guide to Chisquared Testing, John Wiley & Sons.
  25. Guertin, D. A. and Sabatini, D. M. (2007), Defining the role of mTOR in cancer, Cancer Cell, 12(1), 9-22.
  26. Guyon, I. and Elisseeff, A. (2003), An introduction to variable and feature selection, Journal of Machine Learning Research, 3, 1157-1182.
  27. Hall, M. (1999), Correlation-based Feature Selection for Machine Learning, Ph.D. thesis.
  28. Hastie, T., Tibsharani, R., and Friedman, J. H. (2001), The elements of statistical learning, Springer, New York.
  29. Hultsch, C., Bergmannm R., Pawelke, B., Pietzsch, J., Wuest, F., Johannsen, B., and Henle, T. (2005), Biodistribution and catabolism of 18F-labelled isopeptide N(epsilon)-(gamma-glutamyl)-L-lysine, Amino Acids, 29(4), 405-413.
  30. Ideker, T. and Sharan, R. (2008), Protein networks in disease, Genome Research, 18, 644-652.
  31. Jeong, J. W., Lee, K. Y., Han, S. J., Aronow, B. J., Lydon, J. P., O'Malley, B. W., and DeMayo, F. J. (2007), The P160 steroid receptor coactivator 2, SRC-2, regulates murine endometrial function and regulates progesterone-independent and -dependent gene expression, Endocrinology, 148, 4238-4250.
  32. Jiang, R., Tang, W., Wu, X., and Fu, W. (2009), A random forest approach to the detection of epistatic interactions in case-control studies, BMC Bioinformatics, DOI:10.1186/1471-2105-10-S1-S65.
  33. Johansson, T., Lejonklou, M. H., Ekeblad, S., Stalberg, P., and Skogseid, B. (2008), Lack of nuclear expression of hairy and enhancer of split-1 (HES1) in pancreatic endocrine tumors, Horm Metab Res, 40(5), 354-359.
  34. Johnson, V. J., Kim, S., and Sharma, R. P. (2005), Aluminum-maltolate induces apoptosis and necrosis in neuro-2a cells: Potential role for p53 signaling. Toxicol Sci., 83(2), 329-339.
  35. Karanika, S., Karantanos, T., Li, L., Corn, P. G., and-Thompson, T. C. (2014), DNA damage response and prostate cancer: Defects, regulation and therapeutic implications. Oncogene, doi:10.1038/onc.2014.238.
  36. Katz, A. M. (2000), Cytoskeletal abnormalities in the failing heart out on a LIM?, Circulation, 101(23), 2672-2673.
  37. Kenji, K. and Rendell, L. (1992), The feature selection problem: traditional methods and a new algorithm, Proceeding AAAI'92 Proceedings of the Tenth National Conference on Artificial Intelligence, 129-134.
  38. Kohavi, R. and John, G. (1997), Wrappers for feature selection, Artificial Intelligence, 97(1-2), 273-324.
  39. Kohler, S., Bauer, S., Horn, D., and Robinson, P. (2008), Walking the interactome for prioritization of candidate disease genes, The American Journal of Human Genetics, 82, 949-958.
  40. Kondo, M., Osada, H., Uchida, K., Yanagisawa, K., Masuda, A., Takagim K., Takahashim T., and Takahashi. T. (1998), Molecular cloning of human TAK1 and its mutational analysis in human lung cancer, Int. J. Cancer, 75(4), 559-563.<559::AID-IJC11>3.0.CO;2-4
  41. Liu, Y. and Wu, Y. (2007), Variable selection via a combination of the L0 and L1 penalties, J. Comp. Graph. Statist., 16, 782-798.
  42. Mahaffey, K. R., Clickner, R. P., and Bodurow, C. C. (2004), Blood organic mercury and dietary mercury intake: National health and nutrition examination survey, Environ Health Perspect, 112(5), 562-570.
  43. Martimbeau, S. and Tilly, J. L. (1997), Physiological cell death in endocrine-dependent tissues: An ovarian perspective, Clinical Endocrinology, 46(3), 241-254.
  44. McKusick, V. (2007), Mendelian inheritance in man and its online version, OMIM. Am J Hum Genet, 80, 588-604.
  45. Mitchell, M. (1997), Machine learning, WCB.
  46. Neil, J. R. and Schiemann, W. P. (2008), Altered TAB1: I KappaB kinase interaction promotes transforming growth factor beta-mediated nuclear factor-kappaB activation during breast cancer progression, Cancer Res, 68(5), 1462-1470.
  47. Olmez-Hanci, T., Imren, C., Arslan-Alaton, I., Kabdasli, I., and Tunay, O. (2009), H2O2/UV-C oxidation of potential endocrine disrupting compounds: A case study with dimethyl phthalate, Photochem Photobiol Sci., 8(5), 620-627.
  48. Oti, M. and Brunner, H. (2007), The modular nature of genetic diseases, Clin Genet, 71, 1-11.
  49. Prasad, T., Goel, R., Kandasamy, K., Keerthikumar, S., Kumar, S., Mathivanan, S., Telikicherla, D., and Pandey, A. (2009), Human protein reference database, Nucleic Acids Res, 37, 767-772.
  50. Qiu, Y., Zhang, S., Zhang, X., and Chen, L. (2010), Detecting disease associated modules and prioritizing active genes based on high throughput data, BMC Bioinformatics, 11-26.
  51. Radivojac, P., Peng, K., Clark, W., Peters, B., Mohan, A., Boyle, S., and Mooney, S. (2008), An integrated approach to inferring gene-disease associations in humans, Proteins, 72(3), 1030-1037.
  52. Renk, G. and Crouch, R. K. (1989), Analogue pigment studies of chromophore-protein interactions in metarhodopsins, Biochemistry, 28(2), 907-912.
  53. Ring, H. G. (1967), Pancreatic carcinoma with metastasis to the optic nerve, Arch Ophthalmol, 77(6), 798-800.
  54. Smalter, A., Lei, S., and Chen, X. (2007), Human disease-gene classification with integrative sequencebased and topological features of protein-protein interaction networks, BIBM.
  55. Suk, S., Kim, Y., and Lee, S. (2001), Formation of nuclear isopeptide in the process of neuronal cell death following interstitial hyperthermia in normal rat brain, Journal of Korean Neurol Association, 19(6), 633-640.
  56. Szende, B., Szokan, G., Tyiha, E., Pal, K., Gaborjanyi, R., Almas, M., and Khlafulla, A. R. (2002), Antitumor effect of lysine-isopeptides, Cancer Cell International, doi:10.1186/1475-2867-2-4.
  57. Tew, K., Li, X., and Tan, S. (2007), Functional centrality: Detecting lethality of proteins in protein interaction networks, Proceedings of 18th International Conference on Genome Informatics.
  58. Tong, L., Png, E., Lan, W., and Petznick, A. (2011), Recent advances: Transglutaminase in ocular health and pathological processes, J Clinic Experiment Ophthalmol, doi:10.4172/2155-9570.S2-002.
  59. Tsujie, M., Nakamori, S., Okami, J., Takahashi, Y., Hayashi, N., Nagano, H., Dono, K., Umeshita, K., Sakon, M., and Monden, M. (2003), Growth inhibition of pancreatic cancer cells through activation of peroxisome proliferator-activated receptor Gamma/Retinoid X Receptor Alpha pathway, Int J Oncol, 23(2), 325-331.
  60. Usmani-Brown, S., Lebastchi, J., Steck, A. K., Beam, C., Herold, K. C., and Ledizet, M. (2014), Analysis of ${\beta}$-cell death in type 1 diabetes by droplet digital PCR, Endocrinology, 155(9), 3694-3698.
  61. Uttara, B., Singh, A. V., Zamboni, P., and Mahajan, R. T. (2009), Oxidative stress and neurodegenerative diseases: A review of upstream and downstream antioxidant therapeutic options, Curr Neuropharmacol, 7(1), 65-74.
  62. Wang, G., Fu, G., and Corcoran, C. (2015), A forestbased feature screening approach for large-scale genome data with complex structures, BMC Genetics, DOI: 10.1186/s12863-015-0294-9.
  63. Wang, Z. D., Payattakool, R., Philip, S., and Chen, C. (2007), A new method to measure the semantic similarity of GO terms, Bioinformatics, 23(10), 1274-1281.
  64. Xin, G., Qiu, Y., Loh, H. H., and Law, P. Y. (2009), GRIN1 regulates ${\mu}$-opioid receptor activities by tethering the receptor and G protein in the lipid raft, Journal of Biological Chemistry, 284(52), 36521-36534.
  65. Xu, J. and Li, Y. (2006), Discovering disease-genes by topological features in human protein-protein interaction network, Bioinformatics, 22(22), 2800-2805.
  66. Yang, H., Liu, C., Jamsen, J., Wu, Z., Wang, Y., Chen, J., Zheng, L., and Shen, B. (2012), The DNase domaincontaining protein TATDN1 plays an important role in chromosomal segregation and cell cycle progression during zebrafish eye development, Cell Cycle, 11(24), 4626-4632.
  67. Yang, P., Li, X., Chua, H., Kwoh, C., and Ng, S. (2014), Ensemble positive unlabeled learning for disease gene identification, PloS one, 9(5).
  68. Yang, P., Li, X., Mei, J., Kwoh, C., and Ng, S. (2012), Positive-unlabeled learning for disease gene identification, Bioinformatics, 28(20), 2640-2647.
  69. Yang, P., Li, X., Wu, M., Kwoh, C., and Ng, S. (2011), Inferring gene-phenotype associations via global protein complex network propagation, PloS one, 6(7), e21502.
  70. Zhang, H., Ahn, J., Lin, X., and Park, C. (2006), Gene selection using support vector machines with nonconvex penalty, Bioinformatics, 22, 88-95.
  71. Zhong, T., Tan, Y., Zhou, A., Yu, Q., and Zhou, J. (2005), RING finger ubiquitin-protein isopeptide ligase Nrdp1/FLRF regulates parkin stability and activity, Journal of Biological Chemistry, 280(10), 9425-9430.
  72. Zhu, J., Rosset, S., Hastie, T., and Tibshirani, R. (2004), 1-norm support vector machines, The Annual Conference on Neural Information Processing Systems.
  73. Zou, H. (2007), An improved 1-norm support vector machine for simultaneous classification and variable selection, J. Machine Learn. Res., Proceedings Track, 2, 675-681.