An enhanced feature selection filter for classification of microarray cancer data

  • Mazumder, Dilwar Hussain (Department of Computer Science and Engineering, National Institute of Technology Nagaland) ;
  • Veilumuthu, Ramachandran (Department of Computer Science and Engineering, Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology)
  • Received : 2018.09.19
  • Accepted : 2018.12.24
  • Published : 2019.06.03


The main aim of this study is to select the optimal set of genes from microarray cancer datasets that contribute to the prediction of specific cancer types. This study proposes the enhancement of the feature selection filter algorithm based on Joe's normalized mutual information and its use for gene selection. The proposed algorithm is implemented and evaluated on seven benchmark microarray cancer datasets, namely, central nervous system, leukemia (binary), leukemia (3 class), leukemia (4 class), lymphoma, mixed lineage leukemia, and small round blue cell tumor, using five well-known classifiers, including the naive Bayes, radial basis function network, instance-based classifier, decision-based table, and decision tree. An average increase in the prediction accuracy of 5.1% is observed on all seven datasets averaged over all five classifiers. The average reduction in training time is 2.86 seconds. The performance of the proposed method is also compared with those of three other popular mutual information-based feature selection filters, namely, information gain, gain ratio, and symmetric uncertainty. The results are impressive when all five classifiers are used on all the datasets.


  1. M. Dash and H. Liu, Feature selection for classifications, Intell. Data Anal. 1 (1997), 131-156.
  2. I. Guyon and A. Elisseeff, An introduction to variable and feature selection, J. Mach. Learn. Res. 3 (2003), 1157-1182.
  3. A. L. Blum and P. Langley, Selection of relevant features and examples in machine learning, Artif. Intell. 97 (1997), 245-271.
  4. H. H. Hsu, C. W. Hsieh and M. D. Lu, Hybrid feature selection by combining filters and wrappers, Expert Syst. Appl. 38 (2011), 8144-8150.
  5. J. Wang et al., Maximum weight and minimum redundancy: a novel framework for feature subset selection, Pattern Recognit. 46 (2013), 1616-1627.
  6. B. Liu et al., Discrete biogeography based optimization for feature selection in molecular signatures, Mol. Inf. 34 (2015), 197-215.
  7. Y. Samaneh, J. Shanbehzadeh, and E. Aminian, Feature subset selection using constrained binary/integer biogeography based optimization, ISA Trans. 52 (2013), 383-390.
  8. V. Bolon‐Canedo et al., Statistical dependence measure for feature selection in microarray datasets, in Proc. Eur. Symp. Artif. Neural Netw. ‐ESANN, Bruges, Belgium, Apr. 27-29, 2011, pp. 23-28.
  9. P. Meyer, C. Schretter, and G. Bontempi, Information‐theoretic feature selection in microarray data using variable complementarity, IEEE J. Sel. Top. Signal Process. 2 (2008), 261-274.
  10. L. Song et al., Feature selection via dependence maximization, J. Mach. Learn. Res. 13 (2012), 1393-1434.
  11. X. Li and M. Yin, Multi‐objective binary biogeography based optimization for feature selection using gene expression data, IEEE Trans. Nano Biosci. 12 (2013), 343-353.
  12. A. Sharma, S. Imoto, and S. Miyano, A top‐r feature selection algorithm for microarray gene expression data, IEEE/ACM Trans. Comput. Biol. Bioinform. (TCBB) 9 (2012), 754-764.
  13. S. Thawkar and R. Ingolikar, Classification of masses in digital mammograms using Biogeography‐based optimization technique, J. King Saud Univ. Comp. Inf. Sci. (2018),
  14. M. S. Mohamad et al., A modified binary particle swarm optimization for selecting the small subset of in‐formative genes from gene expression data, IEEE Trans. Inf. Technol. Biomed. 15 (2011), 813-822.
  15. K. Kira and L. Rendell, The feature selection problem: Traditional methods and a new algorithm, in Proc. Tenth Natl Conf, Artif. Intell., AAAI Press/The MIT Press, Menlo Park, 1992, pp. 129-134.
  16. M. Dash, H. Liu, and H. Motoda, Consistency based feature selection, in Proc. Fourth Pacific Asia Conf. Knowl. Discov. Data Min., Springer‐Verlag, 2000, pp. 98-109.
  17. M. Hall, Correlation based feature selection for machine learning, Ph.D. Thesis, Univ. Waikato, Dept. Comp. Sci. (1999).
  18. L. Yu and H. Liu, Feature selection for high‐dimensional data: a fast correlation‐based filter solution, in Proc. Twentieth Int. Conf. Mach. Learning ICML, Washington, DC, USA, Aug. 21-24, 2003, pp. 856-863.
  19. C. E. Sarndal, A comparative study of association measures, Psychometrika 39 (1974), 165-187.
  20. H. Joe, Relative entropy measures of multivariate dependence, J. Am. Stat. Assoc. 84 (1989), 157-164.
  21. C. A. Shannon, A mathematical theory of communication, Bell Syst. Tech. J. 27 (1948), 379-423.
  22. I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools with Java Implementations, Morgan Kaufmann, San Francisco, CA, 2000.
  23. T. Li, C. Zhang, and M. Ogihara, A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression, Bioinformatics 20 (2004), 2429-2437.
  24. Z. Zhu, Y. S. Ong, and M. Dash, Markov blanket‐embedded genetic algorithm for gene selection, Pattern Recognit. 49 (2007), 3236-3248.