DOI QR코드

DOI QR Code

An enhanced feature selection filter for classification of microarray cancer data

  • Mazumder, Dilwar Hussain (Department of Computer Science and Engineering, National Institute of Technology Nagaland) ;
  • Veilumuthu, Ramachandran (Department of Computer Science and Engineering, Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology)
  • Received : 2018.09.19
  • Accepted : 2018.12.24
  • Published : 2019.06.03

Abstract

The main aim of this study is to select the optimal set of genes from microarray cancer datasets that contribute to the prediction of specific cancer types. This study proposes the enhancement of the feature selection filter algorithm based on Joe's normalized mutual information and its use for gene selection. The proposed algorithm is implemented and evaluated on seven benchmark microarray cancer datasets, namely, central nervous system, leukemia (binary), leukemia (3 class), leukemia (4 class), lymphoma, mixed lineage leukemia, and small round blue cell tumor, using five well-known classifiers, including the naive Bayes, radial basis function network, instance-based classifier, decision-based table, and decision tree. An average increase in the prediction accuracy of 5.1% is observed on all seven datasets averaged over all five classifiers. The average reduction in training time is 2.86 seconds. The performance of the proposed method is also compared with those of three other popular mutual information-based feature selection filters, namely, information gain, gain ratio, and symmetric uncertainty. The results are impressive when all five classifiers are used on all the datasets.

References

  1. M. Dash and H. Liu, Feature selection for classifications, Intell. Data Anal. 1 (1997), 131-156. https://doi.org/10.3233/IDA-1997-1302
  2. I. Guyon and A. Elisseeff, An introduction to variable and feature selection, J. Mach. Learn. Res. 3 (2003), 1157-1182.
  3. A. L. Blum and P. Langley, Selection of relevant features and examples in machine learning, Artif. Intell. 97 (1997), 245-271. https://doi.org/10.1016/S0004-3702(97)00063-5
  4. H. H. Hsu, C. W. Hsieh and M. D. Lu, Hybrid feature selection by combining filters and wrappers, Expert Syst. Appl. 38 (2011), 8144-8150. https://doi.org/10.1016/j.eswa.2010.12.156
  5. J. Wang et al., Maximum weight and minimum redundancy: a novel framework for feature subset selection, Pattern Recognit. 46 (2013), 1616-1627. https://doi.org/10.1016/j.patcog.2012.11.025
  6. B. Liu et al., Discrete biogeography based optimization for feature selection in molecular signatures, Mol. Inf. 34 (2015), 197-215. https://doi.org/10.1002/minf.201400065
  7. Y. Samaneh, J. Shanbehzadeh, and E. Aminian, Feature subset selection using constrained binary/integer biogeography based optimization, ISA Trans. 52 (2013), 383-390. https://doi.org/10.1016/j.isatra.2012.12.005
  8. V. Bolon‐Canedo et al., Statistical dependence measure for feature selection in microarray datasets, in Proc. Eur. Symp. Artif. Neural Netw. ‐ESANN, Bruges, Belgium, Apr. 27-29, 2011, pp. 23-28.
  9. P. Meyer, C. Schretter, and G. Bontempi, Information‐theoretic feature selection in microarray data using variable complementarity, IEEE J. Sel. Top. Signal Process. 2 (2008), 261-274. https://doi.org/10.1109/JSTSP.2008.923858
  10. L. Song et al., Feature selection via dependence maximization, J. Mach. Learn. Res. 13 (2012), 1393-1434.
  11. X. Li and M. Yin, Multi‐objective binary biogeography based optimization for feature selection using gene expression data, IEEE Trans. Nano Biosci. 12 (2013), 343-353. https://doi.org/10.1109/TNB.2013.2294716
  12. A. Sharma, S. Imoto, and S. Miyano, A top‐r feature selection algorithm for microarray gene expression data, IEEE/ACM Trans. Comput. Biol. Bioinform. (TCBB) 9 (2012), 754-764. https://doi.org/10.1109/TCBB.2011.151
  13. S. Thawkar and R. Ingolikar, Classification of masses in digital mammograms using Biogeography‐based optimization technique, J. King Saud Univ. Comp. Inf. Sci. (2018), https://doi.org/10.1016/j.jksuci.2018.01.004. https://doi.org/10.1016/j.jksuci.2018.01.004
  14. M. S. Mohamad et al., A modified binary particle swarm optimization for selecting the small subset of in‐formative genes from gene expression data, IEEE Trans. Inf. Technol. Biomed. 15 (2011), 813-822. https://doi.org/10.1109/TITB.2011.2167756
  15. K. Kira and L. Rendell, The feature selection problem: Traditional methods and a new algorithm, in Proc. Tenth Natl Conf, Artif. Intell., AAAI Press/The MIT Press, Menlo Park, 1992, pp. 129-134.
  16. M. Dash, H. Liu, and H. Motoda, Consistency based feature selection, in Proc. Fourth Pacific Asia Conf. Knowl. Discov. Data Min., Springer‐Verlag, 2000, pp. 98-109.
  17. M. Hall, Correlation based feature selection for machine learning, Ph.D. Thesis, Univ. Waikato, Dept. Comp. Sci. (1999).
  18. L. Yu and H. Liu, Feature selection for high‐dimensional data: a fast correlation‐based filter solution, in Proc. Twentieth Int. Conf. Mach. Learning ICML, Washington, DC, USA, Aug. 21-24, 2003, pp. 856-863.
  19. C. E. Sarndal, A comparative study of association measures, Psychometrika 39 (1974), 165-187. https://doi.org/10.1007/BF02291467
  20. H. Joe, Relative entropy measures of multivariate dependence, J. Am. Stat. Assoc. 84 (1989), 157-164. https://doi.org/10.1080/01621459.1989.10478751
  21. C. A. Shannon, A mathematical theory of communication, Bell Syst. Tech. J. 27 (1948), 379-423. https://doi.org/10.1002/j.1538-7305.1948.tb01338.x
  22. I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools with Java Implementations, Morgan Kaufmann, San Francisco, CA, 2000.
  23. T. Li, C. Zhang, and M. Ogihara, A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression, Bioinformatics 20 (2004), 2429-2437. https://doi.org/10.1093/bioinformatics/bth267
  24. Z. Zhu, Y. S. Ong, and M. Dash, Markov blanket‐embedded genetic algorithm for gene selection, Pattern Recognit. 49 (2007), 3236-3248.