Rank-based Multiclass Gene Selection for Cancer Classification with Naive Bayes Classifiers based on Gene Expression Profiles

나이브 베이스 분류기를 이용한 유전발현 데이타기반 암 분류를 위한 순위기반 다중클래스 유전자 선택

  • 홍진혁 (연세대학교 컴퓨터과학과) ;
  • 조성배 (연세대학교 컴퓨터과학과)
  • Published : 2008.08.15

Abstract

Multiclass cancer classification has been actively investigated based on gene expression profiles, where it determines the type of cancer by analyzing the large amount of gene expression data collected by the DNA microarray technology. Since gene expression data include many genes not related to a target cancer, it is required to select informative genes in order to obtain highly accurate classification. Conventional rank-based gene selection methods often use ideal marker genes basically devised for binary classification, so it is difficult to directly apply them to multiclass classification. In this paper, we propose a novel method for multiclass gene selection, which does not use ideal marker genes but directly analyzes the distribution of gene expression. It measures the class-discriminability by discretizing gene expression levels into several regions and analyzing the frequency of training samples for each region, and then classifies samples by using the naive Bayes classifier. We have demonstrated the usefulness of the proposed method for various representative benchmark datasets of multiclass cancer classification.

최근 활발히 연구가 진행 중인 유전발현 데이타를 이용한 다중클래스 암 분류는 DNA 마이크로어레이로부터 획득된 대규모의 유전자 정보를 분석하여 암의 종류를 판단한다. 수집된 유전발현 데이타에는 대상 암과 관련이 없는 유전자도 포함되어 있기 때문에 높은 성능의 분류 결과를 얻기 위해서 유용한 유전자를 선택하는 것이 필요하다. 기존의 순위기반 유전자 선택은 이진클래스를 대상으로 고안되었고 이상표식 유전자(Ideal marker gene)를 이용하기 때문에 다중클래스 암 분류에 직접 적용하기에는 한계가 있다. 본 논문에서는 이상표식 유전자를 사용하지 않고 유전발현 수준의 분포를 직접 분석하는 순위기반 다중클래스 유전자 선택 기법을 제안한다. 유전발현 수준을 이산화하고 학습 데이타로부터 빈도를 계산하여 클래스 간 분별력을 측정한 후, 선택된 유전자를 이용하여 나이브 베이즈 분류기를 사용해 다중 암 분류를 수행한다. 제안하는 방법을 다수의 다중클래스 암 분류 데이타에 적용하여 기존 유전자 선택 방법에 비해 우수함을 확인하였다.

Keywords

References

  1. Y. Wang, F. Makedon, J. Ford and J. Pearlman, "HykGene: A hybrid approach for selecting marker genes for phenotype classification using microarray gene expression data," Bioinformatics, Vol. 21, No.8, pp. 1530-1537, 2005 https://doi.org/10.1093/bioinformatics/bti192
  2. S. Ramaswamy, P. Tamayo, R. Rifkin, S. Mukherjee, C. Yeang, M. Angelo, C. Ladd, M. Reich, E. Latulippe, J. Mesirov, T. Poggio, W. Gerald, M. Loda, E. Lander and T. Golub, "Multiclass cancer diagnosis using tumor gene expression signatures," Proc. National Academy of Science, Vol.98, No.26, pp. 15149-15154, 2001 https://doi.org/10.1073/pnas.211566398
  3. Y. Lee and C.-K. Lee, "Classification of multiple cancer types by multicategory support vector machines using gene expression data," Bioinformatics, Vol.19, No.9, pp. 1132-1139, 2003 https://doi.org/10.1093/bioinformatics/btg102
  4. T. Li, C. Zhang and M. Ogihara, "A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression," Bioinformatics, Vol.20, No.15, pp. 2429-2437, 2004 https://doi.org/10.1093/bioinformatics/bth267
  5. A. Statnikov, C. Aliferis, L. Tsamardinos, D. Hardin and S. Levy, "A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis," Bioinformatics, Vol.21, No.5, pp. 631-643, 2005 https://doi.org/10.1093/bioinformatics/bti033
  6. K.-Y. Yeung, R. Bumgarner and A. Raftery, "Bayesian model averaging: Development of an improved multi-class, gene selection and classification tool for microarray data," Bioinformatics, Vol.21, No.10, pp. 2394-2402, 2005 https://doi.org/10.1093/bioinformatics/bti319
  7. J.-H. Hong, and S.-B. Cho, "Multi-class cancer classification with OVR-support vector machines selected by naive Bayes classifier," Lecture Notes in Computer Sciences, Vol.4234, pp. 155-164, 2006
  8. S.-B. Cho and J.-W. Ryu, "Classifying gene expression data of cancer using classifier ensemble with mutually exclusive features," Proceedings of the IEEE, Vol.90, No.11, pp. 1744-1753, 2002 https://doi.org/10.1109/JPROC.2002.804682
  9. J. Liu, B. Li and T. Dillon, "An improved naïve Bayesian classifier technique coupled with a novel input solution method," IEEE Trans. Systems, Man, and Cybernetics-Part C: Applications and Reviews, Vol.31, No.2, pp. 249-256, 2001
  10. S. Armstrong, J. Staunton, L. Silverman, R. Pieters, M. den Boer, M. Minden, S. Sallan, E. Lander, T. Golub, and S. Korsmeyer, "MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia," Nature Genetics, Vol.30, No.1, pp. 41-47, 2002 https://doi.org/10.1038/ng765
  11. D. Ross, U. Scherf, M. Eisen, C. Perou, P. Spellman, V. Iyer, S. Jeffrey, M. Van de Rijn, M. Waltham, A. Pergamenschikov, J. Lee, D. Lashkari, D. Shalon, T. Myers, J. Weinstein, D. Botstein, and P. Brown, "Systematic variation in gene expression patterns in human cancer cell lines," Nature Genetics, Vol.24, No.3, pp. 227-234, 2000 https://doi.org/10.1038/73432
  12. J. Khan, J. Wei, M. Ringnér, L. Saal, M. Ladanyi, F. Westermann, F. Berthold, M. Schwab, C. Antonescu, C. Peterson, and P. Meltzer, "Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks," Nature Medicine, Vol.7, No.6, pp. 673-679, 2001 https://doi.org/10.1038/89044