DOI QR코드

DOI QR Code

Comparison of clustering methods of microarray gene expression data

마이크로어레이 유전자 발현 자료에 대한 군집 방법 비교

  • Lim, Jin-Soo (Department of Biological Sciences, Busan National University) ;
  • Lim, Dong-Hoon (Department of Information Statistics, Gyeongsang National University)
  • 임진수 (부산대학교 생명과학과) ;
  • 임동훈 (경상대학교 정보통계학과)
  • Received : 2011.10.30
  • Accepted : 2011.12.12
  • Published : 2012.01.31

Abstract

Cluster analysis has proven to be a useful tool for investigating the association structure among genes and samples in a microarray data set. We applied several cluster validation measures to evaluate the performance of clustering algorithms for analyzing microarray gene expression data, including hierarchical clustering, K-means, PAM, SOM and model-based clustering. The available validation measures fall into the three general categories of internal, stability and biological. The performance of clustering algorithms is evaluated using simulated and SRBCT microarray data. Our results from simulated data show that nearly every methods have good results with same result as the number of classes in the original data. For the SRBCT data the best choice for the number of clusters is less clear than the simulated data. It appeared that PAM, SOM, model-based method showed similar results to simulated data under Silhouette with of internal measure as well as PAM and model-based method under biological measure, while model-based clustering has the best value of stability measure.

군집분석은 마이크로어레이 발현자료에서 유전자 혹은 표본들의 유사한 특성을 갖는 연관구조를 조사하는데 중요한 도구이다. 본 논문에서는 마이크로어레이 자료에서 계층적 군집방법, K-평균법, PAM (partitioning around medoids), SOM (self-organizing maps) 그리고 모형기반 군집방법 들의 성능을 3가지 군집 타당성 측도인 내적 측도, 안정적 측도 그리고 생물학적 측도를 가지고 비교분석하고자 한다. 모의실험을 통해 생성된 자료와 실제 SRBCT (small round blue cell tumor) 자료를 가지고 여러 가지 군집방법들의 성능을 비교하였으며 그 결과 모의실험 자료에서는 거의 모든 방법들이 3가지 군집측도에서 원래 자료와 일치하는 좋은 군집 결과를 나타내었고 SRBCT 자료에서는 모의실험 자료처럼 명확한 군집화 결과를 보여주지는 않으나 내적측도의 실루엣 너비 (Silhouette width) 관점에서는 PAM 방법, SOM, 모형기반 군집방법 그리고 생물학적 측도에서는 PAM 방법과 모형기반 군집방법이 모의실험 결과와 비슷한 결과를 얻었고 안정적 측도에서 모형기반 군집방법이 다른 방법들보다 좋은 군집결과를 보여주었다.

Keywords

References

  1. 김재희, 고윤실 (2009). 군집분석 비교 및 한우 관능평가 데이터 군집화. <응용통계연구>, 22, 745-758.
  2. 여인권 (2011). 우리나라 기상자료에 대한 군집분석. <한국데이터정보과학회지>, 22, 941-949.
  3. 이경아, 김재희 (2011). 효모 마이크로어레이 유전자 발현 데이터에 대한 군집화 비교. <한국데이터정보과학회지>, 22, 741-753.
  4. 정윤경, 백장선 (2007). 고차원(유전자 발현) 자료에 대한 군집 타당성 분석 기법의 성능비교. <응용통계연구>, 20, 167-181.
  5. 주용성, 정형주, 김병준 (2009). 한국 기상자료의 군집분석: 베이지안 모델기반 방법의 응용. <한국데이터정보과학회지>, 20, 57-64.
  6. 황진수, 김지연 (2009). 마이크로어레이 자료에서 서포트 벡터 머신과 데이터 뎁스를 이용한 분류방법의 비교연구. <한국데이터정보과학회지>, 20, 311-319.
  7. Brock, G., Pihur, V., Datta, S. and Datta, S. (2008). clValid: An R package for cluster validation. Journal of Statistical Software, 25, 1-21
  8. Datta, S. and Datta, S. (2003). Comparisons and validation of statistical clustering techniques for microarray gene expression data. Bioinformatics, 19, 459-466. https://doi.org/10.1093/bioinformatics/btg025
  9. Datta, S. and Datta, S. (2006). Methods for evaluating clustering algorithms for gene expression data using a reference set of functional classes. BMC Bioinformatics, 7, 397. https://doi.org/10.1186/1471-2105-7-397
  10. Deshmukh, S. R. and Purohit, S. G. (2007). Microarray data: Statistical analysis using R, Alpha Science International Ltd, Oxford.
  11. Dunn, J. C. (1974). Well separated clusters and fuzzy partitions. Journal on Cybernetics, 4, 95-104. https://doi.org/10.1080/01969727408546059
  12. Eisen, M. B., Spellman, T. P., Brown, P. O. and Botstein, D. (1998). Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences of the United States of America, 95, 863-14868.
  13. Fraley, C. and Raftery, A. E. (2002). Model-based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association, 97, 611-631. https://doi.org/10.1198/016214502760047131
  14. Kaufman, L. and Rousseeuw, P. J. (1990). Finding groups in data: An introduction to cluster analysis, John Wiley & Sons, New York.
  15. Khan, J., Wei, S., Ringer, M., Saal, L. H., Ladanyi, M., Westermann, F., Berthold, F., Schwab, M., Anyonescu, C. R., Peterson, C. and Meltzer, P. S. (2001). Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature Medicine, 7, 673-679 https://doi.org/10.1038/89044
  16. Kohonen, T. (1997). Self-organizing maps, Springer-Verlag, New York.
  17. Handl, J., Knowles, J. and Kell, D. B. (2005). Computational cluster validation in post-genomic data analysis. Bioinformatics, 21, 3201-3212. https://doi.org/10.1093/bioinformatics/bti517
  18. Hartigan, J. A. and Wong, M. A. (1979). Algorithm AS 136: A k-means clustering algorithm. Applied Statistics, 28, 100-108. https://doi.org/10.2307/2346830
  19. He, Y., Pan, W. and Lin, J. (2006). Cluster analysis using multivariate normal mixture models to detect differential gene expression with microarray data. Computational Statistics & Data Analysis, 51, 641-658 https://doi.org/10.1016/j.csda.2006.02.012
  20. Liu, Y. and Ringner, M. (2004). Multiclass discovery in array data. BMC Bioinformatics, 5, 70-79. https://doi.org/10.1186/1471-2105-5-70
  21. Rousseeuw, P. J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53-65. https://doi.org/10.1016/0377-0427(87)90125-7
  22. Yeung, K. Y., Haynor, D. R. and Ruzzo, W. L. (2001a). Validating clustering for gene expression data. Bioinformatics, 17, 309-318. https://doi.org/10.1093/bioinformatics/17.4.309
  23. Yeung, K. Y., Fraley, C., Murua, A., Raftery, A. E. and Ruzzo, W. L. (2001b). Model-based clustering and data transformations for gene expression data. Bioinformatics, 17, 977-987. https://doi.org/10.1093/bioinformatics/17.10.977

Cited by

  1. Analysis of the abstracts of research articles in food related to climate change using a text-mining algorithm vol.24, pp.6, 2013, https://doi.org/10.7465/jkdi.2013.24.6.1429
  2. Selection and Classification of Bacterial Strains Using Standardization and Cluster Analysis vol.54, pp.6, 2012, https://doi.org/10.5187/JAST.2012.54.6.463
  3. A study on the ordering of similarity measures with negative matches vol.26, pp.1, 2015, https://doi.org/10.7465/jkdi.2015.26.1.89
  4. Bounds of PIM-based similarity measures with partially marginal proportion vol.26, pp.4, 2015, https://doi.org/10.7465/jkdi.2015.26.4.857
  5. A study on the ordering of PIM family similarity measures without marginal probability vol.26, pp.2, 2015, https://doi.org/10.7465/jkdi.2015.26.2.367
  6. Microarray data analysis using relative hierarchical clustering vol.25, pp.5, 2014, https://doi.org/10.7465/jkdi.2014.25.5.999
  7. Reclassification of the vulnerability group of wartime equipment vol.26, pp.3, 2015, https://doi.org/10.7465/jkdi.2015.26.3.581