A Prediction Model for the Development of Cataract Using Random Forests

Random Forests 기법을 이용한 백내장 예측모형 - 일개 대학병원 건강검진 수검자료에서 -

Han, Eun-Jeong;Song, Ki-Jun;Kim, Dong-Geon

  • Published : 2009.08.31


Cataract is the main cause of blindness and visual impairment, especially, age-related cataract accounts for about half of the 32 million cases of blindness worldwide. As the life expectancy and the expansion of the elderly population are increasing, the cases of cataract increase as well, which causes a serious economic and social problem throughout the country. However, the incidence of cataract can be reduced dramatically through early diagnosis and prevention. In this study, we developed a prediction model of cataracts for early diagnosis using hospital data of 3,237 subjects who received the screening test first and then later visited medical center for cataract check-ups cataract between 1994 and 2005. To develop the prediction model, we used random forests and compared the predictive performance of this model with other common discriminant models such as logistic regression, discriminant model, decision tree, naive Bayes, and two popular ensemble model, bagging and arcing. The accuracy of random forests was 67.16%, sensitivity was 72.28%, and main factors included in this model were age, diabetes, WBC, platelet, triglyceride, BMI and so on. The results showed that it could predict about 70% of cataract existence by screening test without any information from direct eye examination by ophthalmologist. We expect that our model may contribute to diagnose cataract and help preventing cataract in early stages.


Random forest;screening test;prediction model of cataracts;accuracy;sensitivity


  1. 국민건강보험공단.건강보험심사평가원 (2007), 2006 건강보험통계연보
  2. 신경환, 김재찬, 김원식, 안병헌, 이진학, 노세현, 송준경, 이용환 (1992a). 한국 백내장 역학 조사회에 의한 노인성 백내장의 제반 위험 인자에 관한 연구 조사, <대한안과학회지>, 33, 127-134
  3. 신경환, 홍내선, 안상기, 김재찬, 이진학, 안병헌, rlaakst, 노세현, 송준경 (1992b). 노인성 백내장의 위험인자 및 환경요소에 대한 역학적 연구: 인구를 기초로 한 역학 조사, <대한안과학회지>, 33, 834-843
  4. 통계청 (2008). <2008 고령자 통계>, 통계청, 서울
  5. Bauer, E. and Kohavi, R. (1999). An empirical comparison of voting classification algorithms: Bagging, boosting, and variants, Machine Learning, 36, 105-139
  6. Breiman, L. (2001). Random forest, Machine Learning, 45, 5-32
  7. Bureau, A., Dupuis, J., Falls, K, Lunetta, K. L., Hayward, B., Keith, T. P. and Van Eerdewegh, P. (2005). Identifying SNPs predictive of phenotype using random forests, Genetic Epidemiology, 28, 171-182
  8. Delcourt, C., Cristol, J. P., Tessier, F., Leger, C. L., Michel. F. and Papoz, L. (2000). Risk factors for cortical, nuclear, and posterior subcapsular cataracts: The POLA study, American Journal of Epidemiology, 151, 497-504
  9. Elkan, C. (2001). The foundations of cost-sensitive learning, In Proceedings of the Seventeenth International Joint Conference on Artijiciallntelligence(IJCAI'01), 973-978
  10. Heidema, A. G., Boer, J. M. A., Nagelkerke, N., Mariman, E. C. M., van der A, D. L. and Feskens, E. J. M. (2006). The challenge for genetic epidemiologists: How to analyze large numbers of SNPs in relation to complex disease, BMC Genetics, 1, 23
  11. Hennis, A., Wu, S. Y., Nemesure, B. and Leske, M. C. (2004). Risk factors for incident cortical and posterior subcapsular lens opacities in the Barbados Eye Studies, Arch Ophthalmol, 122, 525-530
  12. Kuang, T. M., Tsai, S. Y., Hsu, W. M., Cheng, C. Y., Liu, J. H. and Chou, P. (2005). Body mass index and age-related cataract: The Shihpai Eye Study, Archives of Ophthalmol, 123, 1109-1114
  13. Lunetta, K. L., Hayward, L. B., Segal, J. and Van Eerdewegh, P. (2004). Screening Large-scale association study data: Exploiting interactions using random forests, BMC Genentics, 5, 32
  14. Panchapakesan, J., Mitchell, P., Tumuluri, K., Rochtchina, E., Foran, S. and Cumming, R, G. (2003). Five year incidence of cataract surgery: The blue mountains eye study, British Journal of Ophthalmology, 87, 168-172
  15. Prasad, A. M., Iverson, L. R. and Liaw, A. (2006). Newer classification and regression tree techniques: Bagging and random forests for ecological prediction, Ecosystems, 9, 181-199
  16. Robnik-Sikonja, M. (2004). Improving Random Forests, Lecture Notes in Computer Science, Springer, 359-370
  17. Strobl, C, Boulesteix, A. L., Zeileis, A. and Hothorn, T. (2007). Bias in random forest variable importance measures: Illustrations, sources and a solution, BMC Bioinformatics, 8, 25
  18. Tibshirani, R. (1996). Bias, Variance and Prediction Error for Classification Rules, Technical Report, Statistics Department, University of Toronto
  19. Weintraub, J. M., Willett, W. C, Rosner, B., Colditz, G. A., Seddon, J. M. and Hankinson, S, E. (2002). A prospective study of the relationship between body mass index and cataract extraction among US women and men, International Journal of Obesity, 26, 1588-1595
  20. Wolpert, D. H. and Macready, W. G. (1999). An efficient method to estimate Bagging's generalization error, Machine Learning, 35, 41-55