Scoring Korean Written Responses Using English-Based Automated Computer Scoring Models and Machine Translation: A Case of Natural Selection Concept Test

영어기반 컴퓨터자동채점모델과 기계번역을 활용한 서술형 한국어 응답 채점 -자연선택개념평가 사례-

  • Received : 2016.03.21
  • Accepted : 2016.05.05
  • Published : 2016.06.30


This study aims to test the efficacy of English-based automated computer scoring models and machine translation to score Korean college students' written responses on natural selection concept items. To this end, I collected 128 pre-service biology teachers' written responses on four-item instrument (total 512 written responses). The machine translation software (i.e., Google Translate) translated both original responses and spell-corrected responses. The presence/absence of five scientific ideas and three $na{\ddot{i}}ve$ ideas in both translated responses were judged by the automated computer scoring models (i.e., EvoGrader). The computer-scored results (4096 predictions) were compared with expert-scored results. The results illustrated that no significant differences in both average scores and statistical results using average scores was found between the computer-scored result and experts-scored result. The Pearson correlation coefficients of composite scores for each student between computer scoring and experts scoring were 0.848 for scientific ideas and 0.776 for $na{\ddot{i}}ve$ ideas. The inter-rater reliability indices (Cohen kappa) between computer scoring and experts scoring for linguistically simple concepts (e.g., variation, competition, and limited resources) were over 0.8. These findings reveal that the English-based automated computer scoring models and machine translation can be a promising method in scoring Korean college students' written responses on natural selection concept items.


automated computer scoring;written response;natural selection;assessment


  1. Anderson, D. L., Fisher, K. M., & Norman, G. J. (2002). Development and evaluation of the conceptual inventory of natural selection. Journal of Research in Science Teaching, 39(10), 952-978.
  2. Basu, S., Jacobs, C., & Vanderwende, L. (2013). Powergrading: A clustering approach to amplify human effort for short answer grading. Transactions of the Association for Computational Linguistics, 1, 391-402.
  3. Beggrow, E. P., Ha, M., Nehm, R. H., Pearl, D., & Boone, W. J. (2014). Assessing scientific practices using machine-learning methods: How closely do they match clinical interview performance?. Journal of Science Education and Technology, 23(1), 160-182.
  4. Cohen, J. (1968). Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin, 70, 213-220.
  5. Cohen, J. (1992). A power primer. Psychological Bulletin, 112(1), 155-159.
  6. Ha, M. (2013). Assessing scientific practices using machine learning methods: Development of automated computer scoring models for written evolutionary explanations. Unpublished Doctoral Dissertation. Columbus: The Ohio State University.
  7. Ha, M., & Nehm, R. H. (2016a). The impact of misspelled words on automated computer scoring: A case study of scientific explanations. Journal of Science Education and Technology, 25, 358-374.
  8. Ha, H., & Nehm, R. H. (2016b). Predicting the accuracy of computer scoring of text: Probabilistic, multi-model, and semantic similarity approaches. Paper in proceedings of the National Association for Research in Science Teaching, Baltimore, MD, April 14-17.
  9. Haudek, K. C., Prevost, L. B., Moscarella, R. A., Merrill, J., & Urban-Lurain, M. (2012). What are they thinking? Automated analysis of student writing about acid-base chemistry in introductory biology. CBE-Life Sciences Education, 11(3), 283-293.
  10. Kaplan, J. J., Haudek, K. C., Ha, M., Rogness, N., & Fisher, D. G. (2014). Using lexical analysis software to assess student writing in statistics. Technology Innovations in Statistics Education, 8(1).
  11. Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 159-174.
  12. Leacock, C., & Chodorow, M. (2003). C-rater: Automated scoring of short-answer questions. Computers and the Humanities, 37(4), 389-405.
  13. Levesque, A. A. (2011). Using clickers to facilitate development of problem-solving skills. CBE-Life Sciences Education, 10(4), 406-417.
  14. Liu, O. L., Rios, J. A., Heilman, M., Gerard, L., & Linn, M. C. (2016). Validation of automated scoring of science assessments. Journal of Research in Science Teaching, 53(2), 215-233.
  15. Magnusson, S. J., Templin, M., & Boyle, R. A. (1997). Dynamic science assessment: A new approach for investigating conceptual change. The Journal of the Learning Sciences, 6(1), 91-142.
  16. Makiko, M., Yuta, T., & Kazuhide, Y. (2011). Phrase-based statistical machine translation via Chinese characters with small parallel corpora. IJIIP: International Journal of Intelligent Information Processing, 2(3), 52-61.
  17. Mathan, S. A., & Koedinger, K. R. (2005). Fostering the intelligent novice: Learning from errors with metacognitive tutoring. Educational Psychologist, 40(4), 257-265.
  18. Moharreri, K., Ha, M., & Nehm, R. H. (2014). EvoGrader: an online formative assessment tool for automatically evaluating written evolutionary explanations. Evolution: Education and Outreach, 7(1), 1-14.
  19. Nehm, R. H., Ha, M., & Mayfield, E. (2012). Transforming biology assessment with machine learning: automated scoring of written evolutionary explanations. Journal of Science Education and Technology, 21(1), 183-196.
  20. Nehm, R. H., Ha, M., Rector, M., Opfer, J. E., Perrin, L., Ridgway, J. et al. (2010). Scoring guide for the open response instrument (ORI) and evolutionary gain and loss test (ACORNS). Technical Report of National Science Foundation REESE Project 0909999.
  21. Odom, A. L., & Barrow, L. H. (1995). Development and application of a two-tier diagnostic test measuring college biology students' understanding of diffusion and osmosis after a course of instruction. Journal of Research in Science Teaching, 32(1), 45-61.
  22. Opfer, J. E., Nehm, R. H., & Ha, M. (2012). Cognitive foundations for science assessment design: Knowing what students know about evolution. Journal of Research in Science Teaching, 49(6), 744-777.
  23. Rutledge, M. L., & Warden, M. A. (1999). The development and validation of the measure of acceptance of the theory of evolution instrument. School Science and Mathematics, 99(1), 13-18.
  24. Sato, T., Yamanishi, Y., Kanehisa, M., & Toh, H. (2005). The inference of protein-protein interactions by co-evolutionary analysis is improved by excluding the information about the phylogenetic relationships. Bioinformatics, 21(17), 3482-3489.
  25. Shute, V. J. (2008). Focus on formative feedback. Review of educational research, 78(1), 153-189.
  26. Weston, M., Haudek, K. C., Prevost, L., Urban-Lurain, M., & Merrill, J. (2015). Examining the impact of question surface features on students' answers to constructed-response questions on photosynthesis. CBE-Life Sciences Education, 14(2), ar19.
  27. Zhu, Z., Pilpel, Y., & Church, G. M. (2002). Computational identification of transcription factor binding sites via a transcription-factor-centric clustering (TFCC) algorithm. Journal of Molecular Biology, 318(1), 71-81.
  28. Crossgrove, K., & Curran, K. L. (2008). Using clickers in nonmajors-and majors-level biology courses: student opinion, learning, and long-term retention of course material. CBE-Life Sciences Education, 7(1), 146-154.