An Analysis on Rater Error in Holistic Scoring for Performance Assessments of Middle School Students' Science Investigation Activities

중학생 과학탐구활동 수행평가 시 총체적 채점에서 나타나는 채점자간 불일치 유형 분석

  • Received : 2011.10.26
  • Accepted : 2012.01.16
  • Published : 2012.02.29


The purpose of this study is to understand raters' errors in rating performance assessments of science inquiry. For this, 60 middle school students performed scientific inquiry about sound propagation and 4 trained raters rated their activity sheets. Variance components estimation for the result of the generalizability analysis for the person, task, rater design, the variance components for rater, rater by person and rater by task are about 25%. Among 4 raters, 2 raters' severity is higher than the other two raters and their severities were stabilized. Four raters' rating agreed with each other in 51 cases among the 240 cases. Through the raters' conferences, the rater error types for 189 disagreed cases were identified as one of three types; different salience, severity, and overlooking. The error type 1, different salience, showed 38% of the disagreed cases. Salient task and salient assessment components are different among the raters. The error type 2, severity, showed 25% and the error type 3, overlooking showed 31%. The error type 2 seemed to have happened when the students responses were on the borders of two levels. Error type 3 seemed to have happened when raters overlooked some important part of students' responses because she or he immersed her or himself in one's own salience. To reduce the above rater errors, raters' conference in salience of task and assesment components are needed before performing the holistic scoring of complex tasks. Also raters need to recognize her/his severity and efforts to keep one's own severity. Multiple raters are needed to prevent the errors from being overlooked. The further studies in raters' tendencies and sources of different interpretations on the rubric are suggested.


rater error;salient assessment component;severity;overlook;performance assessment;science investigation activity;middle school student


  1. 김명숙(1999). 영어작문 수행평가의 채점행위 분석 연구. 교육평가연구, 12(2), 25-54.
  2. 김형준, 유준희(2010). 중학생 과학탐구활동 수행 평가 시 채점 방식 및 척도의 수에 따른 신뢰도 분석, 한국과학교육학회지, 30(2), 275-290.
  3. 설현수(2010). 평정자간의 엄격성 차이 정도가 피험자 총점산출 방법에 미치는 영향: 원점수, 표준점수, Facet점수 비교. 교육평가연구, 23(1), 125-147.
  4. 성태제(2005). 문항반응이론의 이해와 적용. 교육 과학사.
  5. 송미영, 김수진, 김희경, 남명호(2009). 온라인 시스템을 활용한 대규모 서답형 평가의 채점 일관성. 교육평가연구, 22(3), 827-846.
  6. 이규민(2007). 초등학교 과학과 수행평가의 총체적 채점과 분석적 채점 방식에 대한 일반화가능도분석. 아동교육, 16(4), 169-184.
  7. 지은림(1999). 사회과 보고서 수행평가를 위한 총체적 채점과 분석적 채점의 비교. 교육평가연구, 12(2), 11-24.
  8. 지은림(2008). 논술고사의 신뢰성에 영향을 미치는 채점자 특성 분석. 교육평가연구, 21(2), 97-113.
  9. Black, P. J. (1990). APU science - the past and the future. School Science Review, 72(258), 28-43.
  10. Clauser, B. E. Clayma, S. G., & Swanson, D. B. (1999). Components of rator error in a complex performance assessment. Journal of Education Measurement, 36(1), 29-45.
  11. Clauser, B. E., Harik, P., & Margolos, M. J. (2006). A multivariate generalization analysis of data from performance assessment of physicians' clinical skills. Journal of Educational Measuremnet, 43(3), 173-191.
  12. Etkina, E., Van Heuvelen, A., White-Brahmia, S., Brookes, D. T., Gentile, M., Murthy, S., Rosengrant, D., & Warren, A. (2006). Developing and assessing student scientific abilities. Physical Review Special Topics - Physics Education Research, 2(2), 020103-1-020103-15.
  13. Guilford, J. P.(1954). Psychometirc Methods. Mcgraw-Hill.
  14. Hafner, J. C., & Hafner, P. M. (2003). Quantitative analysis of the rubric as an assessment tool: an empirical study of student peer-group rating. International Journal of Science Education, 25(12), 1509-1528.
  15. Halonen, J. S., Bosack, T., Clay, S., & McCarthy, M. (2003). A rubric for learning, teaching, and assessing scientific inquiry in psychology. Teaching of Psychology, 30(3), 196-208.
  16. Harick, P., Clauser, B. E., Grabovsky, I., Nungester, R. J., & Swanson, D. (2009). AN examination rater drift within a generalizability theory framework. Jopurnal of Education Measurement, 46(1), 43-58.
  17. Klein, S. P., Stecher, B. M., Shavelson, R., McCaffrey, D., Bell, R. M., Comfort, K., Othman, A. R., & Ormseth, T. (1998). Analytic versus holistic scoring of science performance tasks. Applied Measurement in Education, 11(2), 121-137.
  18. Myford, C. & Wolfe, E. (2003). Detecting and measuring rater effects using many facet rasch measurement: Part 1. Journal of Applied Measurement, 4(4), 386-422.
  19. Polio, C. G. (1997). Measures of linguistic accuracy in second language writing research, Language Learning, 47(1), 101-143.
  20. Waltman, K., Kahn, A., & Koency, G. (1998). Alternative approaches to scoring: The effects of using different scoring methods on the validity of scores from a performance assessment. CSE Technical Report, 488.
  21. Wilson, M. & Case, H. (1997). An examination of variation in rater severity over time: A study in rater drift. BEAR report, University of California, Berkeley.
  22. Wilson, M. & Sloane, K. (2000). From principles to practice: An embedded assessment system. Applied Measurement in Education, 13(2), 181-208.
  23. Woolnough, B. E. (1989). Toward holistic view of precesses in science education, in J. Wellington (ed.) Skills and processes in science education: a critical analysis. Routledge.

Cited by

  1. Development of a Descriptive Paper Test Item and a Counting Formula for Evaluating Elementary School Students' Scientific Hypothesis Generating Ability vol.35, pp.2, 2016,