A Study on Comparison of Generalized Kappa Statistics in Agreement Analysis Kim, Min-Seon; Song, Ki-Jun; Nam, Chung-Mo; Jung, In-Kyung;
Agreement analysis is conducted to assess reliability among rating results performed repeatedly on the same subjects by one or more raters. The kappa statistic is commonly used when rating scales are categorical. The simple and weighted kappa statistics are used to measure the degree of agreement between two raters, and the generalized kappa statistics to measure the degree of agreement among more than two raters. In this paper, we compare the performance of four different generalized kappa statistics proposed by Fleiss (1971), Conger (1980), Randolph (2005), and Gwet (2008a). We also examine how sensitive each of four generalized kappa statistics can be to the marginal probability distribution as to whether marginal balancedness and/or homogeneity hold or not. The performance of the four methods is compared in terms of the relative bias and coverage rate through simulation studies in various scenarios with different numbers of raters, subjects, and categories. A real data example is also presented to illustrate the four methods.
Agreement;generalized kappa;marginal probability distribution;
Measurement of Inter-Rater Reliability in Systematic Review, Hanyang Medical Reviews, 2015, 35, 1, 44
Development of a scale to measure diabetes self-management behaviors among older Koreans with type 2 diabetes, based on the seven domains identified by the American Association of Diabetes Educators, Japan Journal of Nursing Science, 2016
Berry, K. J. and Mielke, P. W. (1988). A generalization of Cohen's kappa, Educational and Psychological Measurement, 48, 921-933.
Brennan, R. L. and Prediger, D. J. (1981). Coefficient kappa: Some uses, misuses, and alternatives, Educational and Psychological Measurement, 41, 687-699.
Cohen, J. (1960). A coefficient of agreement for nominal scales, Educational and Psychological Measurement, 20, 37-46.
Cohen, J. (1968). Weighted kappa: Nominal scale agreement provision for scaled disagreement of partial credit, Psychological Bulletin, 70, 213-220.
Conger, A. J. (1980). Integration and generalization of kappas for multiple raters, Psychological Bulletin, 88, 322-328.
Feinstein, A. R. and Cicchetti, D. V. (1990). High agreement but low kappa: 1. The problems of two paradoxes, Journal of Clinical Epidemiology, 43, 543-549.
Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters, Psychological Bulletin, 76, 378-382.
Gwet, K. L. (2008a). Computing inter-rater reliability and its variance in the presence of high agreement, British Journal of Mathematical and Statistical Psychology, 61, 29-48.
Gwet, K. L. (2008b). Variance estimation of nominal-scale interrater reliability with random selection of raters, Psychometrika, 73, 407-430.
Gwet, K. L. (2010). Handbook of Inter-Rater Reliability, 2nd edn. Advanced Analytics, LLC.
Janson, H. and Olsson, U. (2001). A measure of agreement for interval or nominal multivariate observations, Educational and Psychological Measurement, 61, 277-289.
Janson, H. and Olsson, U. (2004). A measure of agreement for interval or nominal multivariate observations by different sets of judges, Educational and Psychological Measurement, 64, 62-70.
Park, M. H. and Park, Y. G. (2007). A new measure of agreement to resolve the two paradoxes of Cohen's kappa, The Korean Journal of Applied Statistics, 20, 117-132.
Quenouille, M. H. (1949). Approximate test of correlation in time-series, Journal of the Royal Statistical Society, Series B, (Methodological), 11, 68-84.
Randolph, J. J. (2005). Free-marginal multirater kappa: An alternative to Fleiss' fixed-marginal multirater kappa, Paper presented at the Joensuu University Learning and Instruction Symposium.
Scott, W. (1955). Reliability of content analysis: The case of nominal scale coding, Public Opinion Quarterly, 19, 321-325.