You are on page 1of 7

Social Science Computer Review http://ssc.sagepub.

com/

Ag09: A Computer Program for Interrater Agreement for Judgments


Roel Popping Social Science Computer Review 2010 28: 391 DOI: 10.1177/0894439309351394 The online version of this article can be found at: http://ssc.sagepub.com/content/28/3/391

Published by:
http://www.sagepublications.com

Additional services and information for Social Science Computer Review can be found at: Email Alerts: http://ssc.sagepub.com/cgi/alerts Subscriptions: http://ssc.sagepub.com/subscriptions Reprints: http://www.sagepub.com/journalsReprints.nav Permissions: http://www.sagepub.com/journalsPermissions.nav Citations: http://ssc.sagepub.com/content/28/3/391.refs.html

>> Version of Record - Jul 14, 2010 What is This?

Downloaded from ssc.sagepub.com at University of Groningen on November 4, 2013

Reports and Communications


Social Science Computer Review 28(3) 391-396 The Author(s) 2010 Reprints and permission: sagepub.com/journalsPermissions.nav DOI: 10.1177/0894439309351394 http://sscr.sagepub.com

Ag09
A Computer Program for Interrater Agreement for Judgments
Roel Popping1

Abstract This text describes a computer program that allows computing the interrater agreement index Scotts p for at least two ratings per object. The program allows using weights; therefore, the user is not restricted to data on a nominal level of measurement. If wanted, it is possible to compute the agreement per category. Keywords nominal scale agreement, Scotts p, computer programs

Introduction
Interrater reliability or agreement is the widely used term for the extent to which independent raters evaluate a characteristic of a message or artifact and reach the same conclusion. Within the social and behavioral sciences, interrater agreement is usually used in text analysis studies or observational studies. First, the results of the coding of texts or text fragments by some independently operating raters are to be compared. Second, behavior is classified by, again independently operating, observers. The comparison of all classifications results in the score on an agreement index. Agreement is very often considered a special kind of association. There are differences however. It is important to determine the similarity of the content of behavior (in a broad sense) between raters in general with the degree of identity of this behavior. The behavior of one rater does not have to be predicted from that of the other. In the case of association, one investigates the strength of the linear relationship between variables. Here the goal is to predict the values of one variable from those of the other. With regard to agreement, most important is the similarity of the content of behavior between raters, with the goal of determining the degree of identity of this behavior. The basic idea of an agreement index is looking at the fraction of observations on which raters agree. The difference between reliability and agreement is described very well: Interrater reliability provides an indication of the extent to which the variance in the ratings is attributable to differences

Department of Sociology, University of Groningen, Groningen, Netherlands

Corresponding Author: Roel Popping, Department of Sociology, University of Groningen, Groningen 9712 TG 31, Netherlands Email: R.Popping@rug.nl

391

392

Social Science Computer Review 28(3)

among the rated objects. . . . Interrater agreement represents the extent to which the different judges tend to assign exactly the same rating to each object (Tinsley & Weiss, 2000, p. 98). One should be aware however that a high reliability does not automatically mean a high validity. Reliability is the agreement between two efforts to measure the same trait through maximally similar methods. Validity is represented in the agreement between two attempts to measure the same trait through maximally different methods (Campbell & Fiske, 1959, p. 83). In most situations in which agreement is to be computed within the social and behavioral sciences, the raters should be trained in recognizing the relevant characteristics necessary for a correct classification. In medicine, one might need a specialist who is able to recognize a specific disease. However, in the first mentioned fields, this type of skills is not necessary. There are a few assumptions that should be satisfied if one is willing to compute interrater agreement. These assumptions are 1. the objects are independent; 2. the raters operate independently; 3. the categories should be used more or less with the same frequency. In case the data are measured on a nominal scale, the categories are independent, exclusive, and exhaustive. For a higher level of measurement, the requirements posed to that level are to be fulfilled. In the index, a correction for chance agreement is to be made (Galtung, 1979). The general formulation for the agreement index therefore is I P o Pe = 1 Pe ; 1

where Po denotes the observed proportion of agreement and Pe the proportion of agreement under the null hypothesis of independence. Several quality criteria for agreement indices have been proposed (Popping, 1988). Two very important ones are that the maximum value of the index should be 1, no matter the number of raters, objects (that what is judged) or categories, and that in case of statistical independence between the answers of the raters, the index has to take the value 0. Even in this extreme case of independence, the raters will agree on the classification of some objects, purely by chance. The correction for such chance agreement has played an important role in the literature.1 In most agreement studies, a classification by raters is only available for a limited set of objects. It is frequently desirable to be able to generalize the conclusions to a larger set of objects, either those for which only one classification is available or those that will be classified in the future. Statistical methods allow doing this provided that the objects on which agreement is calculated are a random sample from this larger population of objects. Statistical generalizations to other raters are also feasible, provided that the actual raters for which one has data are a sample from a larger pool of raters. In many other applications, however, the raters are fixed and not chosen at random. There is no reason to assume that one rater has better skills to perform a coding task than another rater. A good investigator assures that all raters get a training in which they learn how to look at the objects and at certain traits that might be relevant in the coding task. All raters should be equally trained. This is especially so when there is no good or wrong assignment. Usually the fact to be coded is whether a certain issue is available (according to the rater) in the object. Considering the expected agreement, it is best now to base this on the pooled marginal totals over all raters. This position is taken by Scott (1955). The position was attacked by Cohen, one source of disagreement between a pair of judges is precisely their proclivity to distribute their judgments different over the categories (Cohen,
392

Popping

393

1960, p. 41). As indicated before, the situation where this is true is not met within the behavioral and social. For most situations in these fields, the requirement is unrealistic. In literature, the index by Cohen has received most attention. More coding tasks have been performed in his field of application than in the field Scott was working in and that is focused on here. A consequence is that for most users, all indices of type (1) as presented above became indicated as k indices, as Cohen named his index. It even became more complex when Fleiss (1971) presented an index, which he considered as an extension of the Cohen index, but which actually was an extension of the index as proposed by Scott.

Scotts index
Scott (1955) derived an agreement index of the type under Equation 1, he called pi, p. The index is different from the index as proposed by Cohen (1960). The expected agreement as mentioned is based on the pooled marginal totals over all raters. In Cohens index, it is assumed that the expected agreement is based on the marginals of each of the individual raters. Assume there are N objects and c categories. Define nsj as the number of times object s has been assigned to category j. A matrix w is used for weights, wij is the weight for the situation in which one rating of an object is to category i and the other rating is to category j. In the standard situation, one uses wii 1 and wij 0 (where i <> j). The algorithm wij 1 |i j|/(c 1) is used to indicate a linear weighted relation, and wij 1 (i j)2/(c 1)2 refers to the squared weighted relation (Cicchetti, 1972). In practice, these algorithms are used when the data are on an ordinal or interval level of measurement. The definition of agreement: pairwise (proportion of agreeing pairs of judgments), simultaneous (there is only agreement with respect to an object, when all judgments are into the same category), or majority (there is only agreement with respect to an object, when k of the m judgments are into the same category) also has been subject of discussion. Popping (2009) has shown the pairwise agreement is to be used. Now, the number of times object s has been assigned to category i can be computed: ns
c X i1

nsi :

This number must at least be 2, as otherwise no comparison to another coding can be made. The number allows computing the proportion of codings into a pair of categories (Schouten, 1986): pii for all i unequal j: pij
N X s 1 N X s1

nsi nsi 1=fNns ns 1g:

nsi nsj =fNns ns 1g:

The total proportion of codings into category i is pi


N X s1

nsi =Nns :

393

394

Social Science Computer Review 28(3)

Now the observed and expected agreement over all categories can be defined: po
c X c X i1 j1 c X c X i 1 j 1

wij pij ;

pe

wij pi pj ;

as well as the values for the single category i poi


c X j1 c X j1

wij pij =pi ;

pei

wij pj :

Agreement for a single category might also be formulated in terms of assigned to category i or not assigned to category i. Now Po i 1 2pi pii ; Pei 1 2pi 1 pi : 10 11

The sampling distributions are known (Schouten, 1982). The index suffices the above criteria 17 and is simple. Missing values, that is a different number of codings per object, are captured in Equations 3 and 4. Each object ns. might have a different value.

Program features
The Ag09-package is available on personal computers running under Windows. The program should be used interactively; it is based on a menu. The program allows to enter both agreement tables and data matrices, that is matrices with objects in the rows and raters (and items) in the columns. The data can be entered via the programs own spreadsheet by using the keyboard, but they can also be copied from any text file (extension: txt or dat). The standard output consists of the value computed for the agreement index requested and the amounts of observed, expected, and maximum agreement, as well as the number of objects used in the computation. Depending on the setting of the program, the variance, null variance, z value, and probability will also be computed. Optionally, the agreement per category is computed, as well as confidence intervals or the deviation from a fixed result. It is always possible to consult results from an earlier computation in the same run of the program. Results of computations appear in a text window and can be copied to a file or send to a printer. The program allows performing several analyses after each other. These can be analyses on different parts of the data matrix. Missing values can be considered as such, in which case the corresponding objects will be kept out of the analyses. The user has to inform the program the number of ratings per object there should at least be to have the object participate in the computation. The data matrix can contain 90 different kinds of assignments in the column. These are assignments by a number of raters to one or more items (remember, analyses can be performed on only a part of the data matrix). At most 30,000 objects are allowed. Per analysis, the assignments by 30 raters can be entered. The rater may have assigned the objects to 40 different categories.
394

Popping

395

The program is to be downloaded from the web site www.gmw.rug.nl/ *popping. Other agreement indices, for example as defined by Cohen (1960), are computed in the computer program Agree (Popping, 1984), which is also available under Windows. An index resembling Scotts p is Krippendorffs a. For this index, a macro is available to be used in Statistical Package for the Social Sciences (SPSS; Hayes & Krippendorff, 2007). Notes
1. Desiderata have been listed for the situation in which one uses fixed raters, each having an own marginal distribution (Popping, 1988). Below these desiderata are formulated for the situation in which one works with judgments. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. The maximum possible value of the index is 1, regardless of the number of judgments per object or the number of categories. In the case of independence, given the marginals, the index takes the value 0. Permutations of categories may not lead to other results. The estimated value of the index is independent of the number of objects. If there are more than two categories, it should be possible to compute the amount of agreement for all categories together but also per single category. If there are more than two ratings, it should be possible to compute the amount of agreement for all ratings together. The sampling distribution of the index, or at least the variance, should be known. The index should be robust. The index should be valid. The index should be simple and interpretable. The last three desiderata mentioned are very hard to test; therefore, one usually concentrates on criteria 1 to 7.

Declaration of Conflicting Interests The author declared no conflicts of interest with respect to the authorship and/or publication of this article. Funding The author received no financial support for the research and/or authorship of this article. References
Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56, 81-105. Cicchetti, D. V. (1972). A new measure of agreement between rank ordered variables. Proceedings of the 80th annual convention. American Statistical Association, 7, 17-18. Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37-46. Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76, 378-382. Galtung, J. (1979). Measurement of agreement. In J. Galtung (Ed.), Papers on methodology. Theory and methods of social research (Vol. II, pp. 82-135). Copenhagen: Christian Eijlers. Hayes, A. F., & Krippendorff, K. (2007). Answering the call for a standard reliability measure for coding data. Communication Methods and Measures, 1, 77-89. Popping, R. (1984). AGREE, A package for computing nominal scale agreement. Computational Statistics and Data Analysis, 2, 182-185. 395

396

Social Science Computer Review 28(3)

Popping, R. (1988). On agreement indices for nominal data. In W. E. Saris, & I. N. Gallhofer (Eds.), Sociometric research: Data collection and scaling (pp. 90-105). Hampshire: McMillan. Popping, R. (2009). Some views on agreement to be used in content analysis studies. Quality & Quantity, , DOI 10.1007/s11135-009-9258-3. Schouten, H. J. A. (1982). Measuring pair wise agreement among many observers. II. Some improvements and additions. Biometrical Journal, 24, 431-435. Schouten, H. J. A. (1986). Nominal scale agreement among observers. Psychometrika, 51, 453-466. Scott, W. A. (1955). Reliability of content analysis: The case of nominal scale coding. Public Opinion Quarterly, 19, 321-325. Tinsley, H. E. A., & Weiss, D. J. (2000). Interrater reliability and agreement. In H. E. A. Tinsley, & S. D. Brown (Eds.), Handbook of applied multivariate statistics and mathematical modeling (pp. 95-124). San Diego, CA: Academic Press.

Bio
Roel Popping is at the Department of Sociology, University of Groningen. His research interests include methodology, with a specialty in text analysis. He has applied these methods in analyses of developments in democracy and culture, primarily in Hungary and the Netherlands.

396

You might also like