You are on page 1of 110

Best practices in

quantitative methods
3 Best Practices in Interrater
Reliability Three Common Approaches

Contributors: Steven E. Stemler & Jessica Tsai
Editors: Jason Osborne
Book Title: Best practices in quantitative methods
Chapter Title: "3 Best Practices in Interrater Reliability Three Common Approaches"
Pub. Date: 2008
Access Date: May 10, 2014
Publishing Company: SAGE Publications, Inc.
City: Thousand Oaks
Print ISBN: 9781412940658
Online ISBN: 9781412995627

DOI: http://dx.doi.org/10.4135/9781412995627.d5
Print pages: 29-50
©2008 SAGE Publications, Inc. All Rights Reserved.
This PDF has been generated from SAGE Research Methods. Please note that the
pagination of the online version will vary from the pagination of the print book.

University of Arizona
©2008 SAGE Publications, Inc. All Rights Reserved. SAGE Research Methods

http://dx.doi.org/10.4135/9781412995627.d5
[p. 29 ↓ ]

3 Best Practices in Interrater Reliability
Three Common Approaches
Steven E.Stemler

JessicaTsai

1
The concept of interrater reliability permeates many facets of modern society.

For example, court cases based on a trial by jury require unanimous agreement from
jurors regarding the verdict, life–threatening medical diagnoses often require a second
or third opinion from health care professionals, student essays written in the context
of high–stakes standardized testing receive points based on the judgment of multiple
readers, and Olympic competitions, such as figure skating, award medals to participants
based on quantitative ratings of performance provided by an international panel of
judges.

Any time multiple judges are used to determine important outcomes, certain technical
and procedural questions emerge. Some of the more common questions are as follows:
How many raters do we need to be confident in our results? What is the minimum level
of agreement that my raters should achieve? And is it necessary for raters to agree
exactly, or is it acceptable for them to differ from each other so long as their difference
is systematic and can therefore be corrected?

Page 3 of 45 Best practices in quantitative methods: 3 Best
Practices in Interrater Reliability Three Common
Approaches

University of Arizona
©2008 SAGE Publications, Inc. All Rights Reserved. SAGE Research Methods

Key Questions to Ask Before Conducting
an Interrater Reliability Study
If you are at the point in your research where you are considering conducting an
interrater reliability study, then there are three important questions worth considering:

• What is the purpose of conducting your interrater reliability study?
• What is the nature of your data?
• What resources do you have at your disposal (e.g., technical expertise, time,
money)?

The answers to these questions will help determine the best statistical approach to use
for your study.

[p. 30 ↓ ]

What Is the Purpose of Conducting Your
Interrater Reliability Study?
There are three main reasons why people may wish to conduct an interrater reliability
study. Perhaps the most popular reason is that the researcher is interested in getting
a single final score on a variable (such as an essay grade) for use in subsequent data
analysis and statistical modeling but first must prove that the scoring is not “subjective”
or “biased.” For example, this is often the goal in the context of educational testing
where large–scale state testing programs might use multiple raters to grade student
essays for the ultimate purpose of providing an overall appraisal of each student's
current level of academic achievement. In such cases, the documentation of interrater
reliability is usually just a means to an end—the end of creating a single summary
score for use in subsequent data analyses—and the researcher may have little inherent
interest in the details of the interrater reliability analysis per se. This is a perfectly
acceptable reason for wanting to conduct an interrater reliability study; however,

Page 4 of 45 Best practices in quantitative methods: 3 Best
Practices in Interrater Reliability Three Common
Approaches

University of Arizona
©2008 SAGE Publications, Inc. All Rights Reserved. SAGE Research Methods

researchers must be particularly cautious about the assumptions they are making when
summarizing the data from multiple raters to generate a single summary score for each
student. For example, simply taking the mean of the ratings of two independent raters
may, in some circumstances, actually lead to biased estimates of student ability, even
when the scoring by independent raters is highly correlated (we return to this point later
in the chapter).

A second common reason for conducting an interrater reliability study is to evaluate
a newly developed scoring rubric to see if it is “working” or if it needs to be modified.
For example, one may wish to evaluate the accuracy of multiple ratings in the absence
of a “gold standard.” Consider a situation in which independent judges must rate
the creativity of a piece of artwork. Because there is no objective rule to indicate the
“true” creativity of a piece of art, a minimum first step in establishing that there is such
a thing as creativity is to demonstrate that independent raters can at least reliably
classify objects according to how well they meet the assumptions of the construct.
Thus, independent observers must subjectively interpret the work of art and rate the
degree to which an underlying construct (e.g., creativity) is present. In situations such
as these, the establishment of interrater reliability becomes a goal in and of itself. If a
researcher is able to demonstrate that independent parties can reliably rate objects
along the continuum of the construct, this provides some good objective evidence for
the existence of the construct. A natural subsequent step is to analyze individual scores
according to the criteria.

Finally, a third reason for conducting an interrater reliability study is to validate how
well ratings reflect a known “true” state of affairs (e.g., a validation study). For example,
suppose that a researcher believes that he or she has developed a new colon cancer
screening technique that should be highly predictive. The first thing the researcher
might do is train another provider to use the technique and compare the extent to
which the independent rater agrees with him or her on the classification of people
who have cancer and those who do not. Next, the researcher might attempt to predict
the prevalence of cancer using a formal diagnosis via more traditional methods (e.g.,
biopsy) to compare the extent to which the new technique is accurately predicting the
diagnosis generated by the known technique. In other words, the reason for conducting
an interrater reliability study in this circumstance is because it is not enough that
independent raters have high levels of interrater reliability; what really matters is the

Page 5 of 45 Best practices in quantitative methods: 3 Best
Practices in Interrater Reliability Three Common
Approaches

University of Arizona
©2008 SAGE Publications, Inc. All Rights Reserved. SAGE Research Methods

level of reliability in predicting the actual occurrence of cancer as compared with a “gold
standard”—in this case, the rate of classification based on an established technique.

Once you have determined the primary purpose for conducting an interrater reliability
study, the next step is to consider the nature of the data that you have or will collect.

What Is the Nature of Your Data?
There are four important points to consider with regard to the nature of your data. First,
it is important to know whether your data are considered nominal, ordinal, interval, or
ratio (Stevens, 1946). Certain statistical techniques are better suited to certain types
of data. For example, if the data you are evaluating are nominal (i.e., the differences
between the categories you are rating are qualitative), then there are relatively few
statistical methods for you to choose from (e.g., percent agreement, Cohen's kappa).
If, on the other hand, the data are measured at [p. 31 ↓ ] the ratio level, then the data
meet the criteria for use by most of the techniques discussed in this chapter.

Once you have determined the type of data used for the rating scale, you should then
examine the distribution of your data using a histogram or bar chart. Are the ratings
of each rater normally distributed, uniformly distributed, or skewed? If the rating data
exhibit restricted variability, this can severely affect consistency estimates as well
as consensus–based estimates, threatening the validity of the interpretations made
from the interrater reliability estimates. Thus, it is important to have some idea of the
distribution of ratings in order to select the best statistical technique for analyzing the
data.

The third important thing to investigate is whether the judges who rated the data agreed
on the underlying trait definition. For example, if two raters are judging the creativity
of a piece of artwork, one rater may believe that creativity is 50% novelty and 50%
task appropriateness. By contrast, another rater may judge creativity to consist of
50% novelty, 35% task appropriateness, and 15% elaboration. These differences in
perception will introduce extraneous error into the ratings. The extent to which your
raters are defining the construct in a similar way can be empirically evaluated using

Page 6 of 45 Best practices in quantitative methods: 3 Best
Practices in Interrater Reliability Three Common
Approaches

within the context of a low–stakes. or even correlational estimates may be the best match. the final question to ask is the pragmatic question of what resources you have at your disposal. via tests of marginal homogeneity). it is not always necessary to choose a technique that yields the maximum amount of information or that requires sophisticated statistical analyses in order to gain useful information. kappa. within the context of situations that have direct.g. The question of resources often has an influence on the way that interrater reliability studies are conducted. On the other hand. or does one judge assign a person “poor” in mathematics while another judge classifies that same person as %“good”? In other words. a procedure that is further described later in this chapter). then a more advanced measurement approach (e. important stakes for the participants in the study. Similarly. All Rights Reserved. Page 7 of 45 Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches . and money is not a major issue. There are other times when the estimates must be as precise as possible—for example. Inc.. exploratory research study. if you are in a situation where you have a high–stakes test that needs to be graded relatively quickly.. What Resources Do You Have at Your Disposal? As most people know from their life experience. if you are a newcomer who is running a pilot study to determine whether to continue on a particular line of research.g. factor analysis. are they using the rating categories the same way? This can be evaluated using consensus estimates (e. Finally.g.University of Arizona ©2008 SAGE Publications. After specifying the purpose of the study and thinking about the nature of the data that will be used in the analysis. the many–facets Rasch model) is most likely the best selection.. SAGE Research Methods measurement approaches to interrater reliability (e. “best” does not always mean most expensive or most resource intensive. For example. even if the raters agree as to the structure. and time and money are limited. do they assign people into the same category along the continuum. There are times when a crude estimate may yield sufficient information— for example. then a simpler technique such as the percent agreement. within the context of interrater reliability.

Yet. we will discuss (a) the most popular statistics used to compute interrater reliability. In the next section of this chapter. “best” approach across all situations. 2004) describe interrater reliability as if it were a unitary concept lending itself to a single. Page 8 of 45 Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches . Anastasi & Urbina. 32 ↓ ] Choosing the Best Approach for the Job Many textbooks in the field of educational and psychological measurement and statistics (e. Cohen. & Aiken. wisdom). (b) the computation and interpretation of the results of statistics using worked examples.. By contrast. All Rights Reserved. (c) the implications for summarizing data that follow from each technique. if the goal of the study is to generate summary scores for individuals that will be used in later analyses. SAGE Research Methods As an additional example. you will be in a much better position to choose a suitable technique for your project. 1998. 2003. Summary Once you have answered the three main questions discussed in this section. each of which provides a particular kind of solution to the problem of establishing interrater reliability. and (d) the advantages and disadvantages of each technique. 1997.. if the goal of your study is to understand the underlying nature of a construct that to date has no objective. West. then achieving consensus among raters in applying a scoring criterion will be of paramount importance.g. and it is not critical that raters come to exact agreement on how to use a rating scale. Crocker & Algina. the methodological literature related to interrater reliability constitutes a hodgepodge of statistical techniques. then consistency or measurement estimates of interrater reliability will be sufficient.g.University of Arizona ©2008 SAGE Publications. agreed–on definition (e. Cohen. von Eye & Mun. Inc. [p. 1986. Hopkins.

In such cases. and (c) odds ratios. a Likert– type scale). and the implications for summarizing scores from various raters. wisdom. creativity.. Page 9 of 45 Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches . Furthermore. 1968. Krippendorff.. 1996. the judges must come to exact agreement about each of the quantitative levels of the construct under investigation. Consensus estimates tend to be the most useful when data are nominal in nature and different levels of the rating scale represent qualitatively different ideas.University of Arizona ©2008 SAGE Publications. R. Stemler (2004) has argued that the wide variety of statistical techniques used for computing interrater reliability coefficients may be theoretically classified into one of three broad categories: (a) consensus estimates. The three most popular types of consensus estimates of interrater reliability found in the literature include (a) percent agreement and its variants. (b) Cohen's kappa and its variants (Agresti. and (c) measurement estimates. Consensus Estimates of Interrater Reliability Consensus estimates are often used when one is attempting to demonstrate that a construct that traditionally has been considered highly subjective (e. (b) consistency estimates. Consensus estimates also can be useful when different levels of the rating scale are assumed to represent a linear continuum of the construct but are ordinal in nature (e. then this provides some defensible evidence for the existence of the construct. Statistics associated with these three categories differ in their assumptions about the purpose of the interrater reliability study. if two independent judges demonstrate high levels of agreement in their application of a scoring rubric to rate behaviors. 2001). 1960. hate) can be reliably captured by independent raters. Cohen. The assumption is that if independent raters are able to come to exact agreement about how to apply the various levels of a scoring rubric (which operationally defines behaviors associated with the construct). Other less frequently used statistics that fall under this category include Jaccard's J and the G–Index (see Barrett. Inc.g. Hayes and Hatch (1999). SAGE Research Methods Building on the work of Uebersax (2002) and J. then the two judges may be said to share a common interpretation of the construct. the nature of the data. All Rights Reserved.g. 2004).

if Rater A assigns an essay a score of 3 and Rater B assigns the same essay a score of 4. 2006). For example. and it is easy to explain. SAGE Research Methods Percent Agreement. One popular modification of the percent agreement figure found in the testing literature involves broadening the definition of agreement by including the adjacent scoring categories on the rating scale. in a study examining creativity.g. no matter how good one's scoring rubric. write an essay based on the title “2983”). All Rights Reserved. 31). Perhaps the most popular method for computing a consensus estimate of interrater reliability is through the use of the simple percent agreement statistic. The goal of their study was to demonstrate that creativity could be detected and objectively scored with high levels of agreement across independent judges. 1999). draw a picture illustrating Earth from an insect's point of view. Sternberg and Lubart (1995) asked sets of judges to rate the level of creativity associated with each of a number of products generated by study participants (e. this would [p. Another disadvantage to using the simple percent agreement figure is that it is often time–consuming and labor–intensive to train judges to the point of exact agreement. however. For example. then it is possible to get artificially inflated percent agreement figures simply because most of the values fall under one category of the rating scale (J. If a percent adjacent agreement approach were used to score this section of the exam. The rationale for the adjacent percent agreement approach is often a pragmatic one. The statistic also has some distinct disadvantages.University of Arizona ©2008 SAGE Publications. The authors reported percent agreement levels across raters of . Inc. R. 33 ↓ ] mean that the judges would not need to come to exact agreement about the ratings they assign to each participant. the two raters are close enough together to say that they “agree. It is extremely difficult to train independent raters to come to exact agreement. it has a strong intuitive appeal. If the behavior of interest has a low or high incidence of occurrence in the population. some testing programs include writing sections that are scored by judges using a rating scale with levels ranging from 1 (low) to 6 (high) (College Board. Hayes & Hatch. it is easy to calculate..92 (Sternberg & Lubart. p. The percent agreement statistic has several advantages.” even though their agreement is not exact. rather. Thus. For example. raters often give scores that are “pretty close” to Page 10 of 45 Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches . so long as the ratings did not differ by more than one point above or below the other judge. Yet. then the two judges would be said to have reached consensus. 1995.

2s. In other words. All Rights Reserved. it seems logical to average the scores assigned by Rater B and Rater C. this seems. Rater A is systematically one point lower than Rater B. again. the average score would be the same. 1994). on the surface. the scores assigned by raters must be evenly distributed across all possible score categories. the difference between raters must be randomly distributed across items. the thinking is that if we have a situation in which two raters never differ by more than one score point in assigning their ratings. 5s. and 6s across the population of essays that they have read. Yet. thereby demonstrating “interrater reliability. suppose that dozens of raters are used to score the essays.” Which student would you rather be? The one who was lucky enough to draw the B/C rater combination or the one who unfortunately was scored by the A/B combination? Page 11 of 45 Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches . and we do not want to discard this information. this could lead to a situation in which the validity of the resultant summary scores is dubious (see the box below). Imagine that Rater C is also called in to rate the same essay for a different sample of students. either of these assumptions is violated.. Consider a situation in which Rater A systematically assigns scores that are one power lower than Rater B. both raters should give equal numbers of ‘s. 4s. In other words. However. On the surface. Second.University of Arizona ©2008 SAGE Publications. Rater C is paired up with Rater B within the context of an overlapping design to maximize rater efficiency (e. even though neither combination of raters differed by more than one score point in their ratings. SAGE Research Methods the same. Assume that they have each rated a common set of 100 essays. This logic holds under two conditions. If both of these assumptions are met. however. McArdle. First. In other words. If. Inc. to be defensible because it really does not matter whether Rater A or Rater B is assigning the high or low score because even if Rater A and Rater B had no systematic difference in severity of ratings. Suppose that we find a situation in which Rater B is systematically lower than Rater C in assigning grades. we now find ourselves in a situation in which the students rated by the Rater B/C pair score systematically one point higher than the students rated by the Rater A/B pair. and Rater B is systematically one point lower than Rater C. Rater A should not give systematically lower scores than Rater B.g. If we average the scores of the two raters across all essays to arrive at individual student scores. then the adjacent percent agreement approach is defensible. 3s. Thus. then we have a justification for taking the average score across all ratings.

Thus. whereas a rating of 5 would allow you to potentially “agree” with three scores [p. the percent agreement expected by chance jumps to 25%. Now let us examine what happens if the second assumption of the adjacent percent agreement approach is violated.. If only four categories are used. it is not enough to demonstrate adjacent percent agreement between rater pairs. 6) has two potential scores with which it can overlap (i. 4. a 1–4 scale).. and you are told that you will be retained only if you are able to demonstrate interrater reliability with everyone else.. a 6–point scale that uses adjacent percent agreement scoring is most likely functionally equivalent to a 4–point scale that uses exact agreement scoring. This approach is advantageous in that it relaxes the strict criterion that the judges agree exactly. On the other hand. Inc. In other words. SAGE Research Methods Thus.. If three categories. reducing the overall variability in scores given across the spectrum of participants. 5 or 6). you would naturally look for your best strategy to maximize interrater reliability. a 33% chance agreement is expected. If the rating scale has a limited number of points.g. it is entirely likely that the scale will go from being a 6–point scale to a 4–point scale. it must also be demonstrated that there is no systematic difference in rater severity between the rater set pairs. 5. two participants are expected to agree on ratings by chance alone only 17% of the time. thereby maximizing your chances of agreeing with the second rater. if two categories. in order to make a validity argument for summarizing the results of multiple raters..University of Arizona ©2008 SAGE Publications. If you are a rater for a large testing company. 34 ↓ ] (i. For example. Why? Because a rating at the extreme end of the scale (e. a rating of 1 or a rating of 6). percent agreement using adjacent categories can lead to inflated estimates of interrater reliability if there are only a limited number of categories to choose from (e. If you are then told that your scores can differ by no more than one point from the other raters.e. Page 12 of 45 Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches . This can be demonstrated (and corrected for in the final score) through the use of the many–facet Rasch model. then the percent agreement statistics will be artificially inflated due to chance factors. when a scale is 1 to 6. a 50% chance agreement is expected. All Rights Reserved.e.g.e. you would quickly discover that your best bet then is to avoid giving any ratings at the extreme ends of the scale (i. or 6). When the scale is reduced to 1 to 4.

then a breakdown in shared understanding occurred. In that case. rather. If one judge classified the major theme as social development. it is possible to have negative values of kappa if judges agree less often than chance would predict. Cohen's Kappa. Another popular consensus estimate of interrater reliability is Cohen's kappa statistic (Cohen. 2001. it indicates that the two judges did not agree with each other any more than would be predicted by chance alone. social development. cognitive development. Stemler and Bebell (1999) conducted a study aimed at detecting the various purposes of schooling articulated in school mission statements. A value of zero on kappa does not indicate that the two judges did not agree at all.1968). Kappa is a highly useful statistic when one is concerned that the percent agreement statistic may be artificially inflated due to the fact that most observations fall into a single category. 1996). and it would be surprising to find agreement lower than 90%.g. If both judges consistently rated the dominant theme of the mission statement as representing elements of citizenship. Page 13 of 45 Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches . Judges were given a scoring rubric that listed 10 possible thematic categories under which the main idea of each mission statement could be classified (e. Kappa is often useful within the context of exploratory research.. SAGE Research Methods then nearly all points will be adjacent. Cohen's kappa was designed to estimate the degree of consensus between two judges and determine whether the level of agreement is greater than would be expected to be observed by chance alone (see Stemler.University of Arizona ©2008 SAGE Publications. the judges were not coming to a consensus on how to apply the levels of the scoring rubric. Consequently. Inc. For example. for a practical example with calculation). All Rights Reserved. civic development). The interpretation of the kappa statistic is slightly different from the interpretation of the percent agreement figure (Agresti. Judges then read a series of mission statements and attempted to classify each sampling unit according to the major purpose of schooling articulated. and the other judge classified the major theme as citizenship. then they were said to have communicated with each other in a meaningful way because they had both classified the statement in the same way. The authors chose to use the kappa statistic to evaluate the degree of consensus because they did not expect the frequency of the major themes of the mission statements to be evenly distributed across the 10 categories of their scoring rubric. 1960.

it would be most desirable to have an odds ratio that is close to 1. 9/0. From the perspective of interrater reliability. The odds of Rater 1 giving a positive vocal ability score are 90 to 10.g. positive/negative) increase for cases when the other rater has made the same rating. Instead. 2006). Rater 1 gives 90 of them a positive score for vocal ability.0. Now. some research suggests that in practice. while in the same sample of 100 contestants. In addition. 35 ↓ ] for up or down evaluation of whether ratings are different from chance. A third consensus estimate of interrater reliability is the odds ratio. other authors (Krippendorff. Rater 2 only gives 20 of them a positive score for vocal ability. 2004. so the odds ratio is 36. the important idea captured by the odds ratio is whether it deviates substantially from 1.University of Arizona ©2008 SAGE Publications. statistical macros that compute Krippendorff's alpha have been created and are freely available (K. All Rights Reserved. Hayes. SAGE Research Methods Although some authors (Landis & Koch. Inc. 2006). however.. or 9:1. dealing with missing data. these authors suggest that although the statistic gives some indication as to whether the agreement is better than that predicted by chance alone. 2002) have argued that the kappa values for different items or from different studies cannot be meaningfully compared unless the base rates are identical. Uebersax. it is difficult to apply rules of thumb for interpreting kappa across different circumstances. The major disadvantage of Krippendorff's alpha is that it is computationally complex. The Page 14 of 45 Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches . however. while the odds of Rater 2 giving a positive vocal ability score are only 20 to 80. presence/absence of a phenomenon). Consequently. suppose that in a music competition with 100 contestants.25:1.0. and yielding a chance–corrected estimate of interrater reliability. In a 2 × 2 contingency table. although it can be extended to ordered category ratings. which would indicate that Rater 1 and Rater 2 rated the same proportion of contestants as having high vocal ability.g. alpha values tend to be nearly identical to kappa values (Dooley..25 = 36. the odds ratio indicates how much the odds of one rater making a given rating (e. but they should not get too invested in its interpretation. For example. Uebersax (2002) suggests that researchers using the kappa coefficient look at it [p. Within the context of interrater reliability. 1977) have offered guidelines for interpreting kappa values. The odds ratio is most often used in circumstances where raters are making dichotomous ratings (e. Krippendorff (2004) has introduced a new coefficient alpha into the literature that claims to be superior to kappa because alpha is capable of incorporating the information from multiple raters. or 1:4 = 0. Odds Ratios.

particularly to a lay audience. based on what they know about various theories of sleep. The item was scored using a 5–point scoring rubric. Cohen's Kappa. as Osborne (2006) has pointed out. For this particular item. logistic regression).g. 75 participants received scores from two independent raters. Percent Agreement.1). The percent agreement on this item is 42%. The disadvantage to the odds ratio is that it is most intuitive within the context of a 2 × 2 contingency table with dichotomous rating categories. one can run the crosstabs procedure and generate a table to facilitate the calculation (see Table 3. Jarvin. Using SPSS. Essay Question 1. Percent agreement is calculated by adding up the number of cases that received the same rating by both judges and dividing that number by the total number of cases rated by the two judges. SAGE Research Methods larger the odds ratio value. and Sternberg's (2006) study in which they developed augmented versions of the Advanced Placement Psychology Examination. Part d was a question that asked participants to give advice to a friend who is having trouble sleeping. the interpretation of the statistic is not always easy to convey. Furthermore.University of Arizona ©2008 SAGE Publications. however. The odds ratio has the advantage of being easy to compute and is familiar from other statistical applications (e. Participants were required to complete a number of essay items that were subsequently scored by different sets of raters. it involves extra computational complexity that undermines its intuitive advantage. we will draw from Stemler. although the odds ratio is straightforward to compute. The formula for computing Cohen's kappa is listed in Formula 1. Grigorenko. Page 15 of 45 Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches .. All Rights Reserved. the percent adjacent agreement is 87%. Although the technique can be generalized to ordered category ratings. Computing Common Consensus Estimates of Interrater Reliability Let us now turn to a practical example of how to calculate each of these coefficients. the larger the discrepancy there is between raters in terms of their level of consensus. As an example data set. Inc.

36 ↓ ] . and P C = the proportion of units for which agreement is expected by chance.1). SAGE Research Methods where P A = proportion of units on which the raters agree. For this data set. All Rights Reserved. It is possible to compute Cohen's kappa in SPSS by simply specifying in the crosstabs procedure the desire to produce Cohen's kappa (see Table 3.23. Inc. the kappa value is [p.1 SPSS Code and Output for Percent Agreement and Percent Adjacent Agreement and Cohen's Kappa Page 16 of 45 Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches .University of Arizona ©2008 SAGE Publications. which indicates that the two raters agreed on the scoring only slightly more often than we would predict based on chance alone. Table 3.

SAGE Research Methods Odds Ratios.University of Arizona ©2008 SAGE Publications. Page 17 of 45 Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches . All Rights Reserved. Inc. The formula for computing an odds ratio is shown in Formula 2.

Thus. it was necessary to recode the data so that the ratings were dichotomous.University of Arizona ©2008 SAGE Publications. 1. Inc. the summary scores may be calculated by simply taking the score from one of the judges or by averaging the scores given by all of the judges. Page 18 of 45 Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches . In order to compute the odds ratio using the crosstabs procedure in SPSS. ratings of 0. and 2 were assigned a value of 0 (failing) while ratings of 3 and 4 were assigned a value of 1 (passing). the remaining work of rating subsequent items can be split between the raters without both raters having to score all items.2. All Rights Reserved. indicating that there was a substantial difference between the raters in terms of the proportion of students classified as passing versus failing. Implications for Summarizing Scores From Various Raters If raters can be trained to the point where they agree on how to assign scores from a rubric. then adding more raters adds little extra information from a statistical perspective and is probably not justified from the perspective of resources. If raters are shown to reach high levels of consensus. SAGE Research Methods The SPSS code for computing the odds ratio is shown in Table 3. A typical guideline found in the literature for evaluating the quality of interrater reliability based on consensus estimates is that they should be 70% or greater. The odds ratio for the current data set is 30. since high interrater reliability indicates that the judges agree about how to apply the rating scale. Furthermore. This fact has practical implications for determining the number of raters needed to complete a study. then scores given by the two raters may be treated as equivalent. Consequently.

University of Arizona ©2008 SAGE Publications. Inc. SAGE Research Methods [p. Page 19 of 45 Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches . All Rights Reserved. 37 ↓ ] Table 3.2 SPSS Code and Output for Odds Ratios Advantages of Consensus Estimates One particular advantage of the consensus approach to estimating interrater reliability is that the calculations are easily done by hand.

. Inc. particularly in applications where exact agreement is unnecessary (e. In practice. A third advantage is that consensus estimates can be useful in diagnosing problems with judges’ interpretations of how to apply the rating scale. 38 ↓ ] consensus estimate of interrater reliability is that both judges need not score all remaining items. when reporting consensus–based interrater reliability estimates. All Rights Reserved. however. One implication of a high [p. A second disadvantage is that the amount of time and energy it takes to train judges to come to exact agreement is often substantial. For example. inspection of the information from a crosstab table may allow the researcher to realize that the judges may be unclear about the rules for when they are supposed to score an item as zero as opposed to when they are supposed to score the item as missing. it implies that both judges are essentially providing the same information.University of Arizona ©2008 SAGE Publications. SAGE Research Methods A second advantage is that the techniques falling within this general category are well suited to dealing with nominal variables whose levels on the rating scale represent qualitatively different categories. if the exact application of the levels of the scoring rubric is not important. Disadvantages of Consensus Estimates One disadvantage of consensus estimates is that interrater reliability statistics must be computed separately for each item and for each pair of judges. A visual analysis of the output allows the researcher to go back to the data and clarify the discrepancy or retrain the judges. but rather a means to the end of getting a summary score for each respondent). Page 20 of 45 Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches . and median estimates for all items and for all pairs of judges. For example.g. if there were 100 tests to be scored after the interrater reliability study was finished. one should report the minimum. maximum. When judges exhibit a high level of consensus. it is usually a good idea to build in a 30% overlap between judges even after they have been trained. it would be most efficient to ask Judge A to rate exams 1 to 50 and Judge B to rate exams 51 to 100 because the two judges have empirically demonstrated that they share a similar meaning for the scoring rubric. Consequently. in order to provide evidence that the judges are not drifting from their consensus as they read more items.

All Rights Reserved. Consistency Estimates of Interrater Reliability Consistency estimates of interrater reliability are based on the assumption that it is not really necessary for raters to share a common interpretation of the rating scale.70 are typically acceptable for consistency estimates of interrater reliability (Barrett. As we will see in the next section. The three most popular types of consistency estimates are (a) correlation coefficients (e. Spearman). sole reliance on consensus estimates of interrater reliability might lead researchers to conclude that “interrater reliability is low” when it may be more precisely stated that the consensus estimate of interrater reliability is low. consensus estimates can be overly conservative if two judges exhibit systematic differences in the way that they use the scoring rubric but simply cannot be trained to come to a consensus. Finally. SAGE Research Methods Third. the two raters have not come to a consensus about how to apply the rating scale categories. Consistency approaches to estimating interrater reliability are most useful when the data are continuous in nature. For example. if Rater A assigns a score of 3 to a certain group of essays. 2001). For information regarding additional consistency estimates of interrater Page 21 of 45 Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches . Pearson. so long as each judge is consistent in classifying the phenomenon according to his or her own definition of the scale. and (c) intraclass correlation.University of Arizona ©2008 SAGE Publications. Values greater than . although the technique can be applied to categorical data if the rating scale categories are thought to represent an underlying continuum along a unidimen–sional construct.g. Inc. (b) Cronbach's alpha (Cronbach. training judges to a point of forced consensus may actually reduce the statistical independence of the ratings and threaten the validity of the resulting scores. but the difference in how they apply the rating scale categories is predictable.. it is possible to have a low consensus estimate of interrater reliability while having a high consistency estimate and vice versa. Consequently. and Rater B assigns a score of 1 to that same group of essays. 1951). as Linacre (2002) has noted.

In other words. since both ratings being correlated are in the form of rankings. LeBreton. [p. Correlation coefficients measure the association between independent raters. if the data from the rating scale tend to be skewed toward one end of the distribution. two raters may differ in the absolute values they assign to each rating by two points. Values approaching +1 or –1 indicate that the two raters are following a systematic pattern in their ratings. Correlation Coefficients. a correlation coefficient can be computed that is governed by the number of pairs of ratings (Glass & Hopkins.. 39 ↓ ] The Pearson correlation coefficient can be computed by hand (Glass & Hopkins. Consequently. In this case. the Pearson correlation coefficients can be calculated only for one pair of judges at a time and for one item at a time. however. Kaiser. so long as there is a 2–point difference for each rating they assign. 1996). All Rights Reserved. A potential limitation of the Pearson correlation coefficient is that it assumes that the data underlying the rating scale are normally distributed. One beneficial feature of the Pearson correlation coefficient is that the scores on the rating scale can be continuous in nature (e. each judge may rank order the essays that he or she has scored from best to worst. Brennan. Burke and Dunlap (2002). Burgess. see Bock.g. this will attenuate the upper limit of the correlation coefficient that can be observed. Thus. Atchley. only that they are consistent in applying the ratings according to their own unique understanding of the scoring rubric. rather than using a continuous rating scale. then. and James (2003). and Uebersax (2002). Like the percent agreement statistic. SAGE Research Methods reliability. Inc. the raters will have achieved high consistency estimates of interrater reliability.University of Arizona ©2008 SAGE Publications. For example. and Muraki (2002). there may be substantial mean differences between the raters. they can take on partial values such as 1. Page 22 of 45 Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches .5). while values approaching zero indicate that it is nearly impossible to predict the score one rater would give by knowing the score the other rater gave. 1996) or can easily be computed using most statistical packages. Perhaps the most popular statistic for calculating the degree of consistency between raters is the Pearson correlation coefficient. It is important to note that even though the correlation between scores assigned by two judges may be nearly perfect. a large value for a measure of association does not imply that the raters are agreeing on the actual application of the rating scale. The Spearman rank coefficient provides an approximation of the Pearson correlation coefficient but may be used in circumstances where the data under investigation are not normally distributed.

University of Arizona ©2008 SAGE Publications. 1986). then chances are good that this implies that excellent interrater reliability has been achieved. If the intraclass correlation coefficient is close to 1. that individual will be left out of the analysis. Cronbach's alpha coefficient is a measure of internal consistency reliability and is useful for understanding the extent to which the ratings from a group of judges hold together to measure a common dimension. as Uebersax (2002) has noted. The major disadvantage of the method is that each judge must give a rating on every case. or else the alpha will only be computed on a subset of the data. 7). as Barrett (2001) has noted. SAGE Research Methods The major disadvantage to Spearman's rank coefficient is that it requires both judges to rate all cases. 1986). In other words. Cronbach's Alpha. An interesting feature of the intraclass correlation coefficient is that it confounds two ways in which raters differ: (a) consensus (or bias—i. A third popular approach to estimating interrater reliability is through the use of the intraclass correlation coefficient. For this reason. we reduce the variability of the judges’ ratings such that when we average all judges’ ratings. the intraclass correlation may be considered a conservative estimate of interrater reliability. mean differences) and (b) consistency (or association). Inc. All Rights Reserved. if just one rater fails to score a particular individual. If the Cronbach's alpha estimate among the judges is low. In situations where more than two raters are used. On the other hand.. then this implies that the majority of the variance in the total composite score is really due to error variance and not true score variance (Crocker & Algina. As a result.e. the value of the intraclass correlation coefficient will be decreased in situations where there is a low correlation between raters and in situations where there are large mean differences between raters. another approach to computing a consistency estimate of interrater reliability would be to compute Cronbach's alpha coefficient (Crocker & Algina. “because of this ‘averaging’ of ratings. Page 23 of 45 Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches . we effectively remove all the error variance for judges” (p. In addition. The major advantage of using Cronbach's alpha comes from its capacity to yield a single consistency estimate of interrater reliability across multiple judges. Intraclass Correlation. “If the goal is to give feedback to raters to improve future ratings. The major advantage of the intraclass correlation is its capacity to incorporate information from different types of rater reliability data.

the results may not look the same if raters are rating a homogeneous subpopulation as opposed to the general population. because the intraclass correlation represents the ratio of within–subject variance to between–subject variance on a rating scale. The formula for computing the Pearson correlation coefficient is listed in Formula 3. Computing Common Consistency Estimates of Interrater Reliability Let us now turn to a practical example of how to calculate each of these coefficients. One may request both Pearson and Spearman correlation coefficients. the intraclass correlation will be lowered. the intraclass correlation coefficient will be attenuated if assumptions of normality in rating data are violated.University of Arizona ©2008 SAGE Publications. SAGE Research Methods one should distinguish between these two sources of disagreement” (p. In addition. 5). Simply by restricting the between–subject variance. Using SPSS. All Rights Reserved.76. the Spearman correlation coefficient is . Finally.3. like the Pearson correlation coefficient. For this reason. [p. ICCs are not directly comparable across populations. one can run the correlate procedure and generate a table similar to Table 3.74. it is important to note that. it is important to pay special attention to the population being assessed and to understand that this can influence the value of the intraclass correlation coefficient (ICC). Therefore. Page 24 of 45 Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches . We will use the same data set and compute each estimate on the data. 40 ↓ ] Correlation Coefficients. Inc. The Pearson correlation coefficient on this data set is .

4). The Cronbach's alpha value is calculated using Formula 4.3 SPSS Code and Output for Pearson and Spearman Correlations Page 25 of 45 Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches .86. the alpha value is . where N is the number of components (raters).University of Arizona ©2008 SAGE Publications. For this example. All Rights Reserved. Inc. 2 σ x is the variance of the observed total scores. one may simply specify in the crosstabs procedure the desire to produce Cronbach's alpha (see Table 3. SAGE Research Methods Cronbach's Alpha. and 2 σ y i is the variance of component i. In order to compute Cronbach's alpha using SPSS. Table 3.

41 ↓ ] Table 3.University of Arizona ©2008 SAGE Publications. All Rights Reserved.4 SPSS Code and Output for Cronbach's Alpha Page 26 of 45 Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches . Inc. SAGE Research Methods [p.

and 2 σ (w) is the pooled variance within raters. Inc. In order to compute intraclass correlation. SAGE Research Methods Intraclass Correlation.University of Arizona ©2008 SAGE Publications. one may specify the procedure in SPSS using the code listed in Table 3. The intraclass correlation coefficient for this data set is . Formula 5 presents the equation used to compute the intraclass correlation value. where 2 σ (b) is the variance of the ratings between judges. Page 27 of 45 Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches .5.75. All Rights Reserved.

42 ↓ ] Page 28 of 45 Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches . the means and medians of the different judges may be very different. All Rights Reserved. if one judge consistently gives scores that are 2 points lower on the rating scale than does a second judge. Table 3. Thus. SAGE Research Methods Implications for Summarizing Scores From Various Raters It is important to recognize that although consistency estimates may be high.5 SPSS Code and Output for Intraclass Correlation [p.University of Arizona ©2008 SAGE Publications. the scores will ultimately need to be corrected for this difference in judge severity if the final scores are to be summarized or subjected to further analyses. Inc.

then what is most important is that each judge apply the rating scale consistently within his or her own definition of the rating scale. All Rights Reserved.” Instead.. The third advantage is that consistency estimates readily handle continuous data. A second disadvantage of consistency estimates is that judges may differ not only systematically in the raw scores they apply but also in the number of rating scale categories they use. exhibits high intrarater reliability). then it may not be desirable for the two judges to “agree to disagree. Inc. the approach places less stringent demands on the judges in that they need not be trained to come to exact agreement with one another so long as each judge is consistent within his or her own definition of the rating scale (i. regardless of whether the two judges exhibit exact agreement. Disadvantage of Consistency Estimates One disadvantage of consistency estimates is that if the construct under investigation has some objective meaning. the scoring rubric is a means to the end of creating scores for each participant that can be summarized in a meaningful way..University of Arizona ©2008 SAGE Publications. Cronbach's alpha) allow for an overall estimate of consistency among multiple judges. it may be important for the judges to come to an exact agreement on the scores that they are generating. then adding 2 extra points to the exams of all students who were scored by Judge A would provide an equitable adjustment to the raw scores. Consistency estimates allow for the detection of systematic differences between judges. For example. Instead. if Judge A consistently gives scores that are 2 points lower than Judge B does. It is sometimes the case that the exact application of the levels of the scoring rubric is not important in itself. a mean adjustment for a severe judge may provide a Page 29 of 45 Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches .g. If summarization is the goal. SAGE Research Methods Advantages of Consistency Estimates There are three major advantages to using consistency estimates of interrater reliability. which may then be adjusted statistically. First. In that case.e. A second advantage of consistency estimates is that certain methods within this category (e.

The two most popular types of measurement estimates are (a) factor analysis and (b) the many–facets Rasch model (Linacre. 1994.. SAGE Research Methods partial solution. Tatem. Inc. the correlation coefficient will necessarily be deflated due to restricted variability. not the ratings themselves. that is decisive” (p. Englehard. Thus. In other words.g. if most of the ratings fall into one or two categories. A third disadvantage of consistency estimates is that they are highly sensitive to the distribution of the observed data. 858). 2002) or log–linear models (von Eye & Mun. Page 30 of 45 Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches . Measurement Estimates of Interrater Reliability Measurement estimates are based on the assumption that one should use all of the information available from all judges (including discrepant ratings) when attempting to create a summary score for each respondent. and it is impossible for all judges to rate all items. “It is the accumulation of information. As Linacre (2002) has noted. 1994. each judge is seen as providing some unique information that is useful in generating a summary score for a person. Consequently. a reliance on the consistency estimate alone may lead the researcher to falsely conclude that interrater reliability was poor without specifying more precisely that the consistency estimate of interrater reliability was poor and providing an appropriate rationale. Consequently. Measurement estimates are also useful in circumstances where multiple judges are providing ratings. In other words. but the two judges may also differ on the variability in scores they give. it is not necessary for two judges to come to a consensus on how to apply a scoring rubric because differences in judge severity can be estimated and accounted for in the creation of each participant's final score.University of Arizona ©2008 SAGE Publications. a mean adjustment alone will not effectively correct for this difference. All Rights Reserved. under the measurement approach. & Myford. Myford & Cline. mathematical competence). They are best used when different levels of the rating scale are intended to represent different levels of an underlying unidimensional construct (e. 2004). Linacre.

Using this method.. The technique can also be used to check the extent to which judges agree on the number of underlying dimensions in the data set. greater than 60%). Many–Facets Rasch Measurement and Log–Linear Models. Inc. 1979).g. Wright & Stone. The disadvantage to the approach is that it assumes that ratings are assigned without error by the judges. Judge A is a more severe rater). The percentage of variance that is explainable by the first factor gives some indication of the extent to which the multiple judges are reaching agreement. In other words.. Once interrater reliability has been established in this way. One popular measurement estimate of interrater reliability is computed using factor analysis (Harman. each participant may then receive a single summary score corresponding to his or her loading on the first principal component underlying the set of ratings. A second measurement approach to estimating interrater reliability is through the use of the many–facets Rasch 2 model (Linacre. Recent advances in the field of measurement have led to an extension of the standard Rasch measurement model (Rasch. This score can be computed automatically by most statistical packages. the equivalence of the ratings between judges can be empirically determined. then this gives some indication that the judges are rating a common construct. extended model. rather than simply assuming that a score of 3 from Judge A is equally difficult for a participant to achieve as a score of 3 from Judge B.e.e. 1994). This new. Thus..University of Arizona ©2008 SAGE Publications. allows judge severity to be derived using the same scale (i. SAGE Research Methods Factor Analysis. The judges’ scores are then subjected to a common factor analysis in order to determine the amount of shared variance in the ratings [p. All Rights Reserved. known as the many–facets Rasch model. Using a many–facets analysis. The advantage of this approach is that it assigns a summary score for each participant that is based only on the relevance of the strongest dimension underlying the data. multiple judges may rate a set of participants. it could be the case that a score of 3 from Judge A is really closer to a score of 5 from Judge B (i. Page 31 of 45 Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches . 43 ↓ ] that could be accounted for by a single factor. 1967). the logit scale) as person ability and item difficulty. 1960/1980. each essay item or behavior that was rated can be directly compared. If the shared variance is high (e.

In other words. the technique puts rater severity on the same scale as item difficulty and person ability (i. (p. the logit scale)..e. The many–facets Rasch approach has several advantages. even if judges differ in their interpretation of the rating scale. Inc. this feature allows for the computation of a single final summary score that is already corrected for rater severity. Each of the 11 judges (2 unique judges per item + 1 judge who rated all items = 5∗2 + 1 = 11) could be directly compared. 29) Page 32 of 45 Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches .University of Arizona ©2008 SAGE Publications. so that correction can be made to ratings awarded by a judge when he is the only one to rate an examinee. the fit statistics will indicate the extent to which a given judge is faithful to his or her own definition of the scale categories across items and people. The mathematical representation of the many–facets Rasch model is fully described in Linacre (1994). in addition to providing information that allows for the evaluation of the severity of each judge in relation to all other judges.e. in a way which would be helpful to an examining board. the facets approach also allows one to evaluate the extent to which each of the individual judges is using the scoring rubric in a manner that is internally consistent (i. SAGE Research Methods In addition. can also be directly compared. if a history exam included five essay questions and each of the essay questions was rated by 3 judges (2 unique judges per item and 1 judge who scored all items). examinees must be regarded as randomly sampled from some population of examinees which means that there is no way to correct an individual examinee's score for judge behavior. the facets approach would allow the researcher to directly compare the severity of a judge who rated only Item 1 with the severity of a judge who rated only Item 4. This approach. All Rights Reserved. Finally. Consequently. as well as the severity of all judges who rated the items.. the difficulty of each item. For this to be useful. was developed for use in contexts in which only estimates of population parameters are of interest to researchers. As Linacre (1994) has noted. this provides a distinct advantage over generalizability studies since the goal of a generalizability study is to determine the error variance associated with each judge's ratings. an estimate of intrarater reliability). For example. First. however.

Rater1_Item1. the data set should be structured in such a way that each column in the data set corresponds to the score given by Rater × on Item Y to each object in the data set (objects each receive their own row). The major disadvantage to the many–facets Rasch approach is that it is computationally intensive and therefore is best implemented using specialized statistical software (Linacre. SAGE Research Methods Second.g. high–fit statistic values are an indication of rater drift over time.University of Arizona ©2008 SAGE Publications. we instead refer to some excellent sources that are devoted to fully expounding the detailed computations involved. 1988). In this example. Factor Analysis. Rater1_Item2) and 100 rows. Rater2_Item1. we would run a separate factor analysis for each essay item (e. Computing Common Measurement Estimates of Interrater Reliability Measurement estimates of interrater reliability tend to be much more computationally complex than consensus or consistency estimates. When using factor analysis to estimate interrater reliability. In addition. rather than present the detailed formulas for each technique in this section. Inc. This will allow us to focus on the interpretation of the results of each of these techniques. a 5 × 100 Page 33 of 45 Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches . if five raters were to score three essays from 100 students. the data set should contain 15 columns (e. So long as there is sufficient connectedness in the data set (Engelhard.. 1998). this technique is best suited to data that are ordinal in nature. Consequently. In other words. 44 ↓ ] allows the researcher to use resources more efficiently.. 1997).g. The mathematical formulas for computing factor–analytic solutions are expounded in several excellent texts (e. the item fit statistics provide some estimate of the degree to which each individual rater was applying the scoring rubric in an internally consistent manner. the technique is well suited to overlapping research designs. which [p. All Rights Reserved.. the severity of all raters can be evaluated relative to each other. Thus. 1967. Harman. the technique works with multiple raters and does not require all raters to evaluate all objects. Kline. In other words. Third.g.

At the other extreme. Consequently. indicating that independent raters agree on the underlying nature of the construct being rated. The example output presented here is derived from the larger Stemler et al. to an important degree. then this provides some evidence to suggest that the raters are not interpreting the underlying construct in the same manner (e. Table 3. For example.g. with an estimated severity measure of +0. An example output of a many–facets Rasch analysis is listed in Table 3. students whose test items were scored by CL would be more likely to receive lower raw scores than students who had the same test item scored by any of the other raters used in this project. (2006) data set. Many–Facets Rasch Measurement. The key values to interpret within the context of the many–facets Rasch approach are rater severity measures and fit statistics. Each object that has been rated will have a loading on each underlying factor. There are two important pieces of information generated by the factor analysis. the score to be used in subsequent analyses should be the loading on the primary factor.. the many–facets Rasch model is best implemented through the use of specialized software (Linacre. In some cases.89 logits. simply using raw scores would lead to biased estimates of student proficiency since student estimates would depend. In the example output. Assuming that the first factor explains most of the variance.91 logits. it may turn out that the variance in ratings is distributed over more than one factor. The first important piece of information is the value of the explained variance in the first factor. which is also evidence of interrater reliability. on Page 34 of 45 Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches . the shared variance of the first factor is 76%.University of Arizona ©2008 SAGE Publications. Inc. recall the example about creativity mentioned earlier in this chapter). The mathematical formulas for computing results using the many–facets Rasch model may be found in Linacre (1994). rater CL was the most severe rater. Rater severity indices are useful for estimating the extent to which systematic differences exist between raters with regard to their level of severity. Consequently.6 shows the SPSS code and output for running the factor analysis procedure. All Rights Reserved. 1988). In practice. rater AP was the most lenient rater. If that is the case.7. The second important piece of information comes from the factor loadings. with a rater severity measure of – 0. SAGE Research Methods data matrix).

Inc. All Rights Reserved.University of Arizona ©2008 SAGE Publications.6 SPSS Code and Output for Factor Analysis Page 35 of 45 Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches . If these differences were not taken into account when calculating student ability. students who had their exams scored by AP would be more likely to receive substantially higher raw scores than if the same item were rated by any of the other raters. 45 ↓ ] Table 3. The facets program corrects for these differences and incorporates them into student ability estimates. [p. SAGE Research Methods which rater scored their essay.

Inc.7 show that there is about a 1. SAGE Research Methods The results presented in Table 3.University of Arizona ©2008 SAGE Publications. All Rights Reserved.89). Page 36 of 45 Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches .91 to +0.5–logit spread in systematic differences in rater severity (from –0. and differences in rater severity must be taken into account in order to come up with precise estimates of student ability. assuming that all raters are defining the rating scales they are using in the same way is not a tenable assumption. Consequently.

Rater fit statistics are presented in columns 5 and 6 of Table 3.5 indicates that there is 50% less [p.3 indicate that there is more unpredictable variation in the raters’ responses than we would expect Page 37 of 45 Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches . the facets approach also allows us to evaluate the extent to which each of the individual raters is using the scoring rubric in a manner that is internally consistent (i. Infit mean squares that are greater than 1.7. An infit value greater than 1. intrarater reliability).. Conversely. In other words. 1979). the fit statistics will indicate the extent to which a given rater is faithful to his or her own definition of the scale categories across items and people. 2001. SAGE Research Methods In addition to providing information that allows us to evaluate the severity of each rater in relation to all other raters. All Rights Reserved. Wright & Stone.4 indicates that there is 40% more variation in the data than predicted by the Rasch model.University of Arizona ©2008 SAGE Publications. even if raters differ in their own definition of how they use the scale. 46 ↓ ] variation in the data than predicted by the Rasch model.e. These fit statistics are interpreted much the same way as item or person infit statistics are interpreted (Bond & Fox. an infit value of 0.7 Output for a Many–Facets Rasch Analysis Fit statistics provide an empirical estimate of the extent to which the expected response patterns for each individual match the observed response patterns. Inc. Table 3.

the lowest and highest values on the rating scale). Infit mean square values that are less than 0. moderate response bias). The results in Table 3. The table of misfitting ratings provides information on discrepant ratings based on two criteria: (a) how the other raters scored the item and (b) the particular raters’ typical level of severity in scoring items of similar difficulty. All Rights Reserved. 2002.0.4). Page 38 of 45 Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches . The expectation for the mean square index is 1. taking into account the extent to which each judge influences the score. These standardized indices are sensitive to sample size and. JW (infit of 2.g. consequently.University of Arizona ©2008 SAGE Publications. SAGE Research Methods based on the model. Implications for Summarizing Scores From Various Raters Measurement estimates allow for the creation of a summary score for each participant that represents that participant's score on the underlying factor of interest.7 indicate that there is less variation in the raters’ responses than we would predict based on the model. and AM (infit of 2. p. information–weighted indices.7 reveal that 6 of the 12 raters had infit mean–square indices that exceeded 1. Their high infit values suggest that these raters are not using the scoring rubrics in a consistent way.3.. The infit and outfit mean–square indices are unstandardized.2) appear particularly problematic. Myford and Cline (2002) note that high infit values may suggest that ratings are noisy as a result of the raters’ overuse of the extreme scale categories (i. Inc.4). 14). Raters CL (infit of 3. The table of misfitting ratings provided by the facets computer program output allowed for an investigation of the exact nature of the highly unexpected response patterns associated with each of these raters. the accuracy of the standardization is data dependent.. by constrast the infit and outfit standardized indices are unweighted indices that are standardized toward a unit–normal distribution. the range is 0 to infinity (Myford & Cline.e. while low infit mean square indices may be a consequence of overuse of the middle scale categories (e.

Consequently. 1994) across the judges and ratings. Linacre et al. measurement estimates have the distinct advantage of not requiring all judges to rate all items in order to arrive at an estimate of interrater reliability. Rather.University of Arizona ©2008 SAGE Publications. 47 ↓ ] estimates of interrater reliability tend to more accurately represent the underlying construct of interest than do the simple raw score ratings from the judges. the summary scores generated from measurement [p. Furthermore. Inc. the file structure required to use facets is somewhat counterintuitive. it will be possible to directly compare judges.g. Unlike the percent agreement figure or correlation coefficient. as opposed to calculating estimates separately for each item and each pair of judges. Disadvantages of Measurement Estimates The major disadvantage of measurement estimates is that they are unwieldy to compute by hand. measurement estimates can take into account errors at the level of each judge or for groups of judges. Third. A second disadvantage is that certain methods for computing measurement estimates (e. measurement estimates effectively handle ratings from multiple judges by simultaneously computing estimates across all of the items that were rated. facets) can handle only ordinal–level data... First. 1994. Page 39 of 45 Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches . and as long as there is sufficient connectedness (Linacre. measurement approaches typically require the use of specialized software to compute. Second. All Rights Reserved. SAGE Research Methods Advantages of Measurement Estimates There are several advantages to estimating interrater reliability using the measurement approach. judges may rate a particular subset of items.

percent agreement.. “Percent/proportion agreement is affected by chance. There are multiple techniques for computing interrater reliability. Furthermore. however. even if one achieves high consistency estimates.g. however. and mean shifts” (p. we have attempted to outline a framework for thinking about interrater reliability as a multifaceted concept..University of Arizona ©2008 SAGE Publications.. Yet each technique (and class of techniques) has its own strengths and weaknesses. not all raters need to rate all objects). 1682). Consistency estimates of interrater reliability (e. Cronbach's alpha. training raters to exact consensus requires substantial time and energy and may not be entirely necessary. Pearson and Spearman correlations. this comes at the expense of added computational complexity and increased demands on resources (e. kappa and weighted kappa are affected by low prevalence of condition of interest. All Rights Reserved.g. SAGE Research Methods Summary and Conclusion In this chapter. and Magaziner (2005) have noted. Consequently. As Snow. however. Cohen's kappa. Lin. distribution shape. each with its own assumptions and implications. easily derive adjusted summary scores that are corrected for rater severity. time and expertise). The disadvantage to consistency estimates. Inc. Page 40 of 45 Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches . Morgan.. Measurement estimates of interrater reliability (e. depending on the goals of the study. factor analysis..g. and correlations are affected by low variability. further adjustment to an individual's raw scores may be required in order to arrive at an unbiased final score that may be used in subsequent data analyses. we believe that there is no silver bullet “best” approach for its computation. and intraclass correlations) are familiar and fairly easy to compute. Consensus estimates of interrater reliability (e. They have the additional advantage of not requiring raters to perfectly agree with each other but only require consistent application of a scoring rubric within raters—systematic variance between raters is easily tolerated. is that they are sensitive to the distribution of the data (the more it departs from normality. Cook. the more attenuated the results).g. odds ratios) are generally easy to compute and useful for diagnosing rater disparities. and allow for highly efficient designs (e. many–facets Rasch measurement) can deal effectively with multiple raters.g.

and how to approach creating summary scores across raters. (b) the nature of the data. and (c) the desired level of information based on the resources available. SAGE Research Methods In the end. Table 3. exploratory research study. All Rights Reserved. 48 ↓ ] outcomes.University of Arizona ©2008 SAGE Publications. the best technique will always depend on (a) the goals of the analysis (e. the stakes associated with the study outcomes)..g. however. keep in mind that these guidelines are just rough estimates and will vary depending on the purpose of the study and the stakes associated with the [p. Inc. The conventions articulated here assume that the interrater reliability study is part of a low–stakes. The answers to these three questions will help to determine how many raters one needs. We conclude this chapter with a brief table that is intended to provide rough interpretive guidance with regard to acceptable interrater reliability values (see Table 3. These values simply represent conventions the authors have encountered in the literature and via discussions with colleagues and reviewers. whether the raters need to be in perfect agreement with each other.8).8 General Guidelines for Interpreting Various Interrater Reliability Coefficients Notes Page 41 of 45 Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches .

Burke. (2001. R. NJ: Lawrence Erlbaum.). Mahwah. from http://www. A coefficient for agreement for nominal scales Educational and Psychological Measurement. A. J. (1996). & Fox. Bond.ac.uk/~pbarrett/rater. and Dunlap. West. vol. P.. & Urbina. Psychological testing (7th ed. S. A. Cohen. SAGE Research Methods 1. G. 70.37-46(1960). 5(2)159-172(2002). Page 42 of 45 Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches . E. 2003. W.. L.pdf Bock. P.University of Arizona ©2008 SAGE Publications. J. Mahwah. March). The information in multiple ratings Applied Psychological Measurement. Readers interested in this model can refer to Chapters 4 and 5 on Rasch measurement for more information. vol. New York: John Wiley. Applying the Rasch model . Retrieved June 16. An introduction to categorical data analysis (2nd ed. Also known as interobserver or interjudge reliability or agreement.. References Agresti.. & Aiken. L. (2001).. Inc. Weighted kappa: Nominal scale agreement with provision for scale disagreement or partial credit Psychological Bulletin. Assessing the reliability of rating data . S. NJ: Prentice Hall.. All Rights Reserved. (1997). J. J. Applied multiple regression/ correlation analysis for the behavioral sciences (3rd ed. 26(4)364-375(2002). Cohen.. and Muraki.). S. vol. Cohen. Estimating interrater agreement with the average deviation index: A user's guide Organizational Research Methods. Cohen. C. 2. Upper Saddle River.213-220(1968). R.). P. Anastasi. (2003). vol. NJ: Lawrence Erlbaum. T. 20. M. Brennan.liv. Barrett.

Educational and psychological measurement and evaluation (8th ed.University of Arizona ©2008 SAGE Publications. Dooley. and Koch. R. G. SPSS Macro for computing Krippendorff's alpha . 33. Glass. H. G. (2006). K. Boston: Allyn & Bacon.297-334(1951).htm Hopkins. The measurement of observer agreement for categorical data Biometrics. FL: Harcourt Brace Jovanovich. from http://www. 2006. 30(3)411-433(2004). H. New York: Guilford. Reliability in content analysis: Some common misconceptions and recommendations Human Communication Research. H. L. J. v.. How the essay is scored .comm. K. & Algina. Cronbach. Kline. Constructing rater and task banks for performance assessment Journal of Outcome Measurement. (1996).gov/ca050404. Coefficient alpha and the internal structure of tests Psychometrika. Statistical methods in education and psychology . (2006)..htm Engelhard. Landis.ohio-state. Page 43 of 45 Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches . vol. Issues in measuring reliability: Correlation versus percentage of agreement Written Communication. Retrieved from http://www. Hayes.gao. Krippendorff. Introduction to classical and modern test theory . K. 1(1)19-33(1997). Orlando. L. Inc.edu/ahayes/SPSS%20programs/kalpha. (1998). Retrieved November 4. H. vol. 16(3)354-367(1999). G. Boston: Allyn & Bacon. (2006).). Questionnaire Programming LanguageInterrater reliability report . Chicago: University of Chicago Press. (1986). K. vol. K. G. J.159-174(1977).com/student/testing/sat/about/sat/essay_scoring. Retrieved November 4. & Hopkins. Modern factor analysis . R. SAGE Research Methods College Board . A. from http://qpl. vol. Harman. 16. (1967). (1998). All Rights Reserved. J. 2006. and Hatch. vol. R. J.html Crocker. Principles and practice of structural equation modeling .coUegeboard. J. Hayes.

J. (2002. Linacre. Cook. Measurement with judges: Many-faceted conjoint measurement International Journal of Educational Research.-S. 40(5)1676-1693(2005).. (1994). (1988). E. C. vol. O. Bringing balance and technical accuracy to reporting odds ratios and the results of logistic regression analyses Practical Assessment..asp?v=11&n=17(2006). Inc. (1980). G... J. M.). vol. Probabilistic models for some intelligence and attainment tests (Expanded ed. M. April 1-5). G. and Magaziner. J. P. Tatem. Chicago: University of Chicago Press. 6(1)80-128(2003)..net/getvn. J. Englehard. A. C. D. 11(7). B. M. 29(4)409-454(1994). J. vol. Judge ratings with forced agreement Rasch Measurement Transactions. vol.0) . J. J. McArdle. vol.. M. F.. M. R. Myford. Looking for patterns in disagreements: A facets analysis of human raters’ and e-raters’ scores on essays written for the Graduate Management Admission Test (GMAT) .. K. and Myford.University of Arizona ©2008 SAGE Publications. M. & Cline. vol. 16(1)857-858(2002). Paper presented at the annual meeting of the American Educational Research Association. Many-facet Rasch measurement . R. J. R. (Original work published 1960) Snow. R. F.3. Burgess. LA. S.Retrieved from http://pareonline. L. Linacre... Chicago: MESA Press. Atchley. New Orleans.. Lin. Proxies and other external raters: Methodological considerations Health Services Research. and James. The restriction of variance hypothesis and interrater reliability and agreement: Are ratings from multiple sources really dissimilar? Organizational Research Methods.. J. Chicago: MESA Press. L. 21(4)569-577(1994). All Rights Reserved. Rasch. J. Linacre. SAGE Research Methods LeBreton. Page 44 of 45 Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches . W. FACETS: A computer program for many-facet Rasch measurement (Version 3. Kaiser. Structural factor analysis experiments with incomplete data Multivariate Behavioral Research. Osborne. Morgan. M. Research & Evaluation. Linacre.

. S. L. Research & Evaluation. (2002).net/pare/getvn. and Sternberg. http://dx. Stemler. S.University of Arizona ©2008 SAGE Publications. Analyzing rater agreement: Manifest variable methods .d5 Page 45 of 45 Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches . Paper presented at the New England Educational Research Organization (NEERO). & Mun. vol. Chicago: MESA Press. NJ: Lawrence Erlbaum. A comparison of consensus. (1979). Wright. April). L.Retrieved from http://ericae. J. from http://ourworld. R. vol.org/10. Stemler. SAGE Research Methods Stemler. S. E. vol.compuserve. & Lubart. (1995). and measurement approaches to estimating interrater reliability Practical Assessment.com/homepages/jsuebersax/agree. Inc.htm von Eye. T. Stemler. 31(2)75-108(2006). On the theory of scales of measurement Science. New York: Free Press. Using the theory of successful intelligence as a basis for augmenting AP exams in psychology and statistics Contemporary Educational Psychology. Grigorenko. 7(17). Stevens. An empirical approach to understanding and analyzing the mission statements of selected educational institutions .. S. 2002. S. J. Research and Evaluation.. Defying the crowd: Cultivating creativity in a culture of conformity .doi. E. Uebersax. & Stone.. Statistical methods for rater agreement .. Best test design . NH...net/getvn. & Bebell.677-680(1946).Retrieved from http://pareonline. Mahwah. (1999. S. All Rights Reserved. D. M. E. Sternberg. H. B. Y. D. An overview of content analysis Practical Assessment. E. Portsmouth. (2004). consistency.asp?v=9&n=4(2004). J.4135/9781412995627. Jarvin. E. 9(4). A. vol. R. I.asp?v=7&n=17(2001). Retrieved August 9. E. 103.

a link between the are presented so that the researcher can calculate qualitative and quantitative fields is successfully these estimators without the use of sophisticated established through data collection and coding statistical software. Cristian Gugiu Western Michigan University Chris L. Volume 6.jmde. P. Correlation. methods for determining the reliability of weighted kappa statistic. Davey Fenwal. Number 13 140 ISSN 1556-8180 February 2010 . measurement typically were collected from a random sample of 528 occurs during the coding process. Journal of MultiDisciplinary Evaluation. Coryn Western Michigan University Background: Measurement is an indispensable Research Design: Case study. reliability coefficients __________________________________ Intervention: Not applicable. With respect Data Collection and Analysis: Narrative data to qualitative research. aspect of conducting both quantitative and qualitative research and evaluation. methodology. undergraduate students and 28 professors.com/ Articles Quantitative Methods for Estimating the Reliability of Qualitative Data Jason W. ANOVA Binary Intraclass conclusions from qualitative data sources. procedures. qualitative Setting: Not applicable. K e y w o r d s : qualitative coding. Inc. Formulae with such applications.http://www. S. and Kuder-Richardson 20 is Although some qualitative researchers disagree illustrated through a fictitious example. Purpose: This paper presents quantitative Findings: The calculation of the kappa statistic.

particular ontology and.g. Volume 6. to a Moreover. 2008. While perplexing from the vantage point of statisticians may be faulted for paying too quantitative and mixed method much attention to measures of central researchers. Nor do explanations (Healy & Perry. Furthermore. as opposed quantitative analysis. Coryn T he rejection of using quantitative methods for assessing the reliability of qualitative findings by some qualitative to consider whether differences may exist between groups.g. for example. After all.. Other not be synonymous with the positivist qualitative researchers have come to paradigm because statistical inference is equate positivism. objective reality that can themselves to.. the analysis of qualitative However. the use of statistical methods. To statisticians believe in a universal law date. Rather. most statisticians consider all these factors before researchers is both frustrating and formulating their conclusions. the reasoning the same as believing in one all-inclusive sometimes used to justify the wholesale truth. inflexible and fixed. S. causal claims is through the use of a well- statisticians believe that multiple truths conducted experimental design. inductive. assume that for any type of research design. this is not another is understandable. 2008).g. Healy and Perry embracing personal viewpoints and even (2000). may exist and that despite the best efforts the implementation of an experimental of the researcher these truths are design does not necessitate the use of measured with some degree of error. including measurement errors do not exist. characterize biases to describe and interpret the qualitative methods as flexible. and were even specifically be described by universal laws” (p. 2001. certain statistical methods lend belief in a single.. debate. they fail to explain why a methods as characteristic of a positivist majority of qualitative researchers dismiss paradigm (e.Jason W. Statisticians are portrayed as One of the first lines of attack against detached and neutral investigators while the use of quantitative analysis centers on qualitative researchers are portrayed as philosophical arguments. 2000). and inadequate substitutions. rejection of all concepts associated with The distinction between objective quantitative analysis are replete with research and subjective research also mischaracterizations. statisticians would quantitative analysis may be conducted ignore interaction effects. Stenbacka. subjective experience of the phenomena and multifaceted. that were not the case. If quantitative analysis. conclusions. specifically. Number 13 141 ISSN 1556-8180 February 2010 . While parts of methods are often characterized as these characterizations do. with causal to deterministic. reliability analysis). Cristian Gugiu. most differentiate between the two groups of qualitative researchers view quantitative researchers. While the need to distinguish tendency (e. 649). 2008)—a term that has come the formulas used to conduct such to take on a derogatory connotation. developed for. whereas quantitative they study (Miller. quantitative analysis should data (e. overreaching appears to emerge from this paradigm arguments. median) at the one methodological approach from expense of interesting outliers. the gold standard for substantiating measured free of error. Chris L. However. and fail Journal of MultiDisciplinary Evaluation. Davey. Yet. Paley analyses do not know or care whether the (2008) states that “doing quantitative data were gathered using an objective research entails commitment to a rather than a subjective method. Davis. indeed. Paley. P. and by extension concerned with probabilistic. Moreover. mean.

inconsistent with traditional qualitative For example. researchers as to what is measurement. researchers would be forced to Stenbacka (2001). data are available for Measurement is the process of two or more researchers and/or methods. does not make a behaviors. (2008) noted: Clearly. However. the informant’s statement addressed the first level (cleanliness) and The coding process refers to the steps the not the second. Measurement is a indispensable researcher conducts an interview with an aspect of conducting research.e. where it which. arrange. researcher and method” (p. the communicate conclusions drawn from the wholesale rejection of all concepts data.. there does for a phenomenon or that makes the not need to be. Number 13 142 ISSN 1556-8180 February 2010 . Volume 6. generalizability theory).. behaviors) based on a set of purpose in qualitative research never is to prescribed rules (i. measurement may only be authors. a coding rubric). it does not limit phenomenon different from others” (p. In part. traditional forms of validity because “the phrases. would place the burden of is impossible to differentiate between interpretation on the reader. In fact. qualitative research is self-evident to all this statement is inaccurate because those who have conducted such research. or stages of a process and difference. 2008. events. qualitative research in any way. and chooses to assign this statement a systematize the ideas. features. Cristian Gugiu. relationships. that they may be reworked into a smaller number of categories. in absence of utilizing a coding reliability and validity. relevance in qualitative research. several quantitative methods have been the role of measurement may not be as developed for differentiating between the obvious. measure anything. at least.g. P. suppose that the researcher developed a measurement occurs during the coding coding rubric. patterns so as to tell a story or Miller. suppose that a research. and an ‘X’ or 0 (zero) for the academic interesting events. only contained two levels: coding in qualitative research. concepts. or codes to Stenbacka (2001) also objected to phenomena (e. Whether the researcher researcher takes to identify.g. of course. From while the importance of coding to the perspective of quantitative research. Illustrating the integral nature of simplicity. which. Chris L. Coryn qualitative research. regardless informant who states that “the bathrooms if it is quantitative or qualitative. 2001). as is the central then further differentiated or integrated so premise of this paper. symbols. 551).Jason W. for the sake of process. “reliability has no provide readers with all of the data. (p. These are or her judgment to transform the raw Journal of MultiDisciplinary Evaluation. and a misunderstanding on the part of many informant (e. in turn. this may be attributed to researcher. assigning numbers.g. Benaquisto cleanliness and academic performance. however.. and For some qualitative researchers (e. A qualitative method There is nothing inherently quantitative seeks for a certain quality that is typical about this process or. S. and categories uncovered in the data. Davey. data collection method. features. It would seem to the present many times.. in the school are very dirty. 552). phrases. performance category. Stenbacka. that this notion is performed in a qualitative context. Coding checkmark for the cleanliness category or consists of identifying potentially a 1. 85) perceived to be quantitative has extended to general research concepts like Clearly. Moreover. The researcher clearly used his distinguishing them with labels. According to process.” Now further With respect to qualitative research. provided.

within and across researchers (i.e. qualitative harmony between the raw data and the research depends upon measurement to researcher’s interpretations and render judgments..Jason W. However. 2008. 1985). Golafshani. Volume 6. Dependability recognizes that the most respectively)? appropriate research design cannot be Fortunately. However. Chris L. Dependability can be Consequently. replicability is concerned with credibility. 2008). 2000. how many of the statements citing negative cases. Therefore. Miller. Davey. interpretations and conclusions are Like qualitative researchers. Saumure & Given. The idea being that if a espoused the use of “interpretivist different set of researchers use similar alternatives” terms (Seale. in part. 2002. Mayan. on the represented cleanliness and not academic other hand. 2008. including statement X fit the definition of code Y? accurately and richly describing data. grounded in actual data that can be quantitative researchers have developed verified (Jensen. reliability include confirmability. (Jensen. decided that the statement best and member checks. is concerned with the performance. he or she also performed a research methodology and data sources measurement process. realities of the research context in which compete for funding and therefore. Perry. P. Morse. Given & Saumure. Jensen. participants with similar demographic In the qualitative tradition. & Spiers. Second. how reliable is the definition of code analysis and findings. 2003. Credibility. 2008). First. must they conduct the study. intrarater and interrater reliability. dependability. like quantitative researchers. 2008. Coryn statement made by the informant into a reliability indicator through the use of code. Researchers may address this including interrater and intrarater Journal of MultiDisciplinary Evaluation. qualitative researchers have similar ways. Furthermore. using multiple collected fit the definition of code Y? And researchers to review and critique the third. because qualitative alter their research design to meet the researchers. the concepts of reliability addressed by providing a rich description and validity permeate qualitative of the research procedures and research. not every qualitative completely predicted a priori. audit trails. Barrett. transparency. 1999). 2008). variables. Number 13 143 ISSN 1556-8180 February 2010 . 2008). Various means can be used questions may be asked. and confirmability is concerned with coding data in a similar fashion to the confirming the researcher’s original study (Firmin. as compared to persuade funders of the accuracy of their the context they predicted to exist a priori methods and results (Cheek. indicator by conducting the new study on Lincoln & Guba. Cristian Gugiu. when the researcher multiple coders. 2007. owing to the desire to instruments used so that other differentiate itself from quantitative researchers may be able to collect data in research. Researchers may address this reliability Olson. 2008). and conducting Y for differentiating between statements member checks (Given & Saumure. Finally. researcher has accepted Stenbacka’s Consequently. researchers may need to notion. 2008). Healy & similar background as the original study. three conclusions. S. does to enhance credibility. if one used to establish a high degree of accepts this line of reasoning. and replicability repeating a study on participants from a (Coryn. numerous definitions of reliability. Some of methods then they should reach similar the most popular terms substituted for conclusions (Given & Saumure. asking similar questions. 2008.

each rater picture. consensus agreement. features. conducting an interrater reliability Beyond these instructions.g. An example of the perils of not assessment. Often. may be used to The results uncovered by Armstrong finalize ratings after each rater has had an and his colleagues paint a troubling opportunity to rate all data. Number 13 144 ISSN 1556-8180 February 2010 . to assess the credibility of a finding or Gosling. in our experience. Britain and the United States to analyse a accounts for a considerable amount. this task is greatly facilitated by use Hopkins. the other three terms appear to While it is likely that qualitative describe research processes tangentially researchers who prescribe to a related to reliability. In other words. do identified and described the main themes different raters interpret qualitative data that emerged from the focus group in similar ways? The process of discussion. which is detailed in the next researcher was permitted to use any section. in the form of external bias. events. researcher was asked to prepare an phrases. method for extracting the main themes Essentially. internal must work independently of the other to consistency. at least two or more thematically analyzed by one of the raters must independently rate all of the authors. they fail to consider interrater experienced qualitative researchers from reliability. Once the beyond development and finalization of a reports were submitted. name a few (Crocker & Algina. A review of the of a database system that. behaviors) in the same way (van independent report in which they den Hoonaard. Davey. they have constructionist paradigm may object to two major liabilities. On the surface. Similar reviews of the data attending to this issue may be found in an would be necessary. test-retest reliability. In return for a fee. if a reviewer wanted empirical study conducted by Armstrong. Moreover. this is an indispensable process for a finding they would need to review the attaining a reasonable level of interrater audit trail and make an independent reliability. is relatively straightforward. rather than developing their wanted to determine the confirmability of own. reliability.. dependability of a study design. and However. 1998). Interrater reliability is adults living with cystic fibrosis that was concerned with the degree to which convened to discuss the topic of genetic different raters or coders appraise the screening. of the variability in findings in long) from a focus group comprised of qualitative studies. if a reader for a study.g. Chris L. S. for example. each same information (e. For example. Cristian Gugiu. although replicability is (3) records the rater’s code before conceptually equivalent to test-retest displaying the next codable unit. they were coding rubric is that. (2) quantitative notions of reliability..500 words a majority. Weinman. presents the available coding options. if not transcript (approximately 13. and Marteau (1997). 2008). and interclass correlations to reduce bias in the first phase of analysis. they place the the constraint of forcing qualitative burden of assessing reliability squarely on researchers to use the same coding rubric the reader. P. Armstrong and his colleagues invited six Second. Coryn reliability. it was clear that a Journal of MultiDisciplinary Evaluation. First. each analysis. who deliberately abstained from qualitative data using the coding rubric. up to a maximum of five. 1986. reading the original transcript to reduce Although collaboration. (1) qualitative alternative terms revealed displays the smallest codable unit of a them to be indirectly associated with transcript (e. the only additional step they felt was appropriate. a single sentence). Volume 6. which.Jason W.

knowing the respect to how the themes were organized. further education. agreement rates were slightly lower for it is reasonable to assume that it is the ignorance and health service provision desirable to differentiate between the themes (83% and 67%. given parents with choice. However. one unanimous agreement for the visibility may consider examining intrarater and genetic screening themes. and the remainder thought it provided interpretation of data). these are good rates of of the researcher. Furthermore. importance accorded to themes. Davey. it less importance than the overarching Although qualitative researchers can theme they assigned it to). Volume 6. researchers is a desirable quality. In general. raised concern about the eugenic threat.. In all other instances. while one identified four themes. First. answer to this question may mean the Some researchers classified a theme as a difference between passing and rejecting a basic structure whereas others organized policy that allows parents to genetically it under a larger basic structure (i. Coryn reasonable level of consensus in the by knowledgeable researchers. use parents with choice while one linked it of a common coding rubric does not with the eugenic threat. respectively). particularly if consensus is how “reality” is relative to the researcher reached beforehand by all the researchers doing the interpretation. For a amount of disagreement existed with politician. there was are not desired. greatly interfere with normal qualitative These results serve as an example of procedures. There Consequently. gave test embryos. health service provision. a address interrater reliability by following significant amount of disagreement the method used by Armstrong and his existed with respect to the manner in colleagues. Cristian Gugiu. Second. it would greatly screening theme where three researchers benefit qualitative researchers to utilize a indicated that genetic screening provided common coding rubric. for example. a significant personal ideological perspectives. by identification of themes was achieved. we are assuming that Five of the six researchers identified five reliability of findings across different themes. are the agreement. and the perspectives of others presence of each theme. P. reliability is not important because one is ignorance. Clearly. Number 13 145 ISSN 1556-8180 February 2010 .e. However. and only interested in the findings of a specific genetic screening. while the reliability. Similar the importance of reducing the variability inconsistencies with regard to in research findings attributed solely to interpretability occurred for the genetic researcher variability. Of equal importance. In other words. That being the case. Chris L. the ignorance theme suggested a need for the labels used to describe themes. only four themes are certainly may be instances in which discussed in the article: visibility. S. perspectives of the informants and those Overall. however.g. this statement. some of the researchers felt that simply due to researcher differences (e. this research finding requires knowledge of procedure permits the researcher to the degree to which consensus is reached Journal of MultiDisciplinary Evaluation. a deeper researcher’s findings truly grounded in examination of the findings revealed two the data or do they reflect his or her troubling issues. they on the rubric that will be used to code all also demonstrate how the quality of a the data. other researchers structural organization of themes. the likelihood of achieving a which themes were interpreted.Jason W. For reasonable level of reliability will be low example. With respect to the researcher..

Number 13 146 ISSN 1556-8180 February 2010 . researchers should implement a method Method to synthesize all the findings through a consensus-building procedure or Data Collection Process averaging results. to this point. of a positivist position. However. Volume 6. Third. Chris L. three issues discrepancies and consistencies between still remain. programmed to resemble the original comparisons between the reliability of the questionnaire. Cristian Gugiu. subject to a wide range of interpretations will likely produce low reliability Journal of MultiDisciplinary Evaluation. In fact. of synthesis on the reader. Davey. students to assess the quality of the data Fortunately. randomly selected from a university While no one. Therefore. Second. produce high reliability estimates. Coryn remain to be the instrument by which estimates. P. Moreover. simple quantitative entry process.Jason W. it was possible for the smallest not due to differences in researchers. where appropriate and possible. reporting the findings should be immediately addressed by the of multiple researchers will only permit university’s administration. codable units to contain multiple themes illustrate the degree to which reality is although the average number of themes socially constructed or not. our experience respondents to identify the primary suggests it is not prevalent in the research challenges facing the university that community. should not be regarded as products respondents to bullet their responses. to the best of our population. judging the reliability of Narrative data were collected from 528 a study requires that deidentified data are undergraduate students and 28 professors made available to anyone who requests it. in which discrepancies were found The present paper will expound upon four between the responses noted on the quantitative methods for calculating survey and those entered in the database. Corrections to the data solutions exist that enable researchers to entered into the database were made by report the reliability of their conclusions the graduate students in the few instances rather than shift the burden to the reader. interrater reliability that can be Due to the design of the original specifically applied to qualitative data and questionnaire. First. Validation checks were study to another qualitative study are performed by upper-level graduate impractical for complex studies. variability of research findings are or are That said. Data were readers to get an approximate sense of the transcribed from the surveys to an level of interrater reliability or whether it electronic database (Microsoft Access) meets an acceptable standard. Data were collected with the knowledge. Data that are was less than two per unit of analysis. this qualitative process should calculating interrater reliability in considerably improve the credibility of addition to reporting a narrative of the research findings. interpretations are consistent will likely Reporting the results of. S. reporting the findings researchers can be thought of as a form of of multiple researchers places the burden methodological triangulation. whereas data whose data are interpreted (Brodsky. which encouraged thus. Finally. reliability little additional work was necessary to estimates. has studied the degree to help of an open-ended survey that asked which this is practiced. which can roughly be further break responses into the smallest conceptualized as the degree to which codable units (typically 1-3 sentences). 2008).

Coryn Coding Procedures such themes and patterns. 1969) in order to apply different binomial distribution. These estimators are the kappa other categories. coders levels of importance to various coding indicate whether a particular theme either frequencies. That is. Benaquisto. 2008). Number 13 147 ISSN 1556-8180 February 2010 . that requires iterative passes through the This type of estimate can be used as a raw data in order to generate a reliable measure that objectively permits a and comprehensive coding rubric. Kleinman. Cohen. 1994. 1987. Chris L. (2) In this paper four methods that can be problem areas that needed further utilized to assess the reliability of clarification. (selective coding. unified coding rubric. predominately evolved from psychometric categorized these themes based on their theory (Cohen. 1971). the weighted kappa statistic categories that could be further refined to (κW). Lord subcategory based on the underlying data & Novick. 1979. Not statistic was one of the first statistics surprisingly. This researcher to substantiate that his or her task was conducted by two experienced coding scheme is replicable. 1966). the two correlation coefficients (ICC). In the following section. Benaquisto. 2008). 1968. Volume 6. 1992). 1950. called binomial intraclass Following this initial step. Coding qualitative data is an arduous task typically by coefficients of agreement. for ease coders (Cohen. Hill. qualitative researchers who independently Most estimators for gauging the read the original narratives and identified reliability of continuous agreement data primary and secondary themes. introduced the use of numerical weights. A of illustration. P. coding schemes follow a Everitt. Davey. and Similar methods for binomial agreement produced labels for each category and data shortly followed (Cohen. coding rubric could be applied. & Brennan.Jason W. and (5) the Kuder-Richardson 20 (KR-20). Mak. S. Yamamoto & to determine (1) the ease with which the Yanagimoto. and the make important distinctions. attempted an initial coding of the raw data Tamura & Young. 1968). Benaquisto. Lord & Novick. the ANOVA binary ICC. (4) the extensive statistic (κ). 1960. Lipsitz. Newer forms of these (open coding. 1988. Fleiss. & Journal of MultiDisciplinary Evaluation. & Very often. Smith. 1983. perception of the internal structure 1968. This statistic allows the user to apply Statistical Procedures different probability weights to cells in a contingency table (Fleiss. 1987. The kappa overall coverage of the coding rubric. estimators. the reliability of coder’s efforts can be determined. Using the unified Laird. several iterations were developed for assessing the reliability of necessary before the coding rubric was binomial data between two or more finalized. coding rubric. were later researchers further differentiated or developed to handle more explicit integrated their individual coding rubric patterns in agreement data (Fleiss & (axial coding. Cristian Gugiu. the two researchers Nelder & Pregibon. 2008) into a Cuzick. 1973. reliability estimates are modified version of this statistic presented only for a single category. Rozeboom. When two based on the mean squares from an or more individuals code data to identify analysis of variance (ANOVA) model modified for binomial data (Elston. Gulliksen. The ANOVA binary ICC is is or is not present in the data. (3) the trivial categories that binomial coded agreement data are could be eliminated or integrated with presented. 1960.

Jason W. whereas Coder 1 did not feel with this assessment. provided a response that pertains to denoted (i1 = Present. Volume 6. because it was the 20th numbered formula in their seminal Coder 1 article. of this table consists of the total interest. j1 = Present). Chris L. selected from a university population. The first cell. denoted (i1 = these two groups of interview participants Present. Coryn Smith.. Cristian Gugiu. this table consists of the total frequency of Coder 1 felt that an additional seven cases where Coder 1 feels that a theme is professors and two students made a present in the interview responses. and response pertinent to overall satisfaction. of overall satisfaction of university facilities. j1 = Present). and is commonly known as KR-20 Agreement Patterns for Qualitative Data or KR (20). Journal of MultiDisciplinary Evaluation. Coder 1 and Coder 2 agreed that frequency of cases where both Coder 1 and responses from the final 19 professors and Coder 2 agree that a theme is not present 25 students did not pertain to the topic of in the interview responses (Soeken & interest. theme either is or is not present in a given Interview data were collected for and interview response. The second cell. Present (j1) (j2) These four reliability statistics are Theme functions of i x j contingency tables. 1977). The The layout of this table is provided in binomial coding agreement patterns for Table 1. Prescott.e. The Coder 2 current paper will illustrate the use of Theme Not Present Cell12 Cell22 these estimators for a study dataset that (i2) comprises the binomial coding patterns of two investigators. P. The third cell. This estimator is based on the Theme Not ratios of agreement to the total discrete Theme Present variance. of this table consists are provided in Table 2 and Table 3. the coders agreed theme is present in the participant that one professor and 500 students interview responses. of this table interview response of interest. the contingency table transcribed from 28 professors and 528 has two rows (i = 2) and two columns (j = undergraduate students randomly 2). S. The fourth cell. The last estimator was Table 1 developed by Kuder and Richardon General Layout of Binomial Coder (1937). j2 = Not Present). Number 13 148 ISSN 1556-8180 February 2010 . and the first coder does not agree satisfaction. Davey. denoted (i2 = from these two professors pertained to the Not Present. also Present (i1) Cell11 Cell21 known as cross-tabulation tables. Because these coding patterns are from two coders and the Participants coded responses are binomial (i. 1986). of the total frequency of cases where For the group of professor and student Coder 1 and Coder 2 both agree that a interview participants. j1 = Not pertained to the interview response of Present). the second coder does not agree with this whereas Coder 2 did not feel that response assessment. that response from this professor denoted (i2 = Not Present. Coder 2 felt consists of the total frequency of cases that one professor and one student made where Coder 2 feels that a theme is a response pertinent to overall present.

Coryn Table 2 Four Estimators for Calculating Binomial Coder Agreement Patterns for Professor Interview Participants the Reliability of Qualitative Data Coder 1 Theme Kappa Theme Not Present Present (j2) (j1) According to Brennan and Hays (1992). 1971. The traditional formulae for po and pe are c c c c po = ∑ ∑ pij p e = ∑ ∑ p i. xx). Volume 6. and j denotes the jth column (Fleiss. The (j1) expected level of agreement (pe) equals Theme the summation of the cross product of the 500 1 Present (i1) marginal probabilities. In other words. This statistic is based on the observed and expected level of agreement between two or more raters Table 3 with two or more levels. j i =1 j =1 i =1 j =1 and . Coder 2 this is the expected rate of agreement by Theme Not 2 25 random chance alone. These formulae are illustrated in Table 4. Number 13 149 ISSN 1556-8180 February 2010 . i denotes the ith row. Chris L. 1986). Soeken & Prescott. The kappa statistic Present (i2) (κ) then equals (po-pe)/(1-pe). S. where c denotes the total number of cells. P. Davey. The observed Binomial Coder Agreement Patterns for level of agreement (po) equals the Student Interview Participants frequency of records where both coders agree that a theme is present plus the Coder 1 frequency of records where both coders Theme agree that a theme is not present divided Theme Not Present Present (j2) by the total number of ratings. p. Cristian Gugiu.Jason W. Theme the κ statistic “determines the extent of 1 1 agreement between two or more judges Present (i1) Coder 2 exceeding that which would be expected Theme Not Present (i2) 7 19 purely by chance” (p. Journal of MultiDisciplinary Evaluation.

observed level of agreement for students is (500+25)/556 = 0. p.1 + p2. = (c12 + c22) / N present Marginal p. = 26/556 = 7 19 present 0.1 = (c11 + c12) / p.Jason W.2 = (c21 + c22) / N = (c11 + c21 + c12 Column p.9011(0.1 = 8/556 = p.0360) = i =1 j =1 0. Number 13 150 ISSN 1556-8180 February 2010 . present p1. The Table 5 Estimates from Professor Interview Participants for Calculating the Kappa Statistic Marginal Row Coder 1 Probabilities Theme not Theme present pi.2 = 20/556 = N = 28 + 528 = Column p.0468 Marginal p.0017.0036(0. j = p1. P. p.8160.0360.9029) + 0.2 0.j 0.0036 Coder 2 Theme not p2.0486) = statistic are provided in Table 6. S. Volume 6. The expected Journal of MultiDisciplinary Evaluation.0144 0.0486(0. = 2/556 = Theme present 1 1 0.0144) + 0.0360 556 Probabilities Estimates from student interview level of agreement for students is participants for calculating the kappa 0. p. Estimates from professor interview participants for calculating the kappa statistic are provided in Table 5. The 0.j N N + c22) Probabilities c c c11 + c22 observed level of agreement for professors po = ∑∑ pij = is (1+19)/556 = 0. Davey. Coryn Table 4 2 x 2 Contingency Table for the Kappa Statistic Marginal Row Coder 1 Probabilities Theme not Theme present pi. The expected i =1 j =1 N and level of agreement for professors is c c pe = ∑∑ pi.9442. Chris L.0468(0. = (c11 + c21) / N Coder 2 Theme not c12 c22 p2. Cristian Gugiu. present Theme present c11 c21 p1.

These formulae are patterns where both raters agree that a illustrated in Table 7.9442 = 0.0468 Probabilities The total observed level of agreement agree that a theme is not present.9802 – 0. and wij denotes the i.j N = 556 0. & Everitt. Number 13 151 ISSN 1556-8180 February 2010 . κW. 1969.8177) frequency of records where both coders = 0.891.8160 = 0. Chris L.8177)/(1 – 0. Cristian Gugiu. xx). Volume 6. The weighted kappa statistic κW then κ. S. Coryn Table 6 Estimates from Student Interview Participants for Calculating the Kappa Statistic Marginal Row Coder 1 Probabilities Theme not Theme present pi.891 beyond that which another weight divided by the total is expected purely by chance. κW is “the proportion of weighted i =1 j =1 i =1 j =1 agreement corrected for chance. number of ratings.0360 + 0. Cohen. the kappa statistic theme is present times a weight plus the equals κ = (0.9802. As an example. = 27/556 = 2 25 present 0. to be where c denotes the total number of cells. the frequencies of coding Everitt. The level of agreement between agree that a theme is not present times the two coders is 0.9029 0.2 = 26/556 = Column p. The weighted expected level of agreement (pew) equals the Weighted Kappa summation of the cross product of the marginal probabilities. pow = ∑∑ wij pij and pew = ∑∑ wij pi.9011 Coder 2 Theme not p2. but the researcher can differentially equals (pow-pew)/(1-pew). weight (Fleiss. The traditional weight each cell to reflect varying levels of formulae for pow and pew are c c c c importance. present p1. j .8177. coders disagree on the presence of a The total expected level of agreement for theme in participant responses.Jason W.0017 + 0. where each cell in The reliability coefficient. theme is present can be given a larger weight than patterns where both raters Journal of MultiDisciplinary Evaluation. used when different kinds of i denoted the ith row. Davey. 1968). = 501/556 = Theme present 500 1 0. P. agreement (pow) equals the frequency of For the professor and student and records where both coders agree that a professor groups. has the the contingency table has its own weight. j denotes the jth disagreement are to be differentially column. According to Cohen (1968).1 = 502/556 = p. jth cell weighted in the agreement index” (p.0486 Marginal p. p. same interpretation as the kappa statistic. the professor and student interview The weighted observed level of groups is pe = 0. The for the professor and student interview same logic can be applied where the groups is po = 0.

S.10. the probability weight is 0. j was chosen arbitrarily to reflect the i =1 j =1 overall level of importance in the . In the second row and second the sum of the sample sizes from all of column. where impact of differing levels of experience in N qualitative research between the two ni is the sample size of each cell and N is raters. P. Davey. agreement of a theme being present as Karlin.Jason W. = (w12c12 + w12c12 w22c22 present w22c22) / N Marginal p.j w12c12) / N w22c22) / N + c22) Probabilities c c w11c11 + w22 c22 probability weights used are provided in po = ∑∑ pij = Table 8. k = 2). Volume 6. In the first row and first column. the Journal of MultiDisciplinary Evaluation..2 = (w21c21 + N = (c11 + c21 + c12 Column p. N (n i − 1) There is no single standard for applying probability weights to each cell in a contingency table. = (w11c11 + Theme present w11c11 w21c21 w21c21) / N Coder 2 Theme not p2. i =1 j =1 N and the probability weight is 0.09. Cristian Gugiu. p. the probability weight is 0. In the second provided three methods for weighting row and first column. equally weights each pair of observations. The first method column. present p1. The last method weights each cell according to the sample size in each cell.g. effect of the lack of existence of a theme undergraduate students and professors) from the interview data. irrespective of its size. These two weights were used to reduce the n This weight is calculated as w i = i . These weights can 1 be calculated as w i = . and Williams (1981) identified by both coders. Chris L. In the first row and second of a kappa statistic.80. the probability probabilities as applied to the calculation weight is 0.01. This weight c c pe = ∑∑ wij pi.1 = (w11c11 + p. cells of the contingency table. where k is kn i (n i − 1) the number of groups (e.. Number 13 152 ISSN 1556-8180 February 2010 . The formula 1 for this weighting option is w i = . Cameron. The second This weight was employed to reduce the method equally weights each group (e.g. For this study. Coryn Table 7 2 x 2 Contingency Table for the Weighted Kappa Statistic Marginal Row Coder 1 Probabilities Theme not Theme present pi.

10 0.7198) + kappa statistic are provided in Table 10.1(7) = 0. 0. S. = 0. Volume 6.09 0.0016(0.8(500)+0.0016 Coder 2 Theme not p2. The observed level of agreement for professors is [0.8(1) = 0. Coryn Table 8 Probability Weights on Binomial Coder Estimates from professor interview Agreement Patterns for Professor and participants for calculating the weighted Student Interview Participants kappa statistic are provided in Table 9.7196(0. Davey. Number 13 153 ISSN 1556-8180 February 2010 .1 = 1.89/556 = 0.0016 Marginal p.8 0.0005) = 0.09(1) = 0. The expected level of agreement Theme Not Theme Present for professors is 0.7199.00001. present p1.01(19) =0.28/556 = N = 28 + 528 = Column p.0018.09 Present (i1) Coder 2 Theme Not Present 0.0027 0. = 0.80 0.2 = 0.0008(0.0027) + Present (j1) 0. P.Jason W.0016(0.01(19)]/556 = 0.01(25)]/556 = Journal of MultiDisciplinary Evaluation.5180.89/556 = Theme present 0. The expected level of agreement participants for calculating the weighted for professors is 0.01 (i2) Table 9 Estimates from Professor Interview Participants for Calculating the Weighted Kappa Statistic Marginal Row Coder 1 Probabilities Theme not Theme present pi. Chris L.j 0.8(1)+0.19 present 0.5/556 = p. Cristian Gugiu.7 0.0006) = 0. The observed level of agreement for Coder 1 professors is [0. (j2) Theme 0.0005 556 Probabilities Estimates from professor interview 0.

Yi denotes the frequency of (1979). ni is the total sample version of the ICC is based on within size for the ith group or category.7199 = 0. The level of agreement raters are fixed.0008 Marginal p. that which is expected purely by chance The statistic ICC(2. this group or category.0018 + 0. This reliability statistic is notably larger population of raters (Shrout and smaller than the unadjusted kappa Fleiss.5181)/(1 item of interest.00001 + 0.1). = 0.Jason W.1(2) = 0.2 = 0. In these formulae. and N is mean squares and between mean squares the total sample size across all groups or Journal of MultiDisciplinary Evaluation. k denotes the total number of groups or From the writings of Shrout and Fleiss categories. Coryn Table 10 Estimates from Student Interview Participants for Calculating the Weighted Kappa Statistic Marginal Row Coder 1 Probabilities Theme not Theme present pi.09(1) = 0. ICC(3.25 present 0.423 beyond are utilized to code multiple sets of data. More specifically.1 = 400. and is 0.7217 – 0. not currently available for binomial weighted cases where both coders agreed response data.0006 556 Probabilities The total observed level of agreement for two or more coding groups/categories for the professor and student interview from an analysis of variance model groups is pow = 0. P.09/556 = Theme present 0.7196 Coder 2 Theme not p2.5181. 1979) is recommended for use but statistic because of the number of down. = 400.09 0. Number 13 154 ISSN 1556-8180 February 2010 .7217. Davey. modified for binary response variables by The total expected level of agreement for Elston (1977). For the professor and student and appropriate when two or more raters rate professor groups.1) that assumes the after applying importance weights to each coders are randomly selected from a cell.2 0. present p1. S. Chris L.1) assumes that the – 0. Cristian Gugiu.34/556 = N = 28 + 528 = Column p. the currently available ANOVA agreements (both coders indicate a theme Binary ICC that is appropriate for the is present.7198 0. or both coders indicate a theme current data set is based on what they is not present) between coders for the ith refer to as ICC(3.01(25) =0.45/556 = 0.5180 = ratings (Shrout and Fleiss. mean squares within and between along with an adjusted sample size estimate are ANOVA Binary ICC provided in Table 11.5181) = 0. Volume 6.j = 0. that the theme is not present in the The traditional formulae for these interview responses. 1979). the same raters between the two coders is 0.2/556 p.423.8(500) = 400 0. This reliability statistic the professor and student interview measures the consistency of the two groups is pew = 0. that is. the weighted kappa the same interview participants for some statistic equals κW = (0.

1977.1.5827 Size k −1 ⎣ N i =1 ⎦ 2 . (528 2 + 282 ) ⎥ = 54.714.2857. which denotes the consistency of Estimates from professor and student coding between the two coders on the interview participants for calculating the professor and student interview ANOVA Binary ICC are provided in Table responses.0854 − 0. the 0. MS B + (n0 − 1) MSW 2.0157 and 2. further inquiry. the correlation on the observations in these formulae require two split-halves is presented as ρˆ AB = 0.5827 − 1)0.0854. Using reliability estimate equals these estimates. a calculations of the KR-20. Volume 6. requirements of the data in relation to the Kuder-Richardson 20 calculation of the correlation rii . Crocker and coefficient that they used to determine the Algina (1986) present examples on the reliability of test items. = 0. In Table 7. Kuder and to the infancy of mathematical statistics. Richardson (1937) presented the This vagueness has lead to some incorrect derivation of the KR-20 statistic. Davey. the adjusted sample size equals 54. respectively.Jason W. Given that k = 2 and N = 556.1 (pp.0157 1999).303] = 0. Number 13 155 ISSN 1556-8180 February 2010 . P. the ANOVA binary ICC MS B − MSW equals ρˆ AOV = (Elston. S. This is problematic Journal of MultiDisciplinary Evaluation. These authors do not It is not indicated that this statistic is the appear to discuss the distributional Pearson correlation. Demétrio. Two 140). This estimator is a calculation of the KR-20 in Table 7. Coryn categories. and total variance. 11. Using these estimates. The within and between mean squares equal Table 11 Formulae and Estimates from Professor and Student Interview Participants for Calculating the ANOVA Binary ICC Description of Statistic Formula Statistic 1 ⎡k k Yi 2 ⎤ Mean Squares MSW ⎢∑ i ∑ ⎥ = Y − 1 [545 − 536. Chris L. Hill. Ridout.0157 = Smith. Cristian Gugiu.0854 + (54. 136- item variances.1 ⎣ 556 ⎦ Note: ΣYi denotes the total number of cases where both coders indicate that a theme either is or is not present in a given response.2 function of the sample size. & MS B + ( n0 − 1) MSW MS B − MSW 2.303 Between k − 1 ⎢⎣ i =1 ni N ⎝ i =1 ⎠ ⎥⎦ 2 − 1 ⎣ 556 ⎦ 1 ⎡ 1 k 2⎤ 1 ⎡ ⎤ ∑ Adjusted Sample 1 n0 ⎢ N − ni ⎥ = ⎢556 . & Firth.0854 Mean Squares MSB − Y = ⎢536.34 . possibly due to its time of development in relation In their landmark article.0157 Within N − k ⎣ i =1 i =1 ni ⎦ 556 − 2 1 ⎡ k Yi 2 1 ⎛ k ⎞ ⎤ 1 ⎡ 5452 ⎤ 2 ⎢∑ ⎜∑ i ⎟ ⎥ − ⎥ = 2. summation of based on data from Table 7.

the k denotes the total number of groups or correlation may be notably categories. These calculations are further detailed in the next section. Chris L. For the is. that notably to substantially inflated. ( ) example. This is unfortunately incomplete. E(X) = μ and E X 2 − E ( X ) = σ 2 . For the normal pdf. Coryn because this statistic assumes that the two KR-20 will be computed using the random variables are continuous. For continuous data. Otherwise. The form discrete variance to a continuous variance. Volume 6. Number 13 156 ISSN 1556-8180 February 2010 . and reduces to Not only must this substitution be made for the numerators variances. This lack of exclusiveness has caused some confusion in appropriate σ 1 + σ 2 + 2COV ( X 1 .35. 1995). ni is the total sample remainder of this paper. Davey. KR-20 should be based on discrete expectations formula will be based on a ratio of a (Hogg. E (X 2 ) = ∫ x 2 ⋅ f ( x )∂x and +∞ current paper. Journal of MultiDisciplinary Evaluation. The Second. P. As can be seen. ⎡ σ 12 ρ σ σ L ρ σσ ⎤ and Crocker and Algina (1986) elaborated ⎢ 12 1 2 ij i j ⎥ on this statement by indicating “This ⎢ρ σ σ σ2 2 M L ⎥ Σ = ⎢ 21 2 1 ⎥ (Kim formula is identical to coefficient alpha ⎢ L L O M ⎥ with the substitution of piqi for σˆ i2 ” (p. the ⎡ σ 12 ρ12σ1σ 2 ⎤ Σ =⎢ ⎥ for a coding denominator variances must also be ⎢⎣ ρ 21σ 2σ1 σ 22 ⎥⎦ adjusted in the same manner. McKean. 1997). all estimators should be based matrix.Jason W. 1968). if scheme comprised of two raters. for covariance matrix. This E(X) = ∫ x ⋅ f ( x )∂x where f(x) denotes the −∞ variance will equal the summation of the probability density function (pdf) for main and off diagonals of a variance- continuous data. elements in a variance-covariance matrix for binomial data (i. of this variance equals the second moment The resulting total variance will be minus the square of the first moment. σ 2 of ) on the level of measurement appropriate agreement for the i group or category th for the distribution. That is. In this the underlying distribution of the data is binomial. the variances 2 2 ( σ1 . & Craig. Lord σ 1 + σ 2 + 2 ρ12σ 1σ 2 ) (Stapleton. and N is correlation will be substituted with the the total sample size across all groups or Kendall-τc correlation. where N − 1 ⎣ σ T i =1 ni ⎝ ni ⎠⎦ statistic is Kendall-τc and this correlation equals 0. For the group or category.. Yi denotes the number of underestimated as well as the KR-20 if the agreements between coders for the ith incorrect distribution is assumed. the Pearson size for the ith group or category. X 2 ) 2 2 = calculations of the total variance σ t2 . 2004). & Timm. ⎢ρ σ σ L L 2 σn ⎥ ⎣ ij i j ⎦ 139). The 2 2 and Novick (1968) indicated that this variance-covariance matrix takes the statistic is equal to coefficient α general form (continuous) under certain circumstances. when in N ⎡ 1 k Yi ⎛ Yi ⎞⎤ actuality they are discrete.e. E(X2) – [E(X)]2 (Ross. 2007). the KR-20 will be a −∞ function of a total variance based on the +∞ discrete level of measurement. An appropriate formula ⎢1 − 2 ∑ ⎜⎜1 − ⎟⎟⎥ . S. categories (Lord & Novick. Kuder and Richardson (1937) total variance (σ T2 ) for coder agreement present formulae for the calculation of σ t2 patterns equals the summation of and rii that are not mutually exclusive. Cristian Gugiu.

In this paper. and 0. 2000). the correlation should be based on distributions suitable for this Journal of MultiDisciplinary Evaluation. package SAS. The final component of the KR-20 calculating E (X ) and E ( X ) should not be 2 formula is the proportion of agreement utilized because the standard error can be times one minus this proportion (i.. E(X) = n ⋅ p and E (X 2 ) − E ( X ) = n ⋅ p (1 − p ) Estimates for calculating the KR-20 (Efron & Tibshirani. type of data.e. Coryn For discrete data. where Prob(X = Wright. P. and reducing the pi)) for each of the groups. a theme.006.Jason W.210. Otherwise. the important to also note that if the covariance between the groups equals underlying distribution is discrete. Using these estimates. pi(1- substantially inflated. The sum of these values is As with the calculation of E (X 2 ) and 0.1 ⎣ 1. the correlation E(X ) = ∑ xi Pr ob( X = xi ) 2 2 and ρ12 for agreement patterns between the i coders will be Kendall-τc (Bonett and E(X) = ∑ xi Pr ob( X = xi ) . Number 13 157 ISSN 1556-8180 February 2010 . 1993). 1995).081 ⎦ correlations. Cristian Gugiu.816 and 0. Davey. The KR-20 reliability estimate thus E ( X ) the distribution of data must also 556 ⎡ 1 ⎤ equals ⎢1. respectively. Basic 0. 0. Volume 6. The total variance then equals methods assuming continuity for 10. For the binomial pdf. For data that take the professor and student interview responses form as either the presence or absence of on the theme of interest. S. If a discrete based on coder agreement patterns for the distribution cannot be assumed or is professor and student interview groups unknown. This estimate accuracy of statistical inference for the professor and undergraduate (Bartoszynski & Niewiadomska-Bugaj.023. the variance is (Hettmansperger & McKean. The Kendall-τc correlation equals and E ( X ) . This correlation can be i readily estimated using the PROC CORR xi) denotes the pdf for discrete data (Hogg procedure in the statistical software & Craig. be considered in the calculation of 556 .881. 0. it is most appropriate to use the are provided in Table 12.121. for the algebra is only needed to solve for E (X 2 ) professor and undergraduate student groups. 1998). For this last scenario it is 0. standard errors which equals the reliability between will be inflated. Letting x2 = 2 for distribution-free expectation non-agreed responses.204 1996). Chris L.807.210 ⎥ = 0.081. which clearly have a discrete distribution. student interview groups equals 0.

Jason W.994(1-0. proportion of observed agreement According to Landis and Koch (1977).210 Discussion “Excellent. between the coders may be. theme is absent). The KR-20 statistic ‘Substantial’ to ‘Good’ agreement between is a reliability estimator based on the ratio the coders.714(1-0.00 have ‘Moderate. the obtained κ of measure of observed agreement beyond 0. The more complicated the do not provide guidelines for determining topic being investigated.” of qualitative findings following a where coefficients of at least 0. According to coefficients of 0. Volume 6. The ANOVA (binary) ICC agreement between the coders. P. the measures the degree to which two or more obtained KR-20 of 0. in that 0. ANOVA ICC.023 + 2*0.8 should be binomial distribution (theme is present.41-0. Number 13 158 ISSN 1556-8180 February 2010 . but ‘Unacceptable’ agreement between the permits the differential weights of cell coders. George and Mallery (2003) indicate statistic.121 = 1. the lower the the sufficiency of such statistics.’ ‘Substantial.807 demonstrates ratings are consistent. complexity of the theme(s) under Some researchers have developed tools investigation. The more coders. S.816 0.” methods for gauging interrater reliability and less than 0.023)1/2 = 0.61-0.121 Total Variance 0.714) = 0.816)1/2(0.881 Covariance (0. 0.9 are “Good.60.8] are “Acceptable.891 demonstrates ‘Almost Perfect’ to the expected agreement between two or ‘Good’ agreement between the coders. Cristian Gugiu. The κW statistic has the same κW statistic of 0.” of 0.7- 0. demonstrating acceptable reliability coefficients.6] are “Poor.881)(0. Journal of MultiDisciplinary Evaluation. of variances.006 Σpi(1-pi) 0.81-1.0 are this cutoff.” of 0. The κ order. 1987). and Nunnally (1978).7 are This paper presented four quantitative “Questionable. a researcher’s target.204 0.081 pi(1-pi) 0.423 demonstrates ‘Fair’ to interpretation as the kappa statistic.80. The obtained ANOVA ICC of 0. reliabilities of at least and ‘Almost Perfect’ agreement.” of 0.994) = 0. The κ statistic is a According this tool.5-0. Last. Chris L. Coryn Table 12 Estimates from Professor and Student Interview Participants for Calculating the KR 20 Estimate Professor Group Student Group Individual Variances 0.8-0. Cascio (1991). the (Maclure & Willett.9-1.5 are “Unacceptable. it is The resulting question from these important to note that the reliability of findings is “Are these reliability estimates binomial coding patterns is invalid if sufficient?” The answer is dependent based on continuous agreement statistics upon on the focus of the study. but researcher.714 frequencies reflecting patterns of coder demonstrates ‘Substantial’ to ‘Acceptable’ agreement.’ Schmitt (1996). That being said. and 0.023 Kendall-τc Correlation 0.” of 0.6-0.70 are typically sufficient for use. Davey. and the comfort level of the for interpreting reliability coefficients.816 + 0. and KR-20 meet that reliability coefficients of 0.

Thousand Oaks. Selective coding. Volume 6. for example. Thousand Oaks. The place of Future Research inter-rater reliability in qualitative research: An empirical study. L. (2008). estimators. in order to achieve a desired reliability Journal of MultiDisciplinary Evaluation. The Sage encyclopedia should be further developed for reliability of qualitative research methods (Vol. Ankenmann. P. The Sage allow researchers to determine the encyclopedia of qualitative research reliability of one’s grounded theory. L. CA: Sage. estimation would inform the researcher. (2008). Given (Ed. for furthering the use of reliability Bartoszynski. Number 13 159 ISSN 1556-8180 February 2010 . Thousand Oaks. M. R. pp. 581-582). Thousand Oaks. NY: John Wiley.. Weinman. Davey. but interview data with a certain likelihood does not meet the requirement? What prior to the initiation of data collection. Given (Ed.. Coryn What happens if the researcher has an coefficient on their coded qualitative acceptable level of reliability in mind. Cristian Gugiu. Axial coding. beta- coding decisions. as opposed to a component of CA: Sage. Possible densities may include the responses). J. estimators that can be used to gauge M. This would be re-estimated. L. L.). New York. reliability coefficient is achieved. interview data). Sample size estimation methods also M. across themes should be further Benaquisto. Three areas of research are recommended Sociology.. 88-89). qualitative inference. estimators for discrete coding patterns of M. The development 1. In L.g. Chris L. Although this process may seem tedious.. Further recommended that the coders revisit their investigation should be conducted to coding decisions on patterns of determine if there are more appropriate disagreement on the presence of themes discrete distributions to model agreement in the binomial data (e. (1997).g.g. Coding frame. Given (Ed. & Niewiadomska-Bugaj. & Marteau. Given (Ed. The Sage encyclopedia agreement pattern reliability within a of qualitative research methods (Vol. pp. CA: of quality reliability estimators applicable SAGE.). for methods (pp. example. negative binomial. 51-52). This process development could lead to better would be recursive until a desired estimators of reliability coefficients (e. (1996). In developed and investigated. the theory. the confidence in which the coders References identified themes increases and thus improves the interpretability of the data. (2008). D.Jason W. 1998). it is probability density function. CA: the κ statistic (Bonett.). 31(3). Open coding. After the coders revisit their geometric. theme were presented.). Feldt & Sage. T. interview data. In L. 597-606. for the investigation of ‘rare’ events). Sample size Benaquisto. the reliability coefficient binomial. (2008). 2002. as to encyclopedia of qualitative research how many interviews should be conducted methods.. Probability and statistical binomial responses (e. Benaquisto... The Sage in the example of the current paper. M. methods should be employed in this The current study simulated coder situation? If a desired reliability agreement data that follow a binomial coefficient is not achieved. but are presently limited to 2. Gosling. In L. This would L. Armstrong. S. In the current paper Benaquisto. A. and Poisson.

in personnel management (4th ed. 360-363). M. (1979). 21. S. M. Cristian Gugiu. estimating coefficient alpha. The holy trinity of Psychological Bulletin. Barth. From Efron. Hypothesis.. E. comparison alpha reliabilities. Measurement. Given (Ed.). TX: Holt. J. G. B. 27. 72. The Sage encyclopedia of and Statistical Psychology. J. C. (2007). 378-382. 19. (1984). N. scale agreement among many raters. Elston. B. Biometrics.. 57. M. (1991). (1971).. Given The British Journal of Mathematical (Ed. Measuring nominal 70. L. 76. Journal Crocker. test theory. (2008).. C. A. Thousand Oaks. pp. & Algina. Educational and Psychological Firmin. instrument. S. Sample size among five approaches (2nd ed. (2002). Nominal scale agreement with 2. Rinehart. Given (Ed. 65. In L. R.. Query: Estimating Cascio. The reliability of dichotomous judgments: Journal of MultiDisciplinary Evaluation. Englewood Cliffs. Nursing Research. & Tibshirani. estimating Pearson. In L. inquiry & research design: Choosing Fleiss. & Everitt. partial credit. Multivariate Behavioral Research. CA: provision for scaled disagreement or Sage. & Wright. J. NJ: Prentice-Hall Everitt. D. J. S. L. S. Moments of the International. Sample size requirements for Davis. 438-458. L. Applied psychology “Heritability” of a dichotomous trait. 23-28. L. pp. 178. CA: Sage. 754-755). T.). qualitative research methods (Vol. Coryn Bonett. L. 231-236. assessing inter-judge reliability. The Sage encyclopedia of Spearman correlations.). S. & Ankenmann. T. Thousand Oaks. Volume 6. (1960). A.. & Cuzick.. Cheek. 117. and Given (Ed.. W. F. The Sage encyclopedia Cohen. Duetz. L. 1. J. B. L. (1998). W. J. J. Qualitative Psychological Bulletin. 323-327. 20. 213-220. Liewald. 33.). Hill. Fort Worth. C.. Large sample standard errors Evaluation. Cohen. Researcher as SageAGE. Fleiss. Knierim. J. Chris L. A Sage encyclopedia of qualitative probabilistic latent class model for research methods (Vol. & Mulani. 4(7). 113. p. of kappa and weighted kappa. Burla. Psychological Bulletin.. C. 97-103. Replication. & Winston. W. J. (1993). (2008). G. New assessment in qualitative content York. (1986). An text to codings: Intercoder reliability introduction to the bootstrap. J. A coefficient of Psychological Measurement. CA: Sage. NY: Chapman & Hall/CRC. Psychometrika. 766). M. W. (1968). 22. Feldt. (2007). Kendall. methodological rigor: A skeptical view. M. In L. pp. (1968). Thousand Oaks. statistics kappa and weighted kappa. Thousand Oaks. 2. The Dillon. Creswell. 37-46. CA: Appropriate sample size for Sage. W. Applied Cohen. (2008). Fleiss. of Educational and Behavioral Introduction to classical & modern Statistics. (2008).).Jason W. In L. D. S. Funding. Davey. (2008). qualitative research methods (Vol. B. R. K. D. M. & Smith. R. & Abel.. Weighted kappa: of qualitative research methods (Vol. 1. CA: Brodsky. 170- agreement from nominal scales.. 335-340. (1977). 26-31. J. J. G. P. Bonett.). Coryn. Journal of MultiDisciplinary (1969). Number 13 160 ISSN 1556-8180 February 2010 . analysis. requirements for testing and Thousand Oaks. 408-409). R. (2000).

(1950). correlation with variable family age. C. London: Sage. 112). D. S. 151-160. M. Proportions with Greene. (1973). (2007). Confirmability. M. reliability. The Sage encyclopedia Statistical theories of mental test of qualitative research methods (Vol. Educational and Lipsitz. P. Introduction to mathematical Lincoln. K.). Journal of the Sage. 43. Kendalls library of statistics 5. Misinterpretation and misuse of the Given (Ed. CA: independent samples. Laird. Popper. M. CA: Sage.. (2004). Upper Saddle measurement of observer agreement River. Understanding models: Theory and applications with reliability and validity in qualitative SAS (2nd ed. M. Routledge Falmer. P. p. 8(4). The Qualitative Report.). P.. robust nonparametric statistical Kuder.. & Guba. (1995). 208-209). Sibling and parent-offspring Given. CA: Windows step by step: A simple guide Sage. S. (2nd ed. & McKean. (1981). R. Simple moment estimates of evaluation (8th ed. (2004). J. Chapman & Hall/CRC. In L. Davey. Jensen. Given (Ed. CA: Sage. Univariate 896). & Novick.). Dependability. Lord. Number 13 161 ISSN 1556-8180 February 2010 . and reference. 1. M.. V. The theory of estimation of test Hogg. S. D. In L. D. An introduction to its methodology (1998). C. R. D. Kim. George.. & Williams. CA: Sage. for categorical data.Jason W. pp. Hopkins. Volume 6. & Mallery.S. 33. N. In L. Kleinman.. Given (Ed. & Craig. F. Maclure. (2008). S. T. (2008). K. of qualitative research methods (Vol. Gulliksen. Karlin. R. Boston. L. (2008). Applied Statistics.).0 update (4th ed. P. K. 68. Trustworthiness. H. 1. Allyn and Bacon. MA: the κ-coefficient and its variance. (1987).). American Statistical Association. and multivariate general linear Golafshani. Introduction to mathematical Landis. V. Thousand Oaks. 126. CA: Sage. 2.A. NY: research. psychological measurement and A. & Richardson. Epidemiology. McKean. Thousand Oaks. New York. G. R. Given (Ed. (1998). Theory of mental 46-54. Journal of qualitative research methods (Vol. Content analysis: Hettmansperger. research methods (Vol. K. Newbury Park. 537-542. 138-139). G. Y. 3. B. & Craig. 161-169. J..). In L. A. D. scores. M. (1994). & Willett.). J. statistics (6th ed.). (2008). (2003). Boston. 11. E. 895. (2007). Krippendorf. P. London: Arnold. (2003). MA: Allyn & Bacon. Thousand Oaks. NJ: Prentice Hall. J. The statistics (5th ed. MA: Addison-Wesley. (1968). T. 2664-2668. Thousand Oaks. M. A. W. T. W.). CA: Magee. R. Proceedings of the National Academy The Sage encyclopedia of qualitative of Science. Chris L. F. G. C. Journal of MultiDisciplinary Evaluation.. Cristian Gugiu. Jensen. M. (1937). (1977). M. & Saumure. Coryn Unequal numbers of judges per Jensen. The Sage encyclopedia Measurement. W.). Upper Saddle Naturalistic inquiry. T. N. 159-174. N. 2. Psychometrika. Mixed methods in extraneous variance: Single and social inquiry. Rover. pp. Credibility. & Koch. The Sage encyclopedia of kappa statistic. Thousand Oaks. New York: Wiley. Hogg. 597-607. & Brennan. SPSS for 1. J. (1985). U. Reading. Biometrics. NJ: Prentice Hall. subject. Cameron.. models. 309-323. C. & Timm. (1985).. Thousand Oaks. pp. Applied Psychological M. E. tests. 78.

(1983). CA: qualitative research methods (Vol. N. coefficient alpha. J. B. (2008). CA: Ridout. The Sage encyclopedia of and intracoder reliability. expanded sourcebook (2nd ed. Issues in the use of kappa to estimate Olson. 733-741. Homewood. J. & Firth. British Journal of (Ed. R. Reliability. A. In L. Davey. J. S. van den Hoonaard. Biometrics. M. distribution. pp. C. (1997). beta-binomial distribution. M. New York: McGraw. E. reliability. M. (2008). Foundations Designing qualitative research (4th of the theory of prediction. J. Barrett. Sage. N... Verification strategies for establishing Stapleton. CA: Sage. C. Paley. 74. Management Decision. M. Coryn Mak. Estimating intraclass Yamamoto. Soeken. The Sage encyclopedia of Seale.). 32. (1992). Given their interpretation. Uses and abuses of Thousand Oaks. A. 1. CA: Qualitative data analysis: An Sage. 37. 551-555. Maxwell. K. International Journal of Sons. 13-22. C. (1987).. New York. 130. S. Cristian Gugiu. A Nunnally. 273-283. E. D.. (1979). A. 81-84. Thousand Oaks. 79-83. P. 813-824. In L.. C. 376-390. W.).. M. B. Tamura. Analyzing intraclass Ross. Thousand Oaks. S. Qualitative Methods. (1988). Thousand Oaks. Journal of Applied Statistics. Hill. 2. 39(7). 753-754). S. NJ: Prentice Hall. Marshall. reliability. research. The Sage encyclopedia of pp. In L. (1978).. Psychological Miller. T. Quality in qualitative qualitative research methods (Vol.. Algorithm AS189: Mitchell. pp. pp. Psychological 196-204. 55. L. Rigor agreement between observers and in qualitative research. Demétrio. 221-232. (2008). Rozzeboom. Qualitative Inquiry. 5(4). S. observational studies. & Prescott. M. S. 1(2). K. 2. ed.). Morse. W. Medical Care. Inter- Given (Ed. extended quasi-likelihood function. NY: John Wiley & research. A. Number 13 162 ISSN 1556-8180 February 2010 . The Sage encyclopedia of Psychiatry. IL: Dorsey. Sage. & Young. Psychometric stabilized moment estimator for the theory (2nd ed. & Spiers. In L. 8. Miles. (1995).). and parameters of the beta binomial generalizability of data collected in distribution. & Rossman. 344-252. K.. Applied Statistics. & Yanagimoto. Coefficients of Saumure. T. K. 137-148. Mayan. G. 86. (1999). correlations for binary data. Thousand Oaks. & Huberman.). G. P. Upper Saddle Applied Statistics. Positivism. Schmitt. qualitative research methods (Vol. A first course in correlation for dichotomous variables. Inc. Moment estimators for the binomial Biometrics. 19. (2008). 445-446). Journal of MultiDisciplinary Evaluation.. D. qualitative research methods (Vol. Given (Ed. Biometrika. (2006). 24. probability (5th ed. (1994).. H. 2. 43. Smith.Jason W. Bulletin. Assessment. C. M. M. (1996). (2002). Stenbacka. B. (1987). C. & Given. (2001). (1999). (1977). An requires quality concepts of its own. CA: 465-478. Given (Ed. CA: Sage. Qualitative research Nelder. Linear statistical reliability and validity in qualitative models. Volume 6. M.. P. D. 795-796). (1966).. (1986).). Sage. & Pregibon. L. River. W. M. Thousand Oaks. M.). J. J.). 646-650). Interobserver Maximum likelihood estimation of the agreement. K. M. Chris L.

NJ: Lawrence Erlbaum Associates. therewould be 50% "rationalinformative. criterionconsiders the expected loss to a researcherfrom wrong decisions. Mahwah.. subsumesboth the categoricalcase andthe metriccase.Thatis. vided a generalframeworkfor reliabilityindexes for quanti- searcher:There are several tests that give indexes of rater tative and qualitativedata. and it turs out to include some popularly used methods (e..if only ing-critique of the qualityof contentanalysis in consumer by chance. Associates. 1980) and then offered their own solution based on a Whatis thebest way to assess reliabilityin contentanalysis. in a 2 x 2 table with codes whether 60 print advertisementsare "emotionalimagery. This randomagreementwas judging the qualityof contentanalysis. 1 point difference using a 10-point reductionof loss (PRL)as a generalcriterionof reliabilitythat scale is 10%disagreement)adjustedfor the amountof vari.apopu- The basic concernis thatpercentagesdo not take into ac. Simplyusing percentageof agreementbetweenjudges is Kolbe and Burnett(1991) offered a nice-and prettydamn.representsthe sum forthe firstrow. n21representsone kind of disagreement-the numberof ads scribedin his chapteron reliability.. seems intuitive-it seems to be roughly based on the ob.andthe fewerthe numberof categories. JOURNAL Iacobucci.I havethoughtthata simplepercent. overconservatismand. Rust and Cooil (1994) Is percentageagreementbetweenjudges best (NO!)? took a "proportionalreductionin loss" approachand pro- Or. Rust& Cooil. CONSUMER Journal PSYCHOLOGY. Nanda.HughesandGarrett(1990) outlineda num- ASSESSMENT IN CONTENT ANALYSIS ber of different options (including Krippendorff's. I have established Professor Roland Rust raterreliability using the intraclasscorrelationcoefficient. if CoderA andCoderB haveto decideyes-no whethera coding Editor: For the discussion that follows. 1989.Lawrence Copyright Erlbaum Inc. of Consumer 10(1&2)."or "mixedambiguous"in appeal.Perreaultand Leigh's. threecategories-let us call the numberof coding categories Severalscholarshave offered statisticsthattry to correct "c. erties (e. This ance for each questionmay be suitable. VanderbiltUniversity butI also wantto look at interrateragreement(for two raters). inabilityto reachone." randomlydistributed.g. imagine a concrete unithas propertyX. agreementfor nominal data and some other tests or coeffi- cientsthatgive indexesofinterraterreliabilityformetricscale data. Chanceis likely to inflateagreementpercentagesin all cases..OF Dawn (ed. under some conditions.Gleser.)If theratersagreedcompletely. statedin a slightly differentmannerfrom anotherre. butcannotfindanything.e. 1951. & Raja- Professor Kent Grayson ratnam. generalizabilitytheory:Cronbach. Perhaps two experts have been asked to code ing at least 50% of the time (i.For my databased on metric scales. 1995.e. aggregatedoverthecolumns.e. "Krippendorffsalpha"(Krippendorff. and low degrees of freedom on each coding choice (i. which he de.thusmakingthe reliabil- is this concern about percentageagreementas a basis for ity appearbetterthanit reallyis. but especially with two coders.g."The row and columnrepresentthe codes assignedto the for chanceagreement.1960). 71-73 Psychology's Special Issue on Methodological and Statistical ? 2001.) (2001).one of which randomagreementis likely to occur.The of the scores already randomly along the diagonal. measure)as spe- London Business School cial cases. Concerns of the Experimental Behavioral InterraterReliability IV. 71-73. 1994.1972. (Thenotation. not so good becausesome agreementis sureto occur.The one thatI havebeen using lately is advertisementsby the two independentjudges.1980). larmeasurethatis not a PRLmeasureandhas otherbadprop- count the likelihood of chance agreementbetween raters. . Forexample.INTERRATERRELIABILITY Alternatively.25% in each cell.the more research. 1994)haveset forththeconceptofproportional age of agreement(i. even if thereis perfectagreement). Researcher. generalizability theory approach. 10 (1&2). Cronbach's alpha: Cronbach. theconceptualbasisofCohen's kappa(Cohen.nl +. then mere chancewill have themagree.I use it becausethe math thatRaterI judgedas rationalthatRater2 thoughtemotional. which ratingstablethatfollows depictsthe coding scheme with the would representspuriousapparentagreement). few codingcategories).Theyhighlighta numberof criticisms. Whatappropriatetest is therefor this? I have huntedaround Recent work in psychometrics(Cooil & Rust.having servedand expected logic thatunderlieschi-square. example..

In terms of an index suited for general care only about the diagonal entries in the matrix). Fleiss. 1985). 38-39) also criticized the use of the Ir ? (1.1980. dependingon the rational n21 n22 n23 marginaldistributions).)They stated.in essence a test of the signifi- cance of the reliabilityindex: ago. cell would be e22 = (n2+/n++)(n+2/n++)(n++)(just as you In conclusion. 1981.. errorfor the index. Cohen (1960. Iris set to zero. Hughes and Garrett(1990) and Kolbe and Burett (1991).g. To be fair. row properties and proposed extensions (e. kappa. ability and validity from coding judgments using a latent ample. . Brennan & Prediger. chi-squaretest of associationin this application. Cohen emotional n.72 ASSESSMENT RELIABILITY all ads would fall into the nl. thatthe upperboundfor kappacan be less than 1. 1n22.or n33cells.60 soned throughexpectedlevels of chanceagreementin a way thatdidnotdependon themarginalfrequencies. The termpc is the propor.Cohen the numberof codingcategoriesas definedpreviously.what might n+2 n+3 serve as a superiorindex? Perreaultand Leigh (1989) rea- sums e. 37-38) between interrateragreementand item reliabilityas pursuitsof evalu- Pa\Pa ating the quality of data. Tanner& Young. ences between coders due to mean or variance. requirementof agreementis more stringent. respectively. 143): appreciatethe inadequacyof this solution"(p.0.only 10 years form a confidence interval. Hughes and Garrett(1990) used generalizabilitytheory.Thus... coding conditions. with zeros off thanproportionsandanequationforanapproximatestandard the main diagonal. and so on. p. 1980. The prob- lem to which he speaks is thatthis index does not correctfor - |IIr(l-Ir) .stimuli.) tion of agreementsone would expect by chance.Perreaultand Leigh's (1989) index (discussed also provided his equation in terms of frequencies rather earlier)would seem to fit manyresearchcircumstances(e.g. Hubert. (Hughes & Garrettalso criticized the use of wherepais the proportionof agreedonjudgments(in ourex.That is. Cohen endorsement. (rationalcode) 2) class approach. suggestthatthe issues andsolutionsstill havenotperme- ated the social sciences. pp."Ittakesrelativelylittle in the way of sophisticationto also providean estimatedstandarderror(p. \ ( C) c-\ ) provementon the simple (he says "primitive")computation of percentageagreement." drawing the analogy (pp. His index was intendedas an im.g. we interraterreliability.. Pa = (nll + n22 + n33)/n++). column n n If Cohen's(1960) kappahas some problems. Kaye.which is prevalentin marketing. n. agree- mentrequiresall non-zerofrequenciesto be along the diago.Theydefined Cohen (1960) proposed K. these two statisticsmay be used in conjunctionto mary index of interraterreliability. 1977.pc = (el + Ubersax (1988) attemptedto simultaneouslyestimate reli- e22 + e33)/n++). coefficient alphafor ratingscales andinterrateragree- ment for categorical coding).. 38). 141): agreement.where eii = (ni+/n++)(n+i/n++)(n++).becausethe Rust and Cooil (1994) is anotherlevel of achievement. (Recallthatc is numberof units being coded: (nll + n22+ n33)/n++.which is basedon a random-ef- K=_(Pa -Pc) fects analysis of a variance-like modeling approach to (1-pc) aportionvariance due to rater.becausewhentheconditionholdsthatIrx they reviewedas relyingon percentageagreementas the pri- n++> 5. O SI = the fact that there will be some agreementsimply due to n++ chance. althoughCohen's criticismwas clear40 yearsago.g.perhapswe can at leastagreeto finallyban- would compute expected frequenciesin a chi-squaretest of ish the simplepercentageagreementas anacceptableindexof independence in a two-way table-it is just that here. Ifpa < (/lc)."Iras follows (p. thesereviews. association could have all frequenciesconcentratedin subsumesquantitativeand qualitativeindexes of reliability off-diagonalcells. the numberof agreements in the (2. 1971.96)si. Researchershavecriticizedthekappaindexforsome of its Rater2. p. 187.g..reported65% and 32% of the articles whichis important. intraclass correlationcoefficients as insensitive to differ- ample.Percentageagreementis computed as the sum of the diagonal(agreedon) ratingsdividedby the whenpa > (l/c). 42) anticipatedsome of these qualities(e.andso he providedan equationto de- n2+ mixed n3 n32 n33 n3+ terminethe maximumkappaone might achieve. extending Perreaultand Leigh's (1989) index to the situa- tion of three or more ratersand creatinga frameworkthat nal. Hence.2 ni3 nj+ (1960. Rater1: emotional rational mixed sums Kraemer.for ex. the "coefficient of an "indexof reliability. Cohen (1960) proposedkappa: (e.

41. Bruce.(1960).27.MarieAdele. Coefficientalphaandthe internalstructureof tests.Klaus.PsychologicalBulletin. PsychologicalBulletin.. Extension of the kappa coefficient.CA: Sage. reliabilityandobjectivity. Coefficient kappa:Some An examinationof applicationswith directivesfor improvingresearch uses. 687-699. search. surement. Generalestimatorsforthe reliability nal data based on qualitativejudgments. (1971). 185-195. Kapparevisited. 378-382. Measuringnominalscale agreementamongmany Ubersax.& Rust. and alternatives. The dependabilityof behavioral measurements: raters. 458- REFERENCES 468. Kraemer. 297-334.RichardH.Robert L. Tanner.Lawrence.(1994). (1988). Perreault. timationapproachesin marketing:A generalizabilitytheoryframework for quantitativedata. . & Garrett..(1980). (1991).Psychometrika.NewburyPark. Jr. Kolbe.RolandT. Goldine C.JohnS.Lee J. & Burnett.Journalof MarketingResearch.31. 37-46.Melissa S.20.88. Reliabilityof nomi- Cooil.36.& Leigh. Fleiss. Krippendorff. & Young.RolandT. Kenneth. 207-216. Contentanalysis:An introductionto its meth- Cooil. Validityinferencesfrominter-observeragreement. Rust. & Cooil.PsychologicalBulletin. (I 985).. Journal of MarketingRe- of qualitativedata. fying principle. 199-220.84. 180. Cronbach.& Rust. Reliabilityandexpectedloss: A uni. New York:Wiley. Cohen. JosephL. 1-14.JournalofMarketingResearch. 297...RolandT.MartinA.18. RELIABILITYASSESSMENT 73 two raters).59. Biometrics.& Rajaratnam. (1989). Bruce. Harinder. 289- wardthatonecouldcomputetheindexwithouta mathemati.16. 104. Estimatingfalse alarms and missed events from interobserveragreement:A rationale.243-250. (1972).(1977). it appearssufficientlystraightfor.Lee J. (1990). Hubert. Content-analysisresearch: Brennan. A coefficientof agreementfornominalscales. 26. (1981). 76. Hughes. 80(389).PsychologicalBulletin.Educationaland Psychological Mea. data:Theoryandimplications.Helena Chmura. Intercoderreliabilityes- cally-inducedcoronary. Reliabilitymeasuresfor qualitative Psychometrika. (1951).Dale J. Bruce.Psychometrika. LaurenceE. Modelingagreementamong Nageswari. (1980).Furthermore. Nanda. (1994). Kaye. Educa..Dennis E.. raters.WilliamD. 175- Theoryof generalizabilityforscores andprofiles. misuses.(1980). odology. Cronbach. 405-416. tional and Psychological Measurement.60.. 203-216.Journalof ConsumerResearch.Jacob. (1995).Journal of the AmericanStatisticalAssociation. & Prediger.MichaelA. 135-148. Gleser.

it is evident measuring inter-rater reliability is important in the interview and re-interview process. 2004). Statistical measures are used to measure inter-rater reliability in order to provide a logistical proof that the similar answers collected are more than simple chance (Krippendorf. coder fatigue. 2004. 2002). Keyton. . is the extent to which the way information being collected is being collected in a consistent manner (Keyton.b. 2004a. inadequate coder training. Neuendorf. is the information collecting mechanism and the procedures being used to collect the information solid enough that the same results can repeatedly be obtained? This should not be left to chance. or Interview/Re-Interview Design literature reviews). These problems include poorly executed coding procedures in qualitative surveys/interviews (such as a poor coding scheme. Having a good measure of inter-rater reliability rate (combined with solid survey/interview construction procedures) allows project manners to state with confidence they can be confident in the information they have collected. mistakes on part of those recording answers. either. Interview/Re-Interview Methods. 1997. the presence of a rogue administrator) or design (see Survey Methods. From all of the potential problems listed here alone. That is. or the presence of a rogue coder – all examined in a later section of this literature review) as well as problems regarding poor survey/interview administration (facilitators rushing the process. 2004a). Krippendorf. et al. Inter-rater reliability also alerts project managers to problems that may occur in the research process (Capwell. Literature Review of Inter-rater Reliability Inter-rater reliability. simply defined. et al.

“Were you born before July 13. then. then the data will have to be coded before it is analyzed for inter-rater reliability. be sufficient in evaluating whether reliable survey data is being obtained for agency use. For best results. Even if closed data was collected. For instance. the YES/NO priority (Green. Instead of recording birthdates. then coding may be important because in many cases closed-ended data has a large amount of possibilities. then YES can be recorded for each respective survey. 2004).Preparing qualitative/open-ended data for inter-rater reliability checks If closed data was not collected for the survey/interview. 1979?” where the participant would have to answer yes or no) or a likert-scale type mechanism. as that would indicate agreement). While placing qualitative data into a YES/NO priority could be a working method for the information collected in the ConQIR Consortium given the high likelihood that interview data will match. however. the survey design should be created with reliability checks in mind. it can be determined whether the two data collections netted the same result. If so. the forced categorical separation is not considered to be the best available practice and could prove faulty in accepting or rejecting hypotheses (or for applying analyzed data toward other functions). If not. requires answers to be placed into yes or no paradigms as a simple data coding mechanism for determining inter-rater reliability. employing either a YES/NO choice option (this is different than what is reviewed above – a YES/NO option would include questions like. then YES should be recorded for one survey and NO for the other (do not enter NO for both. See the Interview/Re-Interview Design literature review for more details. It should. it would be difficult to determine inter-rater reliability for information such as birthdates. . A common consideration.

or yes/no answers). For closed-ended surveys or interviews where participants are forced to choose one choice. where survey or interview questions are open- ended. 2004). even). To compute inter-rater reliability in quantitative studies (where closed-answer question data is collected using a likert scale. Arrange the responses from the two different surveys/interviews into a contingency table. 2003. et al. a statistical measure determining inter-rater reliability: 1. follow these steps to determine Cohen’s kappa (1960). essentially. Ketyon. some sort of coding scheme will need to be put into place before using this formula (Friedman. their answers would first be laid out and observed: Question Number 1 2 3 4 5 6 7 8 9 10 Interviewer #1 Y N Y N Y Y Y Y Y Y Interviewer #2 Y N Y N Y Y Y N Y N . computing inter-rater reliability is a relatively easy process involving a simple mathematical formula based on a complicated statistical proof (Keyton. 2004). et al. how many of the answers agreed and how many answers disagreed (and how much they disagreed. et al. In the case of qualitative studies. especially when the likert scale is used) (Friedman.How to compute inter-rater reliability Fortunately. a series of options. if two different survey/interview administrators asked ten yes or no questions. For example. This means you will create a table that demonstrates. et al. then the collected data is immediately ready for inter-rater checks (although quantitative checks often produce lower reliability scores. 2003).

they are placed where the two YES answers overlap in the table (with the YES going across the top of the table representing Rater/Interviewer #1 and the YES going down the left side of the table representing Rater/Interviewer #2). a contingency table would be created: RATER #1 (Going across) RATER #2 (Going down) YES NO YES 6 0 NO 2 2 Notice that the number six (6) is entered in the first column because when looking at the answers there were six times when both interviewers found a YES answer to the same question. if a five . Accordingly. If a different number of answers are available for the questions in a survey. NOTE: It is important to consider that the above table is for a YES/NO type survey. Interviewer/Rater #1 never found a NO answer when Interviewer/Rater #2 found a YES). and a two (2) is entered in the second column of the second row because both Interviewer/Rater #1 and Interviewer/Rater #2 found NO answers to the same question two different times. From this data. A zero (0) is entered in the second column in the first row because for that particular intersection in the table there were no occurrences (that is. For instance. The number two (2) is entered in the first column of the second row since Interviewer/Rater #1 found a YES answer two times when Interviewer/Rater #2 found a NO. then the number of answers should be taken into consideration in creating the table.

To find the sum for the first row in the previous example. and the second column would find zero being added to two for a total of two. six (first row total) would be added to four (second row total) for a row total of ten (10). Add all of the agreement cells from the contingency table together. The sum of agreement then. would be eight (8). Add the respective sums from step two together. For the running example. Sum the row and column totals for the items. The number two would be added to the number two for a second row total of four. and the answer to this step. the number six would be added to the number zero for a first row total of six. The first column would find six being added to two for a total of eight. then the table would have five rows and five columns (and all answers would be placed into the table accordingly). this would lead to six being added to two for a total of eight because there were six times where the YES answers matched from both interviewers/raters (as designated by the first column in the first row) and two times where the NO answers matched from both interviewers/raters (as designated by the second column in the second row). 4. it can be determined whether the data has been entered and computed correctly by whether or not the row total matches the column total. it can be seen that the data seems to be in order since both the row and column total equal ten. In the case of this example. The agreement cells will . Eight (first column total) would be added to two (second column total) for a column total of ten (10).question likert-scale were used in a survey/interview. At this point. 2. Then the columns would be added. 3. In the running example.

So. if participants had five possibilities for answers then there should be five cells going across and down the chart in a diagonal pattern that will be added. since it does not take into account the probability that some of these agreements in answers could have been by chance. for instance. if more answers were possible. In the case of this example. 5. however. first the cell containing the number six would be located (since it is the first agreement cell located in row one column one) and the column and row totals would be multiplied by each other (these were found in step two) and then divided by the total: 6 x 8 =48 48/10=4. then the step would be repeated as many times as there are answers. Divide this by the total number possible for all answers (this is the row/column total from step three). for this example. this is the final computation that needs to be made in this step for the sample problem.8.8. for instance. that would lead to eight being divided by ten for a result of 0. NOTE: At this point simple agreement can be computed by dividing the answer in step four by the answer in step five. always appear in a diagonal pattern across the chart – so. Since this is the final cell in the diagonal. To do this. This number would be rejected by many researchers. Compute the expected frequency for each of the agreement cells appearing in the diagonal pattern going across the chart. find the row total for the first agreement cell (row one column one) and multiply that by the column total for the same cell. For a five answer likert scale. The next diagonal cell (one over to the right and one down) is the next row to be computed: 2 x 4=8 8/10=0.8. however. That is why the rest of the steps must be followed to determine a more accurate assessment of inter-rater reliability. .

10 – 5. 6. For a five answer likert scale.6 = 4. all five of the totals found in step five would be added together. What to do if inter-rater reliability is not at an appropriate level Unfortunately.4/4.545. take the first computation from this step (the one that was set aside) and divide it by the second computation from this step. then it is often rejected. Place the result of that computation aside. 7. 2004a). it is often wise to administer an additional data collection so a third .4 2. Determine whether the reliability rate is satisfactory. For the running example that has been provided in this literature review.6.4.5.4 = 0.8 for a sum of 5.7) then it is often recommended that the data be thrown out (Krippendorf.6 = 2. the process would be repeated for five agreement cells going across the chart diagonally in order to consider how those answers matched up and provide a full account of inter-rater reliability. 8. In cases such as these. For the example used in this literature review.7 or higher. that would be 4. take the answer from step four and subtract the answer from step six. Then take the total number of items from the survey/interview and subtract the answer from step six. The resulting computation represents kappa. then the inter-rater reliability rate is generally considered satisfactory (CITE). Add all of the expected frequencies found in step five together. it would look like this: 8 . If kappa is at 0. To do this. Compute kappa. After this has been completed. This represents the expected frequencies of agreement by chance.8 + 0. if inter-rater reliability is not at the appropriate level (generally 0. If not.

2002). This inquiry can also include questions about whether they became fatigued during the coding process (often those coding large sets of information tend to make more mistakes) and whether or not they agree with the process selected for coding (Keyton. 2004). 2004). especially if they have rushed the process (causing rushed and hasty coding). Facilitators of projects may also be to blame for the low inter-rater reliability. In some cases a rogue coder may be the culprit for failure to achieve inter-rater reliability (Neuendorf. If many cases of inter-rater issues are occurring.set of information can compared to the other collected data (and calculated against both in order to determine if an acceptable inter-rater reliability level has been achieved with either of the previous data collecting attempts). . required one individual to code a large amount of information (leading to fatigue). Rogue coders are coders who disapprove of the methods used for analyzing the data and who assert their own coding paradigms. then the data from these cases can often be observed in order to determine what the problem may be (Keyton. If data has been prepared for inter-rater checks from qualitative collection measures. or if the administrator has tampered with the data (Keyton. 2004). et al. for instance. et al. the coding scheme used to prepare the data may be examined. 2004). et al. It may also be helpful to check with the person who coded the data to make sure they understood the coding procedure (Ketyon. et al.

P. Neuendorf. T. Thousand Oaks. J. Menzie. Lawrence.. L. M. Kappa test: A coefficient of agreement for nominal scales. A. J.. References Capwell. 82-91. FL. Friedman. Content analysis procedure book. J... Chidester. Content analysis: An introduction to its methodology. (1997). 411-433. A. 1. National Communication Association. Manning. & Schill. M. Lewis. Pilgram.. (2004). Reliability in content analysis: Some common misconceptions and recommendations. Mabachi. Manning. (2004b). J. Cohen. K. P. CA: Sage. KS: University of Kansas. Chick flicks: An analysis of self-disclosure in friendships. K. M. . Richards. Cleveland: Cleveland State.. Leonard.. D. CA: Sage. 20. Education Psychology Measures... Krippendorf.. The content analysis guidebook. K. K. Personal construct psychology and content analysis. (2004a). 37-46. G. D. King. L. B. J. N. (2004). Kidd.. Keyton. M. Morris.. J. (2002). L. A. Krippendorf. Thousand Oaks. 30. (1960). J. Miami.. T.. Personal Construct Theory and Practice. M. K. Green. Human Communication Research. Analysis of ethnographic interview research procedures in communication studies: Prevailing norms and exciting innovations. (2003). & Bell.

Burbank. once the suggestion was brought up by a team of supervising professors during the preliminary orals of a phenomenological study. Interrater Reliability. be difficult to eliminate.pdf The Application of Interrater Reliability as a Solidification Instrument in a Phenomenological Study Joan F. Applicability. With creativeness and the appropriate calculation approach the researcher of the here reviewed qualitative study managed to apply this verification tool and found that the establishment of interrater reliability served as a great solidification to the research findings. mainly apply interrater reliability as a percentage-based agreement in findings that are usually fairly straightforward in their interpretability. is generally perceived as a means of verifying coherence in the understanding of a certain topic. hence. California Chester McCall Pepperdine University. Research Findings. which are traditionally regarded as more scientifically based than qualitative studies.” as interrater reliability is often referred to in quantitative studies. reaches even further: Not just as a means of verifying coherence in understanding. and Study Solidification Introduction This paper intends to serve as support for the assertion that interrater reliability should not merely be limited to being a verification tool for quantitative research. Key Words: Phenomenology. Malibu. California Interrater reliability has thus far not been a common application in phenomenological studies. but at the same time a method of strengthening the findings of the entire qualitative study. Marques Woodbury University. but that it should be applied as a solidification strategy in qualitative analysis as well. while the term “solidification strategy. where the researcher is perceived as the main instrument and where bias may. A “verification tool.” as referred to in this case of a qualitative study. the utilization of this verification tool turned out to be vital to the credibility level of this type of inquiry. The interraters in a quantitative study are not necessarily required to engage deeply into the material in order to obtain an . The following provides clarification of the distinction in using interrater reliability as a verification tool in quantitative studies versus using this test as a solidification tool in qualitative studies. Bias Reduction.nova.The Qualitative Report Volume 10 Number 3 September 2005 439-462 http://www. for that reason. Qualitative Study.edu/ssss/QR/QR10-3/marques. Quantitative studies. the elimination of bias may be more difficult than in other study types. However. This should be applied particularly in a phenomenological study. where the researcher is considered the main instrument and where.

It is seen more as a solidification tool —that can contribute to the quality of these types of studies and the level of seriousness with which they will be considered in the future. The word 'generalizability' is defined as the degree to which the findings can be generalized from the study sample to the entire population. who have no prior connection with the study. which then needs to be interpreted. Applying interrater reliability in such a study requires the interraters to engage in attentive reading of the material. the act of involving independent interraters. in a qualitative study the findings are usually not represented in plain numbers. the potential contribution of valuable knowledge to the community. the researcher is usually considered the instrument in a qualitative study. Marques and Chester McCall 440 understanding of the study’s findings for rating purposes. the interraters could become true validators of the findings of the qualitative study. However. (¶ 9) This author subsequently suggests that. Regarding the “generalizability” enhancement Myers (2000) asserts Despite the many positive aspects of qualitative research. This type of study is regarded as less scientific and its findings are perceived in a more imponderable light. thereby elevating the level of believability and generalizability of the outcomes of this type of study. This may happen even though the study guidelines assume that he or she will dispose of all preconceived opinions before immersing himself or herself into the research. It is exactly for this purpose. [these] studies continue to be criticized for their lack of objectivity and generalizability. easily understandable among the interraters. due to the predominantly numerical-based nature of the quantitative findings. that the researcher mentioned the elevation of generalizability in qualitative studies. By using interrater reliability as a solidification tool. through the application of interrater reliability as a solidification and thus bias-reducing tool. The findings are usually obvious and require a brief review from the interraters in order to state their interpretations. (¶ 9) Myers continues that The goal of a study may be to focus on a selected contemporary phenomenon […] where in-depth descriptions would be an essential component of the process. as the “instrument” in the study the researcher can easily fall into the trap of having his or her bias influence the study’s findings. As explained earlier. small qualitative studies can gain a more personal understanding of the phenomenon and the results can potentially contribute valuable knowledge to the community” (¶ 9). in the analysis of the obtained data will provide substantiation of the “instrument” and significantly reduce the chance of bias influencing the outcome. . Hence. The entire process can be a very concise and insignificant one.Joan F. As a clarification to the above. “in such situations. while at the same time the interraters are expected to display a similar or basic understanding of the topic. The use of interrater reliability in these studies as more than just a verification tool because qualitative studies are thus far not unanimously considered scientifically sophisticated.

which each participant should meet. This process was executed as follows: First. speaking. 1997). Weinman. the researcher formulated the criteria. Subsequently. ¶ 4). […] a membership organization and clearinghouse that supports personal and organizational transformation through coaching. (Ian I. 2005. & Marteau.” after which the application of interrater reliability in a phenomenological study will be discussed. The researcher interviewed each participant in a similar way. She has also established her authority in the field of spirituality in the workplace in her position of “executive director of The Center for Spirit at Work at the University of New Haven. and publications” (School of Business at the University of New Haven. she identified the participants. all six answers to each question were listed horizontally. namely “interrater reliability” and “phenomenology. as listed by the various interraters. which suggests that the application of this solidification strategy in the qualitative area still needs significant review and subsequent formulation regarding its possible applicability. education. In this study the researcher interviewed six selected participants in order to obtain a listing of the vital themes of spirituality in the workplace. Mitroff. using an interview protocol that was validated on its content by two recognized authors on the research topic. University of Southern California. The phenomenological study that will be used for analysis in this paper is one that was conducted to establish a broadly acceptable definition of spirituality in the workplace. Drs. Gosling. She subsequently eliminated redundancies in the answers and clustered the themes that emerged from this . The six participants were selected through a snowball sampling process: Two participants referred two other participants who each referred to yet another eligible person. 2005. and Managing Crises Before They Happen (Ian I. Los Angeles. He has published “over two hundred and fifty articles and twenty-one books of which his most recent are Smart Thinking for Difficult Times: The Art of Making Wise Decisions. 2005. • Judi Neal is the founder of the Association for Spirit at Work and the author of several books and “numerous academic journal articles on spirituality in the workplace” (Association for Spirit at Work. A Spiritual Audit of Corporate America. Ian Mitroff and Judi Neal. so that the researcher will need to review the context in which these themes are listed in order to determine their correspondence (Armstrong. research. It may also be important to emphasize here that most definitions and explanations about the use of interrater reliability to date are mainly applicable to the quantitative field. ¶ 10-11).441 The Qualitative Report September 2005 Before immersing into specifics it might be appropriate to explain that there are two main prerequisites considered when applying interrater reliability to qualitative research: (1) The data to be reviewed by the interraters should only be a segment of the total amount. • Ian Mitroff is “distinguished professor of business policy and founder of the USC Center for Crisis Management at the Marshall School of Business. since data in qualitative studies are usually rather substantial and interraters usually only have limited time and (2) It needs to be understood that there may be different configurations in the packaging of the themes. Mitroff. This paper will first explain the two main terms to be used. ¶ 2). After transcribing the six interviews the researcher developed a horizonalization table. ¶ 1).

To provide the reader with even more clarification regarding the question formulations. 1). interrater reliability is determined by evaluating the degree of agreement of two observers observing the same phenomena in the same setting” (p. Interrater reliability assesses the consistency of how the rating system is implemented." obviously the interrater reliability would be inconsistent. For example. Interrater Reliability Interrater reliability is the extent to which two or more individuals (coders or raters) agree. 85). Mayan. then. is how the “themes” emerged.” describing it as the degree to which ratings of two or more raters or observations of two or more observers are consistent with each other. The researchers have a sliding rating scale (1 being most positive. Olson. Tashakkori and Teddlie continue “for qualitative observations. education and monitoring skills can enhance interrater reliability. if one researcher gives a "1" to a student response. The class is discussing a movie that they have just viewed as a group. According to these authors. This. 1997. Marques and Chester McCall 442 process.Joan F. as the majority of questions in the interview protocol were worded in such a way that they solicited enumerations of topical phenomena from the participants. received from six participants. From six listings of words. ¶ 1). interrater reliability can be determined by calculating the correlation between a set of ratings done by two raters ranking an attribute in a group of individuals. To clarify this with an example one of the questions was “What are some words that you consider to be crucial to a spiritual workplace?” This question solicited a listing of words that the participants considered identifiable with a spiritual workplace. Interrater reliability is dependent upon the ability of two or more individuals to be consistent. while another researcher gives a "5. & Spiers. “Interrater reliability addresses the consistency of the implementation of a rating system” (Colorado State University. Although widely used in quantitative analyses.” This process was fairly easy. Barrett. Training. . which in phenomenological terms is referred to as “phenomenological reduction. 5 being most negative) with which they are rating the student's oral responses. it was relatively uncomplicated to distinguish overlapping words and eliminate them. phenomenological reduction is much easier to execute these types of answers when compared to answers provided in essay-form. The CSU on-line site further clarifies interrater reliability as follows: A test of interrater reliability would be the following scenario: Two or more researchers are observing a high school classroom. Hence. 2002. p. the interview protocol that was used in this study is included as an appendix (see Appendix A). (¶ 2) Tashakkori and Teddlie (1998) refer to this type of reliability as “interjudge” or “interobserver. this verification strategy has been practically barred from qualitative studies since the 1980’s because “a number of leading qualitative researchers argued that reliability and validity were terms pertaining to the quantitative paradigm and were not pertinent to qualitative inquiry” (Morse.

). so far. According to Morse et al. The suggestion of executing interrater reliability as a side study corresponds with the above-presented perspective from Morse et al. (2002) above about verification tools in qualitative research being more of an evaluative nature (post hoc) than of a constructive (during the process) nature can be omitted by utilizing interrater reliability as it was applied to this study. (2002). The substantiation can happen in various ways. It therefore avoids the problem of concluding insufficient consistency in the interpretations after the study has been completed and it leaves room for the researcher to further substantiate the study before it is too late. presumably. A variety of new criteria were introduced for the assurance of credibility in these research types instead. but constructively during the execution of the study.443 The Qualitative Report September 2005 In the past several years interrater reliability has rarely been used as a verification tool in qualitative studies. the rationale behind those decisions.). p. through the implementation of verification strategies that are integral and self-correcting during the conduct of inquiry itself (Morse et al. this was particularly the case in the United States. which leaves room for assumptions “that qualitative research must therefore be unreliable and invalid. p. p. 2002).. This method of verifying the study’s findings represents a constructive way (during the process) of measuring the consistency in the interpretation of the findings rather than an evaluative (post hoc) way.. performing a new cycle of phenomenological reduction. On the other hand.. 598). 598). As suggested on the Colorado State University (CSU) website (1997) interrater reliability should preferably be established outside of the context of the measurement in your study. 1997. the results from establishing interrater reliability as a “side study” at a critical point during the execution of the main study (see . rather it ‘evokes’ it. which is. (2002) that verification tools should not be executed post-hoc. this might be done by seeking additional study participants. For instance. or the responsiveness and sensitivity of the investigator to data” (Morse et al. 7) and can therefore not be considered a verification strategy... there are qualitative researchers who maintain that responsibility for reliability and validity should be reclaimed in qualitative studies. or resubmitting the package of text to the interraters for another round of theme listing. that different researchers would offer different evocations” (Armstrong et al. This source claims that interrater reliability should preferably be executed as a side study or pilot study. 4). The above-mentioned researchers emphasize that the currently used post-hoc procedures may very well evaluate rigor but do not ensure it (Morse et al. As stated before. lacking in rigor. Many of the researchers that oppose the use of interrater reliability in qualitative analysis argue that it is practically impossible to obtain consistency in qualitative data analysis because “a qualitative account cannot be held to represent the social world. which means. The main argument against using verification tools with the stringency of interrater reliability in qualitative research has. been that “expecting another researcher to have the same insights from a limited data base is unrealistic” (Armstrong et al. These investigators further explain that post-hoc evaluation does “little to identify the quality of [research] decisions. p. adding their answers to the material to be reviewed. right after the initial attainment of themes by the researcher yet before formulating conclusions based on the themes registered. and unscientific” (Morse et al. These researchers claim that the currently used verification tools for qualitative research are more of an evaluative (post hoc) than of a constructive (during the process) nature (Morse et al. The concerns addressed by Morse et al.

For this reason. then. resulting in a short description in which every word accurately depicts the phenomenon as experienced by co- researchers. In this case such would be the interview protocol to be used in the research. According to Creswell (1998) a phenomenological study describes the meaning of the lived experiences for several individuals about a concept or the phenomenon. “Dukes recommends studying 3 to 10 subjects. using interrater reliability as a “pilot study”. to perform some additional research in order to obtain greater consensus. (¶ 10) Creswell suggests for a phenomenological study the process of collecting information should involve primarily in-depth interviews with as many as 10 individuals. the researcher of the phenomenological study described here chose to interview a number of participants between 3 and 10 and ended up with the voluntary choice of 6. 122). However. Blodgett-McDeavitt (1997) presents the following definition. Creswell (1998) describes the procedure that is followed in a phenomenological approach to be undertaken: In a natural setting where the researcher is an instrument of data collection who gathers words or pictures. common themes. and the Riemen study included 10. is what was implemented in the here reviewed case. Phenomenology is regarded as one of the frequently used traditions in qualitative studies. or a part thereof. since there would not be any findings to be evaluated at that time. The study is intended to discover and describe the elements (texture) and the underlying factors (structure) that comprise the experience of the researched phenomenon. analyzes them inductively. the second option suggested by CSU. Phenomenology A phenomenological study entails the research of a phenomenon by obtaining authorities’ verbal descriptions based on their perceptions of this phenomenon: aiming to find common themes or elements that comprise the phenomenon. This. In the opinion of the researcher of this study. would mainly establish consistency in the understandability of the instrument. the researcher perceives no difference between this interpretation of interrater reliability and the content validation here applied to the interview protocol by Mitroff and Neal. Phenomenology is a research design used to study deep human experience. the researcher decided that interrater reliability in this qualitative study would deliver optimal value if performed on critical parts of the study findings. in case of insufficient consistency between the interraters. phenomenology reduces rich descriptions of human experience to underlying. The researcher further questions the value of such a measurement without the additional review of study findings. According to Creswell. Given these recommendations. Marques and Chester McCall 444 explanation above) will enable the researcher. Not used to create new judgments or find new theories. The important point is to describe the meaning of a small number of individuals who have experienced the phenomenon” (p.Joan F. focuses on the .

which may serve as an example of a possible phenomenological process with incorporation of interrater reliability as a constructive solidification tool. 150). The researcher next reflects on his or her own description and uses imaginative variation or structural description. 2. The researcher then constructs an overall description of the meaning and the essence of the experience (p. which “draws on many types and sources of meaning” (Van Manen. After this. 150). and he or she writes a description of the “textures” (textural description) of the experience . 2002a. the researcher who engages in the phenomenological approach should realize that “phenomenology is an influential and complex philosophic tradition” (Van Manen. . Based on the above-presented explanations and their subsequent incorporation in a study on workplace spirituality. nonoverlapping statements (p. ¶1) as well as “a human science method” (Van Manen. 3.including verbatim examples (p. 5. the researcher developed the following model (Figure 1). 150). 6. seeking all possible meanings and divergent perspectives. (p. a “composite” description is written (p. The researcher then finds statements (in the interviews) about how individuals are experiencing the topic. and constructing a description of how the phenomenon was experienced (p. lists out these significant statements (horizonalization of the data) and treats each statement as having equal worth. The researcher begins [the study] with a full description of his or her own experience of the phenomenon (p. and works to develop a list of nonrepetitive. and describes a process that is expressive and persuasive in language. 147). varying the frames of reference about the phenomenon. ¶2).what happened . 14) Like all qualitative studies.445 The Qualitative Report September 2005 meaning of participants. This process is followed first for the researcher’s account of the experience and then for that of each participant. 147). 2002b. Creswell (1998) presents the procedure in a phenomenological study as follows: 1. 4. ¶1). 150). 2002a. These statements are then grouped into “meaning units”: the researcher lists these units.

but instead provided them the entirety of raw transcribed data with highlights of 3 topical questions from which common themes needed to be derived. and synthesis of composite textural and composite structural descriptions. Marques and Chester McCall 446 Figure 1. This process will be explained in more detail later in the paper. in a presentation of the four main steps of phenomenological processes: epoche. Interviewee A Interviewee B Interviewee C Interviewee D Interviewee E Interviewee F Horizonalization Table Phenomenological Reduction Meaning Clusters Interrater 1 Emergent Themes Interrater 2 Internal aspects Integrated External aspects external/internal aspects Leadership Employee imposed imposed aspects aspects Meaning of this Phenomenon Textural and Structural Description Possible Structural Meanings of the Experience Definition Spirituality in the Workplace Underlying Themes and Contexts Precipitating Factors Implications of Findings Invariant Themes Recommendations for Individuals and Organizations In the here-discussed phenomenological study. reduction. the researcher considered the application of interrater reliability most appropriate at the time when the phenomenological reduction was completed. Since the most important research findings would be derived from the emergent themes. imaginative variation. Moustakas.Joan F. Blodgett-McDeavitt (1997) cites one of the prominent researchers in phenomenology. concluded which questions provided the largest numbers of theme listings. which aimed to establish a broadly acceptable definition of spirituality in the workplace and therefore sought to obtain vital themes that would be applicable in such a work environment. and then submitted the raw version of the answers to these questions to the interraters to find out whether they would come up with a decent amount of similar theme findings. The meaning clusters also had been formed. The way Moustakas’ steps can . this seemed to be the most crucial as well as the most applicable part for soliciting interrater reliability. However. the researcher did not submit any pre-classified information to the interraters. In other words. Research process in picture. the researcher first performed phenomenological reduction.

3) Although epoche may be considered an effective way for the experienced phenomenologist to empty him or herself and subsequently see the obtained data from a naïve perspective.447 The Qualitative Report September 2005 be considered to correspond with the earlier presented procedure. and “synthesis” is applied when the researcher constructs an overall description and formulates his or her own accounts as well as those of the participants (Creswell steps 5 and 6). and groups them into meaning units (Creswell step 2 and 3). Because of the uncommonness of using this verification strategy in a qualitative study. “imaginative variation” takes place when the researcher engages in reflection (Creswell step 4). nonoverlapping statements. since there were various ways possible for computing it. at the same time. The first step for the researcher in this study was to find a workable definition for this verification tool. (p. but the study of and mastery of epoche informs how the phenomenological researcher engages in life itself. the utilization of this constructive verification tool emerged into an interesting challenge and. naive perspective from which fuller. it was fairly difficult to determine the applicability and positioning of this tool in the study. However. Using Interrater Reliability in a Phenomenological Study Interrater reliability has thus far not been a common application in phenomenological studies. richer. especially a phenomenology where the researcher is highly involved in the formulation of the research findings. Elaborating on the interpretation of epoche. as formulated by Creswell. Bracketing biases is stressed in qualitative research as a whole. It was even more complicated to formulate the appropriate approach in calculating this rate. Blodgett-McDeavitt (1997) explains. required a high level of creativeness from the researcher in charge. “reduction” occurs when the researcher finds nonrepetitive. more authentic descriptions may be rendered. After in-depth source reviews. the chance is that bias is still very present for the less experienced investigator. is that “epoche” (which is the process of bracketing previous knowledge of the researcher on the topic) happens when the researcher describes his or her own experiences of the phenomenon and thereby symbolically “empties” his or her mind (see Creswell step 1). once the suggestion was brought up by a team of supervising professors about vital themes in a spiritual workplace. It was rather obvious that the application of this solidification strategy toward the typical massive amount of descriptive data of a phenomenology would have to differ significantly from the way this tool was generally used in quantitative analysis where kappa coefficients are the common way to go. Epoche clears the way for a researcher to comprehend new insights into human experience. since the appropriateness of its outcome depends on . the researcher concluded that there was no established consistency to date in defining interrater reliability. The inclusion of interrater reliability as a bias reduction tool could therefore lead to significant quality enhancement of the study’s findings (as will be discussed below). A researcher experienced in phenomenological processes becomes able to see data from new.

. and that every researcher that takes on the challenge of applying this solidification strategy in his or her qualitative study will therefore be a pioneer. and whether groups or individuals are affected by the results (since action affecting individuals requires a higher correlation than action affecting groups).50.55. Marques and Chester McCall 448 the purpose it is used for.59. ¶1). 1.50 and . Consequently. Etsler. and McMillan and Schumacher’s (2001) Research in Education. McMillan and Schumacher conclude that this question is not answered easily. 134). the r values were . after assessing multiple electronic (online) and written sources regarding the application of interrater reliability in various research disciplines. & Drumgold. and other common search engines such as “Google”. as a guideline in her study. and Drumgold (2003) presented the following reasoning for his interrater reliability findings in their study. . Source included Isaac and Michael’s (1997) Handbook in Research and Evaluation.Joan F. Aside from the above presented statements about the divergence in opinions with regards to the appropriate correlation coefficient to be used. . Applying an Analytic Writing Rubric to Children's Hypermedia “Narratives.” A comparative approach to the examination of the technical qualities of a pen and paper writing assessment for elementary students’ hypermedia- created productsPearson correlations averaged across 10 pairs of raters found acceptable interrater reliability for four of the five subscales. Isaac and Michael (1997) illuminate this by stating that “there are various ways of calculating interrater reliability. as well as the proper methods of applying interrater reliability. respectively (Mott. It was the intention of the researcher to use a generally agreed upon percentage. plot and communication. Proquest’s extensive article and paper database as well as its digital dissertations file. The first step for the researcher of this phenomenological study was attempting to find the appropriate degree of coherence that should exist in the establishment of interrater reliability. in applying interrater reliability.. this researcher presented the following illustrations for the observed basic inconsistency. McMillan and Schumacher (2001) elaborate on the inconsistency issue by explaining that researchers often ask how high a correlation should be for it to indicate satisfactory reliability. it depends on the type of instrument (personality questionnaires generally have lower reliability than achievement tests). 2003. For the four subscales. it is also a fact that most or all of these discussions pertain to the quantitative field. According to them. which were not necessarily qualitative in nature. Etsler. This suggests that there is still intense review and formulation needed in order to determine the applicability of interrater reliability in qualitative analyses. if existing. character. . However. Tashakkori and Teddlie’s (1998) Mixed Methodology. the purpose of the study (whether it is exploratory research or research that leads to important decisions). theme. as she perceived them throughout a variety of studies. setting.49. Mott. and that different levels of determining the reliability coefficient take account of different sources of error” (p. the researcher did not succeed in finding a consistent percentage for use of this solidification strategy.

Srebnik. as rated by two independent raters. r =. The Pearson correlations for each variable. Russo. 2001. Considering the fact that the researcher in a phenomenological study always ends up with an abundance of themes on his or her list (since he or she manages the entirety of the data. moderate.78.59. Butler and Strayer (1998) assert the following in their online-presented research document.40. The citation Graham used as a guideline in his thesis referred to the agreement between . r =.66 p < . outside persons to classify meaning units with the appropriate primary themes. we interpreted the intraclass correlations in the following manner: . in the explanation afterwards it becomes apparent that this percentage was not obtained by comparing the findings from two independent judges aside from the researcher. this researcher finally concluded that the acceptance rate for interrater reliability varies between 50% and 90%. 67). t (27) = 8. 66) Graham (2001) then states “the percent agreement between researcher and the student [the external judge] was 78 percent” (p. . r =. administered by Stanford University and titled The Many Faces of Empathy. 3. (Graham.449 The Qualitative Report September 2005 2. although she did find a master’s thesis from the Trinity Western University that briefly mentioned the issue of using reliability in a phenomenological study by stating Phenomenological research must concern itself with reliability for its results to have applied meaning.85. A high degree of agreement between two independent judges will indicate a high level of reliability in classifying the categories. Through multiple reviews of accepted reliability rates in various studies. while the external rater only reviews a limited part of the data) calculating a score as high as 78% should not be difficult to obtain depending on the calculation method (as will be demonstrated later in this paper). are as follows: Average intimacy of disclosure. depending on the considerations mentioned above in the citation of McMillan and Schumacher (1997). t (14) = 4. but by comparing the findings from the researcher to one external rater. Uehara.05 (¶1). The researcher found that in the phenomenological studies she reviewed through the Proquest digital dissertation database. and Shared Affect. The researcher did not succeed in finding a fixed percentage for interrater reliability in general and definitely not for phenomenological research. Specifically. Acceptable interrater reliability was established across both dialogues and monologues for all of the verbal behaviors coded. p < .60 or greater.40 to . weak ” (¶15). p. Focused empathy. and less than .05.79 p < . Comtois. strong. However.38. interrater reliability had not been applied. Generally. a level of 80 percent agreement indicates an acceptable level of reliability. t (8) = 7. She contacted the guiding committee of this study to agree upon a usable rate. and Snowden (2002) approach interrater reliability in their study on Psychometric Properties and Utility of the Problem Severity Summary for Adults with Serious Mental Illness as follows: “Interrater reliability: A priori. Smukler.94.05. reliability is concerned with the ability of objective.

it should be understood that a correspondence percentage higher than 66. 2. with a fairly compatible level of comprehensive ability. The researcher subsequently performed the following measuring procedure: 1. Marques and Chester McCall 450 two independent judges and not to the agreement between one independent judge and the researcher. even in the limited data provided. This correspondence percentage becomes even harder to achieve if one considers that there might also be such a high number of themes to be listed. separately visited for an instructional session. as well as with regards to the vitality of detecting themes that were common (either through direct wording or interpretative formulation by the 6 participants).7% between two external raters might be hard to attain. in this team. These individuals were approached by the researcher and. The choice for 66. there were quantitative as well as qualitative oriented authorities. The interraters were both university professors and administrators.7% was based on the fact that. were not aware of each other’s assignment as an interrater. that one rater could list entirely different themes than the other. Each interrater was thoroughly instructed with regards to the philosophy behind being an interrater. although acquainted with each other. The researcher of the here-discussed phenomenological study on spirituality in the workplace also learned that the application of this solidification tool in qualitative studies has been a subject of ongoing discussion (without resolution) in recent years. This was done by obtaining a listing of the vital themes applicable to a spiritual workplace and consisted of interviews taken with a pre-validated interview protocol from 6 participants. The data gained for the purpose of this study were first transcribed and saved. The team also considered the nature of the study and the multi- interpretability of the themes to be listed and subsequently decided the following: Given the study type and the fact that the interraters would only review part of the data. in which the interrater could list the themes he found when reviewing the 6 answers to each of the three selected questions. The interraters were thus presented with the request to list all the common themes they could detect from the answers to three particular interview questions. The reason for selecting these questions and their answers was to provide the interraters with a fairly clear and obvious overview of possible themes to choose from. after their approval for participation. the transcribed raw data were presented to two pre-identified interraters. which may explain the lack of information and consistent guidelines currently available. or 66. For this procedure. The interraters.7% at the time of the suggestion for applying this solidification tool. During this session. the researcher made sure to select those questions that solicited a listing of words and phrases from the participants. the researcher handed each interrater a form she had developed. . The guiding committee for this particular research agreed upon an acceptable interrater reliability of two thirds. who after thorough discussion came to the conclusion that there were variable acceptable rates for interrater reliability in use. Since one of the essential procedures in phenomenology is to find common themes in participants’ statements. expectedly. each with an interest in spirituality in the workplace and. The researcher chose this option to guarantee maximal individual interpretation and eliminate mutual influence.Joan F. without necessarily having a different understanding of the text.

Each interrater thus had to produce three lists of common themes: one for each highlighted topical question. However. The computation methods that the researcher applied in this study will be explained further in this paper.7% (2/3) agreement was found between interraters and between interraters’ and researcher’s findings.451 The Qualitative Report September 2005 3. what would he or she actually do? and (3) If an organization is consciously attempting to nurture spirituality in the workplace. The purpose of having the interraters list these common themes was to distinguish the level of coordinating interpretations between the findings of both interraters. The reason for refraining from limiting the interraters to a predetermined number of themes was because the researcher feared that a restriction could prompt random choices by each interrater among a possible abundance of available themes. as well as the level of coordinating interpretations between the interraters’ findings and those of the researcher. In other words. The interraters were asked to list the common themes per highlighted question on a form that the researcher developed for this purpose and enclosed in the data package. ultimately leading to entirely divergent lists and an unrealistic conclusion of low or no interrater reliability. which could easily be translated into themes. Another important reason was that these were also the questions from which the researcher derived most of the themes she listed. Since the researcher serves as the main instrument in a phenomenological study. the researcher did not set a fixed number of themes for the interraters to come up with. her list was much more extensive than those of the interraters who only reviewed answers to a selected number of questions. but had them list their themes individually instead in order to be able to compare their findings with hers. what will be present? One reason for selecting these particular responses was that the questions that preceded these answers asked for a listing of words from the interviewees. A complication occurred when the researcher found that the interraters did not return an equal amount of common themes per question. the researcher compared their findings to each other and subsequently to her own. 4. and even more because this researcher first extracted themes from the entire interviews. In other words. After the forms were filled out and received from the interraters. the researcher did not share any of the classifications she had developed with the interraters. but rather left it up to them to find as many themes they considered vital in the text provided. It is for this reason that the agreement between the researcher’s findings and the interraters’ findings was not used as a measuring scale in the determination of the interrater reliability percentage. The highlighted questions in each of the six interviews were: (1) What are some words that you consider to be crucial to a spiritual workplace? (2) If a worker was operating at his or her highest level of spiritual awareness. It may therefore not be very surprising that there was 100% agreement between the limited numbers of themes submitted by the interraters and the abundance of themes found by the researcher. 5. This could happen because the researcher omitted setting a mandatory amount of themes to be submitted. as recommended by the dissertation committee for this particular study. all themes of interrater 1 and all themes of interrater 2 were included in the theme-list of the researcher. . if at least 66. Interrater reliability would be established.

Before illustrating how the researcher calculated interrater reliability for this particular case. interpretable in many different ways. ¶1). the researcher decided to just ask each interrater to list as many common themes he could find among the highlighted statements from the 6 participants. so there would be no confusion with regards to the understanding of what exactly were considered to be “themes. The agreement was determined on two counts: (1) On the basis of exact listing. Based on this conviction. there would be no guarantee which part of the 100 available themes each interrater would choose. 2002.Joan F. • Interrater 2 (I2) submitted a total of 17 detected themes for the selected questions. . Therefore. The researcher color-coded the themes that corresponded with the two interraters (yellow) and subsequently color-coded the additional themes that she shared with either interrater (green for additional corresponding themes between the researcher and interrater 1 and blue for additional corresponding themes between the researcher and interrater 2). note the following information: • Interrater 1 (I1) submitted a total of 13 detected themes for the selected questions. such as “giving to others” and “contributing”. including the essence of perception and of consciousness” (Scott. Between both interraters there were 10 common themes found. • The researcher listed a total of 27 detected themes for the selected questions. Marques and Chester McCall 452 To clarify the researcher’s considerations a simple example would be if there was a total of 100 obvious themes to choose from and the researcher required the submission of only 15 themes per interrater. Phenomenology is concerned with providing a direct description of human experience” (¶1). If this were the case there would be zero percent interrater reliability. which was the case with 7 of these 10 themes and (2) on the basis of similar interpretability. In his presentation of Merleau-Ponty’s Phenomenology of Perception Scott explains. this researcher did not make any pre-judgments on the quality of the various calculation methods presented below. “aesthetically pleasing workplace”. but merely utilized them on the basis of their perceived applicability to this study type. while interrater 2 would choose the last 15. it may be useful to clarify that phenomenology is a very divergent and complicated study type. entailing various sub-disciplines and oftentimes described as “the study of essences.” Dealing with the problem of establishing interrater reliability with an unequal amount of submissions from the interraters was thus another interesting challenge. Before discussing the calculation methods reviewed by this researcher about spirituality in the workplace. It could very well be that interrater 1 would select the first 15 themes encountered. “encouraging” and “motivating”. and “beauty” of which the latter was mentioned in the context of a nice environment. All of the corresponding themes between both interraters (the yellow category) were also on the list of the researcher and therefore also colored yellow on her list. It may also be appropriate to stress here that the researcher explained well in advance to the raters what the purpose of the study was. This may clarify to the reader that the phenomenological researcher is aware that reality is a subjective phenomenon. “phenomenology is a method of describing the nature of our perceptual contact with the world. even though the interraters may have actually had a perfect common understanding of the topic.

The researcher of this study substituted the actual values pertaining to this particular study in the authors’ equations and came to some interesting findings: 1.7%.” which is a standard classification matrix. Performance of such systems is commonly evaluated using the data in the matrix. The rate that these authors label as “the accuracy rate” (AC). named this way because it measures the proportion of the total number of findings from Interrater 1 -. “contains information about actual and predicted classifications done by a classification system. In the case of this study. Table 1 Confusion Matrix 1 Interrater 1 Agree Disagree Agree a B Interrater 2 Disagree c D Hamilton et al. (2003) present is similar to the one displayed in Table 1. Sampson.” In this case . this researcher has specified the entries as recommended by these authors for the purpose of this study. and Olive (2003).7%. This calculation would be executed as follows: 20 / (20+10) = 2/3 = 66. a is the number of agreeing themes that Interrater 1 listed in comparison with Interrater 2. (2003) subsequently present a number of equations relevant to their specific study.453 The Qualitative Report September 2005 The researcher came across various possible methods for calculating interrater reliability described. Findlater. according to Hamilton. The recommendation from Posner. In this study the following meanings will be ascribed to the various entries. and Cheney (1990) is that interrater reliability. the meaning of the entries in the confusion matrix should be specified as they pertain to the context of the study. mentions the percent agreement between two or more raters as the easy way to calculate interrater reliability. also leads to the same outcome.” (¶1) According to these authors. Calculation Method 1 Various electronic sources. Various authors recommend the “confusion matrix. among which a website from Richmond University (n. R = number of agreements / number of agreements + number of disagreements. reliability would be calculated as: (Total # agreements) / (Total # observations) x 100. The confusion matrix that Hamilton et al. Ward. c is the number of disagreeing themes that Interrater 2 listed in comparison with Interrater 1.).the one with the lowest number of themes submitted -. and d is the total number of disagreeing themes that both interraters listed.d. the outcome would be: 20/30 x 100 = 66. whereby 20 equals the added number of agreements from both interraters (10 + 10) and 30 equals the added number of observations from both interraters (13 + 17).that are “accurate. b is the number of disagreeing themes that Interrater 1 listed in comparison with Interrater 2. A confusion matrix. as a valid option for calculating interrater reliability. However. In this case. Gurak.

¶8.9% Dr. . which is similar to the earlier discussed accuracy rate (AC) from Hamilton et al..Joan F. this researcher inserted the values that were derived from the interrater reliability test for this particular study about spirituality in the workplace in the recommended columns and rows. ¶5. AC = (a + d) / (a + b + c + d) = (10 + 10) / (10 + 3 + 7 + 10) = 20/30 = 66. 2003. is calculated as seen below. associate professor of experimental psychology at the University of Idaho also uses the Confusion Matrix for determining interrater reliability.7% 2. the one with the lowest number of submissions (adopted from Hamilton et al. and modified toward the values used in this particular study). and modified toward the values used in this particular study). As mentioned above. The interraters are referred to as R1 and R2. presented below as Table 2. 2003.7%. Marques and Chester McCall 454 “accurate” means in agreement with the submissions of Interrater 2 (adopted from Hamilton et al. (2003). The rate these authors label as “the true agreement rate:” The title of this rate has been modified by substituting the names of values applicable in this particular study. is calculated as seen below. TA = a / (a + b) = 10 / (10 + 3) = 10/13 = 76. Table 2 Confusion Matrix 2 with Substitution of Actual Values R1 Agree Disagree Agree 10 3 13 R2 Disagree 7 10 (=3+7) 17 17 13 30 According to Dyre (2003). interrater reliability = (Number of agreeing themes) + (Number of disagreeing themes) / (Total number of observed themes) = (10 + 10) / 30 = 2/3 = 66. The true agreement rate was named this way because it measures the proportion of agreed upon themes (10) perceived from the entire number of submitted themes from Interrater 1. Brian Dyre (2003).. Dyre recommends the following computation under the heading: Establishing Reliable Measures for Non-Experimental Research.

The researcher thought it to be interesting to examine the calculated outcomes in the case that the names of the two interraters would have been placed differently in the confusion matrix.8%. Expanding on this reasoning. which is the proportion of corresponding themes identified between both interraters. it does not even matter anymore whether the two interraters have full correspondence as far as the submissions .9 between both interraters was also reached in the true agreement rate (TA) as presented earlier by Hamilton et al. When exchanging the interraters’ places in the matrix the outcome of this rate turned out to be different. The researcher thought it to be interesting that the percentage of 76. These calculations are based on calculation method 2. Rationale for this calculation: if the numbers of submissions by both interraters had varied even more. she considered it logical that in case of unequal submissions. (2003). then interrater reliability is 76. the new computation led to an unfavorable. In such a case. further comparison leads to the following findings: All 13 listed themes from interrater 1 (13/13 x 100% = 100%) were on the researcher’s list and 16 of the 17 themes on interrater 2’s list (16/17 X 100% = 94. whereby 13 would be the number of agreements and 43 the total number of observations. interrater reliability would be impossible to be established even if all the 13 themes submitted by interrater 1 were also on the list of interrater 2. With the calculations as presented under calculation method 1. but also unrealistic interrater reliability rate of 58. as is general practice in interrater reliability measures. In fact. since the value substituted for “b” now became that of the number of non- corresponding themes. the outcome would then be: (13 +13) / (30 + 13) = 26/43 = 60. If. say 13 for interrater 1 versus 30 for interrater 2. whereby “a” equals the amount of corresponding themes between both interraters and “b” equals the amount of non-corresponding themes as submitted by the interrater with the lowest number of themes. the “rational” justification of calculation method 2 is accepted. This does not correspond at all with the logical conclusion that a total level of agreement from one interrater’s list onto the other should equal 100%. the above-calculated rate of 66. the lowest submitted number of findings from similar data by any of two or more interraters used in a study should be used as the denominator in measuring the level of agreement. in the case of the above-mentioned substitution. starts turning out extremely low as the submission numbers of the two interraters start differing to an increasing degree. Based on this observation.455 The Qualitative Report September 2005 Calculation Method 2 Since the interraters did not submit an equal number of observations.1%) were also on the list of the researcher. as submitted by the interrater with the highest number of themes. therefore.9%.9%.5%. interrater reliability would be: (Number of common themes) / (Lowest Number of submission) x 100 = 10/13 x 100% = 76. it is calculated using the equation: TA = a / (a+b). Calculation Method 3 Elaborating on Hamilton et al. The “unrealistic” reference lies in the fact that it becomes apparent that the interrater reliability rate.’s (2003) true agreement rate (TA).7% can be disputed.7%. which exceeds the minimally consented rate of 66. Although the researcher did not manage to find any written source to base the following computation on.

It is the researcher’s opinion that this new suggested “computation of moderation” would lead to the following outcome for the true agreement reliability rate (TAR): TAR = ((TA-1) + (TA-2)) / 2 = (76.8% In this study. calculation methods 1 and 2. the confusion matrix is presented in Table 3 with the names of the interraters switched. 13). “The true agreement rate” (title name substituted with names of values applicable in this study).8%) / 2 = 135.9% It was the researcher’s conclusion that whether the reader considers calculation method 1. all three methods demonstrated that there was sufficient common .9% + 58. which is supposed to reflect the common understanding of both interraters.Joan F.7% 2. remains the same: AC = (a + d) / (a + b + c + d) = (10 + 10) / (10 + 3 + 7 + 10) = 20/30 = 66. To illustrate this assertion. Marques and Chester McCall 456 of the lowest submitter goes: The percentage of the interrater reliability. TC = a / (a + b) = 10 / (10 + 7) = 10/17 = 58. or calculation method 3 as the most appropriate one for this particular study.9%. TA rationally presented a rate of 76. will decrease to almost zero. As a reminder to the reader the irrationality of using the highest number of submissions as the denominator may serve the example given under the “rationale” section for calculation method 2. delivered a percentage below the minimum requirement. which was higher than the minimum requirement of 66. On the other hand it is demonstrated in the new true agreement rate here that the less logical process of exchanging the interraters’ positions to where the highest number of submissions would be used as the common denominator instead of the lowest (see first part of calculation method 3). Table 3 Confusion Matrix with Names of Interraters Switched Interrater 2 Agree Disagree Agree a b Interrater 1 Disagree c d With this exchange. The rate that these authors label as “the accuracy rate” (AC). calculation method 2. the outcome for TA changes significantly: 1. is calculated as follows. in which numbers of submissions would diverge significantly (30 vs.7% / 2 = 67.7% in both.

although they were “packaged” different by the interraters. for this study interrater reliability could be considered established.4. 66. 2. say. The officially recognized reliability rate of 66. both interraters had been required to select 15 themes within an equal time span of. the researcher should set standards in the number of observations to be listed by the interraters as well as the time allotted to them. If the solicited number is kept too low it may be that two raters have perfectly similar understanding of the text yet submit .7% for this study is therefore lower than it would have been when both interraters had been limited to a pre-specified number of themes to be listed. The researcher will need to understand that there are different configurations possible in the packaging of the themes as listed by the various interraters. the interrater reliability rate could be easily calculated as 12/15 = . such as “giving to others” and “contributing. the interraters came up with 12 common themes out of 15.2.457 The Qualitative Report September 2005 understanding and interpretation of the essence of the interviewees’ declarations. Although there was a majority of congruent themes between the two interraters (there were 10 common themes between both lists). since data in qualitative studies are usually rather substantial and interraters usually only have limited time. one week. If.7%.67 = 66. Recommendations 1. In order to obtain results with similar depth from all raters. 1997). The fact that these confines were not specified to the interraters resulted in a diverged level of input: One interrater spent only two days in listing the words and came up with a total of 13 themes and the other interrater spent approximately one week in preparing his list and consequently came up with a more detailed list of 17 themes. This may be valuable advice for future applications of this valuable tool to qualitative studies. especially if there is a multiplicity of themes to choose from. can very well select themes with a similar understanding of essentials in the data she also found that there are three major attention points to address in order to enhance the success rate and swiftness of the process:1.. If. in this case.” “aesthetically pleasing workplace. Even in the case of only 10 common themes on a total required submission of 15.” of which the latter was mentioned in the context of a nice environment. as well as an equal level of input from both interraters. The data to be reviewed by the interraters should be only a segment of the total amount.7%.” and “beauty. The researcher of this study has found that although interraters in a phenomenological study. as they all resulted in outcomes equal to.8 = 80%. and presumably generally in qualitative studies.” “encouraging” and “motivating. All interrater reliability calculation methods assume equal numbers of submissions by the interraters. the rate would still meet the minimum requirements: 10/15 = . The solicited number of submissions from the interraters should be set as high as possible. the puzzle regarding the use of either the lowest or highest common denominator would be resolved because there would be only one denominator. or greater than. for example. the calculation of interrater reliability was complicated by the unequal numbers of submissions. Hence. In this paper the researcher gave examples of themes that could be considered similar. so that he or she will need to review the context in which these themes are listed in order to determine their correspondence (Armstrong et al.

Conclusion As mentioned previously. who came to similar findings in an empirical study in which they attempted to detect the level to which various external raters could detect themes from similar data and demonstrate similar interpretations. given the prerequisite of a certain minimal similarity in educational and cultural background as well as interest. The two main prerequisites presented by Armstrong et al. in order to provide them with a more scientifically recognizable basis. 3. the researcher of this particular study has found that the interraters. in qualitative inquiry” (Morse et al. 1997). or of a non-rigorous nature. so that the reader could encounter a greater level of recognition with the study topic as well as the findings. However. could very well select themes with a similar understanding of essentials in the data. the majority of these criteria are either of a “post hoc” (evaluative) nature. p. It would further be advisable to attune the educational and interest level of the interraters to the target group of the study. background.. This was possibly attributable to the fact that various qualitative oriented scholars have asserted in the past years that it is difficult to obtain consistency in qualitative data analysis and interpretation (Armstrong et al.Joan F. which may erroneously elicit the idea that there was not enough coherence in the raters’ perceptions and. This perception is supported by the arguments from various scholars that different reviewers cannot coherently analyze a single package of qualitative data. having been confronted by the guiding committee in a phenomenological study on spirituality in the workplace. However. and interest level in the topic in order to ensure a decent degree of interpretative coherence.. and hence ensuring rigor. 2). a category to which phenomenology belongs. none applied this instrument of control and solidification. These scholars instead. the researcher found that the establishment of interrater reliability or interrater agreement was a major solidification of the themes that were ultimately listed as the most significant ones in this study. were similar to those from the researcher in this phenomenological study.. These prerequisites were presented in this paper in the recommendations section. which are merely used as a confirmation tool for the study participants regarding the authenticity of the provided raw data. Of the eight phenomenology dissertations that this researcher reviewed. introduced a variety of “new criteria for determining reliability and validity. but have nothing to do with the data analysis procedures (Morse et al. with the application of this tool as an enhancement of the reliability of the findings as well as a bias reduction mechanism. Unfortunately. This conclusion is shared with Armstrong et al. (1997). interrater reliability is not a commonly used tool in phenomenological studies. which entails that they are applied after the study had been executed and correction is not possible anymore. Marques and Chester McCall 458 different themes. entailing data limitation and contextual interpretability.). thus. The interraters should have at least a reasonable degree of similarity in intelligence. . is less scientifically grounded than quantitative study. prior to embarking on her own experiential journey. such as member checks. no sufficient interrater reliability. It is the researcher’s opinion that the process of interrater reliability should be applied more often to phenomenological studies. 2002. it is still a general perception that qualitative study. Up to now.

htm Butler.. (2005). 597-606. henceforth. D.ca/cpsy/Documents/Theses/Matt%20Thesis. CA: Sage.156. 2003. Michigan State University. Yet. Thousand Oaks. B. Dr. The professional association for people involved with spirituality in the workplace.). (1998). from http://129. 2003.. (2003... (1997).colostate. from http://writing. A.twu. (1997).cs. The many faces of empathy.usc. Ian I. Continuing and Community Education. 31(3).. (1997. The confusion matrix. (2001).htm Blodgett-McDeavitt.ca/~hamilton/courses/831/notes/confusion_matrix/confusi on_matrix. & Martaeu.pdf A phenomenological study of quest-oriented religion. Retrieved November 16. (1998).edu/web/MOR. Retrieved April 8. (2003. Meaning of participating in technology training: A phenomenology. A.pdf Hamilton. L.. J. October). as well as the time allotted to the interraters. Poster presented at the annual meeting of the Canadian Psychological Association. E. & Olive. 3). Handbook in research and evaluation (Vol. MI. Paper presented at the meeting of the Midwest Research-to-Practice Conference in Adult.uregina. Association for Spirit at Work (2005). J. Interrater reliability.cfm Creswell. McMillan. Retrieved January 25.edu/~adulted/mwr2p/prior/blodgett. & Strayer. 2003.marshall. Alberta. due to the risk of too widely varied choices to be selected by interraters. May 6). no interrater reliability. one might attempt to set as high a number of submissions as possible. The place of inter-rater reliability in qualitative research: An empirical study. if there are many themes available. W. from the University of Southern California Marshall School of Business web site: http://www.. should preferably be kept synchronous. CA: Edits. from http://www2. J. wrongfully educe the idea that there is not enough consistency in comprehension between the raters and. Brian Dyre's pages. S. T.. & Schumacher. 2004. February 7). (1997). Colorado State University. The justifications for this argument are also presented in the recommendations section of this paper. Retrieved September 5. 2003. East Lansing. Retrieved November 12.459 The Qualitative Report September 2005 An interesting lesson from this experience for the researcher was that the number of observations to be listed by the interraters. Findlater. C.iupui. Research in education (5th ed. San Diego. W. thus.. from http://www.cfm?doc_id=3055 . Retrieved February 20. Gosling.edu/guides/research/relval/com2a5. & Michael. Weinman. Edmonton. 2005.com/aboutSAW/profile_JudiNeal. Gurak. E. H. from http://www. Retrieved February 20.html Isaac. This may happen in spite of perfect common understanding between interraters and may. New York: Longman. S. from http://www.107/brian/218%20Lecture%20Slides/L10%20research%20desi gns. Canada. Dyre. References Armstrong. Qualitative inquiry and research design: Choosing among five traditions.101. Mitroff. Sociology. J.spiritatwork. 2005.

A. Mayan. C. 46). Judi Neal Associate Professor. Mixed methodology (Vol. F. Olson. Retrieved September 5. K. 2004. Barrett. from http://ecrp. 1010- 1017.richmond. W. (2002).. K..newhaven. J. M.html School of Business at the University of New Haven. from http://schatz. M. (2005).. C. & Drumgold. D.. P. (2002. Interrater reliability. Retrieved September 4.edu/ssss/QR/QR4-3/myers. Smukler. Psychiatric Services 53. Van Manen.edu/~pli/psy200_old/measure/interrater. Measuring interrater reliability among multiple raters: An example of methods for nominal data.psychiatryonline. (2000. J. 1-19. (1990. 4(3/4).sju..).Joan F. 1(2). Retrieved September 4. (2002).. S. from http://ps. Sources of meaning. Retrieved February 20. Applying an analytic writing rubric to children's hypermedia "narratives".edu/faculty/neal/ Scott. 2005. Retrieved November 13. D. M. from http://www.com/inquiry/49. (n. Ward. Russo. from http://www. Comtois. J. L. (2003).org/cgi/content/full/53/8/1010 Tashakkori. Sampson.. March). September). A. (1998).phenomenologyonline.uiuc. M. D.html Richmond University.. 2003. 2004. 2005. E. (2002b). The Qualitative Report. 2003. from http://www.html Appendix A Interview Protocol Project: Spirituality in the Workplace: Establishing a Broadly Acceptable Definition of this Phenomenon Time of interview: Date: Place: Interviewer: Interviewee: Position of interviewee: . M. M. International Journal of Qualitative Methods.nova. & Cheney. Verification strategies for establishing reliability and validity in qualitative research.. Retrieved March 4. R. Uehara. (2002a). J. from http://www.. from http://www.html Van Manen.. Retrieved November 13. 2003. Merleau-Ponty’s phenomenology of perception.html Myers. August). M. CA: Sage.. 2005. Etsler.5(1) Retrieved September 25. from http://www. Mott. & Snowden. Marques and Chester McCall 460 Morse. Retrieved March 10. E. S.. Early Childhood Research & Practice. M.html Srebnik. Psychometric properties and utility of the problem severity summary for adults with serious mental illness.com/md2/timewarp/merleauponty. K.edu/v5n1/mott.html Posner.edu/multivar/reliab/interrater.d.. M.angelfire.com/inquiry/1. & Spiers. Qualitative research and the generalizability question: Standing firm with proteus. A. 2004. Phenomenological inquiry.phenomenologyonline. Thousand Oaks.. & Teddlie.

461 The Qualitative Report September 2005

To the interviewee:
Thank you for participating in this study and for committing your time and effort.
I value the unique perspective and contribution that you will make to this study.

My study aims to establish a broadly acceptable definition of “spirituality in the
workplace” by exploring the experiences and perceptions of a small group of recognized
interviewees, who have had significant exposure to the phenomenon, either through
practical or theoretical experience. You are one of these icons identified. You will be
asked for your personal definitions and perceived essentials (meanings, thoughts, and
backgrounds) regarding spirituality in the workplace. I am looking for accurate and
comprehensive portrayals of what these essentials are like for you: your thoughts,
feelings, insights, and recollections that might illustrate your statements. Your
participation will hopefully help me understand the essential elements of “spirituality in
the workplace.”

Questions

1. Definition of Spirituality in the Workplace
1.1 How would you describe spirituality in the workplace?
1.2 What are some words that you consider to be crucial to a spiritual workplace?
1.3 Do you consider these words applicable to all work environments that meet your
personal standards of a spiritual workplace?
1.4 What is essential for the experience of a spiritual workplace?

2. Possible structural meanings of experiencing spirituality in the workplace?
2.1 If a worker was operating at his or her highest level of spiritual awareness, what
would he or she actually do?
2.2 If a worker was operating at his or her highest level of spiritual awareness, what
would he or she not do?
2.3 What is easy about living in alignment with spiritual values in the workplace?
2.4 What is difficult about living in alignment with spiritual values in the workplace?

3. Underlying themes and contexts for the experience of a spiritual workplace
3.1 If an organization is consciously attempting to nurture spirituality in the workplace,
what will be present?
3.2 If an organization is consciously attempting to nurture spirituality in the workplace,
what will be absent?

4. General structures that precipitate feelings and thoughts about the experience of
spirituality in the workplace.
4.1 What are some of the organizational reasons that could influence the transformation
from a workplace that does not consciously attempt to nurture spirituality and the human
spirit to one that does?
4.2 From the employee’s perspective, what are some of the reasons to transform from a
worker who does not attempt to live and work with spiritual values and practices to one
that does?

Joan F. Marques and Chester McCall 462

5. Conclusion
Would you like to add, modify or delete anything significant from the interview that
would give a better or fuller understanding toward the establishment of a broadly
acceptable definition of “spirituality in the workplace”

Thank you very much for your participation.

Author Note

Joan Marques was born in Suriname, South America, where she made a career in
advertising, public relations, and program hosting. She founded and managed an
advertising and P.R. company as well as a foundation for women’s awareness issues. In
1998 she immigrated to California and embarked upon a journey of continuing education
and inspiration. She holds a Bachelors degree in Business Economics from M.O.C. in
Suriname, a Master’s degree in Business Administration from Woodbury University, and
a Doctorate in Organizational Leadership from Pepperdine University. Her recently
completed dissertation was centered on the topic of “spirituality in the workplace.” Dr.
Marques is currently affiliated to Woodbury University as an instructor of Business &
Management. She has authored a wide variety of articles pertaining to workplace
contentment for audiences in different continents of the globe. Joan Marques, 712 Elliot
Drive # B, Burbank, CA 91504; E-mail: jmarques01@earthlink.net; Telephone: (818)
845 3063
Chester H. McCall, Jr., Ph.D. entered Pepperdine University after 20 years of
consulting experience in such fields as education, health care, and urban transportation.
He has served as a consultant to the Research Division of the National Education
Association, several school districts, and several emergency health care programs,
providing survey research, systems evaluation, and analysis expertise. He is the author of
two introductory texts in statistics, more than 25 articles, and has served on the faculty of
The George Washington University. At Pepperdine, he teaches courses in data analysis,
research methods, and a comprehensive exam seminar, and also serves as chair for
numerous dissertations. Email: cmccall@pepperdine.edu

Copyright 2005: Joan F. Marques, Chester McCall, and Nova Southeastern
University

Article Citation

Marques, J. F. (2005). The application of interrater reliability as a solidification
instrument in a phenomenological study. The Qualitative Report 10(3), 439-462.
Retrieved [Insert date], from http://www.nova.edu/ssss/QR/QR10-4/marques.pdf

5/10/2014 The place of inter-rater reliability in qualitative research: an empirical study | David Armstrong - Academia.edu

Search People, Research Interests and Universities Home Log In

The place of inter-rater reliability in qualitative research: an empirical study more
by David Armstrong

Download (.pdf)
Sociology August 1997 v31 n3 p597(10) Pa ge 1
1997_Sociology_Inte…
The place of inter-ra ter relia bility in qua lita tive resea rch: a n empirica l 25.6 KB

study.
by David Armstrong, Ann Gosling, Josh Weinman and Theresa Martaeu

Assessing inter-rater reliability, whereby data are independently coded and the codings compared
for agreement, is a recognised process in quantitative research. However, its applicability to
qualitative research is less clear: should researchers be expected to identify the same codes or
themes in a transcript or should they be expected to produce different accounts? Some
qualitative researchers argue that assessing inter-rater reliability is an important method for
ensuring rigour, others that it is unimportant; and yet it has never been formally examined in an
empirical qualitative study. Accordingly, to explore the degree of inter-rater reliability that might
be expected, six researchers were asked to identify themes in the same focus group transcript.
The results showed close agreement on the basic themes but each analyst ’packaged’ the themes
differently.

Key words: inter-rater reliability, qualitative research, research methods.

© CO PYRIGHT 1997 British Soc iologica l Assoc iation consistency of findings from an analysis conducted by two
Public a tion Ltd. (BSA) or more researchers. However, the concept emerges
implicitly in descriptions of procedures for carrying out the
Reliability and validity are fundamental concerns of the analysis of qualitative data. The frequent stress on an
quantitative researcher but seem to have an uncertain analysis being better conducted as a group activity
place in the repertoire of the qualitative methodologist. suggests that results will be improved if one view is
Indeed, for some researchers the problem has apparently tempered by another. Waitzkin described meeting with two
disappeared: as Denzin and Lincoln have observed, research assistants to discuss and negotiate agreements

’Terms such as credibility,
and confirmability transferability,
replace the dependability
usual positivist criteria of and disagreements
described as ’hashingabout
out’coding in a process
(1991:69). Another he
example is
internal and external validity, reliability and objectivity’ afforded by Olesen and her colleagues (1994) who
(1994:14). Nevertheless, the ghost of reliability and validity described how they (together with their graduate students -
continues to haunt qualitative methodology and different a standard resource in these reports - ’debriefed’ and
researchers in the field have approached the problem in a ’brainstormed’ to pull our first-order statements from
number of different ways. respondents’ accounts and agree them. Indeed, in
commenting on Olesen and her colleagues work, Bryman
One strategy for addressing these concepts is that of and Burgess (1994) wondered whether members of teams
’triangulation’. This device, it is claimed, follows from should produce separate analyses and then resolve any
navigation science and the techniques deployed by discrepancies, or whether joint meetings should generate
surveyors to establish the accuracy of a particular point a single, definitive coded set of materials.
(though it bears remarkable similarities to the
psychometric concepts of convergent and construct Qualitative methodologists are keen on stressing the
validity). In this way, it is argued, diverse confirmatory transparency of their technique, for example, in carefully
instances in qualitative research lend weight to findings. documenting all steps, presumably so that they can be
Denzin (1978) suggested that triangulation can involve a ’checked’ by another researcher: ’by keeping all collected
variety of data sources; multiple theoretical perspectives to data in well-organized, retrievable form, researchers can
interpret a single set of data; multiple methodologies to make them available easily if the findings are challenged
study a single problem; and several different researchers or if another researcher wants to reanalyze the data’
or evaluators. This latter form of triangulation implies that (Marshall and Rossman 1989:146). Although there is no
the difference between researchers can be used as a formal description of how any reanalysis of data might be
method for promoting better understanding. But what role used, there is clearly an assumption that comparison with
is there for the more traditional concept of reliability? the original findings can be used to reject, or sustain, any
Should the consistency of researchers’ interpretations, challenge to the original interpretations. In other words,
rather than their differences, be used as a support for the there is an implicit notion of reliability within the call for
status of any findings? transparency of technique.

In general, qualitative methodologies do not make explicit Unusually for a literature that is so opaque about the
use of the concept of inter-rarer reliability to establish the importance of independent analyses of a single dataset,

- Reprinted with permission. Add itional c opying is prohibited. - G AL E G RO UP

Information Integrity

Sociology August 1997 v31 n3 p597(10) Pa ge 2

The place of inter-ra ter relia bility in qua lita tive resea rch: a n empirica l
study.
https://www.academia.edu/458025/The_place_of_inter-rater_reliability_in_qualitative_research_an_empirical_study?login=amrit315@gmail.com&email_was_tak… 1/6

com&email_was_tak… 2/6 . these theme and in those instances when the analysts attempted accounts themselves would have needed yet further a ranking. five themes). The place of inter-rater reliability in qualitative research: an empirical study | David Armstrong . Not only might these adults with means. would have been to ask a further six invisibility of genetic disorders as forms of disability was researchers to write reports on the degree of consistency handled. For example. This latter position is taken then themselves subjected to analysis to identify the further by those so-called ’post-modernist’ qualitative degree of concordance between them. presumably. claim that it is a significant criterion for the concept to creep into their work. particularly those drawing inspiration from summary of what screening techniques were currently Glaser and Strauss’s seminal text which claimed that the available and then discussion from the group on views of virtue of inductive processes was that they ensured that genetic screening was invited and facilitated. so involved asking a number of qualitative researchers to why does anyone need a reliability checker for his or her identify themes in the same data set.edu Mays and Pope explicitly use the term ’reliability’ and. from the background justification whether or not there were similarities and differences between them. These accounts were data?’ (Morse 1994:231). If accounts do but it may diverge clarify then the modernists for the nature of Daly et al. ironically. in other words. was actually one of ascribing quantitative there is a methodological problem and for the weights to pregiven ’variables’ which were then subjected postmodernists a confirmation of diversity. On the one hand are those identifying. Presumably this latter The focus group was invited to discuss the topic of genetic position would be supported by most qualitative screening. researchers (Vidich and Lyman 1994) who would challenge the whole notion of consistency in analysing Method data. the debates within qualitative methodology on were approached and asked if they would ’analyse’ the the place of the traditional concept of reliability (and transcript and prepare an independent report on it. On the other hand are assessing the value of a piece of qualitative research: ’the those who adopt such a relativist position that issues of analysis of qualitative data can be enhanced by organising consistency are meaningless as all accounts have some an independent assessment of transcripts by additional ’validity’ whatever their claims. and However. [R3] one of the authors (DA) scrutinised all six reports and deliberately did not read the original focus group transcript. expecting another researcher to have the same ’insights’ from a limited data base is unrealistic: ’No-one The purpose of the study was to see the extent to which takes a second reader to the library to check that indeed researchers show consistency in their accounts and he or she is interpreting the original sources correctly. a claims that a qualitative account cannot be held to genetic disorder affecting the secretory tissues of the ’represent’ the social world. Accordingly. For example. (1992) in a study of clinical leaves a simple empirical question: do qualitative encounters between cardiologists and their patients when researchers actually show consistency in their accounts? the transcripts were analysed by the principal researcher The answer to this question may not resolve the and ’an independent assessed. One method. and. But then. Yet this still was used by Daly et al. validity) remain confused. Tyler (1986) groups consisted of adults with cystic fibrosis (CF). and where possible rank ordering. One of these focus characterised by multiplicity. particularly the lung. according to researchers to be recruited for another assessment. they claim. rather it ’evokes’ it. https://www. ontological assumptions are so different. . a place remains for some sort of the birth of children with the disorder. and the third rarer: so on. if accounts are to statistical analysis (1992:204). But while all analysts identified an invisibility theme. The choice of method for examining the six reports was context that gave it coherence. At its simplest this can be made on pragmatic grounds. This approach. the modernists’ search for measures of consistency is reinforced and the postmodernists need to A contrary position is taken by Morse who argues that the recognise that accounts do not necessarily recognise the use of ’external raters’ is more suited to quantitative multiple character of reality. All six analysts agreed that it was an important that they perceived in the initial accounts.Academia.academia. correspondence between the description and reality that would allow a role for ’consistency’. panel’. The analysts were offered a fee for this work. Hammersely (1991) by contrast was a condition for which widespread genetic screening argues that this position risks privileging the rhetorical over was being advocated. At some point a ’final’ judgement of consistency needed to be made and it was thought that this could just The visibility of the disability is the single most important as easily be made on the first set of reports. research. A theoretical resolution of skilled qualitative researchers and comparing agreement these divergent positions is impossible as their core between the raters’ (1995:110). The session was introduced with a brief researchers.which body. The stereotypes of the disabled person in the wheelchair. more commonly. Six actually going on) of substantive areas’ (1967:239). The ensuing theory was ’closely related to the daily realities (what is discussion was tape recorded and transcribed. Add itional c opying is prohibited. . similar. element in its representation. that different researchers would offer cystic fibrosis have particular views of disability but theirs different evocations. those who reject the term but allow moreover. the the level of agreement procedure described by methodological confusion the debate. G AL E G RO UP Information Integrity Sociology August 1997 v31 n3 p597(10) Pa ge 3 The place of inter-ra ter relia bility in qua lita tive resea rch: a n empirica l study.5/10/2014 study. all The approach involved listing the themes that were also expressed it as a comparative phenomenon: identified by the six researchers and making judgements traditional disability is visible while CF is invisible. The researcher’s analysis bears no direct correspondence with any underlying ’reality’ and different As part of a wider study of the relationship between researchers would be expected to offer different accounts perceptions of disability and genetic screening. a number as reality itself (if indeed it can be accessed) is of focus groups were organised. experienced qualitative investigators in Britain and the United States who had some interest in this area of work In summary. The aim of such a screening the ’scientific’ and argues that quality of argument and use programme was to identify ’carriers’ of the gene so that of evidence should remain the arbiters of qualitative their reproductive decisions might be influenced to prevent accounts.Reprinted with permission. consistent with illustrated by the way that the theme of the relative the general approach. the main researchers such as Mays and Pope who believe reliability themes emerging from the discussion (with a maximum of should be a benchmark for judging qualitative research. most placed it first.edu/458025/The_place_of_inter-rater_reliability_in_qualitative_research_an_empirical_study?login=amrit315@gmail.

the theme was placed in a eugenics. they thought were related: a link with stigma was mentioned by two analysts. as coherent.500 words long and sent to the six designated researchers. However. Perhaps because the requested. there was and discursive report that commented extensively on the general agreement on the theme and its meaning across dynamics of the focus group. whereas genetic disorders such as CF eugenic threat.Reprinted with permission. [R2] fibrosis was transcribed into a document 13. the contrast between visible. Visibility.Academia. e. The stereotypes of the disabled person in the wheelchair. the theme of people’s ignorance about genetic matters was picked up by five of the six analysts.edu/458025/The_place_of_inter-rater_reliability_in_qualitative_research_an_empirical_study?login=amrit315@gmail. ’relative The main attitudes expressed were of great concern at the deprivation in relation to visibly disabled’. used by some analysts as a vehicle for other issues that five themes could be abstracted from this text. [R2] phenomenon. another pointed out the In broad outline. while ’misperceptions of the disabled’. and ’images of low levels of public awareness and understanding of disability’ were worded differently.and give consideration to . The sixth analysts returned a lengthy contrast with traditional images of deviance. Ignorance and fear about genetic disorders and screening. people’s ignorance. and confusing minor gene . Thus. Although not explicitly described. Five of the reports.5/10/2014 The place of inter-rater reliability in qualitative research: an empirical study | David Armstrong . and of great concern that more educational effort accompanying description that they all related to the same should be put into putting this right. G AL E G RO UP Information Integrity https://www. the theme was contextualised to make it researchers returned reports. One and health care choices. These differences can be illustrated by Ignorance. but then discussed a number all the analysts. backdrops. All six In short.g. described themes: four analysts identified five invisibility theme came with an implicit package of a each.academia. other themes offered fewer such ’natural’ ’health service provision’ and ’genetic screening’. the theme of invisibility was also of more thematic issues. and the special problems posed by the general invisibility of so many The focus group interview with the adults with cystic genetic disabilities. more than a simple descriptor.g.disability Three other analysts tied the population’s ignorance to the that was overt.edu similarities and differences between them. namely the fact that the general public were prepared to identify . specific genetic. consensus. The group saw the Further. Only one analyst themes around such issues as the relative invisibility of expressed it as a basic theme while others chose to link genetic disorders. Whereas the theme of invisibility had a clear examining four different themes that the researchers referent of visibility against which there could be general identified in the transcript. analysts frequently linked it explicitly with the need for education. differed in the actual label they applied to the theme. and Results invisible. . For example. but there were significant differences in the way they were ’packaged’. the eugenic debate ignorance with other issues to make a broader theme. disabilities. Even so. e. namely. gross physical. and sex selection.com&email_was_tak… 3/6 . ’visibility’. For example: were more hidden and less likely to elicit a sympathetic response. and give it meaning. the other four. ’ignorance’. it was clear from the disability. and the future outcomes for society. the six analysts did identify similar themes difficulty of managing invisibility by CF sufferers. although each theme was given a label it was public as associating genetic technologies with Hitler. Add itional c opying is prohibited. All six analysts identified a similar constellation of but presented in different ways.

com&email_was_tak… 4/6 .edu/458025/The_place_of_inter-rater_reliability_in_qualitative_research_an_empirical_study?login=amrit315@gmail.Academia.5/10/2014 The place of inter-rater reliability in qualitative research: an empirical study | David Armstrong .edu https://www.academia.

edu https://www.com&email_was_tak… 5/6 .5/10/2014 The place of inter-rater reliability in qualitative research: an empirical study | David Armstrong .Academia.academia.edu/458025/The_place_of_inter-rater_reliability_in_qualitative_research_an_empirical_study?login=amrit315@gmail.

edu Job Board About Mission Press Blog Stories We're hiring engineers! FAQ Terms Privacy Copyright Send us Feedback Academia © 2014 https://www.com&email_was_tak… 6/6 .edu/458025/The_place_of_inter-rater_reliability_in_qualitative_research_an_empirical_study?login=amrit315@gmail.academia.Academia.5/10/2014 The place of inter-rater reliability in qualitative research: an empirical study | David Armstrong .