VOL 17, NO.

3, 1991

Measurement and Reliability: Statistical Thinking Considerations
Abstract Reliability is defined as the degree to which multiple assessments of a subject agree (reproducibility). There is increasing awareness among researchers that the two most appropriate measures of reliability are the intraclass correlation coefficient and kappa. However, unacceptable statistical measures of reliability such as chi-square, percent agreement, product moment correlation, as well as any measure of association and Yule's y still appear in the literature. There are costs associated with improper measurements, unreliable diagnostic systems, inappropriate statistics and measures of reliability, and poor quality research. Costs are incurred when misleading information directs resources and talents into nonproductive avenues of research. The consequences of unreliable measurements and diagnosis are illustrated with some studies of schizophrenia. Measures of reliability, such as the intraclass correlation coefficient (ICC) and kappa, depend on two components of variance. Reliability, defined as the degree to which multiple assessments of a subject agree (reproducibility), is in concept and computation a function (ratio) of these two variances. What are the variance components that influence reliability and what are their sources? A taxonomy example will illustrate. Suppose a group of taxonomists, using a "mammal rating scale," are presented with only elephants representing zero between mammals' variance. (Between-subject variance among the rated subjects is one of the two reliability variance components.) Presumably, the raters will have perfect agreement for ele-


by John J. Bartko

phants. Is this a reportable reliability measure for the mammal instrument? What will be the raters' performance if they also encounter mice? With mice as well as elephants as rating targets, the between-mammals variance is nonzero. This setting presents a broader test of the raters' skills, and the reliability measure is more representative of these skills. The second component of variance critical to reliability expressions is the error variance or within-subjectsrated variance. This variance expresses the degree to which raters agree for a given mammal, target, or subject. If they agree when they observe both an elephant and a mouse and they have no mouse-elephant disagreements, the within-subject variance is zero, and their reliability would be unity, that is, perfect. The best reliability setting occurs when (1) raters are presented with a large and varied number of subjects to rate (nonzero between-subjects or target variance; in psychometrics this is noted as true score variance) and (2) when most of the raters agree among themselves for each and every one of the targets or subjects rated (low within-subject or error variance). In heterogeneous psychiatric study groups, there are usually many opportunities (targets) to explore the full range of the rating instrument. That being the case, the true score or target variance is typically large. However, in homogeneous environments such as community surveys, in which the prevalence (base rate) for a disorder is low (the betweensubjects target variance is low), the Reprint requests should be sent to Dr. J.J. Bartko, NIH Campus, NIMH, Bldg. 10, Rm. 3N-204, 9000 Rockville Pike, Bethesda, MD 20892.

Downloaded from schizophreniabulletin.oxfordjournals.org by guest on March 10, 2011

the Pearson product moment correlation (r). It is zero if the rat- Downloaded from schizophreniabulletin. TV and EV. The examples which follow illustrate the role that diagnostic reliability has played in studies of schizophrenia and how poor reliability in diagnosis and measurement translates into costs.484 SCHIZOPHRENIA BULLETIN measurement of reliability is exacting. The data sets in table 1 are for two raters and five subjects assumed suitable for ANOVA. The ANOVA is the natural statistical tool for assembling the variance expressions required to compute the ICC.oxfordjournals.0 2. he was largely ignored in the United States until about 1980. Note that there are differences (between-subject variance) among the five subjects. The table also illustrates an inappropriate measure. As consumers we know how effective this thinking has been and the dramatic effect it has had on our lives. Spearman. such as those associated with misleading and inappropriate statistics and the subsequent misuse of patients and their data. and research can have a revolutionary effect on social and economic structure. For data set 1. Error variance (EV) measures the dispersion of ratings within a subject. has no effect on the Pearson correlation.0 5.S. . that is.23 -0. Bartko et al.5/0 R2 1 2 3 4 5 Data sel12 (R2 = R1 + 4) R1 R2 Data set 3 (R2 = 2 x RD R1 1 2 3 4 5 0.52 R2 2 4 6 8 10 1 5 2 6 7 3 8 4 9 5 -0.34 0. This variance information is summarized across the rated subjects and collected into the ICC reliability measure. it doesn't cost. The method for computing the ICC by analysis of variance (ANOVA) is presented elsewhere (Bartko and Carpenter 1976. This creates a greater need to minimize "mistakes. Fleiss 1981).875 0. the importance of variance and statistical thinking.org by guest on March 10. A low base rate places natural limits on use of the research instrument's range. W.0 1. A slogan in quality control circles is "quality pays.5 -0. TV = true variance. The ICC measures how closely raters agree for each and every subject. that is.0 2. not agreement. Other entries in table 1 should be emphasized. The appropriate measure of reliability for continuous data is the ICC (Bartko 1966. resources.34 1. Reliability coefficients are only one of several benchmarks used to assess quality in research. 2011 Table 1.0 -1. M is an expression for how large the TV is relative to the EV. Edwards Deming. 1986) is revered and honored in Japan and has a major quality control prize named for him. the ICC value unity reflects the nature of the data. Although Deming (1978. The raters have an opportunity to use a broad range of the instrument. The Pearson correlation is unity for data set 1 (as well as the other two data sets) only because the rater data are linearly associated. The peculiar nature of the data. Without quality there are costs.—R = rater.) measure association. EV = error variance. Reliability: Continuous Data Illustrations of appropriate and inappropriate measures of reliability and the effect variances have on reliability are presented in table 1.5 2. Comparison of intraciass correlation coefficient (ICC) and Pearson product moment correlation (r) for three data sets Data set 1 (R2 = IRD R1 Subject 1 2 3 4 5 ICC r EV TV TV/(TV + EV)1 M = TV/EV2 1 2 3 4 5 1. 1988). low prevalence places restrictions on the adequacy of the estimates of the two variance components necessary for computing reliability measures.0 0. and time. The automobile and electronics industries provide a dramatic illustration. Correlations (Pearson. the two raters agree on each of the subjects.23 1. Statistically. that the raters agree perfectly (zero withinsubject variance). Bartko and Carpenter 1976. a U. was invited to Japan to teach statistical quality control. and the integration of these into management practice to the Japanese. Forty years ago. etc.5 1.0 8. 2 'The expression of reliability as a function of two variance components." Fewer mistakes lead to smaller withinsubjects error variance. statistician.1875 Note. products. therefore the within-subjects variance is zero." Quality in measurement. unity.

If it is an inappropriate measure. Kappa can be expressed in terms of variances (not elaborated on here). 3. whereas the inappropriate Pearson correlation is again unity. Winer 1971). the unaware reader would assume that the ratings were identical for each and every subject. not reliability. Clearly. to values of 0. Note that the inappropriate Pearson correlation coefficient has a value of unity simply because the two raters are linearly (additively) related. Yule's Y has no comparable standards. The first data set of table 2 illustrates agreement on only one case (a = 1). Simultaneously. more than two categories. Unlike data set 2. for example. EV. not agreement. In data sets 2 and 3 the prevalence rates are 2 percent and 8 percent. this data set also shows very large within-subject EV. Yule's Y has no interpretability. but clearly the first case represents maximal agreement. Chi-square measures association. There are two disagreements. A contingency coefficient based on the chi-square is not a remedy. Data set 2 illustrates variability among the five subjects. Rater 2 is consistently four units higher than rater 1. the reliability is large. Reliability: Categorical Data Table 2 illustrates a simple 2 X 2 categorical table data format with the appropriate reliability measure. Suppose cells b and c are zero and a and d are not. The ICC for this data set is 0. Suppose the data appeared in two research centers where one reported the appropriate ICC. An illustration of chi-square's inap- propriateness follows. EV is simply the within-subjects mean square. kappa. (Bartko and Carpenter 1976. Spitznagel and Helzer 1985) have proposed the Yule's Y as a measure of reliability. the EV does not overwhelm the TV. but not reliability. the commonly accepted meaning of a reliability of unity.g. In the one-way ANOVA table. reflecting the linear (multiplicative R2 = 2 X Rl) relationship between the two raters. and Yule's Y (Yule 1912).VOL 17. NO. unity. in keeping with the striking change in TV and EV. Some writers (e." an expression of how large the TV is relative to the Downloaded from schizophreniabulletin. Yule proposed Y to measure association. etc.49. it is essential to state the name of the reliability measure and perhaps even how it was computed. and the other the inappropriate Pearson coefficient. This is a classic statistical technique for the estimation of variance components given their expected mean square expressions (Bartko 1966. one per rater.7".. summarized in "M. TV and EV. For these particular data sets. Yule's Y may be large when there is major rater disagreement—for example. True variance (TV) is an expression of how much variability there is among the subjects. One can think of it as an expression of how wide and varied the opportunities are for the raters to use the full range of the rating instrument. If a Pearson coefficient of unity were to be reported for data set 2. Results of studies are difficult enough to interpret without the added burden of unsuitable statistics. however.66 and 0. Kappa is only 0. whereas that of kappa is well established (Landis and Koch 1977). One center would claim perfect reliability at the expense of the other.oxfordjournals. when cell b is large and cell c is zero or small.34. TV. 2011 . Table 1 also illustrates that it is not acceptable. We cannot afford the cost of misinterpretation.70 are regarded as excellent. and substandard statistics. misinformation. while kappa is not. it is defined as the oneway ANOVA between-subjects mean square minus the ANOVA withinsubjects mean square divided by 2.org by guest on March 10. The disagreement between the two raters over the subjects. Y is limited to 2 X 2 tables. Data set 3 illustrates less disagreement at the lower end but more at the higher end of the scale.) Inappropriate measures for such data formats include percent agreement.88. whereas kappa does. zero. Percent agreement does not make allowances for chance agreement. is larger than the variance among the five subjects. Values of kappa greater than 0. to report simply that "the reliability was 0. These two data sets generate the same chi-square value. and kappa changes dramatically. The calculation for this data set produces a negative or zero ICC. Fleiss 1981). the reader will be in a position to make a judgment regarding quality. the disagreement pattern is the same. a 1 percent "prevalence" rate. they denounced chance corrected kappa for allegedly having a base rate problem (low prevalence in homogeneous study groups). (Computation expressions for kappa can be found in Bartko and Carpenter [1976] and Fleiss [1981]. This line expresses reliability as a ratio of the two component variances. while the second shows maximal disagreement. as the TV is large and the EV is small. otherwise it is greater than zero. Note that the second to last line of table 1 is identical to the ICC values in the same table. whereas in a second data set cells a and d are zero but b and c are not (for the same n). Kappa has been generated for more than two raters. chi-square. 1991 485 ers agree for each and every subject.

researchers mount a post hoc search for a "better" statistic.88 7. Grove et al.93 Note. As Shrout et al. Spitznagel and Helzer 1985) to discussing the "low base rate problem" and its impact on the measurement of reliability has misfocused on problems with statistics and ignored the demands placed on the statistics in low TV settings. Kappa increases by 35 percent with a slight change from a = 1 to a = 2. Note that Y. that is.—Scz = schizophrenic.66 1. (1987) said.org by guest on March 10. 2011 Scz a c Total a + b c + d Not Scz Total Data set 1 Scz Rater A Not Scz Total Data set 2 Scz Rater A Not Scz Total Data set 3 Scz Rater A Not Scz Total a + c Scz 1 1 2 b + d Rater B Not Scz 1 97 98 Total 2 98 100 Kappa M Yule's Y 0. "the use of Y will mislead researchers into thinking that measurement error is not a problem when in fact it is" (p.oxfordjournals. EV..97 0. Rather than exercise such control. 1981.g.93 0. M = ratio of true variance to error variance. Tables 1 and 2 illustrate several continuous and categorical data sets with the two components. All of the energy devoted (e. This practice avoids the real issue (Kraemer et al. In such settings the onus is on the investigator to use designs and procedures that reduce EV.49 0.97) are approximately the same. Reliability and Variance We have seen that two components of variance form the statistical foundation of reliability. In the second data set. Shrout et al.86 Scz 8 1 Rater B Not Scz 1 90 91 Total 9 91 Kappa M Yule's Y 0. General format for 2 x 2 tables and three data sets Rater B Not Scz b d General format Scz Rater A Downloaded from schizophreniabulletin. 176). The . changes little. the one com- ponent over which the investigator can exercise some control. the TV and EV (M = 0. In the first data set. TV and EV. 1987). 1987. the fundamental role that variance plays in statistics and statistical thinking.486 SCHIZOPHRENIA BULLETIN Table 2.82 Scz 2 1 3 Rater B Not Scz 1 96 97 Total 3 97 100 Kappa M Yule's Y 0.3 100 0. In these three examples EVs are identical while target variance increases. TV is about twice the EV. which is not a function of these variances. The most important reason for abandoning Yule's Y for reliability is that its use for such purposes ignores statistical foundations.

5. making for low reliability. This approximates the kappa found for data set 1 in table 2. 3.9 i—-— Downloaded from schizophreniabulletin. where M = TV/EV. Raters must have training and competence in the use of instruments. For example. Reliability as a function of variances Reliability Values 0.1 / / / / / ! 2 3 4 5 6 7 8 9 10 Values of Multiplier: M Reliability = true variance/(true variance + error variance). reliability increases.oxfordjournals. reliability depends on the raters. NO. Reliability and Sources of Variance Reliability is large if the TV is large or if the EV is small. Figure 1 demonstrates how reliability varies with the relationship between TV and EV.3 0.VOL 17. • Setting: In heterogeneous settings the true target variance is usually large.7 0. a case where EV and TV are equal. Although this seems obvious. as the TV increases relative to the EV. .org by guest on March 10. the figure shows approximately 0. percent agreement. however.52. it does not hold for correlations.5 0. 2011 _ — — • 0. and their reliability experience should be reported.34 for data set 3 of table 1 in which M = 0. In psychiatry. As M increases. the corresponding reliability value is 0. For ICC reliability. or any reliability measure not structured on the statistical variance foundation. Yule's Y. • Signal: Have the data been drawn from case records or from current structured interviews? What is the condition of the subject being rated? How reliable is the subject's memory? • Training: Automobiles and electronic devices have intrinsic reliability." These statements are likely to appear without an expression of rater reliability within that setting. chi-square. Figure 1.2 0. A list of some of these factors follows: • Criteria: Which diagnostic instrument is used7 Are some instruments more valid than others? Which have established training environments and thus provide for better rater competence? Is it easier to sort out rater disagreements with some instruments than with others? • Occasion: How and when were the data collected? What is the time between responses? Test-retest settings have added variance sources in that differences in ratings may be due to the raters or attributed to the changing condition of the subject. Control can often be exercised over factors that contribute to EV. From figure 1. while in homogeneous settings the EV can easily overwhelm the true target variance.8 • 0. not the research instrument. While a graph such as figure 1 holds for the ICC and kappa.6 0.4 0. that is. 1991 487 relationship among these variances is expressed in the tables by M. True variance = M x error variance. suppose M = 1. Likewise figure 1 illustrates kappas for the other two data sets in table 2. regrettably one can find in the literature statements such as "We used XYZ which has demonstrated reliability.

T.34. P. The number of subjects who met no criteria for schizophrenia was 150.488 SCHIZOPHRENIA BULLETIN Some Studies of Schizophrenia It is worth exploring one component of EV and its impact in the literature. Discussion The ICC for continuous data and the kappa for categorical data are good measures of reliability. Center for Advanced Engineering Study.E. The number of schizophrenia classifications varies considerably within studies as the diagnostic system varies.oxfordjournals. Cambridge. Deming. and McGlashan.M..W. Gore and Altman (1982) have argued that the reporting of inappropriate and misleading statistics is unethical. 163:307-317. Schizophrenia Bulletin. These costs are felt when resources and talents are squandered and misleading information directs investigators into nonproductive avenues of research.. 1977. 1978. criteria. I. Without good statistical reviewing procedures. There are costs associated with improper measurements.E. The choice of diagnostic criteria for biological research.11 to 0. Jr. Reliability studies of psychiatric diagnosis: Theory and practice. Andreasen.G. and Shapiro.. M. McDonald-Scott. W. Reliability was not reported. Fleiss. and 58. a major source of differences and variance will be the diagnostic system selected and its reliability. S. Gore.C. J. and inappropriate statistics and measures of reliability. W.E. (1977) and White (1979) indicate that up to 50 percent of published articles have statistical errors. 32:145.M. Bartko.. with a median of 0. E. 38:408-413.G. R. Out of the Crisis. Kappas among the diagnostic systems ranged from 0. that is. Strauss and Gift (1977) reported on 272 subjects with functional psychiatric disorders using seven diagnostic systems in which the number of schizophrenia diagnoses ranged from 4 to 68. On the methods and theory of reliability.H. Jr.33. not the inspection process. Studies by Gore et al. W. For example. 2d ed. Downloaded from schizophreniabulletin.L. 19:3-11.. particularly criteria in schizophrenia studies. The intraclass correlation coefficient as a measure of reliability. W. Psychological Reports. R.. Kappas among the diagnostic systems ranged from 0. discovering substandard statistical methodology and design in a refereeing process is akin to the task of quality control inspectors in a production line. 1976. Bartko. inappropriate statistics are often not discovered before publication. 2011 ..J. and Rytter. Statistics in Practice.M. 1988.C. 14:575-587.J. Statistics Day in Japan. These factors all translate into costs. D. 1:85-87.. 1982. (1982) reported on 196 inpatients using four diagnostic systems in which the number of schizophrenia diagnoses was 38. some serious enough to invalidate conclusions and results. Archives of General Psychiatry. 52. Young et al. Costs also result when groups of subjects are misclassified. Gore. Grove. New York: John Wiley & Sons. England: British Medical Association. Carpenter. J.B. Jones.14 to 0. W. N. [Letter] American Statistician. and Altman. 1986. Kendell.41 with a median of 0.48.J. Keller. Misuse of statistical methods: Critical assessment of articles in BMJ from January to March 1976. Issues in longterm followup studies. These studies illustrate dramatically that the criteria used can have a measurable impact on variance.. unreliable diagnostic systems. 1981. This can lead to the contamination of control or diagnostic groups adding another error component to the results of multivariate group classification methods.org by guest on March 10. Inc. 1981. J. In comparing one study with another or in any meta-analysis of such studies. However. S... Statistical Methods for Rates and Proportions. with a median of 48. British Medical Journal. Journal of Nervous and Mental Disease.. The number of subjects who met no criteria for schizophrenia was 94. London. Deming. It is wasteful of investigator resources and talent to focus on misleading reports. and Carpenter. With unreliable diagnosis systems. 1966. MA: Massachusetts Institute of Technology. Reliability in measurement and diagnosis has been illustrated in some major studies of schizophrenia. The number of patients who met no criteria for schizophrenia was not stated. T. J. Any other measures of reliability should be rejected.. their very nature generates improper statistics. Kendell (1982) reported on 119 psychotic subjects using nine diagnostic systems in which the number of schizophrenia diagnoses ranged from 4 to 45 with a median of 29. costs are incurred when subjects are missed for treatment (false negative) or when they are treated unnecessarily or are treated for another ailment (false positive). Within a given study the selection of a diagnostic instrument can be a pivotal element in the number of diagnoses reported.T. Quality should be an integral part of the design process. References Bartko. 55.

Spitznagel.E.J. Young. PA 15213 Telephone: 412-647-8262 . at the Hilton Hotel. 1971. White.L.. and Helzer.. 1979....D. Statistical errors in papers in the British Journal of Psychiatry.J. On the methods of measuring association between two attributes.. Pruyn. Operational definitions of schizophrenia: What do they identify? Journal of Nervous and Mental Disease.L. please contact: Sheila Woodland. 75:581-642. 3.G..B. J. P.R. 1977.J.. and Koch. 44:172-195. 1991. Statistical Principles in Experimental Design. J.E. Speakers will address etiologic theories and biochemical and neuroanatomic abnormalities as well as current strategies to find better and more specific pharmacologic and nonpharmacologic treatments for schizophrenia. 34:1248-1253. G.J. Ph. Quantification of agreement in psychiatric diagnosis revis- ited. Shrout.L. Grochocinski. R. M. For further information about the conference. 1985. 2011 The Author John J. 170:443-447. 42:725-728. Archives of General Psychiatry.S.Y. New York: McGraw-Hill. H. The measurement of observer agreement for categorical data. 33:159-174.A. J. Gibbons. Pittsburgh. Landis. Greenhouse. A proposed solution to the base rate problem in the kappa statistic. 39:1334-1339. National Institutes of Health.A.C. Spitzer.. 44:1100-1106.. Tanner. MD. S. Archives of General Psychiatry. 1987.. 1987. Waternaux. National Institute of Mental Health. 1982. Bartko. T. Biometrics. D.E. Conference Manager OERP Western Psychiatric Institute and Clinic 3811 O'Hara Street Pittsburgh. Pennsylvania. R.. J. Strauss. 1977.oxfordjournals. E.org by guest on March 10. 1982.P. and Kupfer. Downloaded from schizophreniabulletin. and Meltzer. 2d ed. The special 1-day conference is an update of basic science research on the etiology of the disease and recent advances in treatment. Winer. Methodology in psychiatric research: Report on the 1986 MacArthur Foundation Network I Methodology Institute. M. Archives of General Psychiatry. British Journal of Psychiatry. NO. Yule. Choosing an approach for diagnosing schizophrenia. Archives of General Psychiatry. 1912. is Research and Consulting Mathematical Statistician. and Fleiss.VOL 17. Bethesda. and Gift. V. 135:336-342. Journal of the Royal Statistical Society. C . G. J. B.D. 1991 489 Archives of General Psychiatry. Kraemer. J.. H.U. Announcement The 8th annual conference entitled Latest Developments in the Etiology and Treatment of Schizophrenia will be held November 1.