This action might not be possible to undo. Are you sure you want to continue?
THE CORRELATION COEFFICIENT Meaning of Correlation. be as many varieties of test reliability as there are conditions affecting test scores. C. Despite optimum testing conditions. In its broadest sense. Today. The characteristics of this sample should therefore be specified. & Rajaratnam. such as Runyon and . or relationship. then there would be a perfect correlation between vanables 1 and 2. Such a correlation would have a value of + 1. With regard to score reliability.00. Cronbach. Thus. are relatively few. There could. This is not a sufficiently specific designation. then the day-by-day changes in scores on a test of cheerfulness-depression would be relevant to the purpose of the test and would hence be part of the true variance of the scores. any condition that is irrelevant to the purpose of the test represents error vanance. the same daily fluctuations wouLd fall under the heading of error variance. Nanda. no test is a perfectly reliable instrument. In this chapter. on the other hand. If. Thus. to all scientific data. 84 . Howell (1997). time limits. however. 1991). of course. A more precise description of this reliability procedure is based on its identification of variance components as relevant or irrelevant. and so on down to the poorest . Gieser. For example.' Since aLItypes of reliability are concerned with the degree of consistency or agreement between two independently derived sets of scores. Feldt & Brennan.individual in the group. in fact. between two sets of scores. Factors that might be considered error variance for one purpose would be classified under true variance for another. Crick & Brennan. when examiners try to maintain uniform testing conditions by controlling the testing environment. however. or under other variable examining conditions. Such a measure of reliability characterizes the test when it is administered under standard conditions and given to persons similar to those constituting the normative sample. More technical discussion of correlation. The error terminology has been inherited from an earlier era in psychology when interest centered on finding general laws of behavior and assessing individuals on what were assumed to be rigidly fixed underlying traits. Accordingly. if the top-scoring individual in variable 1 also obtains the top score in variable 2. The types of reliability computed in actual practice. whereby we can predict the range of fluctuation likely occur in a single individual's score as a result of irrelevant or unknown chance factors. the test is designed to measure more permanent personality characteristics. This concept of reliability underlies the computation of the error of measurement of a single score. in order to clarify their use and interpretation. a correlation coefficient (r) expresses the degree of correspondence. 1994. Essen~ially. IThis approach to score reliability has sometimes been called a generalizabihty theory of reliability (see Brennan. the crux of the matter lies in the definition of error variance.relevant for a certain purpose and would thus be classified as error variance. Essentially. as well as more detailed specifications of computing procedures. .Reliability 85 Reliability eliability" refers to the consistency of scores obtained by the same persons when they are reexamined with the same test on different occasions. 1972. the second-best individual in variable 1 is second best in variable 2. test reliability indicates the extent to which individual differences in test scores are attributable to "true" differences in the characteristics under consideration and the extent to which they are attributable to chance errors. they are reducing error variance and making the test scores m~re rel~able. 1989. and other similar factors. they can all be expressed in terms of a correlation coefficient. however. Haber (1991) or D. because generalizability applies to all aspects of a test score and. measures of test reliability make it possible to estimate what proportion of the total variance of test scores is error variance. Hence. every test should be accompanied by a statement of its reliability. These are not "errors" in the sense that they could be avoided or corrected through improved methodology. S~avelson & Webb. together with the type of reliability that was measured. since any such conditions might be i. rapport. instructions. the principal techniques for measuring the reliability of test scores will be examined. or with different sets of equivalent items. To put it in more technical terms. the next section will consid~r some of t~e basic characteristics of correlation coefficients. together with the sources of error variance identified by each. The concept of reliability has been used to cover several aspects of score consistency. if we are inter- to ested in measuring fluctuations of mood. 1982. psychologists recognize variability as an intrinsic property of all behavior and seek to investigate and sort out the many sources of such variability. can be found in any elementary textbook of educational or psychological statistics.
Figure 4-2 illustrates a. When a negative correlation is obtained between two such variables. For example. This diagonal runs in the reverse direction from that in Figure 4:-1.of one person in both variable 1 (horizontal axis) and variable 2 (vertical axis). Thus. all persons fall on the diagonal extending from the upper left. having some value higher than zero but lower than 1. if each person's score on an arithmetic computation test is recorded as the number of minutes required to complete all items. Correlations between measures of abilities are nearly always positive.to the lower right-hand comer. if time scores are correlated with amount scores.00).00. It will be noted that. since it shows that each person occupies the same relative position in both variables. a negative correlation will probably result. there is a complete reversal of scores from one variable to the other.. or below average in both. or average in variable 2. a negative correlation can be expected. If each person's name were pulled at random out of a box to de- Fig u r e 4 . although frequently low. The coefficients found in actual practice generally fall between these extremes.86 TECHNICAL AND METHODOLOGICAL PRINCIPLES Reliability 87 A hypothetical illustration of a perfect positive correlation is shown in Figure 4-1. the higher will be the positive correlation.perfect negative correlation (-1. or bivariate distribution. some persons might score above average in both variables.running from the lower left. such as might occur by chance.1. It will be noted that all of the 100 cases in the group are distributed along the diagonal. Each tally mark in this diagram indicates the score . still others might be above the average in one and at the average in the second. A zero correlation indicates complete absence of relationship.to the upper right-hand comer of the diagram.a zero or near-zero correlation would result. and if the process were repeated for variable 2. Under these conditions. There would be no regularity in the relationship from one individual to another.00. The closer the bivariate distribution of scores approaches this diagonal. The best individual in variable 1 is the poorest in variable 2 and vice versa. Such a distribution indicates a perfect positive correlation (+ 1. Fig u r e 4 .· termine her or his position in variable 1. Bivariate Distribution for a Hypothetical Correlation of -LOO.e. slowest) . the poorest (i. In this case. others might fall above average in one variable and below in the other. Bi~ariate Distribution for a Hypothetical Correlation of+1. This figure presents a scatter diagram. it would be impossible to predict an individual's relative standing in variable 2 from a knowledge of her or his score in variable 1. low. The top-scoring person in variable 1 might score high. and so forth. In such a case.00). it usually results from the way in which the scores are expressed. By chance. in this scatter diagram.2. while the score on an arithmetic reasoning testrepresents the number of problems correctly solved. this reversal being consistently maintained throughout the distribution.
now. The method demonstrated in Table 4-1 is not the quickest. When some products are positive and some negative. In actual practice. In psychological research. we can accept this correlation as an adequate description' of the degree of relation existing between the two variables in this group. while those below the average receive negative scores. The question usually asked about correlations. and the fourth column shows the deviation (y) of each reading score from the reading mean.. The third column shows the deviation (x) of each mathematics score from the mathematics mean. but it illustrates the meaning of the correlation coefficient more clearly than other methods that utilize computational shortcuts. and the sums of the squares are used in computing the standard deviations of the . the 10 cases actually examined would constitute a very inadequate sample of such a population. we multiply each individual's standard score in variable 1 by her or his standard score in variable 2. all of these products will be positive." we mean the chances are no greater than one OUr[ of 100 that the population correlation is zero. When persons above the average in one variable are below the average in the other. Obviously. The Pearson correlation coefficient is simply the mean of these products. Next to each child's name are her or his scores in the mathematics test and the reading test (Y). There are statistical procedures for estimating the probable fluctuation to be expected from sample to sample in the size of correlations. as shown in the correlation formula in Table 4-1. although the relation is not close. we' might want to know whether mathematics and reading ability are correlated among American schoolchildren of the same age as those we tested. persons falling above the average receive positive standard scores. . the sum of these cross-products is divided by the number of cases (N) and by the product of the two standard deviations (SDxSD"'j)' Computation of Pearson Product-Moment Correlation Coefficient 00 Statistical Significance. depending on the nature of the data. the correlation will be close to zero. it is not necessary to convert each raw score to a standard score before finding the cross-products. If we are concerned only with the performance of these 10 children. and any other group measures. we conclude that the two variables are truly correlated. The correlation of 040 found in l1. This correlation coefficient takes into account not only the person's position in the group but also the amount of her or his deviation above or below the group mean. is simply whether the correlation' is significantly greater than zero. could a correlation as high as that obtained in our sample have resulted from sampling error alone? When we say that a correlation is "significant at the 1% (. For example. If. Another comparable sample of the same size might yield a much lower or a much higher correlation. It will be recalled that when each person's standing is expressed in terms of standard scores.ble 4-1 indicates a moderate degree of positive relationship between the mathematics and reading " scores.88 TECHNICAL AND METHODOLOGICAL PRINCIPLES . the corresponding crossproducts will be negative. provided that each person falls on the same side of the mean on both variables. since this conversion can be made once for aU after the cross-products have been added.mathematics and reading scores by the method described in chapter 3. There are many shortcuts for computing the Pearson correlation coefficient. Table 4-1 shows the computation of a Pearson r between the mathematics and reading scores of 10 children. however. they represent. If the sum of the cross-products is negative. Reliability 89 individual will have the numerically highest score on the first test. however.01) level. while the best individual will have the highest score 'on the second. . The most common is the Pearson Product~Moment Correlation Coefficient. we perform this division only once at the end. Thus. an individual who is superior in both variables to be correlated would have two positive standard scores. the correlation will be negative. The crossproducts in the last column (xy) have been found by multiplying the corresponding deviations in the x and y columns. These deviations are squared in the next two columns. Correlation coefficients may be computed in various ways. we are usually interested in generalizing beyond the particular sample of individuals tested' to the larger population. There is some tendency for those children doing well in mathematics also to perform well on the reading test and vice versa. standard deviations. Rather than dividing each 'x and y by its corresponding SD to find standard scores. means. Hence. In other words. if the correlation in the population is zero. It will have a high positive value when corresponding standard scores are of equal sign and of approximately equal amount in the two variables. one inferior in both would have two negative standard scores. To compute the correlation (r). The sums and means of the 10 scores are given under the respective columns.
90 TECHNICAL AND METHODOLOGICAL PRINCIPLES Reliability 91· Significance levels refer to the risk of error we are willing to take in drawing conclusions from our data. any correlation of .05 leveL As might have been anticipated.3. is given in Figure 4-3. 1996). The minimum correlations significant at the . To show that a reliability coefficient (or any correlation) is significantly greater than zero provides little knowledge for either theoretical or practical. Schmidt.g. the probability of error is 5 out of 100. If a correlation is said to be significant at the . even a high correlation fails to meet the "significance test. It will be noted that the tallies cluster close to the diagonal extending from the lower left. For interpretive purposes in this book.) coefficients.25 or higher is significant at this level.72. the smallest correlation significant at the . The correlation between the number-of words written in the two forms of this test was found to be ." A substitute approach that has been receiving increasing attention considers the actual magnitude of the obtained correlation and estimates a confidence interval within which the population correlation is likely to fall at a specified confidence level (see. with only 10 cases it is difficult to establish a general relationship conclusively.40 found in Table 4-1 fails to reach significance even at the .01 level. e.tes~~eliability. except that a different letter was employed. An examination of the scatter diagram in Figure 4-3 shows a typical bivariate distribution of scores corresponding to a high positive correlation. The correlation of . TYPES OF RELIABILITY Test-Retesr Reliability. purposes. With 104 cases. The two letters were chosen by the test authors as being approximately equal in difficulty for this purpose. Especially when obtained on a small sample.05 level. Nevertheless. i g u r e 4 . significance levels have been the traditional way of evaluating correlations. Carver. the obtained correlation is somewhat lower than is desirable for reliability ZOne of the subtesrs of the SRA Tests of Primary Mental Abilities for Ages 11 to 17. Tryon. Cohen. I F. The reliability .of psychometrjc d~ta. Nevertheless. the test takers were given five minutes to write as many words as they could that began with a given letter. which usually fall in the . A Reliability Coefficient of . Olkin & Finn. 1954. however. only an understanding of the general concept is required. (Data from Anastasi & Drake. 'The data were obtained in an investigation by Anastasi and Drake (1954). This trend toward confidence intervals. The Reliability Coefficient.. The most obvious method for finding the reliability oftest scores is by repeating the identical test on a second occasion.05 levels.01 and .05 levels for groups of different sizes can be found by consulting tables of the significance of correlations in any statistics textbook. In the following section. Hunter & Schmidt.BOs or . 1995. foreshadows an important shift in the analysis of correlation coefficients in the years ahead. 1993.72.90s. 1994. The measurement of .01 or the . W.to the upper right-hand corner: the trend is definitely in this direction. computed by the Pearson Product-Moment method. represents one application of such coefficients. An example of a reliability coefficient. J. With this size of sample. there is now a growing awareness of the inadequacy and defects of this procedure. This correlation is high and significant at the . 1990.the scores of 104 persons on two equivalent forms of a Word Fluency test2 were correlated. In this case. Correlation coefficients have many uses in the analysis . although there is a certain amount of scatter Of individual entries. Most psychological research applies either the . W. Any correlation below that value simply leaves unanswered the question of whether the two variables are correlated in the population from which the sample was drawn.63. In one form. 1996. as a supplement to if not a substitute for significance testing.05 level is . although other significance levels may be employed for special reasons. the use of the correlation coeffident in computing different measures of test reliability will be considered. The second form was identical. For many years.
but whose scores reveal an almost complete lack of correspondence when the interval is extended to as long as ten or fifteen years. he or she can reproduce the correct response in the future without going through the intervening steps.92 TECHNICAL AND METHODOLOGICAL PRINCIPLES Reliability 93 coefficient (rtt) in this case is simply the correlation between the scores obtained by the same persons on the two administrations of the test. Again we must fall back on an analysis of the purposes of the test and on a thorough understanding of the behavior the test is designed to predict. the test takers may recall many of their former responses. If we wish to obtain an overall estimate of the individual's habitual finger steadiness. the interval over which it was measured should always be specified. mechanical comprehension. fatigue. Any additional changes in the relative test performance of individuals that occur over longer periods of time are apt to be cumulative and progressive rather than entirely random. and so forth. In actual practice. To some extent. the test-retest technique presents difficulties when applied to most psychological tests. sudden noises and other distractions. such as extreme changes in weather. school. retesting with the identical test is not an appropriate technique for finding a reliability coefficient. psychotherapy. For example. such as educational or job experiences. Since retest correlations decrease progressively as this interval lengthens. The correlation between the scores obtained on the two forms represents the reliability coefficient of the test. we do not ordinarily correlate retest scores over a period of ten years. they are likely to characterize a broader area of behavior than that covered by the test performance itself. an effort is made to keep the interval short. Many preschool intelligence tests. Moreover. The extent to which such factors can affect an individual's psychological development provides an important problem for investigation. in checking this type of test reliability. The error variance corresponds to the random fluctuations of performance from one test session to the other. Retest reliability shows the extent to which scores on a test can be generalized over different occasions. there is not one but an infinite number of retest reliability coefficients for any test. Short-range. . This is especially true of problems involving reasoning or ingenuity. These variations may result in part from uncontrolled testing conditions. the less susceptible the scores are to the random daily changes in the condition of the test takers or of the testing environment. because of circumstances peculiar to the individual's own home. what considerations should guide the choice of interval? Illustrations could readily be cited of tests showing high reliability over periods of a few days or weeks. Alternate-Form Reliability. recent experiences of a pleasant or unpleasaht nature. The nature of the test itself may also change with repetition. For any type of person. The individual's statusmay have either risen or dropped appreciably in relation to others of her or his own age. long-range retests have been conducted with such tests. To be sure. Thus. Although apparently simple and straightforward. a simple distinction can usually be made. for example. but the results are generally discussed in terms of the predictability of adult intelligence from childhood performance. or for other reasons such as illness or emotional disturbance. Moreover. For the large majority of psychological tests. yield moderately stable measures within the preschool period but are virtually useless as predictors of late childhood or adult IQs. random fluctuations that occur during intervals ranging from a few hours to a few months are generally included under the error variance of the test score. if the interval between retests is fairly short. A number of sensory discrimination and motor tests would fall into this category. one's general level of scholastic aptitude. alternate-form reliability provides a useful measure for evaluating many tests. they arise from changes in the condition of the test takers themselves. or community environment. random changes that characterize the test performance itself rather than the entire behavior domain that is being tested. Thus. or artistic judgment may have altered appreciably over a ten-year period. since at early ages progressive developmental changes are discernible over a period of a month or even less. equivalent form on the second. but over a few weeks. This question. or a broken pencil point. and the like. Only tests that are not appreciably affected by repetition lend themselves to the retest technique. however. Once the test taker has grasped the principle involved in the problem or has worked out a 'solution. should not be confused with that of the reliability of a particular test. the period should be even shorter than for older persons. the higher the reliability. One way of avoiding the difficulties encountered in test-retest reliability is through the use of alternate forms of the test. whereas a single test session would suffice for verbal comprehension. The same persons can thus be tested with one form on the first occasion and with another. for example. we would probably require repeated tests on several days. In other words. as illustrated by illness. This coefficient thus combines two types of reliability. owing to unusual intervening experiences. however. emotional strain. or even one year. Since both types are important for most testing purposes. rather than in terms of the reliability of a particular test. and the correlation between them will be spuriously high. It is also desirable to give some indication of relevant intervening experiences of the persons on whom reliability was measured. In testing young children. Apart from the desirability of reporting length of interval. worry. the interval between retests should rarely exceed six months. When we measure the reliability of the Stanford-Binet. counseling. the same pattern of right and wrong responses is likely to recur through sheer memory. When retest reliability is reported in a test manual. steadiness of delicate finger movements is undoubtedly more susceptible to slight changes in the person's condition than is verbal comprehension. The concept of reliability is generally restricted to short-range. It will be noted that such a reliability coefficient is a measure of both temporal stability and consistency of response to different item samples (or test forms). Practice will probably produce varying amounts of improvement in the retest scores of different individuals. however. how- ever. however. the scores on the two administrations of the test are not independently obtained. Thus. It should be noted that different behavior functions may themselves vary in the extent of daily fluctuation they exhibit.
reduction will be negligible. owing to differences in nature and difficulty level of items. parallel forms of a test should be independently constructed tests designed to meet the same specifications. underlies not only alternate-form reliability but also other types of reliability to be discussed shortly. the first list might contain a larger number-of words unfamiliar to individual Athan does the second list. The differences in the scores obtained by the same individuals on these two tests illustrate the type of error variance under consideration. a whole group of items should be assigned intact to one or the other half. Any test can be divided in many different ways. The error variance in this case represents fluctuations in performance from one set of items to another. Like test-retest reliability. alternateform reliability also has certain limitations. fatigue. if the behavior functions under consideration are subject to a large practice effect. This type of reliability coefficient is sometimes called a coefficient of internal consistency. working independently. and other factors. If the items were originally arranged in an approximate order of difficulty. boredom. Were the items in such a group to be . as well as a description of relevant intervening experiences. In the development of alternate forms. the first half and the second half would not-be equivalent. This familiar situation illustrates error variance resulting from content sampling. To what extent do scores on this test depend on factors specific to the particular selection of items? If a different investigator. format. From a single administration of one form of a test it is possible to arrive at a measure of reliability by various split-half procedures. Thus. In this case. finding an unusually large number of items on areas they had failed to review. In such a case. If the practice effect is small. but not fluctuations over time. such a division yields very nearly equivalent half-scores. Finally. the practice effect represents another source of variance that will tend to reduce the correlation between the two test forms. One precaution to be observed in making such an odd-even split pertains to groups of items dealing with a single problem. were to prepare another test in accordance with the same specifications. and the items should be expressed in the same form and should cover the same type of content. Now suppose that a second list of 40 different words is assembled for the same purpose. In most tests. of course. if all test takers were to show the same improvement with repetition. be exercised to ensure that they are truly paralleL Fundamentally. Temporal stability of the scores does not enter into such reliability. how much would an individual's score differ on the two tests? Let us suppose that a 40-item vocabulary test has been constructed as a measure of general verbal comprehension. the use of alternate forms will reduce but not eliminate such an effect. It is much more likely. On another occasion. It is apparent that split-half reliability provides a measure of consistency with regard to content sampling. because only one test. For all these reasons. The second list. care should. since only a single administration of a single form is required. changing the specific content of the items in the second form would not suffice to eliminate this carry-over from the first form. it should be added that alternate forms are unavailable for many tests. In such a way. not across occasions. Although much more widely applicable than test-retest reliability. might contain a disproportionately large number of words unfamiliar to individual B. It should be added that the availability of parallel test forms is desirable for other reasons besides the determination of test reliability. Split-Half Reliability. the resulting correlation shows reliability across forms only. owing to chance differences in the selection of items. the correlation between their scores would remain unaffected. It is therefore appropriate to examine it more closely. To be sure. any item involving the same principle can be readily solved by most persons once they have worked out the solution to the first. Owing to fortuitous factors in the past experience of different individuals. illustrative examples.e. Another related question concerns the degree to which the nature of the test will change with repetition. and that the items are constructed with equal care to cover the same range of difficulty as the first test. session is involved. since adding a constant amount to each score does not alter the correlation coefficient. If the two individuals are approximately equal in their overall word knowledge (i. in their "true scores"). and all other aspects of the test must likewise be checked for equivalence. while A will excel B on the second. however. two scores are obtained for each person by dividing the test into equivalent halves.94 TECHNICAL AND METHODOLOGICAL PRINCIPLES Reliability The concept of item sampling. . B will nevertheless excel A on the first list. on the other hand. The use of several alternate forms also provides a means of reducing the possibility of coaching or cheating. that individuals will differ in amount of improvement. alternate-form reliability should always be accompanied by a statement of the length of the interval between test administrations. owing to extent of previous practice with similar material. practice. other techniques for estimating test reliability are often required. and any other factors varying progressively from the beginning to the end of the test. because of the practical difficulties of constructing truly equivalent forms. If the two forms are administered in imm7diate succession. A procedure that is adequate for most purposes is to find the scores on the odd and even items of the test. In certain types of ingenuity problems. as well as to the cumulative effects of warming up. time limits. The range and level of difficulty of the items should also be equaL Instructions. motivation in taking the test. the relative difficulty of the two lists will vary somewhat from person to person. or content sampling. In the first place. Most students have probably had the experience of taking a course examination in which they felt they had a "lucky break" because many of the items covered the very topics they happened to have studied most carefully. The relative standing of these two persons will therefore be reversed on the two lists. they may have had the opposite experience. To find split-half reliability. Alternate forms are useful in follow-up studies or in investigations of the effects of some intervening ex- 95 perimental factor on test performance. such as questions referring to a particular mechanical diagram or to a given passage in a reading test. for example. The tests should contain the same number of items. Under these conditions. the first problem is how to split the test in order to obtain the most nearly equivalent halves.
more heterogeneous test. 1991). but such individual variations are slight in comparison with those found in a more heterogeneous test. A highly relevant question in this connection is whether the criterion that the test is trying to predict is itself relatively homogeneous or heterogeneous. Th~ Spearman-Brown formula is widely used in determining reliability by the split-half method.Thlssen. When this error variance is subtracted from LOO. multiplication. another may score relatively well on the division items. they may be correlated by the usual method.3 Once the two half-scores have been obtained for each person.: It is reasonable to expect that. Moreover. and considerable research has accumu lated on the statistical treatment of such integrated item groupings. Can we conclude that the performances of the two on this test were equal? Not at alL Smith may have correctly completed 10 vocabulary items. and multiplication. For example. if the number of test items is increased from 25 to 100. subtraction. it can be simplified as follows: r ::-. if one test includes only multiplication items while another comprises addition. since any single error in understanding the problem might affect items in both halves. there might be little or no relationship between an individual's performance on the different types of items. which yields the reliability of the whole test directly: • It is interesting to note the relationship of this formula to the definition of error Yrhere is now good empirical evidence to support this expectation. The variance of these differences. however. Thus. these two values are substituted in the following formula. It is apparent that test scores will be less ambiguous when derived from relatively homogeneous tests. and so on. rtt is the obtained coefficient. In both test-retest and alternate-form reliability. & Wainer. will increase only its consistency over time (see Cureton. This interitem consistency is influenced by two sources of error variance: (1) content sampling (as in alternate-form and split-half reliability). The effect that lengthening or shortening a test will have on its coefficient can be estimated by means of the SpearmanBrown formula. if the entire test consists of 100 items. however. and n is the number of times the test is lengthened or shortened. or "testlers" (Slrcci. In contrast. the formula always involves doubling the length of the test. the higher the interitern consistency. subtraction. 10 perceptual speed items.2rhh It 1 + rhh in which rhh is the correlation of the half-tests. A fourth method for finding reliability. the similarity of the half-scores would be spuriously inflated. if the items were arranged in ascending order of difficulty. which is equal to the reliability coefficient. When applied to split-half reliability. It requires only the variance of the differences between each person's scores on the two half-tests (SDl) and the variance of total scores (SD. 10 arithmetic reasoning. 10 arithmetic reasoning. Under these conditions. Kuder-Richardson Reliability and Coefficient Alpha. Any difference between a person's scores on the two half-tests represents irrelevant or error variance. Jones may have received a score of 20 by the successful completion of 5 perceptual speed. a score of 20 would probably mean that the test taker had succeeded with approximately the first 20 words. and (2) heterogeneity of the behavior domain sampled. It should be noted. the longer a test. 10 spatial relations. a single homogeneous test is obviously not an adequate predictor of a highly heterogeneous criterion. is based on the consistency of responses to all items in the test..S spatial relations. and division items. 1973). In the relatively homogeneous vocabulary test. in contrast to one containing 10 vocabulary. each score is based on the full number of items in the test. and none of the arithmetic reasoning and spatial relations items. This score would have a very different meaning when obtained through such dissimilar combinations of items. in terms of content sampling not its stability ' . He or she might have failed two or three easier words and correctly responded to two or three more difficult items beyond the 20th. that this correlation actually gives the reliability of only a half-test. if it is decreased from 60 to 30. the more reliable it will be. with a larger sample of behavior. Although homogeneous tests are to be preferred because their scores permit fairly unambiguous interpretation. 1965. on the other hand. given below: variance. divided by the variance of total scores. in the prediction of a heterogeneous in which r nn is the estimated coefficient. Smith and Jones both obtain a score of 20.96 TECHNICAL AND METHODOLOGICAL PRINCIPLES Reliability 97 placed in different halves of the test. also utilizing a single administration of a single form. n is 4. Suppose that in the highly heterogeneous 40-item test cited previously. A more extreme example would be represented by a test consisting of 40 vocabulary items. many test manuals reporting reliability in this form. on the other hand. gives the proportion of error variance in the scores. In the latter. the correlation is computed between two sets of scores each of which is based on only 50 items. 4Lengthening a test. In the latter test.). and no vocabulary items. Cureton et al. the former test will probably show more interitem consistency than the latter. The more homogeneous the domain. Many other combinations could obviously produce the same total score of 20. Other things being equal. one test taker may perform better in subtraction' than in any of the other arithmetic operations. For example. but less well in addition. n is }1. it gives the proportion of "true" variance for a specified test use. and 10 perceptual speed items. An alternate method for finding split-half reliability was developed by Rulon (1939). we can arrive at a more adequate and consistent measure.
pq is replaced by I. Novick &Lewis. With such tests. or according to some other all-or-none system. Scorer reliability can be found by having a sample of test papers independently scored by two examiners. is based on a planned split designed to yield equivalent sets of items. on the other hand. It'should now be apparent that the different types of reliability vary in the factors they subsume under error variance. Rather than requiring two half-scores. this method of determining reliability involves little additional computation. is ~s much need for a measure of scorer reliability as there is for the more usual reliability coefficients. Hence. On the other hand. Through special experimental designs. items 3 and 4 arithmetic reasoning. 1975. One source of error variance that can be checked quite simply is scorer variance.pq SD~ rtC- _( n-l n )SD~-SD~ I. Thus. it may be desirable to construct several relatively homogeneous tests. since there would be little consistency of performance among the entire set of 50 items. however.I. in another. most tests provide such highly standardized procedures for administration and scoring that error variance attributable to these factors is negligible. In such a case. 1951). Suppose we construct a 50~item test out of 25 different kinds of items such that items 1 and 2 are vocabulary items. With clinical inst~~ ments employed in intensive individual examinations. 1967). known as coefficient alpha (Cronbach. I. The product of p and q is computed for each item. Some tests. tests. there ts evidence of considerable examiner variance. In fact. Scorer Reliability. this technique is based on an examination of performance on each item. Hence. depending on h . may have multiple-scored items. it is not necessary to report special reliability coefficients corresponding to "distraction variance" or "timing variance. it refers to differences between sets of parallel items. unless the test items are highly homogeneous. and in still another. y. is found by tabulating the proportion of persons who pass (p) and the proportion who do not pass (q) each item." the most widely applicable. and these products are then added for all items. on the other hand. 1967).(SDr) In this formula. to give I. the difference between Kuder-Richardson and split-half reliability coefficients may serve as a rough index of the heterogeneity of a test. each measuring a different phase of the heterogeneous criterion. and the resulting correlation coefficient . The most common procedure for finding inter item consistency is that developed by Kuder and Richardson (1937). r[! is the reliability coefficient of the whole test. 1951. it is not customary to report the error of measurement resulting when a test . however. for example. the value I." Similarly. the respondent may receive a different numerical score on an item. For example. Certain types of tests-notably tests of creativity and 'projective tests of personality-leave a good deal to the judgment of the scorer. a generalized formula has been derived. The only new term in this formula.half coefficients are found by the Rulon formula (based on the variance of the differences between the two half-scores). the heterogeneity of test items would not necessarily represent error variance. n is the number of items in the test. we need only to make certain that the prescribed procedures are carefully followed and adequately checked. In this formula. The two scores thus obtained by each test taker are then correlated in the usual way.5 The ordinary split-half coefficient. Since in the process of test construction p is often routinely recorded in order to find the difficulty level of each item. An extreme example will serve to highlight the difference. would be very low.(SDt). The complete formula for coefficient alpha is given as follows: criterion. the factors excluded from measures of error variance are broadly of two types: (a) those factors whose variance should remain in the scores. however. it includes any interitem inconsistency. Traditional intelligence tests provide a good example of heterogeneous tests designed to predict heterogeneous criteria. we would expect the Kuder-Richardson reliability to be much lower than the split-half reliability. items 5 and 6 spatial orientation. The pro" cedure is to find the variance of all individuals-scores for each item and then to· add these variances across all items. The homogeneity of this test. the Kuder-Richardson coefficient will be lower than the split-half reliability. however.pq. and so on. interitem consistency is found from a single administration of a single test.pq. Of the various formulas derived in the original article. since they are part of the true differences under consideration.is administered under distracting conditions or with a longer or shorter time limit than that specified in the manuaL Timing errors and serious distractions can be empirically eliminated from the testing situation. there. the sum of the variances of item scores. As in the split-half methods. With such instruments. unambiguous interpretation of test scores could be combined with adequate criterion coverage. In this example." is the following: rt n ( n-l [= )SD~. commonly known as "Kuder-Richardson formula 20. and SDt is the standard deviation of total scores on the test. and (b) those irrelevant factors that can be experimentally controlled. thus yielding a high split-half reliability coefficient.98 TECHNICAL AND METHOOOLOGICAL PRINCIPLES Reliability 99 The Kuder-Richardson formula is applicable to tests whose items are scored as right or wrong. 5This is strictly rrue only when the split. error variance covers temporal fluctuations. " "rare I"" or never. it is possible to separate this variance from that attributable to temporal fluctuations in the test taker's condition or to the use of alternate test forms. It can be shown mathematically that the Kuder-Richardson reliability coefficient is actually the mean of all split-half coefficients resulting from different splittings of a test (Cronbach. whet h er h e or sec h ec k" usua 11" "sometimes. On a personality inventory. " For such s y. Kaiser & Michael. This is particularly true of group tests designed for mass testing and computer scoring. In one case. not when they are found by correlation of halves and SpearmanBrown formula (Novick & Lewis. The odd and even scores on this test could theoretically agree quite closely.
08) gives a total error variance of .testuse . from which a scorer reliability of . and 15% depends on error variance (as operationally defined by the specific procedure followed). The three reliability coefficients can now be analyzed to yield the error variances shown in Table 4-4 and Figure 4-4. are given by Gulliksen' (195o. a split-half reliability coefficient can also be computed.62. It willbe.noted that by subtracting the error variance attributable to content sampling alone (split-half reliability) from the error variance attributable to both content and time sampling (alternate-form reliability).10 of the variance can be attributed to time sampling alone. Actually.38 and hence a true variance of ..' . Test manuals should also report it when appropriate.80. . Overview: The different types of reliability coefficients discussed in this section are summarized in Tables 4-2 and 4-3. 7For a better estimate of the coefficient of internal consistency.92 is obtained. the result is the originalreliability coefficient (ret).sessions required. This type of reliability is commonly computed when . Thus. . These proportions.a creativity test have been administered with a two-month interval to . chaps.subjectively scored instruments are employed in research. which can therefore be interpreted directly as the percentage of. Any reliability coefficient maybe interpreted directly in terms of the percentqge of score variance attributable to different sources. . Finally. The statistically sophisticated reader may recall that it is the square of a correlation coefficient that represents proportion of common variance. lime sampling (. expressed in the more familiar percentage terms. Adding the error variances attributable to content sampling (. .g. Let us consider the.85 signifies that 85% of the variance in test scores depends on true variance in the trait measured. . known as the index of reliabilitvi'is equal to the square root of the reliability coefficient (~).100. Experimental designs that yield more than one type of reliability coefficient for the same group permit the analysis of total score variance into different components. are shown graphically in Figure Techniques for Measuring Reliability. In Table +-2. 2'and 3). sixthgrade children. This correlation. which are free from chance errors. stepped up by the Spearman-Brown formula. Fisher z-transformation).'1 ! 100 TECHNICAL AND METHODOLOGICAL PRINCIPLES Reliability 101 is a measure of scorer reliability. we find that .20). and interscorer difference (..I This coefficient. From the responses Sources of ErrorVariance in Relation to Reliability Coefficients Error Variance on either form. split-half correlations could be computed for each fonn and the two coefficients averaged by the appropriate statistical procedures (e.10). The resulting alternate-form reliability is . a reliability coefficient of . in Relation to Test Form and Testing Se~sion Test Forms Required Testi~g Ses~ions~equired One Two Analysis of Sources of ErrorVariance in a Hypothetical Test *Error Variance 6Deri~ations' at the index of reliability. is . Table 4-3 shows the sources of variance treated as error variance by each procedure.true variance for a designated .70. based on two different sets of assumptions. the proportion of true variance in test scores is the square of the correlation between scores on a Single form of the test and true scores.. When the index of reliability is itself squared.following hypothetical example" Forms A andB of. a second scorer has rescored a random sample of 50 papers. the operations followed in obtaining each type of reliability are classified with regard to number of test forms and number of testing .
' An examination of the procedures followed in finding both split-half and Kuder-Richardson reliability will show that both are based on the consistency in number of errors made by the examinee. respectively. as illustrated by the domain-referenced tests discussed in chapter 3. Consequently. cedure is tantamount to administering two equivalent forms of the test in immediate succ~ssion. most tests depending on both power and speed in varying proportions. the measure of reliability must. and Shavelson and Webb (1991). with a score of 34. RELIABILITY OF SPEEDED TESTS Both in test construction and in the interpretation of test scores. The difficulty of the items is steeply graded. This sorting out of sources of variance is the essence of the so-called general. It will be noted that both speed and power tests are designed to prevent the achievement of perfect scores. signed. In actual practice. are inapplicable to speeded tests. The purpose of such testing is not to establish the limits of what the individual can do but to determine whether a preestablished performance level has or has not been reached. each person's score reflects only the speed with which he or she worked. An extreme example will help to clarify this point. and the test includes some items too difficult for anyone to solve.00. As long as individual differences in test scores are appreciably affected by speed. to administer two equivaLent halves of the test with separate time limits.item test depends entirely on speed. but it will still be spuriously high. . Such a pro. To the extent that individual differences in test scores depend on speed of performance.obviously be based on consistency in speed of work. he or she will obviously have 22 correct odd items and 22 correct even items. In other words. When test performance depends . on the other hand. such as Brennan (1984). aU of which are well within the ability level of the persons for whom the test is de. If. What alternative procedures are available to determine the reliability of significantlvspeeded tests? If the test-retest technique is applicable. provided that the split is made in terms of time rather than in terms of items. an important distinction is that between the measurement of speed and of power. Under these conditions. Single-trial reliability coefficients.00. An exception to this rule is found in mastery testing. Similarly. Feldt and Brennan (1989). while the test takers scores are normally based on the whole test. To enable each individual to show fully what he or she is able to accomplish. had been included. an alternative procedure is to divide the totaL time into quarters and to find a score for each of Fig u r e 4 . individual B. but on speed. Cronbach et a1. Thetime limit is made so short that no one can finish all the items. half the time limit of the entire test. rather than on errors. except for accidental care' less errors on a few items. Infermation about these proportions is needed for each test in order not only to understand what the test measures but also to choose the proper procedures for evaluating its reliability.Reliability 103 102 TECHNICAL AND METHODOLOGICAL PRINCIPLES 4-4. The reason for such a precaution is that perfect scores are indeterminate. and each set of items given with one. Such a correlation. Percentage Distribution of Score Variance in a Hypothetical Test. now. the correlation between odd and even scores wouLd be perfect. on a combination of speed and power. Similarly. not on errors. has a time limit long enough to per' mit everyone to attempt all items. Then. so that no one can get a perfect score. (1972). or + 1. however. Each form. A pure power test. the test must provide adequate ceiling. individual differences in test . . . equivalent-form reliability may be properly employed with speed tests. A pure speed test is one in which individual differences depend entirely on speed of performance. Such a test is constructed from items of uniformly low difficulty. however.4. the half-scores must be based on separately timed parts of the test. single-trial reliability coefficients cannot be properly interpreted. 'If it is not feasible to administer the two half-tests separately. such as those found by odd-even or Kuder-Richardson techniques. the single-trial reliability coefficient will fall below 1. izability theory of reliability. either the Spearman-Brown or some other appropriate formula should be used to find the reliability of the whole test. is entirely spurious and provides no information about the reliability of the test. One way of effecting such a split is . For this reason. so that individual differences in score are based wholly on number of items attempted. is half as long as the test proper. will have odd and even scores of 17 and 17. the distinction between speed and power tests is one of degree. Let us suppose that a 50.scores depend. the odd and even items may be separately printed on different pages. it would be appropriate. since it is impossible to know how much higher the individual's score would have been if more items. if individual A obtains a score of 44. either in number of items or in difficulty level. reliability coefficients found by these methods will be spuriously high. For example. or more difficult items. Split. Complex experimental designs permitting simultaneous assessment of more sources of score variance and the interactions among them can be found in detailed topical treatments. . haLf techniques may also be used. .
'. we consider only the subgroup falling within the small rectangle in the upper right-hand portion of the diagram. or extent of individual differences.. shows a negligible difference when computed by the two methods . . the reliability of the Space test is . the two variances will be equal and the ratio will be 1. In this study. now.90. 1954. while the Reasoning test was somewhat more dependent on speed. the reliability of the Reasoning test drops from . there is little relationship. If all test takers finish within the given time limit. The number of items correctly completed within the first and fourth quarters can then be combined to represent one half-score. a high correlation would undoubtedly be obtained between the two tests.: Another.) DEPEN DENCE OF RELIABILITY COEFFICIENTS 0 N TH E SAM PLE TESTED Variability. Reliability Coefficients of Fourof the SRATests of Primary Mental Abilities for Ages 11 to 17 (1st Edition) Reliability Coefficient Found by: Verbal Meaning Reasoning Space Number (Data from Anastasi & Drake. The essential question. = 0). An important condition influencing the size of a reliability coefficient is the nature of the group on which reliability is measured. when it is properly computed. Similarly. The Space and Number tests proved to be highly speeded. If these tests were administered to a highly homogeneous sample. in contrast to a spuriously high odd-even coefficient of . Examination of the hypothetical scatter diagram given in Figure 4-5 will fur. to predict an individual's standing in any other ability from a knowledge of her or his spelling score . Even when no one finishes the test. and similar relationships would hold for other subgroups within this highly heterogeneous sample. If every member of a group were nearly alike in spelling ability. but their detailed consideration falls beyond the scope of this book.83. ther illustrate the dependence of correlation coefficients on the variability.. less extreme. This scatter diagram shows a high positive correlation in the entire.. heterogeneous group. This can easily be done by having the test takers mark the item on which they are working whenever the examiner gives a prearranged signal. between any individual's verbal ability and her or his numerical reasoning ability. if everyone completes exactly 40 items of a 50-item test. the role of speed may be negligible. Several more refined procedures have been developed for determining this proportion. the correlation between the two would probably be very low. while those in the second and third quarters can be combined to yield the other half-score. For example. within the group. as 'did the college sophomores mentioned previously. It would obviously be impossible. such as a group of 300 college sophomores. Reliability coefficients were then computed by correlating scores on separately timed halves. This method is especially satisfactory when the items are not steeply graded in difficulty level. fatigue.75. Calculation of speed indexes showed that the Verbal Meaning test was primarily a power test. 1954). . This proportion can be estimated roughly by finding the variance of number of items completed b~ different persons and dividing it by the variance of total test scores (SD~/SDt). On the other hand. such as a verbal comprehension and an arithmetic reasoning test. Percentage of persons who fail to complete the test might be taken as a crude index of speed versus power.00.104 TECHNICAL AND METHODOLOGICAL PRINCIPLES Reliability 105 the four quarters. The mentally retarded examinees would obtain poorer scores than the college graduates on both tests. in which every individual finishes 40 items. we want to know what proportion of the total variance in test scores is speed variance. speed of work plays no part in determining the scores. if the total test variance (SDD is attributable to individual differences in speed. then the correlation of spelling with any other ability would be close to zero in that group. the mere employment of a time limit does not signify a speed test. The entire index would thus equal zero in a pure power test. In the first place. it is apparent that the correlation between the two variables is closeto. is: "To what extent are individual differences in test scores attributable to speed?" In more technical terms. were the tests to be given to a heterogeneous sample of 300 persons.zero. of course. on the other hand.92 to . and that of the Number test drops from . ranging from mentally retarded persons to college graduates. On the other hand. the numerator of this fraction would be zero. Individuals falling within this restricted range in both variables represent a highly homogeneous group. example is provided by the correlation between two aptitude tests. An example of the effect of speed on single-trial reliability coefficients is provided by data collected in an investigation of the first edition of the SRA Tests of Primary Mental Abilities for Ages 11 to 17 (Anastasi & Drake. These coefficients are shown in the second row of Table 4-5. any correlation coefficient is affected by the range of individual differences in the group. Such a combination of quarters tends to balance out the cumulative effects of practice. Because of restriction of range. since there are no individual differences in number of items completed (SD. and other factors. the reliability of each test was first determined by the usual odd-even procedure. . since the entries are closely clustered about the diagonal extendingfrom lower left. These coefficients are given in the first row of Table 4-5. When is a test appreciably speeded? Under what conditions must the special precautions discussed in this section be observed? Obviously. In the example cited in the preceding paragraph. although no one had time to attempt all the items. within such a selected sample of college students. within such a group. however. .96 to .87. individual differences with regard to speed are entirely absent. It will be noted in Table 4-5 that. If. The reliability of the relatively unspeeded Verbal Meaning test. to upper righthand comers.
Scores. The standard error of measurement can be easily computed from the reliability coefficient of the test. The Effect of Restricted Range upon a Correlation Coefficient. the reliability coefficients are more likely to be applicable to the samples with which the test is to he used in actual practice. Thus. cannot usually be predicted or estimated by any statistical formula but can be discovered only by empirical tryout of the test on groups differing in age or ability level. The reliability of a test may be ex. with regard to age. STANDARD Fig u r e 4 . reliability coefficients depend on thevariabilitv of the sample within which they are found. Ability Level. moreover. and the like. For example. These differences. For many testing purposes. and to report separate reliability coefficients for each subgroup. grade level. let us say. It is preferable. The reported reliability coefficient is applicable only to samples similar to that on which it was computed.grade children to high school students. It is apparent that every reliability coefficient should be accompanied by a full description of the type of group on which it was determined. it cannot be assumed that the rehablht~ would be equally high within. if th~ reliability coefficient reported in a test manual was calculated for a group ranging fro~ fo~~th. it is therefore more useful than the reliability coefficient. Not only does the reliability coefficient vary with the extent of individual differences in the sample. the test manual should report separate reliability coefficients for relatively homogeneous subgroups within the standardization sample. by the following formula: in which SDt is the standard deviation of the test scores and rtt is the reliability coefficient. if deviation IQson a . A desirable and growing practice in test construction is to fractionate the standardization sample into more homogeneous subgroups. This measure is particularly weLLsuited to the interpretation of individual scores. both computed on the same group. ERROR OF MEASUREMENT Like all correlation coefficients. sex. When a test 1S to be used to discriminate individual differences within a more homogeneous sa~ple than the standardization group. pressed in terms of the standard error of measurement (SEM). Under these conditions.5. For tests designed to cover a wide range of age or ability. Formulas for estimating the reliability coefficient to be expected when the standard deviation of the group is increased or decreased are Interpretation of Individual . an eighth-grade sample. the reliability coefficient should be redetermmed on such a sample. Or the length of the test may vary at different age levels.106 TECHNICAL AND METHODOLOGICAL PRINCIPLES Reliability 107 available in elementary statistics textbooks. In other tests. occupation. since their scores are unduly influenced by guessing. to recompute the reliability coefficient empirically on a group comparable to that on which the test is to be used. reliability may be relatively low for the younger and less able groups. however. also called the standard error of a score. Special attention should be given to the variability and the ability level of the sample. but it may also vary between groups differing in average ability level. the upper and lower extremes may not provide enough items at the appropriate difficulty level to permit individuals to demonstrate adequately what they are able to do (ceiling or floor effects). Even when the number of available items is the same. Such differences in the reliability of a single test may arise in part from the fact that a slightly different combination of abilities is measured at different difficulty levels of the test.
. on either side of her true IQ. ' Interpretation of Score Differences. 1987). It is particularly important to consider test reliability and errors of measurement when evaluating the differences between two scores. Although we cannot assign a probability to this statement for any given obtained score. To understand what the SEM teUs us about a score. of course.. not only in materials distributed to high school and college counselors. Such caution is desirable both when comparing test scores of different persons and when comparing the scores of the same individual in different abilities. A frequent question about test scores concerns the individual's relative standing in different areas. we would expect her to score between 105 and 115 about two thirds (68%) of the time. can be assumed to remain constant when ability level varies widely . if we want to compare the reliability of different tests. Thinking in terms of the range within which each score may fluctuate serves as a check against overemphasizing small differences between scores. So important is this application of the SEM that an increasing number of published tests now report scores. '.58 SEM from her true score. On the basis of this reasoning Gulliksen (1950.89 = lsvT1 = 15(. this standard error can be interpreted in terms of the normal curve frequencies discussed in chapter 3 (see Fig.89. we can say that the statement would be correct for 99% of all the cases. and the standard deviation of the distribution can be taken as the SEM. not as a single number. It will be recalled that bet'Yeen the mean and ±lcr. If we want to be more certain of our prediction. the standard error of measurement is more appropriate.we do not have the true scores. Glutting. The procedure yields a test information curve that depends only on the items included in the test and permits an estimation of the error of measurement at each ability level. Similarly. 3-3). On the other hand. Reliability 109 I particular intelligence test have a standard deviation of 15 and a reliability coefficient of . or (2.The differences in reliability coefficients discussed in the preceding section remain when errors of measurement are computed at different levels of the same test. Moreover.58 SEM of her obtained score.13 and 110 + 13). but only the scores obtained in a single test administration. we can choose higher odds than 2:1. Under these circumstances. how sure can we be . the same fallible reliability coefficient.. The usual problems of comparability of units would thus arise when errors of measurement are reported in terms of arithmetic problems. Like any standard deviation. If an individual's obtained score is unlikely to deviate by more than 2. let us suppose that we had a set of 100 IQs obtained with the above test by a single child. To interpret individual scores. the chances are 99:1 that Janet's IQ will fall within 2. Further discussion of these techniques is given in ~~M7.580 on either side of the mean includes exactly 99% of the cases.33) = 5. the error of measurement is independent of the variability of the group 80ther procedures have been proposed that use an estimated "true" score as the center of the confidence interval (Dudek.' : on which it is computed. given 100 equivalent tests. The mean of this distribution of 100 scores can be taken as the "true score" for a specified test use. Ifher trueIQ is 110. of such "reasonable limits" that theerror of measurement is customarily interpreted in psychological testing. 1979. these scores vary. pp. It is in terms. The SEM is likewise covered in instructional materials for use in orienting students regarding the meaning of their test scores. it remains unchanged when found in a homogeneous or a heterogeneous group. but as a score band within which the individual's true score is likely to fall. the optimal procedure varies with the particular purpose for which the test scores are to be used (e. The SEM (or some other index of measurement accuracy) provides a safeguard against placing undue emphasis on a single numerical score. & Stanley. Spanning a wide ability range. Unlike the reliability coefficient. her IQ would fall outside this band of values only once. there are approximately 68% of the cases ina normal curve. If Janet were. Thus. Neither reliability coefficients nor errors of measurement. If the reliability coefficient is high. the reliability coefficient is the better measure. both true score and size of confidence interval are computed from. being reported in score units. for long-term prediction or current performance assessment). We can thus state at the 99% confidence level (with only one chance of error out of 100) that Janet's IQ on any single administration of the test will lie between 97 and 123 (110 .108 TECHNICAL AND METHODOLOGICAL PRINCIPLES '. Is Doris more able along verbal than along numerical lines? Does Tom have more aptitude for mechanical than for verbal activities? If Doris scored higher on the verbal than on the numerical subtests on an aptitude battery and Tom scored higher on the mechanical than on the verbal.7% of the cases. Reference to Figure 3-3 (chap. hut also in the individual score reports on the SAT sent to test takers. Janet. It can be ascertained from normal curve frequency tables that a distance of 2. Hence. 17-20) proposed that the standard error of measurement be used as illustrated above. these techniques offer a means of expressing the measurement accuracy of a test as a function of the level of ability. this procedure has little effect. falling into a normal distribution around Janet's true score. . however. we can apply the above reasoning in the reverse direction. we can conclude that the chances are roughly 2:1 (or 68:32) that Janet's IQ on this test will fluctuate between ±1 SEM or 5 points on either side of her true IQ. Hence. Expressed in terms of individual scores. if it is low. we could argue that her true score must lie within '2. and i~will be so interpreted in this book.g.8 The standard error of measurement and the reliability coefficient are obviously alternative ways' of expressing test reliability. and the Like. Information art SEMs is also provided for use in interpreting scores on the Graduate Record Examinations (GRE 1995-96 guide). McDermott.58 SEM. ".. the SEM of anIQ on this test is: 15V1. changes in scores following instruction or other experimental variables need to be interpreted in the light of errors of measurement. In actual practice. the error of measurement will not be directly comparable from test to test.58) (5) ::::13 points. The College Board provides data on the SEM and an explanation of its use. words in a vocabulary test. 3) shows that ±3cr covers 99. in order to estimate the reasonable limits of the true score for persons with any given obtained score. Because of the types of chance errors discussed in this chapter.· A comprehensive solution for this problem is provided by the IRT techniques of item analysis cited in chapter 3.
as follows: . that between Numerical Reasoning and Abstract Reasoning probably does not. the nururnum significant Verbal-Performance difference at the . . as reported in the test manual.ed by using the actual reliabilitics and SDs found within each age group. are close to 10. In the profile illustrated in Figure 4-6.74 Todetermine how large a score difference could be obtained by chance'at the . which includes a correlation term when the two variables to be compared are correlated.rzz for SEMz. It is good to bear in mind that the standard error of the 'difference between two scores is larger than the error of measurement of either ofthe two scores. When thus computed. = SDV2 Til .-Rpeviation IQs'are expressed on a scale with a mean of 100 and an SD of 15.' Fig u r e 4 . Most of the values.29. Because of the growing interest in the interpretation of score profiles. An example is the individual report form for use with the Differential Aptitude Tests.r22 In this substitution.) ~This for~ula should not be confused with the formula for the standard error of a difference between two group means. for example. could the score differences have resulted merely from the chance selection of specific items in the particular verbal. SEd iff.WAIS.6.j difference between the Verbal Reasoning and Numerical Reasoning scores probably reflects a genuine difference in ability level. numerical.05 level. (Data from Individual Report. . We may illustrate the above procedure with the Verbaland Performance IQs on the Wechsler Adult Intelligence Scale-Revised (WAIS-R).i" . the difference between an individual's WAIS-R Verbal and Performance IQ should be at least 10 points to be significant at"the . The standard error of the difference between two scores can be found from the standard errors of measurement of the two scores by the following formula:9 in which SEdiff. since thefr scores would have to be expressed in terms of the same scale before they could be' compared. Dijferentil1/ AptiUuk Tests.Reliability 110 TECHNICAL AND METHODOLOGICAL PRINCIPLES 111 . and SEMI and SEMz are standard errors of measurement of the separate scores. Examples and further discussion of the problems to be considered in interpreting a person's score profile on such batteries can be found in chapters 8 and 10 (for ability tests) and chapter 13 (for personality tests).respectively. = lSV2 . Each percentile bar corresponds to a distance of I SEM on either side of the obtained score-hence the probability that the individual's "true" score falls within the bar is approximately 2 to 1 (. In interpreting the profiles. and mechanical tests employed? These questions are especially relevant to the proper interpretation of scores on multiscore batteries in both abilities and personality traits (Anastasi. the difference between Abstract Reasoning and Mechanical Reasoning is in the doubtful range.05 level. with the obtained percentile at the center. . especially If they overlap by more than half their length. By substituting So-VI +ru for SEMI and SD VI .68 to . 1985a).83 to 12.04. Score Profile on the Differential Aptitude Tests. On this form..05 level. we multiply the standard error of the difference (4. test users are advised not to attach importance to differences between scores whose percentile bars overlap. Reproduced by permisslon. Illustrating Use of Percentile Bands.is the standard error of the difference between the~wo ~c~. Errors of measure men t in two variables are random errors and hence are assumed to be uncorrelated. Copyright © 1990' by The Psychological Corporation. This follows from the fact that this difference is affected by the chance errors present in both scores. however. The split-halfreliabilities of these scores are .32).74) by 1. 5th ed.93. IOMore precise estimates can ~~ obtain.es.93 = 4. ranges from 8. The' result is 9. test publishers have been developing report forms that permit the evaluation of scores in terms of their errors of measurement. .97 ~ . the same SD was used for tests 1 and 2.97 and .. the . Thus. we may rewrite the formula directly in terms of reliability coefficients. or approximately 10 points. which includes the sort of information illustrated in Figure 4-6.96. that they would still do so on a retest with another form of the battery? In other words. Hence the standard' error of the difference between these two scores can be found follows: as SEdiff. percentile scores on each subtest of the battery are plotted as percentile bands.
The specific purpose for which the test is administered may vary widely. 1984). This apparent difficulty in the assessment of reliability arises from a failure to consider what the domain-referenced tests are designed to measure. these tests are used essentially to differentiate between those persons who have and those who have not acquired the skills and knowledge required for a designated activity. . so does the correlation coefficient. it would be inappropriate to assess the reliability of most domain-referenced tests by applying the usual procedures to a group of persons after they have reached the preestablished mastery level. the position of cutoff scores. 1984a. Nevertheless.. in which all classification errors are considered equally serious regardless of their distance from the cutoff score. and other psychometric features of the test. 1989). These data can be further analyzed hy computing appropriate indexes of agreement and Significance values. Relevant considerations have been extensively discussed in the technical literature (see Berk.into account the actual scores obtained on the two occasions and provide indexes that reflect each person's deviation above or below any given cutoff score . from obtaining a driver's license or assignment to an occupational specialty to advancing to the next unit in an individualized instructional program or admission to. Some. in all such situations. Theoretically. 1984b. A major statistical implication of mastery testing is a reduction in variability of scores among persons. Subkoviak. In an earlier section of this chapter. the fact that atest is used at all implies the expectation of variability in performance among individuals. A major portion of this variability reflects individual differences in amount of prior training in the relevant functions. Feldt & Brennan. In actual practice. if everyone continues training until the skill is mastered. Under these conditions. is affected by the variability of the group in which it is computed. In such cases. including reliability coefficients. Other procedures take . 1984. we saw that any correlation. then. Obviously. of these techniques are appropriate for simple mastery-nonmastery decisions.a particular course of study. test and retest with parallel forms can be used to find the percentage of persons for whom the same decision is reached on both occasions. even a highly stable and internally consistent test could yield a reliability coefficient near zero. variability is reduced to zero.The choice of a particular procedure should take into account the nature and uses of the test. Brennan. As the variability of the sample decreases. MOre than a dozen different techniques have been specifically designed to evaluate the reliability of domain-referenced tests (Berk.112 TECHNICAL AND METHODOLOGICAL PRINCIPLES RELIABILITY APPLIED TO MASTERY TESTING AND CUTOFF SCORES It will be recalled from chapter 3 that domain-referenced tests usually (but not necessarily) evaluate performance in terms of mastery rather than degree of achievement.
This action might not be possible to undo. Are you sure you want to continue?
We've moved you to where you read on your other device.
Get the full title to continue reading from where you left off, or restart the preview.