You are on page 1of 17

ISBN 13:

ISBN 10:
Introduction to
Measurement
Theory

Mary J. Allen
Wendy M. Yen

- . - --
---------

PRESS, INC.
Long Grove, illinois
7.1 Introduction 149

isolate~ raw score does not give any information about how one examinee 's perfor-

7
mance 1s related to the performance of the other examinees. If a raw-score distribu-
tion is irregular (for example, if it is skewed or bimodal), common statistical
techniques that require normality cannot be applied reasonably. Another problem
with the use ofraw scores is that raw scores may not be comparable across tests. For
example, if Gene has a raw score of 60 on Test 1 and a raw score of 20 on Test 2 we
c~nn?t ":18ily assess his relative performance on the two tests, particularly if the
d1stnbut1ons of scores for the two tests have quite different shapes. All of these
Transforming and Equating problems have led to the development and use of common transformations that result
in more easily interpretable reported scores. Common forms of expressing
transformed scores are: (1) percentiles, (2) age and grade scores, (3) expectancy
Test Scores tables, (4) standard and standardized scores, (5) normalized scores, (6) formula
scores, and (7) equal-interval scales.
Transformations of scores are of two basic types: linear and nonlinear. A
linear transformation can be defined by a linear equation of the form y = aX + b,
where a and bare constants, Xis the raw score, and Y is the transformed score. In
making this transformation, every examinee 's X is transformed to a Y using the linear
rule. If the transformati?n equation is known, the transformed score corresponding
to any raw score can be calculated easily. For example, if Y = 3X - 2, the
transformed score corresponding to a raw score of 12 is 3(12) - 2 = 34. When raw
scores are linearly transformed, the shape of the distribution of the transformed
scores is the same as the shape of the distribution of the raw scores. For example, if
the. raw-score distribution is skewed to the right, the linearly-transformed-score
distribution also will be skewed to the right. Furthermore, linear transformations
do not alter the size of correlations (see Section 2.8). The construction of standard
and standardized scores and some formula scores involves a linear transformation
ofra:w scores.
7 .1 Introduction . A nonlinear transformation cannot be expressed in the form of a linear
equa.tion. For example, Y = X2 is a nonlinear transformation of X. In general,
David took a test and received a score of 32. How well did he do? The raw
nonlmear transformations will change correlations and the shape of the score
score-that is, the observed score-does not carry enough information to answer
distribution, so that the transformed-score distribution can be very different from the
this question. However, if we knew something about the distribution ofraw scores,
raw-score distribution. All of the seven transformations listed previously, except for
we could get a better idea of what David's score means. For example, if we knew that
standard and standardized scores and some formula scores, involve nonlinear
David's score fell at the 90th percentile (meaning that he scored as well as or better
transformations.
than 90% of the examinees who took the test), we would have a much better idea
The raw-score transformations (except for some formula scores) that are
about how well he did. By dealing with a percentile rather than a raw score, we have
discussed in this chapter are all monotonic transformations of the raw scores. This
made a raw-score transformation. In this chapter we will discuss several different
types of raw-score transformations that aid in the interpretation of test scores.
means that, if one examinee 's raw score is greater than another examinee 's raw
score, the first examinee will have a greater transformed score than the second
Chapter 8 discusses the levels of measurement (nominal, ordinal, interval, and ratio)
examinee. In other words, monotonic transformations will not alter an examinee's
of the scores produced by these transformations.
rank ~rder in the sample. If the scores are to be used in a ranking or sorting situation,
Because there are a number of disadvantages in retaining the raw scores on
any test, most scores are transformed before being reported. A major problem with there 1s nothing to be gained by using these transformations. If, however, we want to
com'.uni~ate th~ meaning of a score more effectively or to perform special statistical
raw scores was just mentioned: a raw score by itself is difficult to interpret. Even if
raw scores are accompanied by information about the number of items on the .test, an mampulat1ons with the scores, one or another of the transformations can be useful.
The raw-score transformations discussed in this chapter transform scores in
148
150 Transforming and Equating Test Scores 7. 2 Percentiles 151

order to make them more meaningful. In most cases, this meaning is derived by examinees obtaining the score appears in the middle column. The next column on the
comparing an examinee 's performance to the performance of other examinees; that right contains the cumulative frequency of each score, whi9h is the number of
is, the scores derive meaning through reference to a norm group. This technique is examinees who have scores less than or equal to that score. For example, the
called a norm-referenced approach. An alternative approach, criterion-referencing, cumulative frequency of a score of 116 is the frequency of a score of 116 (8)plus the
is discussed in Section 10.5. In criterion-referenced testing, an attempt is made to frequency of a score of 115 (2), which equals 10. Using this observed.test-score
determine whether the examinee has reached a certain specific criterion performance . distribution, let's estimate the percentile rank for a trait value of 117. Remembering
or mastered a specific task (for example, can the examinee subtract single-digit our assumptions, we recognize that the observed test score of 117 represents a range
numbers?). In criterion-referenced testing, a raw score (number-right score) can be of trait values from 116.5 to 117.5 and thatthe six examinees who got a test score of
meaningful and does not require transformation. 117 are evenly distributed from 116.5 to 117 .5. Therefore, we conclude that 13
examinees had trait values at or below 117-the ten examinees below 116.5 plus the
three examinees between 116.5 and 117. (Since we assume the trait distribution is
7 .2 Percentiles continuous, no one falls at exactly 117. Whether 13 examinees fell at and belo.w or
just below 117 is philosophically, but not practically, significant.) Since there are 20
Percentiles are defined with respect to a norm, or reference, group. A norm examinees in all, 13/20 is the desired proportion. Since 13/20 = .65, the percentile
group is a specified sample of examinees-for example, a certain sample of sixth rank for a trait value of 117 is estimated to be 65.
grade students randomly chosen from schools across the United States. If it were Using similar logic to calculate the percentile rank for a trait.value of 116,
possible to determine the actual trait values for a continuous trait (for example, we see that there are six examinees at or belowa trait value of 116 (two below 115.5
intelligence ot hyperactivity) for each person in a norm group, then we could and four between 115.5 and 116). The proportion at or below 116 is 6/20 = . 30, so
determine the percentage of people with trait values less than or equal to any the percentile rank for a trait value of 116 is estimated to.be 30. We also see that 18
specified value. The percentile rank (or percentile or percentile score) of a trait examinees fall at or below a trait value of 118 (16 belo.w 117 .5 plus two between
value is defined as the percentage of people in a norm group who have trait values 117.5 and 118), so the percentile rank for a trait value of 118 is estimated to be
less than or equal to that particular trait value. For example, if 75 out of I 00 people in (100)(16 + 4/2)/20 = 90. Similarly, the percentile rank for a trait value of 115 is
the norm group have spelling abilities less than or equal to 17 .3, then 17 .3 is assigned estimated to be ( 100)(0 + 2/2)/20 = 5. , .
a percentile of 75. In practice, we cannot obtain trait values, but we can obtain Calculations to find trait values corresponding to given percentiles are
observed test scores. We can assume that each test score represents a range of trait slightly more tedious but conceptually similar. To estimate the trait value ~ailing at
values. For example, an observed test score of 17 represents a range of trait values the 75th percentile rank, we need to find the trait value that 75% of the .examrnees fall
from 16.5 to 17.5. We can further assume that every trait value in the range at or below. For the data in Table 7.1, there are 20examinees in all, so the trait value
represented by an observed test score is equally likely to occur. For example, for an at the 75th percentile must have .75(20) = 15 examinees at or below it. Looking at
observed test score of 17, all trait values between 16.5 and 17 .5 are equally likely to the cumulative frequencies of the scores and remembering the assumptions, we see
occur. Thus, if six examinees receive an observed test score of 17, three examinees that 16 examinees have trait values at or below 117.5 and that ten examinees have
are assumed to have trait values less than 17, and three are assumed to have trait trait values at or below 116.5; therefore, the trait va,lue we want must be somewhere
values greater than 17 . between JI 6.5 and ti 7. 5. To make the calculations easier, we can construct a table
Using these assumptions, we can use frequency distributions of observed like Table 7.2. We see that a trait value of 117.5 corresponds to a cumulative
test scores to estimate the percentiles associated with various trait values. Consider frequency of 16 and that a trait value of 116.5 corresponds to a cumulative frequency
the score distribution in Table 7. I. For each score, the frequency (number) of
Table 7.2. Information Table for
Table 7.1. A Score Frequency Distribution Calculation of a Trait Value
Corresponding to_ a Percentile
Cumulative
Score Fre_quency Frequency Trait Cu1nulative
Value Frequency
118 4 20
117 6 16 117 .5 16
116 8 10 ? 15
115 2 2 116.5 10
152 Transforming and Equating Test Scores 7. 2 Percentiles 153

of 10; we want to know the trait value corresponding to a cumulative frequency of Table 7.3. Ungrouped and Grouped Frequency Distributions
15. We find this trait value using linear interpolation: 15 is 5/6 of the way between 10
Ungrouped Grouped
and 16, so the trait value we need must be 5/6 of the way between 116.5 and 117 .5.
The desired tiait value is approximately 116.5 + .83 = 117 .33. This calculation and Score Frequency Score Frequency
its logic are illustrated in Figure 7 .1. On the right side of Figure 7 .1, we see that we
must go up 5/6ofthe distance between !Oand 16. We must go up proportionately the 8 12 5-8 31
7 8 1-4 4
same amount on the left side (between 116.5 and 117.5) or 5/6 times I, which is
6 10
about .83. This produces the answer of 117 .33. 5 I
Occasionally frequency distributions are grouped in terms of score inter- 4 2
vals. For example, we may know that 27 examinees earned scores from 1 to 5 3 l
a
without knowing how many examinees earned score of 1, how many earned a score 2
l
I
0
of 2, and so on. In such a case, we assume that the score interval from 1 to 5 covers
trait values from .5 to 5.5 and that every trait value within this interval is equally
likely to occur. If this last assumption is false, the percentile calculated from grouped
data may differ markedly from the percentile calculated from ungrouped data. For Table 7.4. A Grouped Frequency Distribution
example, when the percentile rank fot a trait value of 6 is calculated for the
Cumulative
ungrouped data in Table 7 .3, it is approximately 28.6. However, for the grouped Score Frequency Frequency
distribution the calculated percentile rank is approximately 44.6.
When dealing with grouped frequency distributions, the logic of calculating 50-59 6 50
40--49 12 44
percentiles is the same as it is with ungrouped frequency distributions. For the
30-39 17 32
distribution in Table 7 .4, we will calculate the percentile rank for a trait value of 37 20-29 10 15
and the trait value that falls at the 70th percentile. These calculations are illustrated in 10-19 2 5
Figures 7 .2 and 7 .3 ,respectively. As shown in Figure 7.2, there are 15 examinees at 0-9 3 3
or below a trait value of29.5 and 32 examinees at or below a trait value of 39 .5. For a
trait value of37, we go up 7. 5 units of a distance of 10 units on the left side of Figure
7 .2, so oil the right side we must go up 7.5/10 of a distance of 17. We conclude that
there are 27.75 examinees at or below a trait value of 37. Since there are 50 Trait Cumulative
Value Frequency

Trait Cumulative 39.5 32


Value Fre_q~yncy
10 37 ? 17
7.5 ?
117.5 16
29.5 15
1.0 ? 15 6

116.5 JO
15 + 37 - 29.5 (32 - 15)
39.5 - 29.5

116.5 + 15 - 10 (117.5 - 116.5) =15+~g (17) =27.75


16 - 10
27.75 = .555
= 116.5 + ~ (1),; 117.33. 50
. 6
Percentile= .555(100) = 55.5
Figure 7. 1. Illustration of the calculation of a trait value corresponding
to a percentile Figure 7. 2. Calculation of the percentile rank for a trait value of 37
154 Transforming and Equating Test Scores 7.2 Percentiles 155

examinees in all, 27. 75/50 = .555 of the examinees fall at or below a trait value of Trait Cumulative
37, which corresponds to a percentile rank of 55 .5. . Value Frequency
Jn Figure 7 .3 we need to find the trait value that 70% of 50exammees fall at 29.5 15
or below-that is, the trait value with a cumulative frequency of 35. There are 44
10 ? 10 10
examinees with trait values at or below 49.5, and there are 32 examinees with trait
values at or below 39.5. Therefore, our answer is somewhere between 39.5 and 19.5 5
49.5. On the right side we go up3/12ofthe way, so on the left side we must go up
3/12 of the distance to a score of 42. Figures 7.4 and 7.5 illustrate similar calcula- .20(50) = 10
Trait Cun1ulative 19.5 + 10 - 5 (29.5 - 19.5)
Value Frequency 15 - 5
49.5 44
= 19.5 + 2-- (10) = 24.5
35 12 10
IO ?

39.5 32 Figure 7. 5. Calculation of the trait value at the 20th percentile

tions; we find that the percentile rank for a trait value of 26 is 23 and that the trait
value atthe 20th percentile is 24.5.
.70(50) = 35 Percentiles, like any transformed scores, have some advantages and some
disadvantages. The primary advantages of percentiles are that they are straightfor-
39.5 + 35 - 32 (49.5 - 39.5)
44 - 32. ward to calculate, regardless of the shape of the distribution of observed scores, and
that they are easy to interpret. For communicating with people who have little
= 39.5 + ?2 (10) = 42 background in statistics, percentiles are probably the most popular and meaningful
transformed scores. '
Figure 7 .3. Calculation of the trait value at the 70th percentile
There are a number of limitations of percentile scores. Percentiles can be
assumed to form ordinal scales (see Sections 2.2 and 8.2); thus, arithmetical
manipulations of percentiles (for example, the calculation of means and variances of
Trait Cu1nulative percentiles) can produce misleading results. Suppose that five seniors from two high
Value Frequency
schools have taken a college entrance exam. The scores for this exam are constructed
29.5 15 on an equal-interval scale with a national mean of 420 and a standard deviation of
IO
40. The scores and national percentiles for the two groups appear in Table 7 .5. It is
26 ?
clear that the equal-interval scores from the two schools have the same variances.
19.5 5 However, the variance of percentiles from School B is much smaller than the vari-
ance of percentiles from School A. Use of the variances of the percentiles could
lead to inaccurate conclusions about the relative variance in performance for
5+ 26 - 19.5 (15 - 5) the two schools.
29.5 - 19.5
Keep in mind, also, that the distribution of percentiles within the norm
=5+ 6.5 (10) =11.5 group is rectangular, not normal, since by definition 1% of the examinees are at each
10 percentile. Unlike the bell-shaped normal curve, a rectangular distribution curve
looks like a horizontal line. Therefore, researchers who desire to use common
.lLl...= .23 statistical techniques that assume normal distributions should avoid the use of
50
percentiles.
Percentile = .23(100) = 23. A third problem with percentile scores is that they may lead to exaggerated
Figure 7.4. Calculation of the percentile rank-for a trait value of 26 interpretations of small differences, especially when the test is short. Consider a test
156 Transforming and Equating Test Scores 7.3 Age and Grade Scores 157

Table 7. 5. Scores and Percentiles for Students from Two Schools 7 .3 Age and Grade Scores
Schoo/A Schoo/B
Often test scores are reported in terms of age or grade scores, sometimes
Score Percentile Score Percentile called age or grade equivalents. For example, a third-grader may be said to read at
the fifth-grade level or have the mental ability of a IO-year-old. To calculate an age
400 31 450 77
460 84 or grade score, the median raw score for examinees of a particular age (or grade) is
410 40
420 50 470 89 determined, and this median raw score is transformed to be that age (or grade) score.
430 60 480 93 For example, ifthe median raw score on areading test for fourth-graders is 17 .2, the
440 69 490 96 raw score of 17. 2 is transformed to a grade score of' 'grade 4." Sometimes mean raw
scores for a group are used instead of median scores to determine age or grade scores.
Mean 420 50 470 88
Variance 200 184 200 45 In either case, intetpolation is used to fill in the gaps between scores. For example, if
themedian grade-3 raw score is 13.3 and the mediangrade-4raw score is 17.2, a raw
score of 16 is transformed to a grade 3. 7, or about 14 of the way between grades 3 and
with an approximately normal score distribution. What happens to percentiles for 4. This calculation is illustrated in Figures 7. 7 and 7 .8.
examinees scoring near the center of the curve? As illustrated in Figure 7 .6, small
Despite their popularity, age and grade scores have a number of serious
score differences near the center of the distribution may lead to large percenale
limitations. Because they are assumed to formordinalscores (Section 8.2), ariJhmet-
differences. Suppose four examinees (P,, P,, P,. and P 4 ) receive scores of 10, 12,
ical manipulations of these scores can lead to misleadingresults. Also, the interpreta-
17, and 19, respectively, on a test. Only two score points separate? 1 fromP, a~dP,
tion of these scores is not as straightforward as it appears. For example, one might
fromP but as illustrated in Figure 7.6, there will be a much larger difference m the
infer that two children with the same mental age or age score think similarly, which
percen:lles f~r p 1 andP, than for P 3 andP 4 For example, the four percentiles forthe
is generally not true. There are enormous cognitive differences between a 5-year-old
examinees might be 25,'50, 97, and 99. Someone examining these percentiles would
with a mental age of 8 and an 8-year-old with a mental age of 8. There will be large
probably conclude that P, is subnormal, P 2 is average, .and P, and P,, are both
differences between the children in background knowledge and experiences, matu-
extremely high on the trait, without realizing that only t~o items m~ked incorrectly
separate P, from p 2 and p 3 from P4 This problem IS not 7estr1cted to normal
distributions; it will occur whenever a large proport10n of examinees get the same or Grad~ Score Raw Score
similar observed scores, causing a one- or two-point score difference to result 10 a
large percentile difference. Since this problem is more likely to occur on short tes~. 4 17.2
on which only a limited number of scores are possible, the test, user should exercISe ? 16 3.9
great caution when dealing with percentiles based on short tests.
3 13.3

3 + 16 - 13.3 (4 - 3)
17.2 - 13.3

= 3 + 2.7 (I) ,;, 3 .69


3.9
Figure 7. 7. Interpolation of grade scores

Grade Score 3 3.7 4


2 3 4 5 6 7 8 9 IO I I 12 13 14 15 16 17 18 19 20 t
Raw' Score Raw Score 13 14 15 16 17 18
Figure 7 .6. An equivalent raw-score difference generating very differ-
ent percentile differences Figure 7.8. Representation of grade scores
158 Transforming and Equating Test Scores 7.3 Age and Grade Scores 159

rity of value systems, interests, cognitive styles, and reasoning ability. Similarly, a Another fact that is often forgotten is that, when we consider all those
5-year-old who is average and a IO-year-old with a mental age of5 are very different, children with the same chronological age (or the same grade), about half of them
despite their similar mental ages. This limitation is also true for grade scores. A will be above average, and the other half will be below average. For example, about
third-grader who is at the fifth-grade level on a science-achievement exam probably half of. the thir?graders read below the grade-3 norm. Therefore, probably very
knows very different items of information and has a very different perception of the few children will fall where one might expect them to on the basis of their. age or
physical world than the average fifth-grader. Children with the same age or grade grade level.
scores, especially when widely disparate in age, probably got different answers . '."nother problem, especially for grade scores, is that schools may differ in
correct, used different test-taking strategies or styles, and are prepared for different th err cumcula and may introduce topics at different rates. Thus, a whole school may
types of'subsequent training. These facts suggest that the use of age or grade scores be below average in arithmetic and above average in history. In a case such as this,
could mislead most interpreters to perceive equalities that are not real. ~e grade scores might suggest that the individual students are retarded or gifted
A third problem with age and grade scores is that score distributions for m these areas, when actually it is differential exposure that accounts for their
adjacent grades typically tend to have increasing overlap as grade level increases (see score patterns.
Figure 7 .9). Consider reading-level grade norms. A first-grader who is reading one Obviously, the use of age or grade scores is only reasonable when the trait
grade level ahead may be in the 85th percentile among first-graders, but a fifth- being measured increases (or decreases) monotonically with age or grade. The test
grader who is one year ahead may only be at the 65th percentile among fifth-graders. scores illustrated in Figure 7 .10 should not be transformed to age scores, because the
Even though both students are one grade level ahead, the interpretation of their test does not discriminate between 2-year-olds and 4- or 5-year-olds. Only when
reading levels is quite different. The situation with age scores is the same. A the relationship between mean or median score and age or grade is monoton-
5-year-old who is two years ahead would be truly exceptional; for an 11-year-old, ically increasing or decreasing, as illustrated in Figure 7 .11, should age or grade
being two years ahead in mental age would not be as extraordinary. This problem ,scores be considered.
also makes it difficult to compare examinees' perfmmances at different age or grade A final problem with age and grade scores is that interpolation between tests
levels. If two third-graders score at grade 3 and grade 4 on a mathematics exam, the may be ~naccurat~. School achievement exams are usually age graded. For example,
one who scores higher might be very superior to the other. But, if two seventh- there might be different mathematics-achievement tests for grades 3 to 6 and for
graders score at grade 7 and grade 8, they might be very similar to each other in grades 7 to 12. Uril~ss these different tests can be ,equated (see Section 7.9), a
mathematical competence. A new teaching method that improves performance by fourth-grader who scores at the eighth-grade level on the lower level of the test
one grade level in only a few months' time would be marvelous if it worked for probably will not score at I/le eighth"grade level on the higher level of the test. In
third-graders but less impressive if it worked for seventh-graders. Similar problems summary, age and grade scores, despite their seeming appropriateness for use with
occur in age scores. A change of one year in cognitive or social development is an children, have a large number \lf serious drawba9ks that make m,eaningful interpreta-
enormous leap for an infant; it is a much less dramatic change for a 16-year-old. t\9n extreme! y difficult.

Grade-I Grade-2 Grade-5 Grade-6 5


"'
u
Distribution Distribution Distribution Distribution ~
"" 0
u 4
"
""
~ "'
u.
:g" 3
" ::; 2
~0: ....~
Grade-I Grade-2 Grade-5 Grade-6 0
Mean Mean 2 3 4 5 6
Mean Mean
Test Score Age

Figure 7.9. Example of increasing overlap of within-grade-score dis- Figure 7,- 10. A re1ationship between test scores and age that is not
tributions as grade increases appropriate for the development of age scores
160 Transforming and Equating Test Scores 7. 5 Standard and Standardized Scores 161

tancy tables built on local norms for schools, mental-health clinics, factories, and so
on are conceptually easy to develop and use.
There are, however, some disadvantages to expectancy tables. These tables
often cannot be developed, either because of time or monetary considerations or
because a clear-cut criterion is not available. The norm group used to develop
expectancy tables should be large enough to ensure that the probabilities in the table
are reasonably stable. For a test with wide applications, a large number of expec-
tancy tables may be necessary to relate the test scores to an array of criteria. Local
norms, rather than norms based on a national sample, may be necessary for specific
Age or Grade Age or Grade school programs, therapy situations, or careers . These problems with expectancy
(a) (b) tables are practical rather than theoretical. When specific criteria and reasonable
sample sizes for the norm groups are available, norms displayed in expectancy-table
Figure 7. 11 . Relationships between test scores and age that are appro- form are one of the most useful transformations possible.
priate for the development of age or grade scores
An alternative to the expectancy table that can be used in similar situations
involves the regression of the criterion on the test (see Sections 2.9 and 2.14). A
7 .4 Expectancy Tables predicted criterion score is provided for each test score or range oftest scores. In the
Expectancy tables are probably the best transformations available f~r.tests preceding example, the predicted (expected) course grade for every seore on the test
that can be tied to a reasonable criterion. An expectan~y table gives the cond1t1ona! could be provided. The advantages and disadvantages of the transformation using
distribution (see Section 2. 13) of criterion scores for different test scores Table 7 regression techniques are very similar to those for expectancy tables.
gives a hypothetical expectancy table relating pre-course te~t scores to perform~n~~
in a statistics class~ Of the people in the norm group who scored between 25Fan h 7. 5 Standard and Standardized Scores
on the pre-course test 50% eamed "A' s, "30%v eamed "B's "and so on. or t e
eople scoring between ' 5 and 9, none earne d "A. 's, "2~v ea med "B's,, and soon.
Section 2. 7 demonstrated how to calculate standard scores, often called Z
scores, where Z = (X - /.tx )l<J'x. To get a standard score corresponding to any raw
~counselor would advise a student with a high pre-course test score to ~e t~e score, the mean of the raw scores is subtracted from the raw score, and the resulting
'th a warning that a few students from this test-score level still did
course, perh aps w1 . h unselor number is divided by the standard deviation of the raw scores. A standard score
oorl in the course. When advising a person with a low test score, t .e co
~i ; su est remedial work for the student before he or she enroll~d m the. cl?ss indicates how many standard deviations from the mean a score lies. For example, if
Th~exp~tancy table illustrates the probabilistic nature of psycholog1.cal ~red1ct1on. Z = + 1, the raw score lies one standard deviation above the mean; if Z = - 2, the
raw score is two standard deviations below the mean. Since the standard-score
Although most high scorers did well in the class, some did not--;:ma~mg it clear that
transformation equation is linear, the shape of the standard-score distribution will
a student with a high pre-course test score is not guaranteed an A. .
The main advantages of expectancy tables are especially apparent m coun- be the same as the shape of the raw-score distribution, and correlations will not be
seling applications. The expectancy table makes clear the re~ations~ip b".tween the affected. Contrary to popular belief, not all standard scores have a normal distribu-
test score and the criterion and the probabilistic nature of this relat10nsh1p. Expec- tion. If the raw scores are skewed or bimodal, the standard-score distribution will
have these same properties.
Standard scores al ways have a mean of 0 and a standard deviation of 1. One
Table 7 .6. Expectancy Table Relating Pre-Course Test Scores to Course Grades major disadvantage of standard scores is that about half the scores are negative. Most
Test Grade people prefer not to deal with negative numbers, because transcription and
Score A (4) B (3) c (2) D (1) F(O) mathematical errors are more common (negative signs are easily lost) and because
.01 examinees dislike having negative scores. For these reasons, standard scores gener-
25-30 .50 .30 .15 .04
.20 .10 .01 ally are not used in reporting scorc;s .
20-24 .35 .34
.15 .58 .15 .02 Standardized scores are linear transformations of raw scores (or their
15-19 .10
.06 .59 .25 .05 standard-score equivalents) that eliminate the problems involved with negative
10-14 .05
.02 .48 .40 .10 numbers. Any set of standard scores can be transformed to have an arbitrary mean,
5-9 0
.01 .27 .42 .30
0-4 0 ,*,and standard deviation, <1'*, by applying the formula Y = <J'*Z + ,, whereZ is
162 Transforming and Equating Test Scores 7.6 Normalized Scares 163

the standard score and Y is the standardized score. For example, if you want to have have a mean of 100 and a standard deviation of 10, a score of 90 is one standard
= 100 and <T* = 16, the transformationfroinZ to Y would.be: Y = 16Z + 100. A.n deviation below the mean, or at about ihe 16th percentile.
examinee with a standard score of 2 would have a standardized score of 132. 'Ilus One disadvantage of standard or standardized scores is that they may be
=
formula can also be easily applied to the raw scores. Since Z (X - Mx )/<Tx, difficult for those unfamiliar with statistics to understand fully. Another disadvan-

y = <T* (X ;:x) + *. (7.1)


tage is that, since these transformations are linear, the distribution of transformed
scores will contain any irregularities found in the raw-score distribution. Irregular
"bumps" in this distribution, usually due to sampling irregularities, will be pre-
For example, if the raw-score mean and standard deviation are P.x = 36 and <Tx = 3, served by the transformation. A third problem with standard or standardized scores is
and you want to transform to standardized scores with = 50 and <T* = 5, the more subtle. Suppose Juan had two standard scores, .19 on Test A and .14 on Test
equation would be B, that lead us to the conclusion that he di' about equally well on the two tests.
y= 5( x ~ 36 ) + 50. However, if the shapes of the two score distributions are quite different, our
conclusion would be'wrong. Figure 7.12 illustrates one possible pair of standard-
score distributions. The distribution for Test A is skewed to the right, and the
For a raw score, X, of 30, the transformed standardized score, Y, would be Y = distribution for Test Bis skewed to the left. The standard score of .19 is at the 64th
5 (30 - 36)/3 + 50 = 40. percentile for Test A, and the standard score of .14 is at the 50th percentile for Test
There are several standardized scale transformations in common use. The ,B. Similar standard or standardized scores can lead to different interpretations of
Army General Classification Test (AGCT) scores, developed in World War II, are relative merit. Thus, unless the test interpreter knows that two frequency distribu-
standardized scores with = 100, <T* = 20. The College Entrance Exam Board tions have the same.shape, it is difficult to compare scores on.two standardized
(CEEB) test scores have* = 500, <T* = 100. The subtest scores on the Wechsler scales, and it is particularly risky to interpret small differences in standard scores.
tests (WISC and WAIS) are standardized to have = 10, <T* = 3. The Sumford-
BinetIQ test score is standardized to have* = 100, <T* = 16. Some personality-test
scores, such as the California Psychological Inventory (CPI) and Minnesota Mul-
tiphasic Personality Inventory ~MMPI) scales, have =
50, <T* =
10. . .15
r-'\
For people with a basic statistics background, standard and.standardized
scores are relatively simple to understand. Since they are linear transformations of
the raw scores, these transformed scores will have a distribution with the same shape
"'
0
v
~
c<
v~ .IO
\.\Test B
as the raw-score distribution. If the raw-score distribution is approximately normal, "">v \\
it is fairly easy to transform the standard or standardized scores to appro.ximate
percentiles, such as those given in Table 7 .7. (This table was created usmg the
~
~
.05 '.\
standardnormal table in the Appendix.) For example, if an examinee's standard
~,
score is 1.9, you can guess that the examinee is at about the 95th percentile (actually,
from the table in the Appendix, it is the 97th percentile), If the standardized scores -2 -I 0 +I +2 +3
Standard Score
Table 7. 7. Z Scores and Their
Approximate Percentiles in a Nonnal
Distribution Figure 7. 12. Similar standafd scores fall at different percentile ranks for
frequency distributions with different shapes
Approximate
z Percentile

-2 2 7 .6 Normalized Scores
- I 16
0 50 The transformation to normalized scores involves forcing the distribution
+ I 84 of transformed scores to be as close as possible to a normal distribution by smoothing
+2 98
out, stretching, or condensing irregularities and departures from normality in the
164 Transforming and Equating Test Scores 7.6 Normalized Scores 165

raw-score distribution. This transformation can be reasonably applied if the test discriminations. among examinees can't be made with them. The. percentile ranks
developer believes that the underlying trait has a normal distribution and that the and ~he percentile ranges for each stanine are given in Table 7 .9. Transformations to
nonnormality of the raw-score distribution represents error due to sampling or stan~es can be done by calculating the percentile rank for any raw score and then
test-construction problems. refemng to Table 7.9. For example, a score at the 83rd percentile would be
The normalization process involves several steps: transfo~ed to a stanie of7, and a score at the 12th percentile would be transformed
to a stanme of 3.
I . Transform the raw scores to percentiles. . . .The advantage of transforming to normalized scores is that the transformed
2. Find the standard score in the normal distribution corresponding to each dtstnbuuon h~s. a well-k~own form that is easily interpretable and is amenable to
common statistical mampulations. Scores on different tests, if normalized and
percentile.
3. (Optional) Transform these standard scores to standardized scores with a desired converte~ ~ the same mean and standard deviation, become directly comparable
mean and standard deviation. thus av01dmg. the complications involved when frequency distributions have differ'.
ent shapes It is also easy to ~onvert any normalized score to its equivalent percentile.
To illustrate this process, we will normalize the raw scores given in Table 7 .8 to have , The use of no~al.tze~ scores may not be reasonable if the underlying trait
a transformed-score mean of 100 and standard deviation of 10. The column labeled has a very nonnormal d1stribut1on. For example, if a score distribution is bimodal
Z gives the score in the standard normal distribution corresponding to each per- ?ue to the presence of two disparate types of examinees (see Figures 6.12 and 6.13),
centile, obtained by referring to the Appendix; Z is a normalized score with a mean It w.ould not make sense to normalize the distribution. Also remember that nor-
of O and a standard deviation of I. The last column is obtained by the formula Y mal~ed scores do not have a truly normal distribution. The normal distribution is
= !OZ+ 100, which gives the desired normalized and standardized scores. Inthe continuous from negative infinity to positive infinity. However, normalized scores
first row, a raw score of 118 has been transformed to a normalized and standard- because they.~ based on raw scores, are discrete and generally will fall within thre;
stand~rd d~v1a11ons of the mean. Usually the problem of discreteness is not serious,
ized score of 114.1.
Two normalized scores are in common use: T scores and stanines. T scores especially .If ther~ Rf<) a large number of scores and.the normalized distribution is
are normalized scores with,= 50 and CT= 10. Nonnormalized, standardized scores f~1rl'. w~ll a~p~xtmated by the continuous normal curve. However, if the raw-score
with, = 50 and CT = 10 are called T.scores by some test publishers, so the test user d1Strtbut1on is highly skewed, small raw-score differences between extreme scores
should read the manual carefully to determine whether the "T scores" are nor- may be exaggerated or compressed by the normalization. A last problem is that the
malized. If the raw-score distribtion is approximately normal, then the normalized transforme? scores, with their approximately normal distribution, rnay lead the test
user to believe that the test yields "perfect , normal" scores . Nonnal'1zed scores
and nonnormalized standardized ''T scores'' will be approximately equal.
Stanines are one-digit normalized scores. They have a mean of 5 and a based ~n a. ~or ~st (for example, a test with an inappropriate difficulty level or with
standard deviation of approximately 2; consequently, the difference between adja- poor dIScn~nal!on a~ong examinees) will not be very useful, despite their appar-
cent stanine scores is approximately one half of a standard deviation. The main ently pleasmg, approxunately normal distribution.
advantages to the use of stanines are that their distributions are approximately normal
and that each stanine involves only one number. Stanine scores are useful for rough Table 7.9. Percentile Ranks and Ranges Corresponding to
screening of examinees , but because there are only nine different stanine scores , fine Stai1ines

Percentile Percentile
Sta nine Rank Range
Table 7.8. Calculation of Normalized Scores
z y 9 98 96-100
Normalized Normalized and 8 94.5 89-96
x Standard Standardized 7 83 77-89
Raw Cumulative Percentile
Score Score 6 68.5 60-77
Score Frequency Frequency for Trait
5 50 40-60
1.41 114.1 4 31.5 23-40
118 4 24 92
62 .31 103.1 3 17 11-23
117 10 20
25 -.67 93.3 2 5.5 4-11
116 8 10
4 -1.75 82.5 1 2 0-4
115 2 2
166 Transforming and Equating Test Scores 7. 7 Corrections for Guessing and Omissions 167

7. 7 Corrections for Guessing and Omissions Wl(A - 1) = 2414 = 6 items correct by guessing and 16 - 6 = lOitems correct
without guessing. Notice that this is just an estimate of the number ofitems correctly
Transfonnations of scores can also be made to adjust for the effects of answered without guessing. Some examinees may be "lucky" or "unlucky" and
guessing and the effects of omitting items. These transfonnations can aid in the may guess correctly more or less than I/A of the time. Given limited infonnation and
interpretation of an examinee 's perfonnance and in the comparison of the perfor- a simple model, the formula score F 1 uses the best estimate available for assessing
mances of different examinees. Transfonnations that take into account guessing or the effects of guessing.
omissions traditionally have been called formula scores. Because examinees sometimes don't answer all the items on a test, we need
On multiple-choice tests, examinees can get an item correct, without a formula score that takes item omissions into account. Otherwise, examinees who
knowing the right answer, simply by guessing. Guessing is not a problem on randomly guess on some items can obtain higher scores than those examinees who
personality or attitude tests, where there are no right answers, but it can cause omit those items. LetN be the total number of items in a test andB be the number of
concern on aptitude, perfonnance, or achievement tests. If there are three possi- items that are blank or not answered .X, W, andA are defined as before. Then,
ble answers for each question on a test, an examinee who simply guesses at random N=X+W+B; (7.4)
has a probability of 113 of getting each item correct. On a 30-item test, an exam-
inee who answered randomly would be expected, on the average, to obtain a score the total number of items in the test equals the number of items that are right plus the
of (30)(113) = 10. number of items that are wrong plus the number of items that are blank or omitted.
It is possible to estimate the effects that guessing has on a test score and The number-right scores, X, for examinees who answer all items are comparable to
correct or adjust the observed score accordingly. This procedtJre involves estimating F 2 scores for examinees who omit items, where
the number of items that the individual would have gotten correct if he or she had not F2 =X +BIA. (7.5)
guessed. Suppose there are A options (answer choices) for each item on an N-item
multiple-choice test, and assume that the probability of guessing correctly is llA. If The second fonnula score, F 2 , is the estimated numberof items that would be correct
an examinee guesses on G of the items, the expected number of items guessed if every blank item were replaced by a random guess. For example, if an examinee
correctly is GIA, and the expected numberof items guessed incorrectly is G - GIA. had 10 right answers and 2 blanks on a test with four options per item, that
If we assume that an examinee gets an item wrong only through incorrect guessing, examinee's fonnulascore would be 10 + 214 = 10.5.
then thenumberofwrongresponses, W, equalsG -GIA, soG =AWl(A - I). The . The scoring formula F 2 is a linear function of the scoring fonnula F 1
number of items that are guessed correctly is GIA = Wl(A - 1). If an examinee got (Equat10n 7. 2), since
X items correct on a test, we can estimate that he or she got F 1 items correct with- F 2 =X+(N-X-W)IA
out guessing, where
(A - l)X _ W + !!_
F 1 =X-GIA A A A
=X - Wl(A - 1). (7.2)

F 1 is the first fonnula score discussed in.this section, and Wl(A - !)is the correction
= (A~ 1) F, + ~ .
for guessing. If the examinee answers all items)< 1 is linearly related to the numberof Scores based on scoring fonnula F 2 are perfectly correlated with and have the same
right responses, X, by the fonnula reliability and validity as scores based onF 1 F 2 , likeF,, is perfectly correlated with
F, =X - (N -X)/!,4 - 1) the number-right score, X, only if all of the examinees answer all of the items. That
is, if B = 0, then F ,, F,, andX are perfectly correlated. When there are omissions,
=(A
A - I )X- -.;t::T.
N (7 .3) F 1 andF, are not perfectly correlated withX.
There has been a long controversy about the propriety and usefulness of
When there are no omitted items, F 1 is a linear function of X, is perfectly correlated fonnula scores. In most cases, examinees don't guess randomly on items they don't
withX, and has the same reliability and validity asX. In other words, the correction know. Usually some of the possible answers can be eliminated as clearly impossible
for guessing has no important effect when all the items are answered. However, or un'":'e; therefore, the probability of guessing correctly among the remaining
wljen there are omitted items, F 1 and X are not perfectly correlated. alternatives 1s greater than I/A. Also examinees may differ in their tendencies to
Suppose that an examinee gets 16 items correct and 24 items wrong on a guess on or omit items. If the test directions clearly state that examinees are to omit
40-item, five-option multiple-choice test. We can estimate that this examinee got items only when they feel that they would have to guess randomly, and if the
168 Transforming and Equating Test Scores 7.8 Equal-Interval Scales 169

examinees follow these directions, then a formula score is appropriate (Lord, 197 5). from a test's raw scores, the test must measure a trait that has equal intervals. The
However a formula score is not appropriate theoretically when the test directions scale developer first makes predictions about how the trait is related to test perfor-
either giv~ no instructions about omissions or state that an examinee 's score is the mance and then examines the accuracy of these predictions. If the predictions are
number of questions answered correctly. If the theoretical results about the eff".cts of accurate, the scale developer can transform the raw scores into an interval scale. This
formula scores on a test's reliability and validity are based on the assumpl!on of section examines one commonly used method for constructing an interval
random guessing, then these results are suspect unless it can be verified that scale-Thurstone' s absolute scaling method. Chapters 8 and 11 describe a number
examinees do omit those items on which they would have to guess randomly. of other methods for forming interval (and ratio) scales.
Empirical examinations of the reliability and validity of formula scores do Thurstone 's absolute scaling method (which is described by Gulliksen,
not present a clear case either for or against formula sc?ring. Sa?~rs and Feldt (19?8) 1950) hypothesizes that the continuous trait being measured by a test has a normal
altered test-taking directions with respect to guessmg by g1vmg an admomlion distribution in some specified population. It also hypothesizes that raw scores on the
against guessing versus giving no instructions abo~t g~~ssing. They found that the test are monotonically related to trait values. (A monotonic relationship is one in
directions and formula scoring did not alter the rehabihty of selected mathemalics which every increase in the raw score reflects an increase in the trait value.) If these
tests. Traub, Hambleton, and Singh (1969) found that changes in test directions and hypotheses are true and the raw scores are normalized, then the normalized scores
formula scoring had small effects on test reliability. Diamond and E~a?s (1973) have equal intervals. In order to examine the accuracy of these hypotheses, the test is
reviewed a number of studies on formula scoring and concluded that, 1f it has any administered to two samples of examinees, each of which is assumed to have a
consistent effect, formula scoring tends to slightly increase test validity. Inshort, normal distribution of trait values. For example, a spelling test might be adminis-
evidence about the usefulness of formula scoring is not clear cut. Formula scores t~red to a national sample of seventh-grade students and a national sample of
may help, hinder, or not affect test reliability and validity. . eighth-grade students. The resulting raw-score distributions are normalized within
. The value of formula scoring depends on many factors: the difficulty of the each sample. Each raw score is thereby transformed into one normalized score in one
test, the probability of correctly guessing answers, the variability of examinee~' sample and another normalized score in the other sample. If these two sets of
tendencies to omit items, and the reliability and validity of the tendency to onut normalized scores are linearly related, then the normalized test scores obtained from
items. Appropriate evaluation of a formula score requires that the test dire~tion~ be either sample form an equal-interval scale.
compatible with the scoring formula and that the examinees ~nderstand .the '.mphca- Thurs tone's absolute scaling method can be illustrated with a simple exam-
tions of the test directions and act according! y. For example, if the test drrect10ns say ple. _Table 7.10 contains raw-score frequency distributions, percentiles, and nor-
that the examinees should not guess but these directions aren't followed; then the mahzed scores (Z scores) for a hypothetical spelling test administered to seventh-
meaning of omissions and formula scores becomes unclear. Similarly, if the ex- and eighth-grade students. Figure 7 .13 displays a plot of the Z scores obtained from
aminees are directed not to omit items but they omit them anyway, then the
meaning of the number-right scores becomes unclear. The test user must make~ Table 7.10. Testing Results for a Hypothetical Spelling Test
careful evaluation of the meaning of omissions for the test at hand and score it
Seventh Grade Eighth Grade
accordingly. In most cases, the effect of formula scoring on the reliability and Raw
validity of a test must be evaluated empirically. Score Frequency Percentile ZScore Frequency Percentile ZScore
0 4 I - 2.3 4 I - 2.3
7. 8 !!qual-lnterval Scales I IO 5 - 1.7 6 4 - 1.8
2 18 12 - 1.2 8 7 - 1.5
In Section 2.2, the concept of equal intervals was introduced. A set of 3 30 24 - .7 14 13 . - 1.1
scotes has equal intervals if any given difference between scores always represc:nts 4 38 41 - ,2 18 21 .8
the same amount of difference in the trait being measured. For example, if a 5 38 60 + .2 24 31 .5
one-point difference in spelling-test scores between the .cores of.JOO an~ 101,. ~01 6 30 77 + .7 26 44 - .2
7 18 89 + 1.2 26 57 + .2
and 102, and so on always represents the same amount of mc~ase m sp.ellmg .abihty 8
the test has equal intervals. A test whose scores form equal intervals IS particularly
IO 96 + 1.7 24 69 + .5
9 2 99 + 2.3 18 80 + .8
useful for measuring growth or change in a trait or behavior. Typically, a test's raw 10 2 99.5 + 2.6 14 88 + 1.2
scores do not display equal intervals, but sometimes the raw scores can be trans- 11 0 100 + 00 8 93 + 1.5
formed into a set of scores (a scale) that does have equal intervals. 12 0 JOO + 00 10 98 + 2.0
In order to form an equal-Interval scale (usually called an interval scale)
13 0 JOO + 00 0 JOO + 00
170 Transforming and Equating Test Scores 7.9 Equating Test Scores 171

.,~
.08

v
"~ +2
~ .06
c:i
,,
'5 +I "<Tv
~

~
~
0 "'v

" .04
"~ -I
j
"'N
-2

-3 -2 -c I 0 +I +2 +3
Z Score for Seventh Grade

Figure 7.13. Relationship betweenZ scores from Table 7 .10.

the two samples. (The plot does not include points containing + oo.) Since the Z
scores for the two grades are linearly related, theZ scores for either grade can be used
as an equal-interval scale. Of course, theZ scores also could be standardized to have
any desired mean and standard deviation.
Equal-interval scales can be used in arithmetical manipulations, such as the
calculation of means, stan.dard deviations, and so on. The major disadvantage of
equal-interval scales is that the process involved in deriving them is so complex that
some test users may be unable to understand 1he scores or to communicate their
understanding.
Figure 7 .14 displays how an equal,interval scale obtained from Thurs tone's forms ".'i1hin a sp"."ified difficulty level. Several procedures for equating tests will be
absolute scaling method might be related to other scores. very bnefly descnbed. These procedures determine which scores on two tests can be
said to be equivalent; the problems of unequal reliabilities of the tests will not be
7 .9 Equating Test Scores discussed here. More detailed information about test equating methods is available
from Angoff ( 1971) and Lord ( 1977b).
When alternate forms of a test are constructed, it is often desired to . The two t~t fo~ms to be equated can be administered to the same sample of
transform the test scores to make them equivalent, if possible. This procedure is exammees. The maJor dJSadvantage of this equating method is that 1he scores on the
called test equating. Lord (l 977b) defines test forms as being equated when it would test that is ~ministered second can be affected by practice or fatigue. Such effects
be a matter of indifference to each examinee which test form he or she takes. Thus, can be exammed by reversing the order of administration of the tests for a random
for tests to be equated they must measure the same trait, and every level of the trait half of the sample. If the raw scores f'ro.m the two tests are linearly related (which can
must be measured with equal accuracy by the two tests. Unequally reliable tests be determined by examining scatterplots of the scores), the score distributions forthe
cannot be equated. When test forms are successfully equated, the examinees' scores two tests will have very similar shapes, and standard scores for the two tests will be
will not be affected by the particular forms administered to them; consequently, the equivalent. This results in a linear equating of the two tests. If the scores for the two
forms can be used interchangeably. Similarly, it is often desired to equate different tests are nonlinearly related, which is typically the case, standard scores for the tests
levels of a test so that an examinee will get the same score regardless of whether an will not be equivalent. In such a case, a table or graph would be used to show how a
easier or harder level of the test is administered. This is sometimes called articula- score on one of the tests would be transformed to a score on the other test.
tion of test levels or vertical equating, in contrast to the horiwntal equating of test Two tests can also be equated by administering them to different, but
172 Transforming and Equating Test Scores 7.10 Summary 173

equivalent, samples. Equivalent samples can be obtained by very carefully con- Level 2. It may be impossible to equate Levels I and 3 directly, because it may be
trolled sampling procedures. Since the samples are equivalent, scores that are at the that few examinees who perform reasonably well on Level 1 can answer any of the
same percentile in the two samples are said to be equivalent. For example, suppose questions on Level 3, whereas most of the examilll:es who do reasonably well on
that a score of 10 on Test A is at the 25th percentile for Sample 1, a score of 12 on Level 3 get perfect ornearly perfect scores on Level I. Thus, it may be impossible to
Test B is at the 22nd percentile for Sample 2, and a score of 13 on test B is at the double-check the equating of scores from Levels I and 3. When vertical equating is
28th percentile for Sample 2. Using linear interpolation, a score"of 10 on Test A is accomplished by the successive equating of adjacent test levels, the use of scores
said to be equivalent to a score of 12.5 on Test B. This is called the equipercentile from a wide variety of test levels in one data analysis may be inappropriate or may
method of equating. If the two distributions of scores differ only in terms of their produce results that are difficult to interpret.
means and standard deviations, the equipercentile method will produce a linear
equating of the scores. If the two distributions differ in more than their means and 7 .10 Summary
standard deviations, a nonlinear equating will result.
The advantage of the equipercentile method of equating is that each ex- Mostraw scores are transformed to other numbers so that they can be more
aminee only needs to take one test instead of two. The main disadvantage of the easily interpreted. There are two main types of transformations ofraw scores: linear
equipercentile method is the need to obtain equivalent samples. Also, the equi- and nonlinear. Linear transformations do not alter the basic shapes of distributions or
percentile method does not determine whether the two tests should be equated; it the size of correlations. Nonlinear transformations can alter both of these proper-
simply equates them. Using this method, it is possible to appear to equate tests ties of scores.
that measure very different traits (for example, vocabulary and mathematical A popular nonlinear transformation is the percentile rank. The percentile
computation). rank of any trait value is the percentage of examinees in a norm group who obtain
Thurstone's method of absolute scaling (Section 7.8) can also be used to scores at or below the trait value in question. Percentiles are a popular transformation
equate tests. To use this method, one must assume that the trait being measured by and are relatively easy to calculate and interpret. However, it can be misleading to
the two tests is normally distributed within each of two equivalent samples. deal with means or standard deviations for percentiles. Moreover, since the distribu-
Thurstone 's method has the advantage of involving the administration of only one tion of percentile ranks is rectangular in shape, common statistical techniques that
test to each examinee, and it allows an examination of the degree to which the two assume normality should not be used with them. If the test is short, or if a large
tests are equatable. The major disadvantage of this method is the necessity of number of examinees obtain the same raw score, the transformation to percentile
obtaining samples in which the trait is normally distributed; this method typically ranks may lead to exaggerated interpretations of small differences in raw scores.
involves expensive sampling procedures. Other types ofnonlinear transformations result in age and grade scores. Age
In another method of equating, the two tests are administered to different and gtade scores are common transformations with serious limitations. They are
and not necessarily equivalent samples. In addition both groups take a test called an ordinal scales, and arithmetical operations with them (for example, calculations of
anchor test that measures the same trait as the tests being equated. Scores on the means) can be misleading. It often is forgotten that half of the norm group scores
anchor test are used to control for differences in the two groups that could affect the above its nominal age or grade level, and half scores below. Age and grade scores
equating. The latent-trait models described in Chapter 11 are.particularly useful for can reflect curriculum or environmental differences, and comparisons using age and
this type of equating. The advantage of this method is not having to obtain equivalent grade scores based on different tests may cause misunderstandings. Age and grade
samples; the disadvantages are the need to administer the anchor test and the need to ,,r scores can lead users to perceive equalities that aren't real.
'!
deal with the statistical complexities of controlling for differences between the Another nonlinear transformation is the expectancy table. Expectancy
equating groups. tables give the conditional distribution of criterion scores for various levels of test
Vertical equating of tests can be more difficult than horizontal equating, performance. If an explicit criterion and reasonable sample sizes are available,
because different levels of a test typically involve different types of items. For expectancy tables are easy to construct and especially useful for counseling applica-
example, Level I of a mathematics-achievement test may involve addition of tions. As an alternative, regression techniques offer similar advantages.
single-digit numbers , but Level 2 may involve addition of two-digit numbers as well. Standard and standardized scores are frequently used linear transformations
It may be difficult to equate these tests, and, even if the test scores appear to equate of raw scores. If the distributions of two raw scores are different in shape, compari-
statistically, there are conceptual problems in interpreting "equivalent" scores from sons of their corresponding standard or standardized scores can be misleading.
the two tests. Also, in vertical equating it is common to link many levels together to Normalized scores are nonlinear transformations created by smoothing the
form one scale. Level 1 is equated to Level 2, Level 2 is equated to Level 3, and so raw-score distribution into an approximately normal shape. T scores and stanines are
forth. Scores on Level I are equated to scores on Level 3 only indirectly, through common normalized scores. Normalization transforms scores to a known, easily
174 Transforming and Equating Test Scores 7.13 Computational Problems 175

interpretable distribution that is required by many common statistical tests. This 7.12 Study Questions
transformation is not reasonable if the underlying trait distribution is not normal,
l. Why are raw scores transformed?
and it is not a cure for a poorly designed test ..
2. What are the differences between linear and nonlinear transformations?
Formula scores (F, and F 2 ) are designed to improve the information
provided by raw scores about the performance of examinees who guess on or omit 3. Explain haw to calculate the percentile for a given trait value and the trait value
items. F 1 estimates the number-right score if correct answers due to guessing are corresponding to any given percentile. What are the advantages and disadvan-
tages of the percentile transformation?
eliminated. F 2 estimates what the number-right scores would have been if examinees
4. Haw 'are age and grade scores calculated? What are the advantages and dis~d
had randomly guessed on all omitted items. Experimental evidence for the value of
vantages of age and grade scares?
calculating such scores is contradictory; consequently, the test developer must
5. How are expectancy tables constructed? What are the advantages and disadvan-
evaluate each of these transformations empirically for use with specific tests for
tages of expectancy tables?
specific purposes.
Equal-interval scales reflect the trait being measured on an interval level of 6. Haw can a regression equation be used to provide information similar to that
provided by an expectancy table?
measurement. If the required assumptions are reasonable, equal-interval scales are
7. Explain how to calculate standard and standardized scores. What are the
particularly useful in the assessment of growth or change. Thurs tone's absolute
scaling method is a common method of obtaining an equal-interval scale. advantages and disadvantages of standard and standardized scores?
Tests are equated so that their scores can be used interchangeably. Two tests 8. Explain how to normalize a set of scores. What are the advantages and disadvan-
tages of normalized scores?
can be equated by administering both of them to one sample of examinees and
determining their relationship. If they are linearly related, standard scores based on 9. Assumin~ that the raw-score distribution in the following table is approximately
the tests are equivalent. If they are nonlinearly related, a table or graph can be used to normal with ,x = 48 and U'x = 2, verify the column entries.
show which scores are equivalent .. Tests can also be equated by administering Raw
different tests to each of two equivalent samples and equating scores with the same
Score z Percentile T Stanine
percentiles in the two groups (the equipercentile method). Finally, tests can be 54 3.0 99+ 80 9
equated using Thurs tone's absolute scaling method or by the use of an anchor test. 53 2.5 99 75 9
52 2.0 98 70 9
51 1.5 93 65 8
7 .11 Vocabulary 50 1.0 84 60 7
49 .5 69 55 6
age equivalent monotonic transformation 48 0 50 50 5
nonlinear transformation 47 .5 31 45 4
age scores 46
normalized scores 1.0 16 40 3
anchor test 45 1.5 7 35 2
articulation of test levels norm group 44 - 2.0 2 30 1
correction for guessing norm-referenced testing 43 - 2.5 I 25 I
criterion-referenced testing percentile 42 - 3.0 1- 20 1
equal-interval scale percentile rank
percentile score 10. De~c~be the two formula scores developed as corrections for guessing and
equipercentile equating
raw score om1ss1ons.
expectancy table
rectangular distribution 11. What problems are involved in the use of formula scores? How useful are
F, formula scores?
F, reference group
standardized score 12. What is an equal-interval scale? How can equal-interval scales be constructed?
formula score
standard score 13. Describe four procedures for equating test scores: What are the advantages and
grade equivalent
stanine disadvantages of each procedure?
grade scores
horizontal equating test equating
Thurstone 's absolute scaling
7 .13 Computational Problems
interval scale
linear transformation transformed score l. Using the data in the following table:
mental age Tscores a. Calculate the percentile ranks for trait values of 86, 92, and I 07.
monotonic relationship vertical equating b. Calculate the trait values at the 3rd, 22nd, and ?5th percentiles.
Transforming and Equating Test Scores 7. 13 Computational Problems 177
176

Score Frequency Frequency

110-119 10 Score Women Men


100-109 5
90-99 8 0 1 1
80-89 12 1 2 5
70-79 18 2 4 7
3 5 5
4 4 1
2. On a cognitive-maturity test, the median score for six-year-olds is 23 points and 5 3 1
the median score for seven-year-olds is 32 points. Estimate the mental age (age 6 1 0
score) of a child earning 28 points on this test.
3. Two potential students for a statistics class take a pre-course exam. Micah gets 10. A_ssume the data in the following table are the result of administerin two
21 points, and Becky gets 13 points. Using the expectancy table (Table 7.6), different tests to two different, but equivalent, samples. g
what advice would you give each student? a. Equate ~e test scores (giving the Test B score equivalent to each Test A
4. (Optional) Using the definition ofregression contained in Section 2. 14, obtain score) us mg the equipercentile method.
the expected course grade for each of the six pre-course test-score ranges in b. Equate the test scores using Thurstone 's absolute scaling method
c. Compare the results from a. and b.
Table7.6.
5. The raw scores for a test have a mean of 83 and a standard deviation of 12. Test A TestB
a. What is the standard score corresponding to a raw score of 89?
Score Frequency Score Frequency
b. What is the standardized score for a raw score of 80 if the scores are
standardized to be comparable to CEEB scores? 10 5 105 10
6. Transform the following scores to T scores and then to stanines. 11 10 115 15
12 20 125 15
Raw Score Frequency 13 20 135 30
14 25 145 20
37 6 15 10 155 5
36 124 16 10 165 5
35 58
34 10
33 2

7. Carey obtains 20correct answers on a 39-item four-option multiple-choice test.


a. Estimate the number of items Carey got correct by guessing.
b. Estimate the number of items Carey got correct without guessing.
c. What assumptions were used in making these estimates?
8. Two examinees take a true/false test with 50 items. Examinee A gets 40 items
correct and omits four items. Examinee B gets 35 items correct and omits 12.
Estimate the number of items each examinee would have gotten correct if their
omissions were replaced with random guesses. Evaluate the relative perfor-
mance of the two examinees.
9. a. Using the test-score frequency distributions in the following table, create an
equal-interval scale using Thurstone's absolute scaling method.
b. Do the data appear to form a good equal-interval scale?