0 Up votes0 Down votes

667 views17 pagesJun 30, 2017

© © All Rights Reserved

PDF, TXT or read online from Scribd

© All Rights Reserved

667 views

© All Rights Reserved

- Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
- Hidden Figures Young Readers' Edition
- The Law of Explosive Growth: Lesson 20 from The 21 Irrefutable Laws of Leadership
- The E-Myth Revisited: Why Most Small Businesses Don't Work and
- The Wright Brothers
- The Power of Discipline: 7 Ways it Can Change Your Life
- The Other Einstein: A Novel
- The Kiss Quotient: A Novel
- State of Fear
- State of Fear
- The 10X Rule: The Only Difference Between Success and Failure
- Being Wrong: Adventures in the Margin of Error
- Algorithms to Live By: The Computer Science of Human Decisions
- The Black Swan
- Prince Caspian
- The Art of Thinking Clearly
- A Mind for Numbers: How to Excel at Math and Science Even If You Flunked Algebra
- The Last Battle
- The 6th Extinction
- HBR's 10 Must Reads on Strategy (including featured article "What Is Strategy?" by Michael E. Porter)

You are on page 1of 17

ISBN 10:

Introduction to

Measurement

Theory

Mary J. Allen

Wendy M. Yen

- . - --

---------

PRESS, INC.

Long Grove, illinois

7.1 Introduction 149

isolate~ raw score does not give any information about how one examinee 's perfor-

7

mance 1s related to the performance of the other examinees. If a raw-score distribu-

tion is irregular (for example, if it is skewed or bimodal), common statistical

techniques that require normality cannot be applied reasonably. Another problem

with the use ofraw scores is that raw scores may not be comparable across tests. For

example, if Gene has a raw score of 60 on Test 1 and a raw score of 20 on Test 2 we

c~nn?t ":18ily assess his relative performance on the two tests, particularly if the

d1stnbut1ons of scores for the two tests have quite different shapes. All of these

Transforming and Equating problems have led to the development and use of common transformations that result

in more easily interpretable reported scores. Common forms of expressing

transformed scores are: (1) percentiles, (2) age and grade scores, (3) expectancy

Test Scores tables, (4) standard and standardized scores, (5) normalized scores, (6) formula

scores, and (7) equal-interval scales.

Transformations of scores are of two basic types: linear and nonlinear. A

linear transformation can be defined by a linear equation of the form y = aX + b,

where a and bare constants, Xis the raw score, and Y is the transformed score. In

making this transformation, every examinee 's X is transformed to a Y using the linear

rule. If the transformati?n equation is known, the transformed score corresponding

to any raw score can be calculated easily. For example, if Y = 3X - 2, the

transformed score corresponding to a raw score of 12 is 3(12) - 2 = 34. When raw

scores are linearly transformed, the shape of the distribution of the transformed

scores is the same as the shape of the distribution of the raw scores. For example, if

the. raw-score distribution is skewed to the right, the linearly-transformed-score

distribution also will be skewed to the right. Furthermore, linear transformations

do not alter the size of correlations (see Section 2.8). The construction of standard

and standardized scores and some formula scores involves a linear transformation

ofra:w scores.

7 .1 Introduction . A nonlinear transformation cannot be expressed in the form of a linear

equa.tion. For example, Y = X2 is a nonlinear transformation of X. In general,

David took a test and received a score of 32. How well did he do? The raw

nonlmear transformations will change correlations and the shape of the score

score-that is, the observed score-does not carry enough information to answer

distribution, so that the transformed-score distribution can be very different from the

this question. However, if we knew something about the distribution ofraw scores,

raw-score distribution. All of the seven transformations listed previously, except for

we could get a better idea of what David's score means. For example, if we knew that

standard and standardized scores and some formula scores, involve nonlinear

David's score fell at the 90th percentile (meaning that he scored as well as or better

transformations.

than 90% of the examinees who took the test), we would have a much better idea

The raw-score transformations (except for some formula scores) that are

about how well he did. By dealing with a percentile rather than a raw score, we have

discussed in this chapter are all monotonic transformations of the raw scores. This

made a raw-score transformation. In this chapter we will discuss several different

types of raw-score transformations that aid in the interpretation of test scores.

means that, if one examinee 's raw score is greater than another examinee 's raw

score, the first examinee will have a greater transformed score than the second

Chapter 8 discusses the levels of measurement (nominal, ordinal, interval, and ratio)

examinee. In other words, monotonic transformations will not alter an examinee's

of the scores produced by these transformations.

rank ~rder in the sample. If the scores are to be used in a ranking or sorting situation,

Because there are a number of disadvantages in retaining the raw scores on

any test, most scores are transformed before being reported. A major problem with there 1s nothing to be gained by using these transformations. If, however, we want to

com'.uni~ate th~ meaning of a score more effectively or to perform special statistical

raw scores was just mentioned: a raw score by itself is difficult to interpret. Even if

raw scores are accompanied by information about the number of items on the .test, an mampulat1ons with the scores, one or another of the transformations can be useful.

The raw-score transformations discussed in this chapter transform scores in

148

150 Transforming and Equating Test Scores 7. 2 Percentiles 151

order to make them more meaningful. In most cases, this meaning is derived by examinees obtaining the score appears in the middle column. The next column on the

comparing an examinee 's performance to the performance of other examinees; that right contains the cumulative frequency of each score, whi9h is the number of

is, the scores derive meaning through reference to a norm group. This technique is examinees who have scores less than or equal to that score. For example, the

called a norm-referenced approach. An alternative approach, criterion-referencing, cumulative frequency of a score of 116 is the frequency of a score of 116 (8)plus the

is discussed in Section 10.5. In criterion-referenced testing, an attempt is made to frequency of a score of 115 (2), which equals 10. Using this observed.test-score

determine whether the examinee has reached a certain specific criterion performance . distribution, let's estimate the percentile rank for a trait value of 117. Remembering

or mastered a specific task (for example, can the examinee subtract single-digit our assumptions, we recognize that the observed test score of 117 represents a range

numbers?). In criterion-referenced testing, a raw score (number-right score) can be of trait values from 116.5 to 117.5 and thatthe six examinees who got a test score of

meaningful and does not require transformation. 117 are evenly distributed from 116.5 to 117 .5. Therefore, we conclude that 13

examinees had trait values at or below 117-the ten examinees below 116.5 plus the

three examinees between 116.5 and 117. (Since we assume the trait distribution is

7 .2 Percentiles continuous, no one falls at exactly 117. Whether 13 examinees fell at and belo.w or

just below 117 is philosophically, but not practically, significant.) Since there are 20

Percentiles are defined with respect to a norm, or reference, group. A norm examinees in all, 13/20 is the desired proportion. Since 13/20 = .65, the percentile

group is a specified sample of examinees-for example, a certain sample of sixth rank for a trait value of 117 is estimated to be 65.

grade students randomly chosen from schools across the United States. If it were Using similar logic to calculate the percentile rank for a trait.value of 116,

possible to determine the actual trait values for a continuous trait (for example, we see that there are six examinees at or belowa trait value of 116 (two below 115.5

intelligence ot hyperactivity) for each person in a norm group, then we could and four between 115.5 and 116). The proportion at or below 116 is 6/20 = . 30, so

determine the percentage of people with trait values less than or equal to any the percentile rank for a trait value of 116 is estimated to.be 30. We also see that 18

specified value. The percentile rank (or percentile or percentile score) of a trait examinees fall at or below a trait value of 118 (16 belo.w 117 .5 plus two between

value is defined as the percentage of people in a norm group who have trait values 117.5 and 118), so the percentile rank for a trait value of 118 is estimated to be

less than or equal to that particular trait value. For example, if 75 out of I 00 people in (100)(16 + 4/2)/20 = 90. Similarly, the percentile rank for a trait value of 115 is

the norm group have spelling abilities less than or equal to 17 .3, then 17 .3 is assigned estimated to be ( 100)(0 + 2/2)/20 = 5. , .

a percentile of 75. In practice, we cannot obtain trait values, but we can obtain Calculations to find trait values corresponding to given percentiles are

observed test scores. We can assume that each test score represents a range of trait slightly more tedious but conceptually similar. To estimate the trait value ~ailing at

values. For example, an observed test score of 17 represents a range of trait values the 75th percentile rank, we need to find the trait value that 75% of the .examrnees fall

from 16.5 to 17.5. We can further assume that every trait value in the range at or below. For the data in Table 7.1, there are 20examinees in all, so the trait value

represented by an observed test score is equally likely to occur. For example, for an at the 75th percentile must have .75(20) = 15 examinees at or below it. Looking at

observed test score of 17, all trait values between 16.5 and 17 .5 are equally likely to the cumulative frequencies of the scores and remembering the assumptions, we see

occur. Thus, if six examinees receive an observed test score of 17, three examinees that 16 examinees have trait values at or below 117.5 and that ten examinees have

are assumed to have trait values less than 17, and three are assumed to have trait trait values at or below 116.5; therefore, the trait va,lue we want must be somewhere

values greater than 17 . between JI 6.5 and ti 7. 5. To make the calculations easier, we can construct a table

Using these assumptions, we can use frequency distributions of observed like Table 7.2. We see that a trait value of 117.5 corresponds to a cumulative

test scores to estimate the percentiles associated with various trait values. Consider frequency of 16 and that a trait value of 116.5 corresponds to a cumulative frequency

the score distribution in Table 7. I. For each score, the frequency (number) of

Table 7.2. Information Table for

Table 7.1. A Score Frequency Distribution Calculation of a Trait Value

Corresponding to_ a Percentile

Cumulative

Score Fre_quency Frequency Trait Cu1nulative

Value Frequency

118 4 20

117 6 16 117 .5 16

116 8 10 ? 15

115 2 2 116.5 10

152 Transforming and Equating Test Scores 7. 2 Percentiles 153

of 10; we want to know the trait value corresponding to a cumulative frequency of Table 7.3. Ungrouped and Grouped Frequency Distributions

15. We find this trait value using linear interpolation: 15 is 5/6 of the way between 10

Ungrouped Grouped

and 16, so the trait value we need must be 5/6 of the way between 116.5 and 117 .5.

The desired tiait value is approximately 116.5 + .83 = 117 .33. This calculation and Score Frequency Score Frequency

its logic are illustrated in Figure 7 .1. On the right side of Figure 7 .1, we see that we

must go up 5/6ofthe distance between !Oand 16. We must go up proportionately the 8 12 5-8 31

7 8 1-4 4

same amount on the left side (between 116.5 and 117.5) or 5/6 times I, which is

6 10

about .83. This produces the answer of 117 .33. 5 I

Occasionally frequency distributions are grouped in terms of score inter- 4 2

vals. For example, we may know that 27 examinees earned scores from 1 to 5 3 l

a

without knowing how many examinees earned score of 1, how many earned a score 2

l

I

0

of 2, and so on. In such a case, we assume that the score interval from 1 to 5 covers

trait values from .5 to 5.5 and that every trait value within this interval is equally

likely to occur. If this last assumption is false, the percentile calculated from grouped

data may differ markedly from the percentile calculated from ungrouped data. For Table 7.4. A Grouped Frequency Distribution

example, when the percentile rank fot a trait value of 6 is calculated for the

Cumulative

ungrouped data in Table 7 .3, it is approximately 28.6. However, for the grouped Score Frequency Frequency

distribution the calculated percentile rank is approximately 44.6.

When dealing with grouped frequency distributions, the logic of calculating 50-59 6 50

40--49 12 44

percentiles is the same as it is with ungrouped frequency distributions. For the

30-39 17 32

distribution in Table 7 .4, we will calculate the percentile rank for a trait value of 37 20-29 10 15

and the trait value that falls at the 70th percentile. These calculations are illustrated in 10-19 2 5

Figures 7 .2 and 7 .3 ,respectively. As shown in Figure 7.2, there are 15 examinees at 0-9 3 3

or below a trait value of29.5 and 32 examinees at or below a trait value of 39 .5. For a

trait value of37, we go up 7. 5 units of a distance of 10 units on the left side of Figure

7 .2, so oil the right side we must go up 7.5/10 of a distance of 17. We conclude that

there are 27.75 examinees at or below a trait value of 37. Since there are 50 Trait Cumulative

Value Frequency

Value Fre_q~yncy

10 37 ? 17

7.5 ?

117.5 16

29.5 15

1.0 ? 15 6

116.5 JO

15 + 37 - 29.5 (32 - 15)

39.5 - 29.5

16 - 10

27.75 = .555

= 116.5 + ~ (1),; 117.33. 50

. 6

Percentile= .555(100) = 55.5

Figure 7. 1. Illustration of the calculation of a trait value corresponding

to a percentile Figure 7. 2. Calculation of the percentile rank for a trait value of 37

154 Transforming and Equating Test Scores 7.2 Percentiles 155

examinees in all, 27. 75/50 = .555 of the examinees fall at or below a trait value of Trait Cumulative

37, which corresponds to a percentile rank of 55 .5. . Value Frequency

Jn Figure 7 .3 we need to find the trait value that 70% of 50exammees fall at 29.5 15

or below-that is, the trait value with a cumulative frequency of 35. There are 44

10 ? 10 10

examinees with trait values at or below 49.5, and there are 32 examinees with trait

values at or below 39.5. Therefore, our answer is somewhere between 39.5 and 19.5 5

49.5. On the right side we go up3/12ofthe way, so on the left side we must go up

3/12 of the distance to a score of 42. Figures 7.4 and 7.5 illustrate similar calcula- .20(50) = 10

Trait Cun1ulative 19.5 + 10 - 5 (29.5 - 19.5)

Value Frequency 15 - 5

49.5 44

= 19.5 + 2-- (10) = 24.5

35 12 10

IO ?

tions; we find that the percentile rank for a trait value of 26 is 23 and that the trait

value atthe 20th percentile is 24.5.

.70(50) = 35 Percentiles, like any transformed scores, have some advantages and some

disadvantages. The primary advantages of percentiles are that they are straightfor-

39.5 + 35 - 32 (49.5 - 39.5)

44 - 32. ward to calculate, regardless of the shape of the distribution of observed scores, and

that they are easy to interpret. For communicating with people who have little

= 39.5 + ?2 (10) = 42 background in statistics, percentiles are probably the most popular and meaningful

transformed scores. '

Figure 7 .3. Calculation of the trait value at the 70th percentile

There are a number of limitations of percentile scores. Percentiles can be

assumed to form ordinal scales (see Sections 2.2 and 8.2); thus, arithmetical

manipulations of percentiles (for example, the calculation of means and variances of

Trait Cu1nulative percentiles) can produce misleading results. Suppose that five seniors from two high

Value Frequency

schools have taken a college entrance exam. The scores for this exam are constructed

29.5 15 on an equal-interval scale with a national mean of 420 and a standard deviation of

IO

40. The scores and national percentiles for the two groups appear in Table 7 .5. It is

26 ?

clear that the equal-interval scores from the two schools have the same variances.

19.5 5 However, the variance of percentiles from School B is much smaller than the vari-

ance of percentiles from School A. Use of the variances of the percentiles could

lead to inaccurate conclusions about the relative variance in performance for

5+ 26 - 19.5 (15 - 5) the two schools.

29.5 - 19.5

Keep in mind, also, that the distribution of percentiles within the norm

=5+ 6.5 (10) =11.5 group is rectangular, not normal, since by definition 1% of the examinees are at each

10 percentile. Unlike the bell-shaped normal curve, a rectangular distribution curve

looks like a horizontal line. Therefore, researchers who desire to use common

.lLl...= .23 statistical techniques that assume normal distributions should avoid the use of

50

percentiles.

Percentile = .23(100) = 23. A third problem with percentile scores is that they may lead to exaggerated

Figure 7.4. Calculation of the percentile rank-for a trait value of 26 interpretations of small differences, especially when the test is short. Consider a test

156 Transforming and Equating Test Scores 7.3 Age and Grade Scores 157

Table 7. 5. Scores and Percentiles for Students from Two Schools 7 .3 Age and Grade Scores

Schoo/A Schoo/B

Often test scores are reported in terms of age or grade scores, sometimes

Score Percentile Score Percentile called age or grade equivalents. For example, a third-grader may be said to read at

the fifth-grade level or have the mental ability of a IO-year-old. To calculate an age

400 31 450 77

460 84 or grade score, the median raw score for examinees of a particular age (or grade) is

410 40

420 50 470 89 determined, and this median raw score is transformed to be that age (or grade) score.

430 60 480 93 For example, ifthe median raw score on areading test for fourth-graders is 17 .2, the

440 69 490 96 raw score of 17. 2 is transformed to a grade score of' 'grade 4." Sometimes mean raw

scores for a group are used instead of median scores to determine age or grade scores.

Mean 420 50 470 88

Variance 200 184 200 45 In either case, intetpolation is used to fill in the gaps between scores. For example, if

themedian grade-3 raw score is 13.3 and the mediangrade-4raw score is 17.2, a raw

score of 16 is transformed to a grade 3. 7, or about 14 of the way between grades 3 and

with an approximately normal score distribution. What happens to percentiles for 4. This calculation is illustrated in Figures 7. 7 and 7 .8.

examinees scoring near the center of the curve? As illustrated in Figure 7 .6, small

Despite their popularity, age and grade scores have a number of serious

score differences near the center of the distribution may lead to large percenale

limitations. Because they are assumed to formordinalscores (Section 8.2), ariJhmet-

differences. Suppose four examinees (P,, P,, P,. and P 4 ) receive scores of 10, 12,

ical manipulations of these scores can lead to misleadingresults. Also, the interpreta-

17, and 19, respectively, on a test. Only two score points separate? 1 fromP, a~dP,

tion of these scores is not as straightforward as it appears. For example, one might

fromP but as illustrated in Figure 7.6, there will be a much larger difference m the

infer that two children with the same mental age or age score think similarly, which

percen:lles f~r p 1 andP, than for P 3 andP 4 For example, the four percentiles forthe

is generally not true. There are enormous cognitive differences between a 5-year-old

examinees might be 25,'50, 97, and 99. Someone examining these percentiles would

with a mental age of 8 and an 8-year-old with a mental age of 8. There will be large

probably conclude that P, is subnormal, P 2 is average, .and P, and P,, are both

differences between the children in background knowledge and experiences, matu-

extremely high on the trait, without realizing that only t~o items m~ked incorrectly

separate P, from p 2 and p 3 from P4 This problem IS not 7estr1cted to normal

distributions; it will occur whenever a large proport10n of examinees get the same or Grad~ Score Raw Score

similar observed scores, causing a one- or two-point score difference to result 10 a

large percentile difference. Since this problem is more likely to occur on short tes~. 4 17.2

on which only a limited number of scores are possible, the test, user should exercISe ? 16 3.9

great caution when dealing with percentiles based on short tests.

3 13.3

3 + 16 - 13.3 (4 - 3)

17.2 - 13.3

3.9

Figure 7. 7. Interpolation of grade scores

2 3 4 5 6 7 8 9 IO I I 12 13 14 15 16 17 18 19 20 t

Raw' Score Raw Score 13 14 15 16 17 18

Figure 7 .6. An equivalent raw-score difference generating very differ-

ent percentile differences Figure 7.8. Representation of grade scores

158 Transforming and Equating Test Scores 7.3 Age and Grade Scores 159

rity of value systems, interests, cognitive styles, and reasoning ability. Similarly, a Another fact that is often forgotten is that, when we consider all those

5-year-old who is average and a IO-year-old with a mental age of5 are very different, children with the same chronological age (or the same grade), about half of them

despite their similar mental ages. This limitation is also true for grade scores. A will be above average, and the other half will be below average. For example, about

third-grader who is at the fifth-grade level on a science-achievement exam probably half of. the thir?graders read below the grade-3 norm. Therefore, probably very

knows very different items of information and has a very different perception of the few children will fall where one might expect them to on the basis of their. age or

physical world than the average fifth-grader. Children with the same age or grade grade level.

scores, especially when widely disparate in age, probably got different answers . '."nother problem, especially for grade scores, is that schools may differ in

correct, used different test-taking strategies or styles, and are prepared for different th err cumcula and may introduce topics at different rates. Thus, a whole school may

types of'subsequent training. These facts suggest that the use of age or grade scores be below average in arithmetic and above average in history. In a case such as this,

could mislead most interpreters to perceive equalities that are not real. ~e grade scores might suggest that the individual students are retarded or gifted

A third problem with age and grade scores is that score distributions for m these areas, when actually it is differential exposure that accounts for their

adjacent grades typically tend to have increasing overlap as grade level increases (see score patterns.

Figure 7 .9). Consider reading-level grade norms. A first-grader who is reading one Obviously, the use of age or grade scores is only reasonable when the trait

grade level ahead may be in the 85th percentile among first-graders, but a fifth- being measured increases (or decreases) monotonically with age or grade. The test

grader who is one year ahead may only be at the 65th percentile among fifth-graders. scores illustrated in Figure 7 .10 should not be transformed to age scores, because the

Even though both students are one grade level ahead, the interpretation of their test does not discriminate between 2-year-olds and 4- or 5-year-olds. Only when

reading levels is quite different. The situation with age scores is the same. A the relationship between mean or median score and age or grade is monoton-

5-year-old who is two years ahead would be truly exceptional; for an 11-year-old, ically increasing or decreasing, as illustrated in Figure 7 .11, should age or grade

being two years ahead in mental age would not be as extraordinary. This problem ,scores be considered.

also makes it difficult to compare examinees' perfmmances at different age or grade A final problem with age and grade scores is that interpolation between tests

levels. If two third-graders score at grade 3 and grade 4 on a mathematics exam, the may be ~naccurat~. School achievement exams are usually age graded. For example,

one who scores higher might be very superior to the other. But, if two seventh- there might be different mathematics-achievement tests for grades 3 to 6 and for

graders score at grade 7 and grade 8, they might be very similar to each other in grades 7 to 12. Uril~ss these different tests can be ,equated (see Section 7.9), a

mathematical competence. A new teaching method that improves performance by fourth-grader who scores at the eighth-grade level on the lower level of the test

one grade level in only a few months' time would be marvelous if it worked for probably will not score at I/le eighth"grade level on the higher level of the test. In

third-graders but less impressive if it worked for seventh-graders. Similar problems summary, age and grade scores, despite their seeming appropriateness for use with

occur in age scores. A change of one year in cognitive or social development is an children, have a large number \lf serious drawba9ks that make m,eaningful interpreta-

enormous leap for an infant; it is a much less dramatic change for a 16-year-old. t\9n extreme! y difficult.

"'

u

Distribution Distribution Distribution Distribution ~

"" 0

u 4

"

""

~ "'

u.

:g" 3

" ::; 2

~0: ....~

Grade-I Grade-2 Grade-5 Grade-6 0

Mean Mean 2 3 4 5 6

Mean Mean

Test Score Age

Figure 7.9. Example of increasing overlap of within-grade-score dis- Figure 7,- 10. A re1ationship between test scores and age that is not

tributions as grade increases appropriate for the development of age scores

160 Transforming and Equating Test Scores 7. 5 Standard and Standardized Scores 161

tancy tables built on local norms for schools, mental-health clinics, factories, and so

on are conceptually easy to develop and use.

There are, however, some disadvantages to expectancy tables. These tables

often cannot be developed, either because of time or monetary considerations or

because a clear-cut criterion is not available. The norm group used to develop

expectancy tables should be large enough to ensure that the probabilities in the table

are reasonably stable. For a test with wide applications, a large number of expec-

tancy tables may be necessary to relate the test scores to an array of criteria. Local

norms, rather than norms based on a national sample, may be necessary for specific

Age or Grade Age or Grade school programs, therapy situations, or careers . These problems with expectancy

(a) (b) tables are practical rather than theoretical. When specific criteria and reasonable

sample sizes for the norm groups are available, norms displayed in expectancy-table

Figure 7. 11 . Relationships between test scores and age that are appro- form are one of the most useful transformations possible.

priate for the development of age or grade scores

An alternative to the expectancy table that can be used in similar situations

involves the regression of the criterion on the test (see Sections 2.9 and 2.14). A

7 .4 Expectancy Tables predicted criterion score is provided for each test score or range oftest scores. In the

Expectancy tables are probably the best transformations available f~r.tests preceding example, the predicted (expected) course grade for every seore on the test

that can be tied to a reasonable criterion. An expectan~y table gives the cond1t1ona! could be provided. The advantages and disadvantages of the transformation using

distribution (see Section 2. 13) of criterion scores for different test scores Table 7 regression techniques are very similar to those for expectancy tables.

gives a hypothetical expectancy table relating pre-course te~t scores to perform~n~~

in a statistics class~ Of the people in the norm group who scored between 25Fan h 7. 5 Standard and Standardized Scores

on the pre-course test 50% eamed "A' s, "30%v eamed "B's "and so on. or t e

eople scoring between ' 5 and 9, none earne d "A. 's, "2~v ea med "B's,, and soon.

Section 2. 7 demonstrated how to calculate standard scores, often called Z

scores, where Z = (X - /.tx )l<J'x. To get a standard score corresponding to any raw

~counselor would advise a student with a high pre-course test score to ~e t~e score, the mean of the raw scores is subtracted from the raw score, and the resulting

'th a warning that a few students from this test-score level still did

course, perh aps w1 . h unselor number is divided by the standard deviation of the raw scores. A standard score

oorl in the course. When advising a person with a low test score, t .e co

~i ; su est remedial work for the student before he or she enroll~d m the. cl?ss indicates how many standard deviations from the mean a score lies. For example, if

Th~exp~tancy table illustrates the probabilistic nature of psycholog1.cal ~red1ct1on. Z = + 1, the raw score lies one standard deviation above the mean; if Z = - 2, the

raw score is two standard deviations below the mean. Since the standard-score

Although most high scorers did well in the class, some did not--;:ma~mg it clear that

transformation equation is linear, the shape of the standard-score distribution will

a student with a high pre-course test score is not guaranteed an A. .

The main advantages of expectancy tables are especially apparent m coun- be the same as the shape of the raw-score distribution, and correlations will not be

seling applications. The expectancy table makes clear the re~ations~ip b".tween the affected. Contrary to popular belief, not all standard scores have a normal distribu-

test score and the criterion and the probabilistic nature of this relat10nsh1p. Expec- tion. If the raw scores are skewed or bimodal, the standard-score distribution will

have these same properties.

Standard scores al ways have a mean of 0 and a standard deviation of 1. One

Table 7 .6. Expectancy Table Relating Pre-Course Test Scores to Course Grades major disadvantage of standard scores is that about half the scores are negative. Most

Test Grade people prefer not to deal with negative numbers, because transcription and

Score A (4) B (3) c (2) D (1) F(O) mathematical errors are more common (negative signs are easily lost) and because

.01 examinees dislike having negative scores. For these reasons, standard scores gener-

25-30 .50 .30 .15 .04

.20 .10 .01 ally are not used in reporting scorc;s .

20-24 .35 .34

.15 .58 .15 .02 Standardized scores are linear transformations of raw scores (or their

15-19 .10

.06 .59 .25 .05 standard-score equivalents) that eliminate the problems involved with negative

10-14 .05

.02 .48 .40 .10 numbers. Any set of standard scores can be transformed to have an arbitrary mean,

5-9 0

.01 .27 .42 .30

0-4 0 ,*,and standard deviation, <1'*, by applying the formula Y = <J'*Z + ,, whereZ is

162 Transforming and Equating Test Scores 7.6 Normalized Scares 163

the standard score and Y is the standardized score. For example, if you want to have have a mean of 100 and a standard deviation of 10, a score of 90 is one standard

= 100 and <T* = 16, the transformationfroinZ to Y would.be: Y = 16Z + 100. A.n deviation below the mean, or at about ihe 16th percentile.

examinee with a standard score of 2 would have a standardized score of 132. 'Ilus One disadvantage of standard or standardized scores is that they may be

=

formula can also be easily applied to the raw scores. Since Z (X - Mx )/<Tx, difficult for those unfamiliar with statistics to understand fully. Another disadvan-

tage is that, since these transformations are linear, the distribution of transformed

scores will contain any irregularities found in the raw-score distribution. Irregular

"bumps" in this distribution, usually due to sampling irregularities, will be pre-

For example, if the raw-score mean and standard deviation are P.x = 36 and <Tx = 3, served by the transformation. A third problem with standard or standardized scores is

and you want to transform to standardized scores with = 50 and <T* = 5, the more subtle. Suppose Juan had two standard scores, .19 on Test A and .14 on Test

equation would be B, that lead us to the conclusion that he di' about equally well on the two tests.

y= 5( x ~ 36 ) + 50. However, if the shapes of the two score distributions are quite different, our

conclusion would be'wrong. Figure 7.12 illustrates one possible pair of standard-

score distributions. The distribution for Test A is skewed to the right, and the

For a raw score, X, of 30, the transformed standardized score, Y, would be Y = distribution for Test Bis skewed to the left. The standard score of .19 is at the 64th

5 (30 - 36)/3 + 50 = 40. percentile for Test A, and the standard score of .14 is at the 50th percentile for Test

There are several standardized scale transformations in common use. The ,B. Similar standard or standardized scores can lead to different interpretations of

Army General Classification Test (AGCT) scores, developed in World War II, are relative merit. Thus, unless the test interpreter knows that two frequency distribu-

standardized scores with = 100, <T* = 20. The College Entrance Exam Board tions have the same.shape, it is difficult to compare scores on.two standardized

(CEEB) test scores have* = 500, <T* = 100. The subtest scores on the Wechsler scales, and it is particularly risky to interpret small differences in standard scores.

tests (WISC and WAIS) are standardized to have = 10, <T* = 3. The Sumford-

BinetIQ test score is standardized to have* = 100, <T* = 16. Some personality-test

scores, such as the California Psychological Inventory (CPI) and Minnesota Mul-

tiphasic Personality Inventory ~MMPI) scales, have =

50, <T* =

10. . .15

r-'\

For people with a basic statistics background, standard and.standardized

scores are relatively simple to understand. Since they are linear transformations of

the raw scores, these transformed scores will have a distribution with the same shape

"'

0

v

~

c<

v~ .IO

\.\Test B

as the raw-score distribution. If the raw-score distribution is approximately normal, "">v \\

it is fairly easy to transform the standard or standardized scores to appro.ximate

percentiles, such as those given in Table 7 .7. (This table was created usmg the

~

~

.05 '.\

standardnormal table in the Appendix.) For example, if an examinee's standard

~,

score is 1.9, you can guess that the examinee is at about the 95th percentile (actually,

from the table in the Appendix, it is the 97th percentile), If the standardized scores -2 -I 0 +I +2 +3

Standard Score

Table 7. 7. Z Scores and Their

Approximate Percentiles in a Nonnal

Distribution Figure 7. 12. Similar standafd scores fall at different percentile ranks for

frequency distributions with different shapes

Approximate

z Percentile

-2 2 7 .6 Normalized Scores

- I 16

0 50 The transformation to normalized scores involves forcing the distribution

+ I 84 of transformed scores to be as close as possible to a normal distribution by smoothing

+2 98

out, stretching, or condensing irregularities and departures from normality in the

164 Transforming and Equating Test Scores 7.6 Normalized Scores 165

raw-score distribution. This transformation can be reasonably applied if the test discriminations. among examinees can't be made with them. The. percentile ranks

developer believes that the underlying trait has a normal distribution and that the and ~he percentile ranges for each stanine are given in Table 7 .9. Transformations to

nonnormality of the raw-score distribution represents error due to sampling or stan~es can be done by calculating the percentile rank for any raw score and then

test-construction problems. refemng to Table 7.9. For example, a score at the 83rd percentile would be

The normalization process involves several steps: transfo~ed to a stanie of7, and a score at the 12th percentile would be transformed

to a stanme of 3.

I . Transform the raw scores to percentiles. . . .The advantage of transforming to normalized scores is that the transformed

2. Find the standard score in the normal distribution corresponding to each dtstnbuuon h~s. a well-k~own form that is easily interpretable and is amenable to

common statistical mampulations. Scores on different tests, if normalized and

percentile.

3. (Optional) Transform these standard scores to standardized scores with a desired converte~ ~ the same mean and standard deviation, become directly comparable

mean and standard deviation. thus av01dmg. the complications involved when frequency distributions have differ'.

ent shapes It is also easy to ~onvert any normalized score to its equivalent percentile.

To illustrate this process, we will normalize the raw scores given in Table 7 .8 to have , The use of no~al.tze~ scores may not be reasonable if the underlying trait

a transformed-score mean of 100 and standard deviation of 10. The column labeled has a very nonnormal d1stribut1on. For example, if a score distribution is bimodal

Z gives the score in the standard normal distribution corresponding to each per- ?ue to the presence of two disparate types of examinees (see Figures 6.12 and 6.13),

centile, obtained by referring to the Appendix; Z is a normalized score with a mean It w.ould not make sense to normalize the distribution. Also remember that nor-

of O and a standard deviation of I. The last column is obtained by the formula Y mal~ed scores do not have a truly normal distribution. The normal distribution is

= !OZ+ 100, which gives the desired normalized and standardized scores. Inthe continuous from negative infinity to positive infinity. However, normalized scores

first row, a raw score of 118 has been transformed to a normalized and standard- because they.~ based on raw scores, are discrete and generally will fall within thre;

stand~rd d~v1a11ons of the mean. Usually the problem of discreteness is not serious,

ized score of 114.1.

Two normalized scores are in common use: T scores and stanines. T scores especially .If ther~ Rf<) a large number of scores and.the normalized distribution is

are normalized scores with,= 50 and CT= 10. Nonnormalized, standardized scores f~1rl'. w~ll a~p~xtmated by the continuous normal curve. However, if the raw-score

with, = 50 and CT = 10 are called T.scores by some test publishers, so the test user d1Strtbut1on is highly skewed, small raw-score differences between extreme scores

should read the manual carefully to determine whether the "T scores" are nor- may be exaggerated or compressed by the normalization. A last problem is that the

malized. If the raw-score distribtion is approximately normal, then the normalized transforme? scores, with their approximately normal distribution, rnay lead the test

user to believe that the test yields "perfect , normal" scores . Nonnal'1zed scores

and nonnormalized standardized ''T scores'' will be approximately equal.

Stanines are one-digit normalized scores. They have a mean of 5 and a based ~n a. ~or ~st (for example, a test with an inappropriate difficulty level or with

standard deviation of approximately 2; consequently, the difference between adja- poor dIScn~nal!on a~ong examinees) will not be very useful, despite their appar-

cent stanine scores is approximately one half of a standard deviation. The main ently pleasmg, approxunately normal distribution.

advantages to the use of stanines are that their distributions are approximately normal

and that each stanine involves only one number. Stanine scores are useful for rough Table 7.9. Percentile Ranks and Ranges Corresponding to

screening of examinees , but because there are only nine different stanine scores , fine Stai1ines

Percentile Percentile

Sta nine Rank Range

Table 7.8. Calculation of Normalized Scores

z y 9 98 96-100

Normalized Normalized and 8 94.5 89-96

x Standard Standardized 7 83 77-89

Raw Cumulative Percentile

Score Score 6 68.5 60-77

Score Frequency Frequency for Trait

5 50 40-60

1.41 114.1 4 31.5 23-40

118 4 24 92

62 .31 103.1 3 17 11-23

117 10 20

25 -.67 93.3 2 5.5 4-11

116 8 10

4 -1.75 82.5 1 2 0-4

115 2 2

166 Transforming and Equating Test Scores 7. 7 Corrections for Guessing and Omissions 167

7. 7 Corrections for Guessing and Omissions Wl(A - 1) = 2414 = 6 items correct by guessing and 16 - 6 = lOitems correct

without guessing. Notice that this is just an estimate of the number ofitems correctly

Transfonnations of scores can also be made to adjust for the effects of answered without guessing. Some examinees may be "lucky" or "unlucky" and

guessing and the effects of omitting items. These transfonnations can aid in the may guess correctly more or less than I/A of the time. Given limited infonnation and

interpretation of an examinee 's perfonnance and in the comparison of the perfor- a simple model, the formula score F 1 uses the best estimate available for assessing

mances of different examinees. Transfonnations that take into account guessing or the effects of guessing.

omissions traditionally have been called formula scores. Because examinees sometimes don't answer all the items on a test, we need

On multiple-choice tests, examinees can get an item correct, without a formula score that takes item omissions into account. Otherwise, examinees who

knowing the right answer, simply by guessing. Guessing is not a problem on randomly guess on some items can obtain higher scores than those examinees who

personality or attitude tests, where there are no right answers, but it can cause omit those items. LetN be the total number of items in a test andB be the number of

concern on aptitude, perfonnance, or achievement tests. If there are three possi- items that are blank or not answered .X, W, andA are defined as before. Then,

ble answers for each question on a test, an examinee who simply guesses at random N=X+W+B; (7.4)

has a probability of 113 of getting each item correct. On a 30-item test, an exam-

inee who answered randomly would be expected, on the average, to obtain a score the total number of items in the test equals the number of items that are right plus the

of (30)(113) = 10. number of items that are wrong plus the number of items that are blank or omitted.

It is possible to estimate the effects that guessing has on a test score and The number-right scores, X, for examinees who answer all items are comparable to

correct or adjust the observed score accordingly. This procedtJre involves estimating F 2 scores for examinees who omit items, where

the number of items that the individual would have gotten correct if he or she had not F2 =X +BIA. (7.5)

guessed. Suppose there are A options (answer choices) for each item on an N-item

multiple-choice test, and assume that the probability of guessing correctly is llA. If The second fonnula score, F 2 , is the estimated numberof items that would be correct

an examinee guesses on G of the items, the expected number of items guessed if every blank item were replaced by a random guess. For example, if an examinee

correctly is GIA, and the expected numberof items guessed incorrectly is G - GIA. had 10 right answers and 2 blanks on a test with four options per item, that

If we assume that an examinee gets an item wrong only through incorrect guessing, examinee's fonnulascore would be 10 + 214 = 10.5.

then thenumberofwrongresponses, W, equalsG -GIA, soG =AWl(A - I). The . The scoring formula F 2 is a linear function of the scoring fonnula F 1

number of items that are guessed correctly is GIA = Wl(A - 1). If an examinee got (Equat10n 7. 2), since

X items correct on a test, we can estimate that he or she got F 1 items correct with- F 2 =X+(N-X-W)IA

out guessing, where

(A - l)X _ W + !!_

F 1 =X-GIA A A A

=X - Wl(A - 1). (7.2)

F 1 is the first fonnula score discussed in.this section, and Wl(A - !)is the correction

= (A~ 1) F, + ~ .

for guessing. If the examinee answers all items)< 1 is linearly related to the numberof Scores based on scoring fonnula F 2 are perfectly correlated with and have the same

right responses, X, by the fonnula reliability and validity as scores based onF 1 F 2 , likeF,, is perfectly correlated with

F, =X - (N -X)/!,4 - 1) the number-right score, X, only if all of the examinees answer all of the items. That

is, if B = 0, then F ,, F,, andX are perfectly correlated. When there are omissions,

=(A

A - I )X- -.;t::T.

N (7 .3) F 1 andF, are not perfectly correlated withX.

There has been a long controversy about the propriety and usefulness of

When there are no omitted items, F 1 is a linear function of X, is perfectly correlated fonnula scores. In most cases, examinees don't guess randomly on items they don't

withX, and has the same reliability and validity asX. In other words, the correction know. Usually some of the possible answers can be eliminated as clearly impossible

for guessing has no important effect when all the items are answered. However, or un'":'e; therefore, the probability of guessing correctly among the remaining

wljen there are omitted items, F 1 and X are not perfectly correlated. alternatives 1s greater than I/A. Also examinees may differ in their tendencies to

Suppose that an examinee gets 16 items correct and 24 items wrong on a guess on or omit items. If the test directions clearly state that examinees are to omit

40-item, five-option multiple-choice test. We can estimate that this examinee got items only when they feel that they would have to guess randomly, and if the

168 Transforming and Equating Test Scores 7.8 Equal-Interval Scales 169

examinees follow these directions, then a formula score is appropriate (Lord, 197 5). from a test's raw scores, the test must measure a trait that has equal intervals. The

However a formula score is not appropriate theoretically when the test directions scale developer first makes predictions about how the trait is related to test perfor-

either giv~ no instructions about omissions or state that an examinee 's score is the mance and then examines the accuracy of these predictions. If the predictions are

number of questions answered correctly. If the theoretical results about the eff".cts of accurate, the scale developer can transform the raw scores into an interval scale. This

formula scores on a test's reliability and validity are based on the assumpl!on of section examines one commonly used method for constructing an interval

random guessing, then these results are suspect unless it can be verified that scale-Thurstone' s absolute scaling method. Chapters 8 and 11 describe a number

examinees do omit those items on which they would have to guess randomly. of other methods for forming interval (and ratio) scales.

Empirical examinations of the reliability and validity of formula scores do Thurstone 's absolute scaling method (which is described by Gulliksen,

not present a clear case either for or against formula sc?ring. Sa?~rs and Feldt (19?8) 1950) hypothesizes that the continuous trait being measured by a test has a normal

altered test-taking directions with respect to guessmg by g1vmg an admomlion distribution in some specified population. It also hypothesizes that raw scores on the

against guessing versus giving no instructions abo~t g~~ssing. They found that the test are monotonically related to trait values. (A monotonic relationship is one in

directions and formula scoring did not alter the rehabihty of selected mathemalics which every increase in the raw score reflects an increase in the trait value.) If these

tests. Traub, Hambleton, and Singh (1969) found that changes in test directions and hypotheses are true and the raw scores are normalized, then the normalized scores

formula scoring had small effects on test reliability. Diamond and E~a?s (1973) have equal intervals. In order to examine the accuracy of these hypotheses, the test is

reviewed a number of studies on formula scoring and concluded that, 1f it has any administered to two samples of examinees, each of which is assumed to have a

consistent effect, formula scoring tends to slightly increase test validity. Inshort, normal distribution of trait values. For example, a spelling test might be adminis-

evidence about the usefulness of formula scoring is not clear cut. Formula scores t~red to a national sample of seventh-grade students and a national sample of

may help, hinder, or not affect test reliability and validity. . eighth-grade students. The resulting raw-score distributions are normalized within

. The value of formula scoring depends on many factors: the difficulty of the each sample. Each raw score is thereby transformed into one normalized score in one

test, the probability of correctly guessing answers, the variability of examinee~' sample and another normalized score in the other sample. If these two sets of

tendencies to omit items, and the reliability and validity of the tendency to onut normalized scores are linearly related, then the normalized test scores obtained from

items. Appropriate evaluation of a formula score requires that the test dire~tion~ be either sample form an equal-interval scale.

compatible with the scoring formula and that the examinees ~nderstand .the '.mphca- Thurs tone's absolute scaling method can be illustrated with a simple exam-

tions of the test directions and act according! y. For example, if the test drrect10ns say ple. _Table 7.10 contains raw-score frequency distributions, percentiles, and nor-

that the examinees should not guess but these directions aren't followed; then the mahzed scores (Z scores) for a hypothetical spelling test administered to seventh-

meaning of omissions and formula scores becomes unclear. Similarly, if the ex- and eighth-grade students. Figure 7 .13 displays a plot of the Z scores obtained from

aminees are directed not to omit items but they omit them anyway, then the

meaning of the number-right scores becomes unclear. The test user must make~ Table 7.10. Testing Results for a Hypothetical Spelling Test

careful evaluation of the meaning of omissions for the test at hand and score it

Seventh Grade Eighth Grade

accordingly. In most cases, the effect of formula scoring on the reliability and Raw

validity of a test must be evaluated empirically. Score Frequency Percentile ZScore Frequency Percentile ZScore

0 4 I - 2.3 4 I - 2.3

7. 8 !!qual-lnterval Scales I IO 5 - 1.7 6 4 - 1.8

2 18 12 - 1.2 8 7 - 1.5

In Section 2.2, the concept of equal intervals was introduced. A set of 3 30 24 - .7 14 13 . - 1.1

scotes has equal intervals if any given difference between scores always represc:nts 4 38 41 - ,2 18 21 .8

the same amount of difference in the trait being measured. For example, if a 5 38 60 + .2 24 31 .5

one-point difference in spelling-test scores between the .cores of.JOO an~ 101,. ~01 6 30 77 + .7 26 44 - .2

7 18 89 + 1.2 26 57 + .2

and 102, and so on always represents the same amount of mc~ase m sp.ellmg .abihty 8

the test has equal intervals. A test whose scores form equal intervals IS particularly

IO 96 + 1.7 24 69 + .5

9 2 99 + 2.3 18 80 + .8

useful for measuring growth or change in a trait or behavior. Typically, a test's raw 10 2 99.5 + 2.6 14 88 + 1.2

scores do not display equal intervals, but sometimes the raw scores can be trans- 11 0 100 + 00 8 93 + 1.5

formed into a set of scores (a scale) that does have equal intervals. 12 0 JOO + 00 10 98 + 2.0

In order to form an equal-Interval scale (usually called an interval scale)

13 0 JOO + 00 0 JOO + 00

170 Transforming and Equating Test Scores 7.9 Equating Test Scores 171

.,~

.08

v

"~ +2

~ .06

c:i

,,

'5 +I "<Tv

~

~

~

0 "'v

" .04

"~ -I

j

"'N

-2

-3 -2 -c I 0 +I +2 +3

Z Score for Seventh Grade

the two samples. (The plot does not include points containing + oo.) Since the Z

scores for the two grades are linearly related, theZ scores for either grade can be used

as an equal-interval scale. Of course, theZ scores also could be standardized to have

any desired mean and standard deviation.

Equal-interval scales can be used in arithmetical manipulations, such as the

calculation of means, stan.dard deviations, and so on. The major disadvantage of

equal-interval scales is that the process involved in deriving them is so complex that

some test users may be unable to understand 1he scores or to communicate their

understanding.

Figure 7 .14 displays how an equal,interval scale obtained from Thurs tone's forms ".'i1hin a sp"."ified difficulty level. Several procedures for equating tests will be

absolute scaling method might be related to other scores. very bnefly descnbed. These procedures determine which scores on two tests can be

said to be equivalent; the problems of unequal reliabilities of the tests will not be

7 .9 Equating Test Scores discussed here. More detailed information about test equating methods is available

from Angoff ( 1971) and Lord ( 1977b).

When alternate forms of a test are constructed, it is often desired to . The two t~t fo~ms to be equated can be administered to the same sample of

transform the test scores to make them equivalent, if possible. This procedure is exammees. The maJor dJSadvantage of this equating method is that 1he scores on the

called test equating. Lord (l 977b) defines test forms as being equated when it would test that is ~ministered second can be affected by practice or fatigue. Such effects

be a matter of indifference to each examinee which test form he or she takes. Thus, can be exammed by reversing the order of administration of the tests for a random

for tests to be equated they must measure the same trait, and every level of the trait half of the sample. If the raw scores f'ro.m the two tests are linearly related (which can

must be measured with equal accuracy by the two tests. Unequally reliable tests be determined by examining scatterplots of the scores), the score distributions forthe

cannot be equated. When test forms are successfully equated, the examinees' scores two tests will have very similar shapes, and standard scores for the two tests will be

will not be affected by the particular forms administered to them; consequently, the equivalent. This results in a linear equating of the two tests. If the scores for the two

forms can be used interchangeably. Similarly, it is often desired to equate different tests are nonlinearly related, which is typically the case, standard scores for the tests

levels of a test so that an examinee will get the same score regardless of whether an will not be equivalent. In such a case, a table or graph would be used to show how a

easier or harder level of the test is administered. This is sometimes called articula- score on one of the tests would be transformed to a score on the other test.

tion of test levels or vertical equating, in contrast to the horiwntal equating of test Two tests can also be equated by administering them to different, but

172 Transforming and Equating Test Scores 7.10 Summary 173

equivalent, samples. Equivalent samples can be obtained by very carefully con- Level 2. It may be impossible to equate Levels I and 3 directly, because it may be

trolled sampling procedures. Since the samples are equivalent, scores that are at the that few examinees who perform reasonably well on Level 1 can answer any of the

same percentile in the two samples are said to be equivalent. For example, suppose questions on Level 3, whereas most of the examilll:es who do reasonably well on

that a score of 10 on Test A is at the 25th percentile for Sample 1, a score of 12 on Level 3 get perfect ornearly perfect scores on Level I. Thus, it may be impossible to

Test B is at the 22nd percentile for Sample 2, and a score of 13 on test B is at the double-check the equating of scores from Levels I and 3. When vertical equating is

28th percentile for Sample 2. Using linear interpolation, a score"of 10 on Test A is accomplished by the successive equating of adjacent test levels, the use of scores

said to be equivalent to a score of 12.5 on Test B. This is called the equipercentile from a wide variety of test levels in one data analysis may be inappropriate or may

method of equating. If the two distributions of scores differ only in terms of their produce results that are difficult to interpret.

means and standard deviations, the equipercentile method will produce a linear

equating of the scores. If the two distributions differ in more than their means and 7 .10 Summary

standard deviations, a nonlinear equating will result.

The advantage of the equipercentile method of equating is that each ex- Mostraw scores are transformed to other numbers so that they can be more

aminee only needs to take one test instead of two. The main disadvantage of the easily interpreted. There are two main types of transformations ofraw scores: linear

equipercentile method is the need to obtain equivalent samples. Also, the equi- and nonlinear. Linear transformations do not alter the basic shapes of distributions or

percentile method does not determine whether the two tests should be equated; it the size of correlations. Nonlinear transformations can alter both of these proper-

simply equates them. Using this method, it is possible to appear to equate tests ties of scores.

that measure very different traits (for example, vocabulary and mathematical A popular nonlinear transformation is the percentile rank. The percentile

computation). rank of any trait value is the percentage of examinees in a norm group who obtain

Thurstone's method of absolute scaling (Section 7.8) can also be used to scores at or below the trait value in question. Percentiles are a popular transformation

equate tests. To use this method, one must assume that the trait being measured by and are relatively easy to calculate and interpret. However, it can be misleading to

the two tests is normally distributed within each of two equivalent samples. deal with means or standard deviations for percentiles. Moreover, since the distribu-

Thurstone 's method has the advantage of involving the administration of only one tion of percentile ranks is rectangular in shape, common statistical techniques that

test to each examinee, and it allows an examination of the degree to which the two assume normality should not be used with them. If the test is short, or if a large

tests are equatable. The major disadvantage of this method is the necessity of number of examinees obtain the same raw score, the transformation to percentile

obtaining samples in which the trait is normally distributed; this method typically ranks may lead to exaggerated interpretations of small differences in raw scores.

involves expensive sampling procedures. Other types ofnonlinear transformations result in age and grade scores. Age

In another method of equating, the two tests are administered to different and gtade scores are common transformations with serious limitations. They are

and not necessarily equivalent samples. In addition both groups take a test called an ordinal scales, and arithmetical operations with them (for example, calculations of

anchor test that measures the same trait as the tests being equated. Scores on the means) can be misleading. It often is forgotten that half of the norm group scores

anchor test are used to control for differences in the two groups that could affect the above its nominal age or grade level, and half scores below. Age and grade scores

equating. The latent-trait models described in Chapter 11 are.particularly useful for can reflect curriculum or environmental differences, and comparisons using age and

this type of equating. The advantage of this method is not having to obtain equivalent grade scores based on different tests may cause misunderstandings. Age and grade

samples; the disadvantages are the need to administer the anchor test and the need to ,,r scores can lead users to perceive equalities that aren't real.

'!

deal with the statistical complexities of controlling for differences between the Another nonlinear transformation is the expectancy table. Expectancy

equating groups. tables give the conditional distribution of criterion scores for various levels of test

Vertical equating of tests can be more difficult than horizontal equating, performance. If an explicit criterion and reasonable sample sizes are available,

because different levels of a test typically involve different types of items. For expectancy tables are easy to construct and especially useful for counseling applica-

example, Level I of a mathematics-achievement test may involve addition of tions. As an alternative, regression techniques offer similar advantages.

single-digit numbers , but Level 2 may involve addition of two-digit numbers as well. Standard and standardized scores are frequently used linear transformations

It may be difficult to equate these tests, and, even if the test scores appear to equate of raw scores. If the distributions of two raw scores are different in shape, compari-

statistically, there are conceptual problems in interpreting "equivalent" scores from sons of their corresponding standard or standardized scores can be misleading.

the two tests. Also, in vertical equating it is common to link many levels together to Normalized scores are nonlinear transformations created by smoothing the

form one scale. Level 1 is equated to Level 2, Level 2 is equated to Level 3, and so raw-score distribution into an approximately normal shape. T scores and stanines are

forth. Scores on Level I are equated to scores on Level 3 only indirectly, through common normalized scores. Normalization transforms scores to a known, easily

174 Transforming and Equating Test Scores 7.13 Computational Problems 175

interpretable distribution that is required by many common statistical tests. This 7.12 Study Questions

transformation is not reasonable if the underlying trait distribution is not normal,

l. Why are raw scores transformed?

and it is not a cure for a poorly designed test ..

2. What are the differences between linear and nonlinear transformations?

Formula scores (F, and F 2 ) are designed to improve the information

provided by raw scores about the performance of examinees who guess on or omit 3. Explain haw to calculate the percentile for a given trait value and the trait value

items. F 1 estimates the number-right score if correct answers due to guessing are corresponding to any given percentile. What are the advantages and disadvan-

tages of the percentile transformation?

eliminated. F 2 estimates what the number-right scores would have been if examinees

4. Haw 'are age and grade scores calculated? What are the advantages and dis~d

had randomly guessed on all omitted items. Experimental evidence for the value of

vantages of age and grade scares?

calculating such scores is contradictory; consequently, the test developer must

5. How are expectancy tables constructed? What are the advantages and disadvan-

evaluate each of these transformations empirically for use with specific tests for

tages of expectancy tables?

specific purposes.

Equal-interval scales reflect the trait being measured on an interval level of 6. Haw can a regression equation be used to provide information similar to that

provided by an expectancy table?

measurement. If the required assumptions are reasonable, equal-interval scales are

7. Explain how to calculate standard and standardized scores. What are the

particularly useful in the assessment of growth or change. Thurs tone's absolute

scaling method is a common method of obtaining an equal-interval scale. advantages and disadvantages of standard and standardized scores?

Tests are equated so that their scores can be used interchangeably. Two tests 8. Explain how to normalize a set of scores. What are the advantages and disadvan-

tages of normalized scores?

can be equated by administering both of them to one sample of examinees and

determining their relationship. If they are linearly related, standard scores based on 9. Assumin~ that the raw-score distribution in the following table is approximately

the tests are equivalent. If they are nonlinearly related, a table or graph can be used to normal with ,x = 48 and U'x = 2, verify the column entries.

show which scores are equivalent .. Tests can also be equated by administering Raw

different tests to each of two equivalent samples and equating scores with the same

Score z Percentile T Stanine

percentiles in the two groups (the equipercentile method). Finally, tests can be 54 3.0 99+ 80 9

equated using Thurs tone's absolute scaling method or by the use of an anchor test. 53 2.5 99 75 9

52 2.0 98 70 9

51 1.5 93 65 8

7 .11 Vocabulary 50 1.0 84 60 7

49 .5 69 55 6

age equivalent monotonic transformation 48 0 50 50 5

nonlinear transformation 47 .5 31 45 4

age scores 46

normalized scores 1.0 16 40 3

anchor test 45 1.5 7 35 2

articulation of test levels norm group 44 - 2.0 2 30 1

correction for guessing norm-referenced testing 43 - 2.5 I 25 I

criterion-referenced testing percentile 42 - 3.0 1- 20 1

equal-interval scale percentile rank

percentile score 10. De~c~be the two formula scores developed as corrections for guessing and

equipercentile equating

raw score om1ss1ons.

expectancy table

rectangular distribution 11. What problems are involved in the use of formula scores? How useful are

F, formula scores?

F, reference group

standardized score 12. What is an equal-interval scale? How can equal-interval scales be constructed?

formula score

standard score 13. Describe four procedures for equating test scores: What are the advantages and

grade equivalent

stanine disadvantages of each procedure?

grade scores

horizontal equating test equating

Thurstone 's absolute scaling

7 .13 Computational Problems

interval scale

linear transformation transformed score l. Using the data in the following table:

mental age Tscores a. Calculate the percentile ranks for trait values of 86, 92, and I 07.

monotonic relationship vertical equating b. Calculate the trait values at the 3rd, 22nd, and ?5th percentiles.

Transforming and Equating Test Scores 7. 13 Computational Problems 177

176

100-109 5

90-99 8 0 1 1

80-89 12 1 2 5

70-79 18 2 4 7

3 5 5

4 4 1

2. On a cognitive-maturity test, the median score for six-year-olds is 23 points and 5 3 1

the median score for seven-year-olds is 32 points. Estimate the mental age (age 6 1 0

score) of a child earning 28 points on this test.

3. Two potential students for a statistics class take a pre-course exam. Micah gets 10. A_ssume the data in the following table are the result of administerin two

21 points, and Becky gets 13 points. Using the expectancy table (Table 7.6), different tests to two different, but equivalent, samples. g

what advice would you give each student? a. Equate ~e test scores (giving the Test B score equivalent to each Test A

4. (Optional) Using the definition ofregression contained in Section 2. 14, obtain score) us mg the equipercentile method.

the expected course grade for each of the six pre-course test-score ranges in b. Equate the test scores using Thurstone 's absolute scaling method

c. Compare the results from a. and b.

Table7.6.

5. The raw scores for a test have a mean of 83 and a standard deviation of 12. Test A TestB

a. What is the standard score corresponding to a raw score of 89?

Score Frequency Score Frequency

b. What is the standardized score for a raw score of 80 if the scores are

standardized to be comparable to CEEB scores? 10 5 105 10

6. Transform the following scores to T scores and then to stanines. 11 10 115 15

12 20 125 15

Raw Score Frequency 13 20 135 30

14 25 145 20

37 6 15 10 155 5

36 124 16 10 165 5

35 58

34 10

33 2

a. Estimate the number of items Carey got correct by guessing.

b. Estimate the number of items Carey got correct without guessing.

c. What assumptions were used in making these estimates?

8. Two examinees take a true/false test with 50 items. Examinee A gets 40 items

correct and omits four items. Examinee B gets 35 items correct and omits 12.

Estimate the number of items each examinee would have gotten correct if their

omissions were replaced with random guesses. Evaluate the relative perfor-

mance of the two examinees.

9. a. Using the test-score frequency distributions in the following table, create an

equal-interval scale using Thurstone's absolute scaling method.

b. Do the data appear to form a good equal-interval scale?

- Vissim 7 ChangesUploaded byzimzolla
- Measures of VariabilityUploaded bymathworld_0204
- PSUnit_II_Lesson_5_Locating_Percentiles_under_the_Normal_Curve.pptxUploaded byJaneth Marcelino
- Math-10-Quarter-4.docxUploaded byEdwin Sartin
- Statistics Lecture 1Uploaded byMohd Jasmee Mohd Mokhtar
- Chapter 02 _PowerPoint (1).pptUploaded bySarfaraj Ovi
- Descriptive StatisticsUploaded byShyamal De
- Speed Spot StudiesUploaded byLuqman Yusof
- CURSO NPTEL - Reliability EngineeringUploaded bycargadory2k
- 2CH1L3Uploaded bykatey14444
- 05 z Scores Normal Curve 02Uploaded byZaenal Muttaqin
- Discussion3_1232013Uploaded byVishak Muthukumar
- BCS-040 (1)Uploaded byAnkitSingh
- A Level - Edexcl S1 Check ListUploaded byAhmed Nurul
- l3-stats-formulae-2013Uploaded byapi-255349730
- graph stat past yeats.docUploaded byLee Fhu Sin
- Tong Radlo gUploaded byAngello Rivera Calle
- Sarstedt Et Al 2016Uploaded byOmer Malik
- Bv Cvxbook Extra ExercisesUploaded byAnonymous WkbmWCa8M
- 9677-27110-1-PBUploaded byFauzy Fadlurahman
- CAPITULO 3Uploaded byJavierVargas
- 4_datatypeUploaded byAldrin Antivola
- Convenient Plot SizeUploaded byIsmael Neu
- C181Uploaded bysenthilkumar
- ExploreUploaded byTRIYA
- Rappa_2Uploaded byCleiton Lopes Aguiar
- mentor survey resultsUploaded byapi-283387991
- secure-image-encryption-algorithms-a-reviewUploaded byIJSTR Research Publication
- Anthropometry and BiomechanicsUploaded byRodrigo Noguerol Correa
- 1979 Psychophysicai Aspects of Sensory AnalysisUploaded bySebas-GhisRamirez

- CapTel Workers Union PamphletUploaded byCapTel Workers
- piaget'sUploaded bymitchphilcatingub
- Maverick LodgingUploaded byPu Kang
- PDFUploaded byLawrence Cruz
- Young Parliamentary AssociatesUploaded byMashooque Ali
- Brain Formatting, Accessing Stress, Consciousness and AwarenessUploaded bytaichi7
- IIT CEED Exam Preparation - The Future Belongs to a Career for DesigningUploaded byentrancetest
- Consumer Online Shopping Attitudes and Behavior- An Assessment OfUploaded byezad890
- The Daily Tar Heel for May 17, 2017Uploaded byThe Daily Tar Heel
- year 2 - skip counting by 2s 5s and 10sUploaded byapi-315787363
- Unit 8_Interactive Media ProductionUploaded byCatherine Moore
- Genosa NotesUploaded byAlphonse Samson
- symbol analysis great gatsbyUploaded byapi-255994188
- KSAVE - Framework - Defining 21st Century Skills - ExtractUploaded byBudi Laksono Putro
- Political EthnicityUploaded byKhalilhomaam
- Incidence of Mild Cognitive Impairment and Alzheimer Disease in Southern Brazil.pdfUploaded byAnonymous fPQaCe8
- Do's and Dont's for Teaching ELLUploaded byYamith J. Fandiño
- Sample Essay for Write PlacerUploaded byTrickSuri
- 6130609 Cerebral PalsyUploaded bySundarajan Mani
- "Becoming a Hit Man" Sociology EssayUploaded byArianna Alfano
- APA Journal Coverage ListUploaded byapi-3855346
- CloacalsUploaded byMarinejet
- gggggggggggggggggggggggggggggggggggggUploaded byEdward
- Artigo - A Enfermagem Na Perspectiva Do Parto Humanizado - Uma Revisão Integrativa de LiteraturaUploaded byLrbAei
- Healthy Marriage Initiative Conference 2014 HandoutUploaded byNorthwest Family Services
- %5BBarbara J Gabrys; Jane a Langdale%5D How to Succeed(B-ok.xyz)Uploaded byArtz Rata
- Neo-Advaita James SwartzUploaded byNicholas O'Connell
- Farmer Herder CrisisUploaded byAustinIwar
- What is Charisma_ Traits of Charismatic People _ Reader's DigestUploaded byVon Louie Lacastesantos
- JADD PaperUploaded byJovanka Solmosan

## Much more than documents.

Discover everything Scribd has to offer, including books and audiobooks from major publishers.

Cancel anytime.