You are on page 1of 30

Percentile z T D IQ Stanine Sten

-4 10 40 Very low
0.1 -3 20 55 low
2.5 -2 30 70 1 1.5 Below average
16 -1 40 85 3 3.5
50 0 50 100 5 5.5 average
84 1 60 115 7 7.5
97.5 2 70 130 9 9.5 Above average
99.1 3 80 145 high
4 90 160 Very high

1
Understanding Scores

Overview

 Review types of scores


 Consider standards for comparison

What types of scores are there?

1. Raw Scores
2. Linear Transformed Scores
3. Area Transformed Scores
4. Stanines

What good are they?

The Raw Score

The Raw Score is a simple count of


the test behavior, (e.g., # of free
throws). It isn't usually meaningful
for psychological measures. If I say
you have a score of '4,' is that
good? Bad? Indifferent?

You can't know anything from a


simple total. It may sound like a lot
or a little, but without knowing how
it relates to other information, it is
useless.

So we transform it.

 Transformation

1. Doesn't change the score, it just uses different units. (Sounds contradictory?


wait and see how it works).
2. Transformation includes or takes into account information not in the raw
score (such as the # of items on the test).
3. More informative/interpretable than a raw score.

Linear Transformation

Uses the linear equation for the transformation.

Examples:

2
 z score, mean = 0, sd = 1
 Useful to psychometricians but not for interpretation to lay public because
half the scores are negative and often fractional. The public thinks a -.01 is
an awful score when really it is average.
 so transform the z scores again to...
o T scores, mean = 50, sd = 10
o IQ scores, mean = 100, sd = 15
o You can pick the mean and sd you like and convert the z to whatever
scale you want. But the T and the IQ score are the most common.

So how are these used?

Well, if you know an IQ is 100, you know that person earned a score at the mean of
everyone who takes the test.

If the IQ score is 115, you know the person is noticeably above the mean (1
standard deviation= 15. see above).

If 130, much more noticeably above the mean--definitely an unusual score because
it is 2 standard deviations from the mean.

How unusual a score is 70? 70 is as unusual as which, 115 or 130?

130-- they're both 2 standard


deviations from the mean.

Suddenly, by using a linear


transformation, the number now
has meaning! It tells the
unusualness/commoness of a
score just by knowing the chosen
mean/sd and the score.

Area Transformation

Uses the normal curve to create transformations. An area transformation is where


on the curve the score falls, e.g., percentiles (NOT equivalent to percent or
percentage).

55th percentile: earned score equal to or greater than 55 percent of the people who


took the test. Notice that phrase 'equal to or greater than.' It has to be exactly that
phrase to be accurate.

Percentiles have uneven


intervals�Greatly compressed in the
middle�stretched at the limits. The
difference between 50 and 55
percentile is tiny, but the difference
between a 90 and a 95 percentile is
huge. Notice that you have to
understand the normal distribution to
correctly interpret percentile scores.

3
Standard Nine aka Stanine

 Popular in education.
 The stanine divides the normal curve into 9 segments (Middle 20% scorers =
5; next 17% = 4 or 6; next 12% = 3 and 7; next 7% = 2 and 8; extreme
4% = 1 and 9) [Don't try to memorize that.]
 Not all that useful other than it reduces overinterpretation of scores. In
other words, if you tell someone their child had a 72%ile they want to know
why they didn't have a 76%ile or want to claim their Johnny is much smarter
than Billy who got a 69%ile-- the test probably does not support that fine a
distinction. Tell them they had an 8 stanine and that seems to satisfy them.

But what does the number mean in real life?

Depends on what you are using for the standard or comparison. That is, a score is
meaningful relative to the comparison on which it is based.

Options for Standards

Personal performance -
improvement compared to self
over time. You may say "I lift
more weight now than I used to."

Useful in clinical and some


educational work. People like this
standard when they are
learning--it emphasizes progress
made away from the starting
point.

   
External or criterion-referenced.
An absolute level of performance
must be met to be successful.
External agencies often set
criterion-referenced standards.

This one can seem hard from a


learner's point of view. It doesn't
matter how hard you tried, how
nice you are, or how much you
studied. All that matters is
whether you can remove an
appendix without killing the
patient or can land a plane
without crashing.
   

4
Norm groups- Find out how a
group of people do on the test
and compare all scores to their
performance. This one is used a
lot for psychological measures
because there often is no
external criterion.

How much is enough learning?


The norm group answer is "As
much as everyone else is doing."
Learners often like this one
because it gives them the
security of fitting in. However,
they may also rebel at being
compared to everyone else.

Since norm groups are so common in psychology, let's explore them a bit further.

Notice that norm groups are also called "standardization samples". Do you see
where the concept of standard comes into play?

What group should be used for the norm comparison?

�Depends on what you want to know. Do you want to know how the person is
doing compared to a sample of peers, e.g., from the same school? or compared to
the US population, e.g., a national group? Do you want to know how they do
relative to others their age? or how they'd fit into an older group?

�The sample typically is representative

 of the US population
 of a particular age
 of a particular grade or
 Gender, etc.

But it doesn't have to fit one of the above categories if you have a specific research
or clinical question, you need a norm group that reflects your question. My mother
had a right hemisphere stroke and I want to know how she is doing. I may want to
know relative to other persons with right hemisphere strokes so we can estimate if
her pace of improvement is "normal", "slower", or "faster" so we can predict
rehabilitation and financial needs.

[Note in this case that there is also an external comparison to be made and
considered-- she can either feed herself or not, toilet herself or not. Meeting those
standards is necessary for independent functioning. It doesn't matter if she is doing
better than the average right hemisphere patient. And relative to herself she has
made tremendous improvement, from being confined to bed to being able to walk
100 feet with help. This is great, but it may not be enough for that hard external
standard.]

Let's examine the most popular norms.

Age norms

Many psychological traits of interest (e.g., intelligence) change with age so you
need a comparison group of the same age.

5
Crude example: let's say that 5 year olds can define "apple" but 4 year olds can't.

If the subject defines "apple" then she is performing at a "5 year old level"

Grade Norms

Done similarily only with grades. They may say a child is performing
similar to the beginning of 3rd grade or similar to the end of 3rd
grade.

Typically, they just get a beginning and end of year sample, unless it is first or 2nd
grade, then they may have a middle. (They limit their samples because they'd have
to have a full sample of children for each segment -- it gets costly.)

Big problems with grade norms though

1. "Typical grade tasks" vary considerably w/schools-- so what does "grade"


mean anyway? What does it mean to be at the beginning of 3rd grade?
2. You absolutely may NOT interpolate (guess at scores between beginning and
end of year calculated ones). Why? Because development is not even or
regular over the course of the school year. So you can't say child is
performing similar to the middle of the year, unless you took a comparison
sample from the middle of the year.
3. Finally, there's a big problem with "of out of age performance"
interpretation�

Example: a first grader passes "3rd grade" items.

 This "score" does NOT mean the child is doing 3rd grade work- means the
child is at about 2 or 3 sds for first graders.
 Tendency to overgeneralize to other abilities than the tested abilities. ("Janey
is outstanding in Math, therefore we don't have to think about her English"--
wrong!)
 Generally, avoid using grade norms. Yuck phooey bad norms!

Other considerations

Remember! The particular question you are trying to answer (decision you are
trying to make) determines what is the appropriate comparison group. If you are
deciding who to take into the US Army, then the potential pool is everyone in the
USA. So any selection test standardization sample needs to reflect that. If you are
deciding whom to hire in your Bowling Green bakery, you'll be choosing from
amongst local Kentuckians and your selection process norm group should reflect
that.

Culture Issues

Case ex: I was asked to evaluate for purposes of educational placement an


immigrant child who was not doing well in a school in the USA. She didn't speak
English. The families intention was to stay in the USA. What norms should be used
to evaluate ability? What questions might you need to consider?

What if we changed the situation so that she was returning to her home country in
a month for the rest of her life and would need to know where to be placed when
she returned home?

To summarize

6
 Percentiles are commonly used for interpretation to lay population
 Standard scores (z, T, IQ) are favored by professionals because they are less
likely to be misinterpreted.
 Occasionally use age norms
 Avoid Grade norms!

Lecture 3

IMPORTANT FORMULAS
Scaled Scores:

The raw score is an untransformed score from a


measurement.

Raw Data
Statistical data in its original form, before any
statistical techniques are used to refine, process, or summarize.
Ex. When a person gets 85 answers correct on a 100 item test, the raw score= 85.

Is 85/100 correct on an exam good? It depends on the evaluation system.

Raw scores are difficult to interpret without additional information.


"The curve" has preset percentage categories for evaluation.
For Ex. A score in the top 10% on an exam = an 'A'. It is a relative position to other test scores.
"The standard" has preset cutoffs (relative to cut off values).
For Ex. 100-90 points correct on an exam = an 'A'. With the standard you are up against a system, not
each other.

Percentile
One of the division points between 100 equal-sized pieces of the population when the population
is arranged in numerical order. The 78th percentile is the number such that 78% of the population
is smaller and 22% of the population is larger.

The percentile is transformed from a raw score. It will give you a relative position, for example, 1 to 99.
The numbers = the percentage of scores below your raw score.

Obtaining a percentile rank of 80 means that whatever your raw score was, 80% of the other raw scores
were below yours.

Raw Score Percentile


78 50
85 60
90 70

Both sides (the distance between the scores) need to be equal for a linear transformation. The big problem
with percentiles is that they are not linear transformations from raw scores, hence they are called "non-
linear".

Centimeter Inch
2.54 1
5.08 2

7
These are linear transformations.
As one side changes the other changes in equal proportions.
Percentiles are great at a descriptive level.

Standard Scores are linear transformations.

Raw scores are transformed into standard scores.


Changing a raw score to a percentile score or percentile rank:

Given a set of scores: 52, 89, 42, 13, 88, 76, 44, 45,
22, 105
Find the percentile rank (PR) for 45.

Lecture 4

Scaled Scores:
Percentile rank
Standard score (a linear transformation)

Z= raw score - mean of raw scores/standard deviation of raw scores

The Z score tells you how far the raw score is away from the mean in terms of standard deviation units.
It does not change the shape of the distribution!

Raw score does not change into a bell shaped curve when changed into standard scores.
The numbers do not change physically, the measurement just changes.
It transforms the unit of measurement you are working with.

Percentile
Z score
mean of Z scores = 0
standard deviation of Z scores = 1
If the Z score is negative, this says that the raw score was below the mean of raw scores.
If the Z score is positive, this says that the raw score was above the mean of raw scores.
If the Z score is zero, this says that the raw score was equal to the mean of raw scores.

"CEEB Score"
The College Examination Board (now called ETS) converts your raw score into a scale (different
unit of measurement) where the mean is 500 and the standard deviation is 100. This is done to
eliminate negative numbers.
SAT Lowest Highest Mean
Verbal 200 800 500
Quantatative 200 800 500

The Graduate Entrance Examination (GRE) is the same.

McCall T-Score Scale (Personality Tests)


The raw scores have been converted resulting in a scale where the mean = 50 and the standard
deviation = 10.

8
Ex. Given the raw scores 22, 17, 19, 37, 26 convert 26 to a scaled score on the McCall T-Score Scale.

Ex. -23, 18, -2, 5, 44, 39, 19, 18 . What is the percentile rank for a raw score of 18 (X= 18).

Tied Ranks are 2 numbers having the same values, for example '18' and '18'.

Rank Scores
1 -23
2 -2
3 5
4 18
5 18
6 19
7 39
8 44

Look at ranks 4 and 5, take the mean of these = 4.5.

Ex. 4, 29, 17, 29, 29, 30, 7, 11, 14. Convert 29 to a percentile.

Rank Scores
1 4
2 7
3 11
4 14
5 17
6 29
7 29
8 29
9 30

Dr. Lee may use different wording on your


exams. Don't get confused. Percentile and Percentile Rank mean the same thing.

Lecture 5

Ex. Convert 46 to a centile rank.

X (raw scores) Rank


100 7
80 6
29 2
46 5
18 1
45 4

9
34 3

1) Rank the numbers.

What does this mean?


It means that 64% of the scores lie below 46.

Ex. Convert 80 to a Z score.

M = Mean
X-bar is also used but not in this class
because it can be confused with X which
is a vector.

Convert 80 to a Z score.

1. Find the mean and standard deviation.

M = 50.285
SD = 27.154

Z = 80 - 50.28 /27.154

Z = 1.094 = 1.09
This says that the score of 80 lies over 1 standard deviation above the mean (50.285).

The following is very important:


Percentiles are represented as integers.
Z scores are carried to 2 decimal places.
To insure accuracy at 2 decimal places you must carry to at least 3 decimal places in your calculations.

Ex. Convert 46 to a Z score.

Z = 46 - 50.285 / 27.154
Z = -0.16

46 is .16 standard deviations below the mean.

Ex. Convert all the scores to Z scores.


What is the mean of the Z scores?
The mean = 0.

* The mean of the Z scores is zero.


* The standard deviation of the scores is 1.

Ex. Convert 80 to a CEEB scaled score.

10
Use this formula :
CEEB = 100 ( Z score for 80) + 500

1. Find the Z score for 80.


2. use the formula listed above.
100 (1.09) + 500 = 609.

Ex. Convert 80 to a McCall T- score scale.


T score = 10 ( Z score for 80) + 50
10 (1.09) + 50 = 10.9 + 50 = 60.9.

Ex. Convert X = 80 to a scaled score where the mean =


100 and standard deviation = 16.
Scaled score = 16 ( Z score for 80) + 100
Scaled score = 16 ( 1.09) + 100 = 117.44.

Deviation Intelligence Quotient (IQ)


Deviation Intelligence Quotient (IQ) is a way of measuring an individual's
generalized intelligence. It uses statistics to analyze a person's intelligence
relative to their age. Deviation IQ is scored based on how an individual deviates
from the average IQ of 100. It measures IQ as a normal distribution with the
average IQ being a 100 with a standard deviation of +/- 15.

This differs from the original way of measuring IQ which was using a ratio score
which compared a person's "mental" age with their actual age. Deviation IQ
scores are intended to be more accurate and account for people who have very
high scores on intelligence measures.

normalized standard scores


[¦nȯr·mə‚līzd ¦stan·dərd ′skȯrz]
(statistics)
A procedure in which each set of original scores is converted to some standard scale under the assumpti
on that the distribution of scores approximates that of a normal.

Standard Score

The standard score (more commonly referred to as a z-score) is a very useful


statistic because it (a) allows us to calculate the probability of a score
occurring within our normal distribution and (b) enables us to compare two

11
scores that are from different normal distributions. The standard score does
this by converting (in other words, standardizing) scores in a normal
distribution to z-scores in what becomes a standard normal distribution.

Standard Normal Distribution and Standard Score (z-score)

When a frequency distribution is normally distributed, we can find out the probability of a score
occurring by standardising the scores, known as standard scores (or z scores). The standard
normal distribution simply converts the group of data in our frequency distribution such that the
mean is 0 and the standard deviation is 1 (see below).

Z-scores are expressed in terms of standard deviations from their means. Resultantly, these z-
scores have a distribution with a mean of 0 and a standard deviation of 1. The formula for
calculating the standard score is given below:

As the formula shows, the standard score is simply the score, minus the mean score, divided by
the standard deviation. 

MMPI-2 Scoring & Interpretation

Scores are converted to what are called normalized “T scores” on a scale ranging from 30 to

12
120. The “normal” range of T scores is from 50 to 65. Anything above 65 and anything below 50
is considered clinically significant and open for interpretation by the psychologist.

What is Classical Test Theory?

Classical Test Theory (CTT), sometimes called the true score model, is the mathematics behind
creating and answering tests and measurement scales. The goal of CTT is to improve tests,
particularly the reliability and validity of tests.

Reliability implies consistency: if you take the ACT five times, you should get roughly the same
results every time. A test is valid if it measures what it’s supposed to.

It’s called “classic” because Item Response Theory is a more modern framework.

True Scores

Classical Test Theory assumes that each person has an innate true score. It can be summed up
with an equation:

X = T + E,

Where:

 X is an observed score,

 T is the true score,

 E is random error.

For example, let’s assume you know exactly 70% of all the material covered in a statistics
course. This is your true score (T); A perfect end-of-semester test (which doesn’t exist) should
ideally reflect this true score. In reality, you’re likely to score around 65% to 75%. The 5%
discrepancy from your true score is the error (E).

Classical test theory (CTT) has been the foundation for measurement theory for over 80 years.
The conceptual foundations, assumptions, and extensions of the basic premises of CTT have
allowed for the development of some excellent psychometrically sound scales. This chapter
outlines the basic concepts of CTT as well as highlights its strengths and limitations. Because
total test scores are most frequently used to make decisions or relate to other variables of
interest, sifting through item-level statistics may seem tedious and boring. However, the total
score is only as good as the sum of its parts, and that means its items. Several analyses are
available to assess item characteristics. The approaches discussed in this chapter have
stemmed from CTT. Classical test theory is a bit of a misnomer; there are actually several types
of CTTs. The foundation for them all rests on aspects of a total test score made up of multiple

13
items. Most classical approaches assume that the raw score (X) obtained by any one individual
is made up of a true component (T) and a random error (E) component: (5–1) X = T + E. The
true score of a person can be found by taking the mean score that the person would get on the
same test if they had an infinite number of testing sessions. CHAPTER 5 Classical Test Theory
05-Kline.qxd 1/10/2005 11:58 AM Page 91 Because it is not possible to obtain an infinite
number of test scores, T is a hypothetical, yet central, aspect of CTTs. Domain sampling theory
assumes that the items that have been selected for any one test are just a sample of items from
an infinite domain of potential items. Domain sampling is the most common CTT used for
practical purposes. The parallel test theory assumes that two or more tests with different
domains sampled (i.e., each is made up of different but parallel items) will give similar true
scores but have different error scores. Regardless of the theory used, classical approaches to
test theory (and subsequently test assessment) give rise to a number of assumptions and rules.
In addition, the overriding concern of CTTs is to cope effectively with the random error portion
(E) of the raw score. The less random error in the measure, the more the raw score reflects the
true score. Thus, tests that have been developed and improved over the years have adhered to
one or another of the classical theory approaches. By and large, these tests are well developed
and quite worthy of the time and effort that have gone into them. There are, however, some
drawbacks to CTTs and these will be outlined in this chapter as well.

 I. Test construction: Introduction and Overview

A. Definition of Psychological Tests:

"an objective and standardized measure of a sample of behavior."

B. Standards of Test Construction and Test Use:

A good test should be reliable and valid. Concerns relating to standards include user
qualification, security of test content, confidentiality of test results, and the prevention of the
misuse of tests and results.

C. Test Characteristics and Response sets.

Characteristics include

i. a test of maximum performance (e.g., achievement test) which tells us what a person can do.

ii. a test of typical performance (e.g., personality test) which tells us what a person usually does.

iii. a speed test, in which response rate is assessed.

iv. a mastery test asses whether or not the person can attain a pre-specified mastery level of
performance.

Response sets include: social desirability (giving responses that are perceived to be socially
acceptable), acquiescence (agreeing or disagreeing with everything, and deviation (giving
unusual or uncommon responses).

14
All of the above can threaten the validity of a given set of results.

II. Reliability

A. Classical test theory states:

a test is reliable a) to the degree that it is free from error and provides information about
examinees’ "true" test scores and b) to the degree that it provides repeatable, consistent results.

B. Methods of estimating reliability

[we did not go over this as it involves calculating a reliability coefficient--a correlation
coefficient--which we have not discussed yet]

C. Standard error of measurement.

[we skipped over, as we are not ready to discuss yet]

D. Factors affecting the reliability coefficient

Any factor which reduces score variability or increases measurement error will also reduce the
reliability coefficient. For e.g., all other things being equal, short tests are less reliable than long
ones, very easy and very difficult tests are less reliable than moderately difficult tests, and tests
where examinees’ scores are affected by guessing (e.g. true-false) have lowered reliability
coefficients.

III. Validity

Three major categories:

content, criterion-related, and construct validity

1) content validity:

A test has content validity if it measures knowledge of the content domain of which it was
designed to measure knowledge. Another way of saying this is that content validity concerns,
primarily, the adequacy with which the test items adequately and representatively sample the
content area to be measured. For e.g., a comprehensive math achievement test would lack
content validity if good scores depended primarily on knowledge of English, or if it only had
questions about one aspect of math (e.g., algebra). Content validity is primarily an issue for
educational tests, certain industrial tests, and other tests of content knowledge like the
Psychology Licensing Exam.

15
Expert judgement (not statistics) is the primary method used to determine whether a test has
content validity. Nevertheless, the test should have a high correlation w/other tests that purport
to sample the same content domain.

This is different from face validity: face validity is when a test appears valid to examinees who
take it, personnel who administer it and other untrained observers. Face validity is not a
technical sense of test validity; i.e., just b/c a test has face validity does not mean it will be valid
in the technical sense of the word. "just cause it looks valid doesn’t mean it is."

2) criterion-related validity:

Criterion-related validity is a concern for tests that are designed to predict someone’s status on
an external criterion measure. A test has criterion-related validity if it is useful for predicting a
person’s behavior in a specified situation.

2a) Concurrent vs. predictive validation:

First the term "validation" refers to the procedures used to determine how valid a predictor is.
There are two types.

In concurrent validation, the predictor and criterion data are collected at or about the same time.
This kind of validation is appropriate for tests designed to asses a person’s current criterion
status. It is good diagnostic screening tests when you want to diagnose.

In Predictive validation, the predictor scores are collected first and criterion data are collected at
some later/future point. this is appropriate for tests designed to asses a person’s future status
on a criterion.

2b) Standard Error of estimate

The standard error of estimate (s est) is used to estimate the range in which a person’s true
score on a criterion is likely to fall, given his/her score as estimated by a predictor.

2c) Decision-Making

In many cases when using predictor tests, the goal is to predict whether or not a person will
meet or exceed a minimum standard of criterion performance — the criterion cutoff point. When
a predictor is to be used in this manner, the goal of the validation study is to set an optimal
predictor cutoff score; an examinee who scores at or above the predictor cutoff is predicted to
score at or above the criterion cutoff.

We then get

true positives (or valid acceptance): accurately identified by the predictor as meeting the
criterion standard

16
False positives (or false acceptance): incorrectly identified by the predictor as meeting the
criterion standard.

True negative (valid rejection): accurately identified by the predictor as not meeting the criterion
standard.

False negative (invalid rejection): meets the criterion standard, even though the predictor
indicated s/he wouldn’t .

2d) Factors affecting the criterion-related validity coefficient:

This is about factors that potentially affect the magnitude of the criterion-related validity
coefficient. We will not go into this as it relates to correlation.

3) Construct validity

a test has construct validity if it accurately measures a theoretical, non-observable construct or


trait. The construct validity of a test is worked out over a period of time on the basis of an
accumulation of evidence. There are a number of ways to establish construct validity.

Two methods of establishing a test’s construct validity are convergent/divergent validation and
factor analysis.

3a) Convergent/divergent validation

A test has convergent validity if it has a high correlation with another test that measures the
same construct. By contrast, a test’s divergent validity is demonstrated through a low correlation
with a test that measures a different construct. Note this is the only case when a low correlation
coefficient (b/w two test that measure different traits) provides evidence of high validity.

The multitrait-multimethod matrix is one way to assess a test’s convergent and divergent
validity. We will not get into this right now.

3b) Factor analysis

Factor analysis is a complex statistical procedure which is conducted for a variety of purposes,
one of which is to assess the construct validity of a test or a number of tests. We will get there.

3c) Other methods of assessing construct validity:

we can asses the test’s internal consistency. That is, if a test has construct validity, scores on
the individual test items should correlate highly with the total test score. This is evidence that the
test is measuring a single construct

also developmental changes. tests measuring certain constructs can be shown to have


construct validity if the scores on the tests show predictable developmental changes over time.

17
and experimental intervention, that is if a test has construct validity, scores should change
following an experimental manipulation, in the direction predicted by the theory underlying the
construct.

4) Relationship between reliability and validity

If a test is unreliable, it cannot be valid.

For a test to be valid, it must reliable.

However, just because a test is reliable does not mean it will be valid.

Reliability is a necessary but not sufficient condition for validity!

IV. Item Analysis

There are a variety of techniques for performing an item analysis, which is often used, for
example, to determine which items will be kept for the final version of a test. Item analysis is
used to help "build" reliability and validity are "into" the test from the start. Item analysis can be
both qualitative and quantitative. The former focuses on issues related to the content of the test,
eg. content validity. The latter primarily includes measurement of item difficulty and item
discrimination.

item difficulty: an item’s difficulty level is usually measured in terms of the percentage of
examinees who answer the item correctly. This percentage is referred to as the item difficulty
index, or "p".

Item discrimination refers to the degree to which items differentiate among examinees in


terms of the characteristic being measured (e.g., between high and low scorers). This can be
measured in many ways. One method is to correlate item responses with the total test score;
items with the highest test correlation with the total score are retained for the final version of the
test. This would be appropriate when a test measures only one attribute and internal
consistency is important.

Another way is a discrimination index (D).

V. Interpretation of Test Scores

1. norm-referenced interpretation:

involves comparing an examinee’s score to that of others in a normative sample. It provides an


indication of where the examinee stands in relation to others who have taken the test. Norm-
referenced scores include developmental scores (eg, mental age scores and grade
equivalents), which indicate how far along the normal developmental path an individual has

18
progressed, and within-group norms, which provide a comparison to the scores other individuals
whom the examinee most resembles.

2. Criterion-referenced interpretation:

involves interpreting an examinee’s test score in terms of an external pre-established standard


of performance.

 Inter-rater reliability is the extent to which two or more raters (or observers, coders, examiners)
agree. It addresses the issue of consistency of the implementation of a rating system. Inter-rater
reliability can be evaluated by using a number of different statistics. Some of the more common
statistics include: percentage agreement, kappa, product–moment correlation, and intraclass
correlation coefficient. High inter-rater reliability values refer to a high degree of agreement
between two examiners. Low inter-rater reliability values refer to a low degree of agreement
between two examiners.

Chapter 3: Understanding Test Quality-Concepts of Reliability and Validity

Test reliability and validity are two technical properties of a test that indicate the quality and
usefulness of the test. These are the two most important features of a test. You should examine
these features when evaluating the suitability of the test for your use. This chapter provides a
simplified explanation of these two complex ideas. These explanations will help you to
understand reliability and validity information reported in test manuals and reviews and use that
information to evaluate the suitability of a test for your use.

Chapter Highlights

1. What makes a good test?

2. Test reliability

3. Interpretation of reliability information from test manuals and reviews

4. Types of reliability estimates

5. Standard error of measurement

6. Test validity

7. Methods for conducting validation studies

8. Using validity evidence from outside studies

9. How to interpret validity information from test manuals and independent reviews.

Principles of Assessment Discussed Use only reliable assessment instruments and


procedures. Use only assessment procedures and instruments that have been demonstrated to
be valid for the specific purpose for which they are being used. Use assessment tools that are
appropriate for the target population.

19
What makes a good test?

An employment test is considered "good" if the following can be said about it:

 The test measures what it claims to measure consistently or reliably. This means that if a
person were to take the test again, the person would get a similar test score.

 The test measures what it claims to measure. For example, a test of mental ability does
in fact measure mental ability, and not some other characteristic.

 The test is job-relevant. In other words, the test measures one or more characteristics
that are important to the job.

 By using the test, more effective employment decisions can be made about individuals.
For example, an arithmetic test may help you to select qualified workers for a job that
requires knowledge of arithmetic operations.

The degree to which a test has these qualities is indicated by two technical
properties: reliability and validity.

Test reliability

Reliability refers to how dependably or consistently a test measures a characteristic. If a


person takes the test again, will he or she get a similar test score, or a much different score? A
test that yields similar scores for a person who repeats the test is said to measure a
characteristic reliably.

How do we account for an individual who does not get exactly the same test score every time
he or she takes the test? Some possible reasons are the following:

 Test taker's temporary psychological or physical state. Test performance can be


influenced by a person's psychological or physical state at the time of testing. For
example, differing levels of anxiety, fatigue, or motivation may affect the applicant's test
results.

 Environmental factors. Differences in the testing environment, such as room


temperature, lighting, noise, or even the test administrator, can influence an individual's
test performance.

 Test form. Many tests have more than one version or form. Items differ on each form,
but each form is supposed to measure the same thing. Different forms of a test are
known as parallel forms or alternate forms. These forms are designed to have similar
measurement characteristics, but they contain different items. Because the forms are not
exactly the same, a test taker might do better on one form than on another.

 Multiple raters. In certain tests, scoring is determined by a rater's judgments of the test
taker's performance or responses. Differences in training, experience, and frame of
reference among raters can produce different test scores for the test taker.

20
Principle of Assessment: Use only reliable assessment instruments and procedures. In other
words, use only assessment tools that provide dependable and consistent information.

These factors are sources of chance or random measurement error in the assessment process.
If there were no random errors of measurement, the individual would get the same test score,
the individual's "true" score, each time. The degree to which test scores are unaffected by
measurement errors is an indication of the reliability of the test.

Reliable assessment tools produce dependable, repeatable, and consistent information about
people. In order to meaningfully interpret test scores and make useful employment or career-
related decisions, you need reliable tools. This brings us to the next principle of assessment.

Interpretation of reliability information from test manuals and reviews

Test manuals and independent review of tests provide information on test reliability. The
following discussion will help you interpret the reliability information about any test.

The reliability of a test is indicated by the reliability coefficient. It is denoted by the letter "r,"
and is expressed as a number ranging between 0 and 1.00, with r = 0 indicating no reliability,
and r = 1.00 indicating perfect reliability. Do not expect to find a test with perfect reliability.
Generally, you will see the reliability of a test as a decimal, for example, r = .80 or r = .93. The
larger the reliability coefficient, the more repeatable or reliable the test scores. Table 1 serves
as a general guideline for interpreting test reliability. However, do not select or reject a test
solely based on the size of its reliability coefficient. To evaluate a test's reliability, you should
consider the type of test, the type of reliability estimate reported, and the context in which the
test will be used.

Table 1. General Guidelines for

Reliability coefficient value Interpretation

.90 and up excellent

.80 - .89 good

.70 - .79 adequate

below .70 may have limited applicability

Types of reliability estimates

21
There are several types of reliability estimates, each influenced by different sources of
measurement error. Test developers have the responsibility of reporting the reliability estimates
that are relevant for a particular test. Before deciding to use a test, read the test manual and any
independent reviews to determine if its reliability is acceptable. The acceptable level of reliability
will differ depending on the type of test and the reliability estimate used.

The discussion in Table 2 should help you develop some familiarity with the different kinds of
reliability estimates reported in test manuals and reviews.

Table 2. Types of Reliability Estimates

 Test-retest reliability indicates the repeatability of test scores with the passage of time.
This estimate also reflects the stability of the characteristic or construct being measured
by the test.

Some constructs are more stable than others. For example, an individual's reading
ability is more stable over a particular period of time than that individual's anxiety level.
Therefore, you would expect a higher test-retest reliability coefficient on a reading test
than you would on a test that measures anxiety. For constructs that are expected to vary
over time, an acceptable test-retest reliability coefficient may be lower than is suggested
in Table 1.

 Alternate or parallel form reliability indicates how consistent test scores are likely to
be if a person takes two or more forms of a test.

A high parallel form reliability coefficient indicates that the different forms of the test are
very similar which means that it makes virtually no difference which version of the test a
person takes. On the other hand, a low parallel form reliability coefficient suggests that
the different forms are probably not comparable; they may be measuring different things
and therefore cannot be used interchangeably.

 Inter-rater reliability indicates how consistent test scores are likely to be if the test is
scored by two or more raters.

On some tests, raters evaluate responses to questions and determine the score.
Differences in judgments among raters are likely to produce variations in test scores. A
high inter-rater reliability coefficient indicates that the judgment process is stable and the
resulting scores are reliable.

Inter-rater reliability coefficients are typically lower than other types of reliability
estimates. However, it is possible to obtain higher levels of inter-rater reliabilities if raters
are appropriately trained.

22
 Internal consistency reliability indicates the extent to which items on a test measure
the same thing.

A high internal consistency reliability coefficient for a test indicates that the items on the
test are very similar to each other in content (homogeneous). It is important to note that
the length of a test can affect internal consistency reliability. For example, a very lengthy
test can spuriously inflate the reliability coefficient.

Tests that measure multiple characteristics are usually divided into distinct components.
Manuals for such tests typically report a separate internal consistency reliability
coefficient for each component in addition to one for the whole test.

Test manuals and reviews report several kinds of internal consistency reliability
estimates. Each type of estimate is appropriate under certain circumstances. The test
manual should explain why a particular estimate is reported.

Standard error of measurement

Test manuals report a statistic called the standard error of measurement (SEM). It gives the
margin of error that you should expect in an individual test score because of imperfect reliability
of the test. The SEM represents the degree of confidence that a person's "true" score lies within
a particular range of scores. For example, an SEM of "2" indicates that a test taker's "true" score
probably lies within 2 points in either direction of the score he or she receives on the test. This
means that if an individual receives a 91 on the test, there is a good chance that the person's
"true" score lies somewhere between 89 and 93.

The SEM is a useful measure of the accuracy of individual test scores. The smaller the SEM,
the more accurate the measurements.

When evaluating the reliability coefficients of a test, it is important to review the explanations
provided in the manual for the following:

 Types of reliability used. The manual should indicate why a certain type of reliability
coefficient was reported. The manual should also discuss sources of random
measurement error that are relevant for the test.

 How reliability studies were conducted. The manual should indicate the conditions
under which the data were obtained, such as the length of time that passed between
administrations of a test in a test-retest reliability study. In general, reliabilities tend to
drop as the time between test administrations increases.

 The characteristics of the sample group. The manual should indicate the important
characteristics of the group used in gathering reliability information, such as education
level, occupation, etc. This will allow you to compare the characteristics of the people
you want to test with the sample group. If they are sufficiently similar, then the reported
reliability estimates will probably hold true for your population as well.

23
For more information on reliability, consult the APA Standards, the SIOP Principles, or any
major textbook on psychometrics or employment testing. Appendix A lists some possible
sources.

Test validity

Validity is the most important issue in selecting a test. Validity refers to what characteristic the
test measures and how well the test measures that characteristic.

 Validity tells you if the characteristic being measured by a test is related to job
qualifications and requirements.

 Validity gives meaning to the test scores. Validity evidence indicates that there is linkage
between test performance and job performance. It can tell you what you may conclude
or predict about someone from his or her score on the test. If a test has been
demonstrated to be a valid predictor of performance on a specific job, you can conclude
that persons scoring high on the test are more likely to perform well on the job than
persons who score low on the test, all else being equal.

 Validity also describes the degree to which you can make specific conclusions or
predictions about people based on their test scores. In other words, it indicates the
usefulness of the test.

Principle of Assessment: Use only assessment procedures and instruments that have been
demonstrated to be valid for the specific purpose for which they are being used.

It is important to understand the differences between reliability and validity. Validity will tell you


how good a test is for a particular situation; reliability will tell you how trustworthy a score on that
test will be. You cannot draw valid conclusions from a test score unless you are sure that the
test is reliable. Even when a test is reliable, it may not be valid. You should be careful that any
test you select is both reliable and valid for your situation.

A test's validity is established in reference to a specific purpose; the test may not be valid for
different purposes. For example, the test you use to make valid predictions about someone's
technical proficiency on the job may not be valid for predicting his or her leadership skills or
absenteeism rate. This leads to the next principle of assessment.

Similarly, a test's validity is established in reference to specific groups. These groups are called
the reference groups. The test may not be valid for different groups. For example, a test
designed to predict the performance of managers in situations requiring problem solving may
not allow you to make valid or meaningful predictions about the performance of clerical
employees. If, for example, the kind of problem-solving ability required for the two positions is
different, or the reading level of the test is not suitable for clerical applicants, the test results
may be valid for managers, but not for clerical employees.

Test developers have the responsibility of describing the reference groups used to develop the

24
test. The manual should describe the groups for whom the test is valid, and the interpretation of
scores for individuals belonging to each of these groups. You must determine if the test can be
used appropriately with the particular type of people you want to test. This group of people is
called your target population or target group.

Principle of Assessment: Use assessment tools that are appropriate for the target population.

Your target group and the reference group do not have to match on all factors; they must be
sufficiently similar so that the test will yield meaningful scores for your group. For example, a
writing ability test developed for use with college seniors may be appropriate for measuring the
writing ability of white-collar professionals or managers, even though these groups do not have
identical characteristics. In determining the appropriateness of a test for your target groups,
consider factors such as occupation, reading level, cultural differences, and language barriers.

Recall that the Uniform Guidelines require assessment tools to have adequate supporting


evidence for the conclusions you reach with them in the event adverse impact occurs. A valid
personnel tool is one that measures an important characteristic of the job you are interested in.
Use of valid tools will, on average, enable you to make better employment-related decisions.
Both from business-efficiency and legal viewpoints, it is essential to only use tests that are valid
for your intended use.

In order to be certain an employment test is useful and valid, evidence must be collected
relating the test to a job. The process of establishing the job relatedness of a test is
called validation.

Methods for conducting validation studies

The Uniform Guidelines discuss the following three methods of conducting validation studies.


The Guidelines describe conditions under which each type of validation strategy is appropriate.
They do not express a preference for any one strategy to demonstrate the job-relatedness of a
test.

 Criterion-related validation requires demonstration of a correlation or other statistical


relationship between test performance and job performance. In other words, individuals
who score high on the test tend to perform better on the job than those who score low on
the test. If the criterion is obtained at the same time the test is given, it is called
concurrent validity; if the criterion is obtained at a later time, it is called predictive validity.

 Content-related validation requires a demonstration that the content of the test


represents important job-related behaviors. In other words, test items should be relevant
to and measure directly important requirements and qualifications for the job.

 Construct-related validation requires a demonstration that the test measures the


construct or characteristic it claims to measure, and that this characteristic is important
to successful performance on the job.

25
The three methods of validity-criterion-related, content, and construct-should be used to provide
validation support depending on the situation. These three general methods often overlap, and,
depending on the situation, one or more may be appropriate. French (1990) offers situational
examples of when each method of validity may be applied.

First, as an example of criterion-related validity, take the position of millwright. Employees'


scores (predictors) on a test designed to measure mechanical skill could be correlated with their
performance in servicing machines (criterion) in the mill. If the correlation is high, it can be said
that the test has a high degree of validation support, and its use as a selection tool would be
appropriate.

Second, the content validation method may be used when you want to determine if there is a
relationship between behaviors measured by a test and behaviors involved in the job. For
example, a typing test would be high validation support for a secretarial position, assuming
much typing is required each day. If, however, the job required only minimal typing, then the
same test would have little content validity. Content validity does not apply to tests measuring
learning ability or general problem-solving skills (French, 1990).

Finally, the third method is construct validity. This method often pertains to tests that may
measure abstract traits of an applicant. For example, construct validity may be used when a
bank desires to test its applicants for "numerical aptitude." In this case, an aptitude is not an
observable behavior, but a concept created to explain possible future behaviors. To
demonstrate that the test possesses construct validation support, ". . . the bank would need to
show (1) that the test did indeed measure the desired trait and (2) that this trait corresponded to
success on the job" (French, 1990, p. 260).

Professionally developed tests should come with reports on validity evidence, including detailed
explanations of how validation studies were conducted. If you develop your own tests or
procedures, you will need to conduct your own validation studies. As the test user, you have the
ultimate responsibility for making sure that validity evidence exists for the conclusions you reach
using the tests. This applies to all tests and procedures you use, whether they have been
bought off-the-shelf, developed externally, or developed in-house.

Validity evidence is especially critical for tests that have adverse impact. When a test has
adverse impact, the Uniform Guidelines require that validity evidence for that specific
employment decision be provided.

The particular job for which a test is selected should be very similar to the job for which the test
was originally developed. Determining the degree of similarity will require a job analysis. Job
analysis is a systematic process used to identify the tasks, duties, responsibilities and working
conditions associated with a job and the knowledge, skills, abilities, and other characteristics
required to perform that job.

Job analysis information may be gathered by direct observation of people currently in the job,

26
interviews with experienced supervisors and job incumbents, questionnaires, personnel and
equipment records, and work manuals. In order to meet the requirements of the Uniform
Guidelines, it is advisable that the job analysis be conducted by a qualified professional, for
example, an industrial and organizational psychologist or other professional well trained in job
analysis techniques. Job analysis information is central in deciding what to test for and which
tests to use.

Using validity evidence from outside studies

Conducting your own validation study is expensive, and, in many cases, you may not have
enough employees in a relevant job category to make it feasible to conduct a study. Therefore,
you may find it advantageous to use professionally developed assessment tools and procedures
for which documentation on validity already exists. However, care must be taken to make sure
that validity evidence obtained for an "outside" test study can be suitably "transported" to your
particular situation.

The Uniform Guidelines, the Standards, and the SIOP Principles state that evidence of


transportability is required. Consider the following when using outside tests:

 Validity evidence. The validation procedures used in the studies must be consistent
with accepted standards.

 Job similarity. A job analysis should be performed to verify that your job and the
original job are substantially similar in terms of ability requirements and work behavior.

 Fairness evidence. Reports of test fairness from outside studies must be considered for
each protected group that is part of your labor market. Where this information is not
available for an otherwise qualified test, an internal study of test fairness should be
conducted, if feasible.

 Other significant variables. These include the type of performance measures and
standards used, the essential work activities performed, the similarity of your target
group to the reference samples, as well as all other situational factors that might affect
the applicability of the outside test for your use.

To ensure that the outside test you purchase or obtain meets professional and legal standards,
you should consult with testing professionals. See Chapter 5 for information on locating
consultants.

How to interpret validity information from test manuals and independent reviews

To determine if a particular test is valid for your intended use, consult the test manual and
available independent reviews. (Chapter 5 offers sources for test reviews.) The information
below can help you interpret the validity evidence reported in these publications.

In evaluating validity information, it is important to determine whether the test can be used in the
specific way you intended, and whether your target group is similar to the test reference group.

27
Test manuals and reviews should describe

 Available validation evidence supporting use of the test for specific purposes. The
manual should include a thorough description of the procedures used in the validation
studies and the results of those studies.

 The possible valid uses of the test. The purposes for which the test can legitimately be
used should be described, as well as the performance criteria that can validly be
predicted.

 The sample group(s) on which the test was developed. For example, was the test
developed on a sample of high school graduates, managers, or clerical workers? What
was the racial, ethnic, age, and gender mix of the sample?

 The group(s) for which the test may be used.

The criterion-related validity of a test is measured by the validity coefficient. It is reported as a


number between 0 and 1.00 that indicates the magnitude of the relationship, "r," between the
test and a measure of job performance (criterion). The larger the validity coefficient, the more
confidence you can have in predictions made from the test scores. However, a single test can
never fully predict job performance because success on the job depends on so many varied
factors. Therefore, validity
coefficients, unlike reliability Validity coefficient value Interpretation
coefficients, rarely exceed r = .40.

Table 3. General Guidelines for


above .35 very beneficial
Interpreting Validity Coefficients

As a general rule, the higher the


.21 - .35 likely to be useful
validity coefficient the more beneficial
it is to use the test. Validity
coefficients of r =.21 to r =.35 are
.11 - .20 depends on circumstances
typical for a single test. Validities for
selection systems that use multiple
tests will probably be higher because below .11 unlikely to be useful
you are using different tools to
measure/predict different aspects of
performance, where a single test is more likely to measure or predict fewer aspects of total
performance. Table 3 serves as a general guideline for interpreting test validity for a single test.
Evaluating test validity is a sophisticated task, and you might require the services of a testing
expert. In addition to the magnitude of the validity coefficient, you should also consider at a
minimum the following factors:

 level of adverse impact associated with your assessment tool

 selection ratio (number of applicants versus the number of openings)

28
 cost of a hiring error

 cost of the selection tool

 probability of hiring qualified applicant based on chance alone.

Here are three scenarios illustrating why you should consider these factors, individually and in
combination with one another, when evaluating validity coefficients:

Scenario One
You are in the process of hiring applicants where you have a high selection ratio and are filling
positions that do not require a great deal of skill. In this situation, you might be willing to accept
a selection tool that has validity considered "likely to be useful" or even "depends on
circumstances" because you need to fill the positions, you do not have many applicants to
choose from, and the level of skill required is not that high.

Now, let's change the situation.

Scenario Two
You are recruiting for jobs that require a high level of accuracy, and a mistake made by a worker
could be dangerous and costly. With these additional factors, a slightly lower validity coefficient
would probably not be acceptable to you because hiring an unqualified worker would be too
much of a risk. In this case you would probably want to use a selection tool that reported
validities considered to be "very beneficial" because a hiring error would be too costly to your
company.

Here is another scenario that shows why you need to consider multiple factors when evaluating
the validity of assessment tools.

Scenario Three
A company you are working for is considering using a very costly selection system that results
in fairly high levels of adverse impact. You decide to implement the selection tool because the
assessment tools you found with lower adverse impact had substantially lower validity, were just
as costly, and making mistakes in hiring decisions would be too much of a risk for your
company. Your company decided to implement the assessment given the difficulty in hiring for
the particular positions, the "very beneficial" validity of the assessment and your failed attempts
to find alternative instruments with less adverse impact. However, your company will continue
efforts to find ways of reducing the adverse impact of the system.

Again, these examples demonstrate the complexity of evaluating the validity of assessments.
Multiple factors need to be considered in most situations. You might want to seek the assistance
of a testing expert (for example, an industrial/organizational psychologist) to evaluate the
appropriateness of particular assessments for your employment situation.

When properly applied, the use of valid and reliable assessment instruments will help you make

29
better decisions. Additionally, by using a variety of assessment tools as part of an assessment
program, you can more fully assess the skills and capabilities of people, while reducing the
effects of errors associated with any one tool on your decision making.

30

You might also like