CHAPTER 4 Norms and Reliability

NORMS
&
RELIABILITY
Chrissa Toh Xin Ying
MPP201036
Tey Miao Hui

MPP202022
TABLE OF CONTENTS
01 Norms and Test Standardization

a. Raw Scores d. Selecting a Norm Group
b. Essential Statistical Concepts e.. Criterion-Referenced Tests
c. Raw Score Transformations
Concepts of Reliability
02 a. Concept of reliability g. Reliability as Temporal Stability
b. Classical Test Theory h. Reliability as Internal Consistency
c. Sources of Measurement Error i. Item Response Theory
d. Unsystematic Measurement j. Special Circumstances in the Estimation of
Error Reliability
k. Reliability and the Standard Error of
e. Measurement Error and
Measurement and the Difference
Reliability
f. The Reliability Coefficient & The
Correlation Coefficient
Norms and Test
Standardization
1. Norms and Test Standardization reviews the
process of standardizing a test against an
appropriate norm group.
2. A norm group consists of a sample of

examinees who are representative of the
population for whom the test is intended.
3. For example, population of the same age,

grade, sex, and so on
Raw Score
01
● The most basic level of information provided by a psychological test,
for example, the number of questions answered correctly.
● A raw score becomes meaningful mainly in relation to norms.

(Eg. administering a test to a large and representative sample of
persons.)
Raw score of
personality testing
Raw score of ability testing

Essential Statistical Concepts
2
➔ Summarize, condense, and organize a collection of quantitative data into
meaningful patterns.
➔ Approaches to organizing and summarizing quantitative data:
1. Frequency Distributions
2. Measures of Central Tendency
3. Measures of Variability
4. The Normal Distribution
5. Skewness
a. Frequency
Distribution
A frequency distribution is
prepared by specifying a small
number of usually equal-sized
class intervals and then tallying
how many scores fall within each
interval.
It is common for frequency distributions to

include between 5 and 15 class intervals.
Histogram Frequency Polygon
In a histogram, the height of a A frequency polygon is similar to
column indicates the number of a histogram, except that the
frequency of the class intervals
scores occurring within that
is represented by single points
interval
b. Measure of Central Tendency
Mean(m) Median Mode

is one such measure the middlemost score the most frequently
of central tendency. when all the scores occurring score
have been ranked.
If a distribution of scores is skewed, the median is a better index of central tendency.

c. Measure of
Variability
1. To describe the degree of dispersion, we
need a statistical index that expresses the
variability of scores in the distribution.
2. The most commonly used statistical index
of variability in a group of scores is the
standard deviation, designated as s or
abbreviated as SD.
3. The standard deviation is nonetheless the
preferred measure of variance in
psychological testing because of its direct
relevance to the normal distribution.
d. Skewness
Skewness refers to the symmetry or asymmetry of a frequency distribution.
Positive Skew Negative Skew

the test probably contains too few easy the test probably contains too few hard
items to make effective discriminations items to make effective discriminations at
at this end of the scale. this end of the scale.
Raw Score Transformations
3
➔ Transforming the raw scores into more interpretable and useful forms of information.
➔ Several kinds of raw score transformations:

1. Percentiles and Percentile Ranks
2. Standard Scores
3. T Scores and Other Standardized Scores
4. Normalizing Standard Scores
5. Stanines, Stens, and C Scale
a. Percentiles and
Percentile Ranks
1. A percentile expresses the percentage of
persons in the standardization sample
who scored below a specific raw score.
2. Higher percentiles indicate higher scores.
3. A percentile indicates only how an
examinee compares to the
standardization sample
4. Percentiles are the most common type of
raw score transformation encountered in
psychological testing. For example, percentile of 94, denoted as P94.
intelligence tests are used to obtain IQ
scores
b. Standard Scores
1. A standard score uses the standard
deviation of the total distribution of
raw scores as the fundamental unit of
measurement.
2. Standard score expresses the
distance from the mean in standard
deviation units.
3. A standard score not only expresses
the magnitude of deviation from the
mean, but the direction of departure
(positive or negative) as well.
4. Mean, Standard deviation, raw score
c. T scores and other
Standardized Scores
1. standardized scores are identical to
standard scores. Both kinds of scores
contain exactly the same information.
standardized scores are always
expressed as positive whole numbers
(no decimal fractions or negative signs),
2. One popular kind of standardized score
is the T score, which has a mean of 50
and a standard deviation of 10. T score
scales are especially common with
personality tests.
d. Normalizing
Standard Score
1. Conversion of standard scores to
percentiles
2. In practice, normalized standard
scores are used sparingly. Such
transformations are appropriate only
when the normative sample is large
and representative and the raw
score distribution is only mildly
nonnormal.
e. Stanines, Stens and C Scale
1. In a stanine scale, all raw scores are converted to a single-digit system of

scores ranging from 1 to 9. The mean of stanine scores is always 5, and the
standard deviation is approximately 2.
2. The transformation from raw scores to stanines is simple: The scores are ranked
from lowest to highest, and the bottom 4 percent of scores convert to a stanine of
1, the next 7 percent convert to a stanine of 2, and so on.
Selecting a Norm Group
04
● the sample of test-takers who are representative of the population for
whom the test is intended.
● Simple random sampling: everyone would have an equal chance of
being chosen to take the test.
● stratified random sampling: consists of stratifying, or classifying,
the target population on important background variables (e.g., age,
sex, race, social class, educational level) and then selecting an
appropriate percentage of persons at random from each stratum.
Age and Grade Norms Local and Subgroup Norms
a. Age norm depicts the level of test performance for a. Local norms are derived from representative local
each separate age group in the normative sample. examinees, 截然相反as opposed to a national sample.
-The purpose of age norms is to facilitate same-aged
comparisons b. Subgroup norms consist of the scores obtained from
an identified subgroup (African Americans, Hispanics,
b. Grade norm depicts the level of test performance for females), as opposed to a diversified national sample.
each separate grade in the normative sample. these - subgroups can be formed with respect to sex,
norms are especially useful in school settings when ethnic background, geographical region, urban
reporting the achievement levels of school children. versus rural environment, socio economic level,
-Eg. academic achievement.
Expectancy Tables
1. An expectancy table portrays the established relationship between test scores and expected
outcome on a relevant task.
2. Expectancy tables are always based on the previous predictor and criterion results for large
samples of examinees.
3. Based on 7,835 previous examinees who subsequently attended a major university, the expectancy
table in Table 3.6 provides the probability of achieving certain first-year college grades as a function
of score on the American College Testing (ACT) examination.
● Expectancy tables do not

foreordain how new examinees
will do on the criterion.
● Expectancy tables provide a
broadly accurate preview of
criterion performance
Criterion-referenced Test
05
● Criterion-referenced tests are used to compare examinees’ accomplishments to
a predefined performance standard.
● Criterion-referenced tests focus on what the test taker can do rather than on
comparisons to the performance levels of others.
● For example, evaluate how well students have mastered the academic skills
expected at each grade level, to evaluate the curriculum and to determine how
well individual schools are teaching the curriculum. used for specific classroom
objectives (e.g., meeting a minimal level of proficiency in spelling for sixth
graders) or for more far-reaching, high-stakes purposes such as determining
graduation from high school.
Distinctive Characteristics of
Criterion-Referenced and Norm Referenced Test
Dimension Criterion-Referenced Tests Norm-Referenced Tests
Purpose Compare examinees’ performance to a standard Compare examinees’ performance to one another
Item Content Narrow domain of skills with real-world Broad domain of skills with indirect relevance
relevance
Item Selection Most items of similar difficulty level Items vary widely in difficulty level
Interpretation Scores usually expressed as a percentage, with Scores usually expressed as a standard score,
of Scores passing level predetermined percentile, or grade equivalent
Example 1. Testing of basic academic skills (e.g., 1. Diagnosing student’s strengths and
reading level, computation skill)
weaknesses
2. Determining graduation from high school.
3. Specific classroom objectives 2. The diagnosis of learning disabilities
3. Selecting students for specific programs(SAT
(Scholastic Assessment Test) and ACT
(American College Test)
02
RELIABILITY
the attribute of consistency
in measurement
In the short run, measures of
weight are highly consistent,
intellectual test scores are
a. Concept of moderately stable,
Reliability
A continuum ranging from minimal
consistency of measurement (e.g., but simple reaction time is
simple reaction time) to near somewhat erratic.
perfect repeatability of results (e.g.,
weight). Most psychological tests
fall somewhere in between these
two extremes.
b. Classical Test Theory
the basis for test development throughout most of the
twentieth century
The basic starting point of the classical theory of measurement is the idea
that test scores result from the influence of two factors that contribute to:
01 02
Consistency Inconsistency
Characteristics of the individual, test, or
These consist entirely of the stable
situation that have nothing to do with the
attributes of the individual, which the
attribute being measured, but that nonetheless
examiner is trying to measure.
affect test scores. It represents the unavoidable
It represents the true amount of the
nuisance of error factors that contribute to
attribute in question,
inaccuracies of measurement.
b. Classical Test Theory
X is the obtained score, T is the true score,
and e represents errors of measurement
X=T+e
If e is positive, the obtained score X will be higher than the true score
T. Conversely, if e is negative, the obtained score will be lower than
the true score.
Note:
The true score is never known. We can obtain a probability that the true
score resides within a certain interval and we can also derive a best
estimate of the true score. However, we can never know the value of a
true score with certainty.
c. Sources of Measurement Error
Test
Item Selection Test Scoring
Administration
One source of measurement Factors that influence on the Some degree of judgment is
error is the instrument itself. accuracy of measurement required to assign points to
A test developer must settle directly/indirectly include: answers in essay questions or
on a finite number of items - uncomfortable room projective test. Nunnally (1978)
from a potentially infinite temperature, dim lighting, and points out that the projective
pool of test questions. The excessive noise. tester might undergo an
psychometrician can - Momentary fluctuations in evolutionary change in scoring
minimize this unwanted anxiety, motivation, attention, criteria over time, coming to
nuisance by attending and fatigue level of the test regard a particular type of
carefully to issues of test taker response as more and more
construction. - The examiner may contribute pathological with
to measurement error in the each encounter.
process of test administration.
d. Unsystematic Measurement Error
X is the obtained score, T is the true score, es
is the systematic error due to the anxiety subcomponent,
and eu is the collective effect of the unsystematic
measurement errors previously outlined.
X= T+ e+ e s u
Because by definition their presence is initially undetected,

systematic measurement errors may constitute a significant problem
in the development of psychological tests.
Example:
A scale to measure social introversion also inadvertently taps anxiety in a
consistent fashion.
e. Measurement Error & Reliability
The correlation Variance of obtained

measurement error reduces score is the variance
the reliability or repeatability
of psychological test results.
of true scores plus the
variance of errors of
measurement:
Classical Theory
1. Measurement errors are random.
2. Mean error of measurement = 0.
3. True scores and errors are The preceding formula
uncorrelated: demonstrates that test
rTe = 0. scores vary as the result of
4. Errors on different tests are two factors: variability in true
uncorrelated: scores, and variability due to
r12 = 0. measurement error.
01 02
f. The Reliability Coefficient f. The Correlation Coefficient
• A correlation coefficient (r) expresses the
degree of linear relationship between two
sets of scores obtained from the same
persons.
• Correlation coefficients can take on values
ranging from -1.00 to +1.00.
• The correlation coefficient is independent of
the mean.
g. Reliability as Temporal Stability
Test Retest Reliability Alternate-Forms Reliability

The most straightforward method for Alternate forms of a test incorporate
determining the reliability of test scores is to • similar content and cover the same range
administer the identical test twice to the and level of difficulty in items.
same group of heterogeneous and
representative subjects. If the test is perfectly • possess similar statistical and normative
reliable, each person’s second score will be properties.
completely predictable from his or her first • introduces item-sampling differences as an
score. additional source of error variance.
• Expensive
h. Reliability as Internal Consistency
Split-Half Reliability
split-half reliability is obtained by correlating the pairs of scores
obtained from equivalent halves of a test administered only once to a
representative sample of examinees. The challenge is to divide the
test into two nearly equivalent halves.
The Spearman-Brown Formula

The split-half method gives us an estimate of reliability for an
instrument half as long as the full test. In comparison to the shorter
test, the longer test embodies equivalent content and similar item
difficulty. Thus, the Pearson r between two halves of a test will usually
underestimate the reliability of the full instrument. The spearman
brown formula provides the appropriate adjustment.
Coefficient Alpha
Coefficient alpha may be thought of as the mean of all possible split-half
coefficients, corrected by the Spearman-Brown formula. The formula for
coefficient alpha is
Coefficient alpha is an index of the internal consistency of the items, that is, their
tendency to correlate positively with one another.
The Kuder-Richardson Estimate of Reliability
The KR20 is used for items that have varying difficulty. For example, some
items might be very easy, others more challenging. It should only be used if
there is a correct answer for each question — it shouldn’t be used for
questions with partial credit is possible or for scales like the Likert Scale.
The scores for KR-20 range from 0 to 1,

where 0 is no reliability and 1 is perfect
reliability. The closer the score is to 1, the
more reliable the test. Just what
constitites an “acceptable” KR-20 score
depends on the type of test. In general,
a score of above .5 is usually considered
reasonable.
Interscorer reliability
The scorer can be a major factor in the reliability of these instruments, a report of
interscorer reliability is imperative. Computing is a very straightforward
procedure. A sample of tests is independently scored by two or more examiners
and scores for pairs of examiners are then correlated. Test manuals typically
report the training and experience required of examiners and then list
representative interscorer correlation coefficients.
Note: Interscorer reliability supplements other reliability estimates but does not replace them.
Methods No. No. Sessions Source of Error Variance Note:
Forms
Test-retest 1 2 Changes over time Appropriate for tests designed to be
administered to individuals more than
once, where test demonstrate reliability
across time
Alternate 2 1 Item Sampling
Forms(Immediate)
Alternate Forms 2 2 Item Sampling Changes
(delayed) over time
Split-Half 1 1 Item sampling Work well for instruments that have items
Nature of split carefully ordered according to difficulty
level.
Coeffiicient Alpha 1 1 Item sampling Not suitable for all tests but applies only to
Test heterogeneity measures that are designed to assess a
single factor
Interscorer 1 1 Scorer differences Appropriate for any test that involves
subjectivity of scoring.
i. Item Response Theory
item response theory (IRT)
• a collection of mathematical models and
statistical tools with widespread uses.
• The applications of IRT include analyzing

items and scales, developing homogeneous
psychological measures, measuring
Individuals on psychological constructs (e.g.,
depression, intelligence, leadership), and
administering psychological tests by
computer.
• The foundational elements of IRT include

item response functions (IRFs), information
functions, and the assumption of invariance
• the IRT model concludes that test scores are According to classical test theory, the difficulty level of an
more reliable for individuals of average ability item is equivalent to the proportion of examinees in a
and increasingly less reliable for those with standardization sample who pass the item. In contrast,
very high or very low ability. according to IRT, difficulty is indexed by how much of the
trait is needed to answer the item correctly.
Example of Item Response Functions
• Item A: The lowest difficulty as almost everyone,
even examinees possessing only a small amount of
the trait in question can answer the correct answer.
• Item B and C are equal in difficulty level—for

example, individuals with an average trait level (a
score of 0) have a 50 percent chance of answering
these items correctly.
• item D has the highest difficulty level—only those

with high amounts of the trait typically answer
• Item C with its steeper curve possesses better

discrimination, it is better able to differentiate among
individuals at this level of the trait. (Less traits, less
correct answer/More traits, more correct answers)
Item Information acquire from IRT

a test item typically provides a different level of
information at each level of the trait in question.
• Item A: useful only for testing individuals low on the

relevant trait—at higher levels, everyone answers
correctly, and no information is gained. It would be
pointless to administer this item to individuals at the
higher end of the trait spectrum because it is certain
they will answer correctly.
• item D: useful only for individuals with high trait

levels—at lower trait levels, it is certain that everyone
fails the item and, likewise, no information is gained.
j. Special Circumstances in the
Estimation Of Reliability
Unstable Characteristic Restriction of range
For example, the true amount of Test–retest reliability will be spuriously low
emotional reactivity changes so quickly, if it is based on a sample of homogeneous
subjects for whom there is a restriction of
test and retest must be nearly
range on the characteristic being
instantaneous in order to provide an
measured. For example, it would be
accurate index of reliability for unstable
inappropriate to estimate the reliability of
characteristics such as an electrodermal an intelligence test by administering it
measure of emotional reactivity. twice to a sample of college students.
Speed and Power Test Reliability of

A speeded test has a restrictive time limit that Criterion reference test
guarantees few subjects complete the entire The structure of criterion referenced tests is
test while power test allows enough time for such that the variability of scores among
test takers to attempt all items but is examinees is typically quite minimal. The key
constructed so that no test taker is able to issue here is not whether test and retest scores
obtain a perfect score. The reliability of a are close to one another, but whether the
speed test should be based on the test–retest classification (“can do/can’t do”) is the same in
method or split-half reliability from two, both instances, the percentage of persons for
separately timed half tests. The Spearman- whom the same decision is reached on both
Brown correction is needed after that. occasions—the closer to 100 percent, the better
k. Reliability and The Standard
Error of Measurement
The standard error of measurement (SEm) estimates how repeated
measures of a person on the same instrument tend to be distributed around
his or her “true” score. The true score is always an unknown because no
measure can be constructed that provides a perfect reflection of the true
score. SEm is directly related to the reliability of a test; that is, the larger the
SEm, the lower the reliability of the test and the less precision there is in the
measures taken and scores obtained.
k. Standard Error of the Difference
A statistical measure that can help a test user determine whether a
difference between scores is significant. The standard error of the
difference between two scores can be computed from the SEMs of the
individual tests by the following formula:
In particular, suppose an examinee obtains Verbal IQ 112 and Performance IQ

105 on the WAIS-R. Is 7 IQ points a significant difference?
Differences that are approximately twice as large as SEdiff (i.e., 1.96 x

4.74) can be considered significant in the sense that they will occur by
chance only 5 percent of the time. We may conclude, then, that differences
of about 9 points or more between Verbal and Performance IQ likely reflect
real differences in scores rather than chance contributions from errors of
measurement. Thus, more likely than not, a difference of merely 7 IQ points
does not signify a bona fide, significant difference between verbal and
performance intelligence.
THANK YOU

CHAPTER 4 Norms and Reliability

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CHAPTER 4 Norms and Reliability

Uploaded by

Copyright:

Available Formats

NORMS

Tey Miao Hui

01 Norms and Test Standardization

2. A norm group consists of a sample of

3. For example, population of the same age,

● A raw score becomes meaningful mainly in relation to norms.

Raw score of ability testing

It is common for frequency distributions to

Mean(m) Median Mode

If a distribution of scores is skewed, the median is a better index of central tendency.

Positive Skew Negative Skew

➔ Several kinds of raw score transformations:

1. In a stanine scale, all raw scores are converted to a single-digit system of

● Expectancy tables do not

a. Concept of moderately stable,

Because by deﬁnition their presence is initially undetected,

The correlation Variance of obtained

Test Retest Reliability Alternate-Forms Reliability

The Spearman-Brown Formula

The scores for KR-20 range from 0 to 1,

• The applications of IRT include analyzing

• The foundational elements of IRT include

• Item B and C are equal in difﬁculty level—for

• item D has the highest difﬁculty level—only those

• Item C with its steeper curve possesses better

Item Information acquire from IRT

• Item A: useful only for testing individuals low on the

• item D: useful only for individuals with high trait

Speed and Power Test Reliability of

In particular, suppose an examinee obtains Verbal IQ 112 and Performance IQ

Differences that are approximately twice as large as SEdiff (i.e., 1.96 x

You might also like