Professional Documents
Culture Documents
&
RELIABILITY
Chrissa Toh Xin Ying
MPP201036
Concepts of Reliability
02 a. Concept of reliability g. Reliability as Temporal Stability
b. Classical Test Theory h. Reliability as Internal Consistency
c. Sources of Measurement Error i. Item Response Theory
d. Unsystematic Measurement j. Special Circumstances in the Estimation of
Error Reliability
k. Reliability and the Standard Error of
e. Measurement Error and
Measurement and the Difference
Reliability
f. The Reliability Coefficient & The
Correlation Coefficient
Norms and Test
Standardization
1. Norms and Test Standardization reviews the
process of standardizing a test against an
appropriate norm group.
01
● The most basic level of information provided by a psychological test,
for example, the number of questions answered correctly.
2
➔ Summarize, condense, and organize a collection of quantitative data into
meaningful patterns.
➔ Approaches to organizing and summarizing quantitative data:
1. Frequency Distributions
2. Measures of Central Tendency
3. Measures of Variability
4. The Normal Distribution
5. Skewness
a. Frequency
Distribution
A frequency distribution is
prepared by specifying a small
number of usually equal-sized
class intervals and then tallying
how many scores fall within each
interval.
3
➔ Transforming the raw scores into more interpretable and useful forms of information.
04
● the sample of test-takers who are representative of the population for
whom the test is intended.
● Simple random sampling: everyone would have an equal chance of
being chosen to take the test.
● stratified random sampling: consists of stratifying, or classifying,
the target population on important background variables (e.g., age,
sex, race, social class, educational level) and then selecting an
appropriate percentage of persons at random from each stratum.
Age and Grade Norms Local and Subgroup Norms
a. Age norm depicts the level of test performance for a. Local norms are derived from representative local
each separate age group in the normative sample. examinees, 截然相反as opposed to a national sample.
-The purpose of age norms is to facilitate same-aged
comparisons b. Subgroup norms consist of the scores obtained from
an identified subgroup (African Americans, Hispanics,
b. Grade norm depicts the level of test performance for females), as opposed to a diversified national sample.
each separate grade in the normative sample. these - subgroups can be formed with respect to sex,
norms are especially useful in school settings when ethnic background, geographical region, urban
reporting the achievement levels of school children. versus rural environment, socio economic level,
-Eg. academic achievement.
Expectancy Tables
1. An expectancy table portrays the established relationship between test scores and expected
outcome on a relevant task.
2. Expectancy tables are always based on the previous predictor and criterion results for large
samples of examinees.
3. Based on 7,835 previous examinees who subsequently attended a major university, the expectancy
table in Table 3.6 provides the probability of achieving certain first-year college grades as a function
of score on the American College Testing (ACT) examination.
05
● Criterion-referenced tests are used to compare examinees’ accomplishments to
a predefined performance standard.
● Criterion-referenced tests focus on what the test taker can do rather than on
comparisons to the performance levels of others.
● For example, evaluate how well students have mastered the academic skills
expected at each grade level, to evaluate the curriculum and to determine how
well individual schools are teaching the curriculum. used for specific classroom
objectives (e.g., meeting a minimal level of proficiency in spelling for sixth
graders) or for more far-reaching, high-stakes purposes such as determining
graduation from high school.
Distinctive Characteristics of
Criterion-Referenced and Norm Referenced Test
Dimension Criterion-Referenced Tests Norm-Referenced Tests
Purpose Compare examinees’ performance to a standard Compare examinees’ performance to one another
Item Content Narrow domain of skills with real-world Broad domain of skills with indirect relevance
relevance
Item Selection Most items of similar difficulty level Items vary widely in difficulty level
Interpretation Scores usually expressed as a percentage, with Scores usually expressed as a standard score,
of Scores passing level predetermined percentile, or grade equivalent
Example 1. Testing of basic academic skills (e.g., 1. Diagnosing student’s strengths and
reading level, computation skill)
weaknesses
2. Determining graduation from high school.
3. Specific classroom objectives 2. The diagnosis of learning disabilities
3. Selecting students for specific programs(SAT
(Scholastic Assessment Test) and ACT
(American College Test)
02
RELIABILITY
the attribute of consistency
in measurement
In the short run, measures of
weight are highly consistent,
intellectual test scores are
Reliability
A continuum ranging from minimal
consistency of measurement (e.g., but simple reaction time is
simple reaction time) to near somewhat erratic.
perfect repeatability of results (e.g.,
weight). Most psychological tests
fall somewhere in between these
two extremes.
b. Classical Test Theory
the basis for test development throughout most of the
twentieth century
The basic starting point of the classical theory of measurement is the idea
that test scores result from the influence of two factors that contribute to:
01 02
Consistency Inconsistency
Characteristics of the individual, test, or
These consist entirely of the stable
situation that have nothing to do with the
attributes of the individual, which the
attribute being measured, but that nonetheless
examiner is trying to measure.
affect test scores. It represents the unavoidable
It represents the true amount of the
nuisance of error factors that contribute to
attribute in question,
inaccuracies of measurement.
b. Classical Test Theory
X is the obtained score, T is the true score,
and e represents errors of measurement
X=T+e
If e is positive, the obtained score X will be higher than the true score
T. Conversely, if e is negative, the obtained score will be lower than
the true score.
Note:
The true score is never known. We can obtain a probability that the true
score resides within a certain interval and we can also derive a best
estimate of the true score. However, we can never know the value of a
true score with certainty.
c. Sources of Measurement Error
Test
Item Selection Test Scoring
Administration
One source of measurement Factors that influence on the Some degree of judgment is
error is the instrument itself. accuracy of measurement required to assign points to
A test developer must settle directly/indirectly include: answers in essay questions or
on a finite number of items - uncomfortable room projective test. Nunnally (1978)
from a potentially infinite temperature, dim lighting, and points out that the projective
pool of test questions. The excessive noise. tester might undergo an
psychometrician can - Momentary fluctuations in evolutionary change in scoring
minimize this unwanted anxiety, motivation, attention, criteria over time, coming to
nuisance by attending and fatigue level of the test regard a particular type of
carefully to issues of test taker response as more and more
construction. - The examiner may contribute pathological with
to measurement error in the each encounter.
process of test administration.
d. Unsystematic Measurement Error
X is the obtained score, T is the true score, es
is the systematic error due to the anxiety subcomponent,
and eu is the collective effect of the unsystematic
measurement errors previously outlined.
X= T+ e+ e s u
Example:
A scale to measure social introversion also inadvertently taps anxiety in a
consistent fashion.
e. Measurement Error & Reliability
Coefficient alpha is an index of the internal consistency of the items, that is, their
tendency to correlate positively with one another.
h. Reliability as Internal Consistency
The Kuder-Richardson Estimate of Reliability
The KR20 is used for items that have varying difficulty. For example, some
items might be very easy, others more challenging. It should only be used if
there is a correct answer for each question — it shouldn’t be used for
questions with partial credit is possible or for scales like the Likert Scale.
Note: Interscorer reliability supplements other reliability estimates but does not replace them.
h. Reliability as Internal Consistency
Methods No. No. Sessions Source of Error Variance Note:
Forms
Test-retest 1 2 Changes over time Appropriate for tests designed to be
administered to individuals more than
once, where test demonstrate reliability
across time
Alternate 2 1 Item Sampling
Forms(Immediate)
Alternate Forms 2 2 Item Sampling Changes
(delayed) over time
Split-Half 1 1 Item sampling Work well for instruments that have items
Nature of split carefully ordered according to difficulty
level.
Coeffiicient Alpha 1 1 Item sampling Not suitable for all tests but applies only to
Test heterogeneity measures that are designed to assess a
single factor
Interscorer 1 1 Scorer differences Appropriate for any test that involves
subjectivity of scoring.
i. Item Response Theory
item response theory (IRT)
• a collection of mathematical models and
statistical tools with widespread uses.
• the IRT model concludes that test scores are According to classical test theory, the difficulty level of an
more reliable for individuals of average ability item is equivalent to the proportion of examinees in a
and increasingly less reliable for those with standardization sample who pass the item. In contrast,
very high or very low ability. according to IRT, difficulty is indexed by how much of the
trait is needed to answer the item correctly.
i. Item Response Theory
Example of Item Response Functions
• Item A: The lowest difficulty as almost everyone,
even examinees possessing only a small amount of
the trait in question can answer the correct answer.