You are on page 1of 5

1

Test Theory Chapter 4 - Reliability


Four Important Distinctions
 Difference between reliability and validity. Validity measures if a test
measures what it was intended to. Reliability measures the consistency.
 Difference between the everyday use of the word and its technical
meaning (important for us)
 Distinction between real change and fluctuations, such as luck of the draw
 Constant and unsystematic errors. A constant error leads a person to score
systematically lower/higher. Reliability deals only with unsystematic
errors.
Review of Statistics
 Relationship of two variables can be presented by a bivariate distribution,
known as a scatterplot (examples p.111). The correlation coefficient (r)
provides a numerical summary of the relationship. Its range lies between -
1 and 1. Formula see p. 111.
 Regression line is the term to describe the best fitting line for such a
scatterplot.
Factors Affecting Correlation Coefficients
 The Pearson correlation coefficient only accounts for linear relationship
and will underestimate nonlinearity. The assumption that scores are
normally distributed is called homoscedasticity (equal scatter). However,
a scatter can also show heteroscedasticity (different scatter). Correlation is
strictly a matter of relative position within each group. Variability is often
called heterogeneity (difference) or its opposite homogeneity (sameness).
Major Sources of Unreliability
 Anything that results in unsystematic variation in test scores is a source of
unreliability.
 Variation in test scoring is also a source of unreliability.
 Test content will also have a major influence on reliability. Imagine we all
only studied chapters 1-10 extensively for the exam and others would
study the last few. When the exam would contain many questions from
the first chapters we would have higher scores than others.
 Test administration condition also have a great influence on reliability.
Time limits and physical arrangements for administration are important to
keep an eye on.
2

 Obviously personal conditions also have a great impact on reliability.


Being in a bad mood or sick will influence your score on a test.
Conceptual Framework: True Score Theory
 3 different theoretical contexts: classical test theory (CTT), item response
theory (IRT) and generalizability theory (GT)
 The classical test theory distinguished between observed score (O), true
score (T) and error score (E). True score describes the score a person
would get if all sources of unreliability would have been removed or
eliminated. We never get this score, mostly we end up with the observed
score. The overall formula that holds is T=O±E
 We never know a person’s true score, but we wish to.
Methods of Determining Reliability
 Test-retest Reliability. As the name says a test is done twice and the
different scores are then analysed. The reliability coefficient is simply the
correlation between scores on the first and second testing. There are 3
drawbacks however this method has: 1. The method does not take into
account unsystematic error due to variations in test content. 2. It is a
nuisance to obtain test-retest reliability for simple and short test. 3.
Obviously, there is concern that the first test will have an influence on the
second one, because content is the same and could be remembered easily.
Moreover, timing is of importance. The interval between the tests should
not be to short but also not long, so the trait measured does not undergo
real change.
Inter-Scorer Reliability
 It assesses unsystematic variation between people who take the test.
Sometimes it is also called inter-observer or inter-rater reliability. To
obtain such a score, Pearson’s correlation is used between the scores
between the first and second scorers. It is important that the scorers work
independently to avoid contamination.
Alternate Form Reliability
 Also known as parallel or equivalent form reliability. Requires two forms
of the test. Those should be very similar in terms of items, time limits and
content specifications. Of course, it is difficult to create such tests,
therefore this method is not used very often. 3 broad categories of
reliability coefficients are recognized: 1. Coefficients derived from the
administration of alternate forms independent testing sessions (alternate-
form coefficient). 2. Coefficients obtained by administering the same
3

form in separate occasions (test-retest reliability). 3. Coefficients based on


the relationship among scores derived from individual items or subsets of
items (internal consistency coefficients).
Internal Consistency Reliability
 Most frequently used method. 3 methods: split-half, Kuder-Richardson
and coefficient alpha.
Split-Half Reliability
 A test will be split in half and then both scores will be used to measure
correlation. Most frequent method split tests into half is the by odd and
even-numbered questions. The result is therefore referred to as odd-even
reliability. To yield reliability for such split tests a correction is needed,
the Spearman-Brown correction.
Kuder-Richardson Formulas
 Established two formulas known as KR-20 (most popular) and KR-21 (for
more info look at page 131).
Coefficient Alpha
 No restrictions, items can have any type if continuous score. Also called
Cronbach’s alpha. Alpha indicates the extent to which items in the test are
measuring the same constructs or traits. Therefore, it is also called a
measure of item homogeneity.
Three Important Conclusions
 Length is important. The number of items is included in all formulas we
have seen so far. The longer the test the more reliable. Reliability is
maximized when the percentage of examines responding correctly in a
cognitive test or responding in a direction in a noncognitive test is
near .50. Finally, correlation among items is important. To get good
internal consistency, use items measuring a well-defined trait.
The Standard Error of Measurement
 To interpret information about a test not only a reliability coefficient is
needed, but also the standard error of measurement (SEM). The SEM is
the standard deviation of a hypothetically infinite number of scores
obtained around the person’s true score.
Confidence Bands
4

 The SEM can be used to create confidence intervals; those are also called
confidence bands. Computer-generated score reports often use these
confidence bands.
Appropriate Units for SEM
 The standard error of measurement should be expressed in the score units
used for interpretation. If the interpretation employs normed scores the
raw SEM scores need to be converted.
Standard Errors: Three Types
 The SEM is different to the standard error of the mean and the standard
error of estimate. The standard error of measurement is the standard
deviation from a hypothetical population of observed scores distributed
around the true score of an individual. The standard error of the mean is
the standard deviation of a hypothetical population of sample means
around the population mean. The standard error of estimate is the standard
deviation of actual Y scores around the predicted Y scores when predicted
from X.
Some Special Issues in Reliability
Reliability in Interpretive Reports
 Narrative reports are not readily adapted to the tools of reliability analysis.
The impression could arise that reliability is not an issue, but it always is.
As the reader of a narrative report it is important must ensure to be
familiar with the reliability information given about the test. Finally,
every narrative report should include the concept of SEM.
Reliability of Subscores and Individual Items
 Information must be provided for the score that is actually being
measured. One cannot assume for example that individual items have the
same reliability as the total scores of a test.
Reliability in Item Response Theory
 The standard error in IRT is often referred to as an index of the precision
of measurement. The SE compared to the SEM in CTT is not dependent
on the homogeneity or heterogeneity of the test items.
Generalizability Theory
 GT is an attempt to assess many sources of unreliability at the same time.
In GT the true score is referred to as a universe score or domain score.
5

The person’s universe score is the average score across all occasions,
forms and scorers. Generalizability theory is divided into G-studies and
D-studies. The G-study analyses the components of variance, including
interactions. The D-study uses results of the G-study to decide how the
measurement might be improved by changes in one of the components.
GT offers an exceptionally useful framework for thinking about the
reliability of measures, but is not widely used in practical examinations as
they are complicated and need lots of time to be conducted.
Factors Affecting Reliability Coefficients
 The fact that correlation is a matter of relative position rather than
absolute scores is not a significant concern for reliability. Curvilinearity is
also not an issue for reliability data. Heteroscedasticity is very much a
problem for the SEM. Group variability is also often a problem when
interpreting reliability data.
How High Should Reliability Be?
 Summarized, reliability is always important. However, more important
than reliability is validity. It is possible to have a test with high reliability
which is not valid at all.

You might also like