You are on page 1of 10

variance.

The greater the proportion of the


CHAPTER 5: Reliability total variance attributed to true variance, the
more reliable the test.
o True differences are assumed to be
Reliability
stable = they will yield consistent
- Synonym for dependability or
scores on repeated administrations
consistency. of the same test as well as on
- In Psychometric sense, it only refers equivalent forms of tests.
to something that is consistent—not o Because error variance may
necessarily consistently good or increase or decrease a test score by
bad, but simply consistent. varying amounts, consistency of the
- Not an all-or-none matter; A test test score—and thus the reliability
may be reliable in one context and —can be affected.
unreliable in another. There are - The term measurement error refers to all of
different types and degrees of the factors associated with the process of
reliability. measuring some variable, other than the
variable being measured. (Ex: administering
English-language test on the subject of 12th-
Reliability Coefficient grade algebra to students from China)
- An index of reliability, a proportion o Can be categorized as being either
that indicates the ratio between the random or systematic.
true score variance on a test and the o Random error – a source of error
total variance. in measuring a targeted variable
caused by unpredictable
The Concept of Reliability fluctuations and inconsistencies of
- A score on an ability test is reflected other variables in the measurement
not only the testtaker’s true score on process.
the ability being measured but also  Sometimes referred to as
error. “noise,” this source of
error fluctuates from one
- Error refers to the component of the
testing situation to another
observed test score that does not with no discernible pattern
have to do with the testtaker’s ability. that would systematically
- X=T+E raise or lower scores. (Ex:
o X = observed score unanticipated events in the
o T = true score vicinity of test
o E = error environment like rally or
- A statistic useful in describing lightning strike;
sources of test score variability is the unanticipated physical
events happening within
variance (σ2).
the test taker like surge in
- Variance from true differences is their blood pressure.)
true variance, and variance from o Systematic error – a source of error
irrelevant, random sources is error in measuring a variable that is
variance. typically constant or proportionate
- σ2 = σ2th + σ2e to what is presumed to be the true
value of the variable being
o In this equation, the total
measured.
variance in an observed  Once a systematic error
distribution of test scores (σ 2) becomes known, it
equals the sum of the true becomes predictable—as
variance (σ2th) plus the error well as fixable. Note also
variance (σ2e). that a systematic source of
- The term reliability refers to the proportion error does not affect score
of the total variance attributed to true consistency. (Ex: a
systematic error source may also serve as a source of error.
would not change the So, for example, test results may
variability of the vary depending upon whether the
distribution or affect the test taker’s country is at war or at
measured reliability of the peace (Gil et al., 2016). A variable
instrument. In the end, the of interest when evaluating a
individual crowned “the patient’s general level of
biggest loser” would suspiciousness or fear is the
indeed be the contestant patient’s home neighborhood and
who lost the most weight lifestyle.
—it’s just that he or she o Other potential sources of error
would actually weigh 5 variance during test administration
pounds more than the are test taker variables. Pressing
weight measured by the emotional problems, physical
show’s official scale.) discomfort, lack of sleep, and the
effects of drugs or medication can
Sources of Error Variance all be sources of error variance.
- Include test construction, administration, Formal learning experiences, casual
scoring, and/or interpretation. life experiences, therapy, illness,
- Test construction – One source of variance and changes in mood or mental
during test construction is item sampling or state are other potential sources of
content sampling, terms that refer to test taker-related error variance.
variation among items within a test as well o Examiner-related variables are
as to variation among items between tests. potential sources of error variance.
o The extent to which a test taker’s The examiner’s physical
score is affected by the content appearance and demeanor—even
sampled on a test and by the way the presence or absence of an
the content is sampled (that is, the examiner—are some factors for
way in which the item is consideration here. Some
constructed) is a source of error examiners in some testing
variance. situations might knowingly or
o From the perspective of a test unwittingly depart from the
creator, a challenge in test procedure prescribed for a
development is to maximize the particular test. On an oral
proportion of the total variance that examination, some examiners may
is true variance and to minimize the unwittingly provide clues by
proportion of the total variance that emphasizing key words as they
is error variance. pose questions. Clearly, the level of
- Test administration – Sources of error professionalism exhibited by
variance that occur during test examiners is a source of error
administration may influence the test taker’s variance.
attention or motivation. The test taker’s - Test scoring and interpretation – In many
reactions to those influences are the source tests, the advent of computer scoring and a
of one kind of error variance. growing reliance on objective, computer-
o Examples of untoward influences scorable items have virtually eliminated
during administration of a test error variance caused by scorer differences.
include factors related to the test However, not all tests can be scored from
environment: room temperature, grids blackened by no. 2 pencils.
level of lighting, and amount of Individually administered intelligence tests,
ventilation and noise, for instance. some tests of personality, tests of creativity,
A relentless fly may develop a various behavioral measures, essay tests,
tenacious attraction to an portfolio assessment, situational behavior
examinee’s face. tests, and countless other tools of assessment
o External to the test environment in still require scoring by trained personnel.
a global sense, the events of the day
o Scorers and scoring systems are of abuse also may
potential sources of error variance. contribute to systematic
A test may employ objective-type error. Females, for
items amenable to computer example, may underreport
scoring of well-documented abuse because of fear,
reliability. Yet even then, a shame, or social
technical glitch might contaminate desirability factors and
the data. over-report abuse if they
o If subjectivity is involved in are seeking help.
scoring, then the scorer (or rater)
can be a source of error variance. Reliability Estimates
Indeed, despite rigorous scoring Test-Retest Reliability Estimates
criteria set forth in many of the - One way of estimating the reliability of a
better-known tests of intelligence, measuring instrument is by using the same
examiner/scorers occasionally still instrument to measure the same thing at two
are confronted by situations where points in time. In psychometric parlance,
an examinee’s response lies in a this approach to reliability evaluation is
gray area. called the test-retest method, and the result
- Other sources of error – Surveys and polls of such an evaluation is an estimate of test-
are two tools of assessment commonly used retest reliability.
by researchers who study public opinion. In - Test-retest reliability is an estimate of
the political arena, for example, researchers reliability obtained by correlating pairs of
trying to predict who will win an election scores from the same people on two
may sample opinions from representative different administrations of the same test.
voters and then draw conclusions based on The test-retest measure is appropriate when
their data. evaluating the reliability of a test that
o The error in such research may be a purports to measure something that is
result of sampling error—the extent relatively stable over time, such as a
to which the population of voters in personality trait. If the characteristic being
the study actually was measured is assumed to fluctuate over time,
representative of voters in the then there would be little sense in assessing
election. the reliability of the test using the test-retest
o Alternatively, the researchers may method.
have gotten such factors right but - It is generally the case (although there are
simply did not include enough exceptions) that, as the time interval
people in their sample to draw the between administrations of the same test
conclusions that they did. This increases, the correlation between the scores
brings us to another type of error, obtained on each testing decreases. The
called methodological error. passage of time can be a source of error
o Certain types of assessment variance. The longer the time that passes, the
situations lend themselves to greater the likelihood that the reliability
particular varieties of systematic coefficient will be lower.
and nonsystematic error. For - When the interval between testing is greater
example, consider assessing the than six months, the estimate of test-retest
extent of agreement between reliability is often referred to as the
partners regarding the quality and coefficient of stability.
quantity of physical and o An estimate of test-retest reliability
psychological abuse in their from a math test might be low if the
relationship. test takers took a math tutorial
 A number of studies before the second test was
(O’Leary & Arias, 1988; administered. An estimate of test-
Riggs et al., 1989; Straus, retest reliability from a personality
1979) have suggested that profile might be low if the test taker
underreporting or over- suffered some emotional trauma or
reporting of perpetration received counseling during the
intervening period.
- A low estimate of test-retest reliability might Overall, it appears that most scientists now
be found even when the interval between recognize replicability as a concern that
testings is relatively brief. This may well be needs to be addressed with meaningful
the case when the testings occur during a changes to what has constituted “business-
time of great developmental change with as-usual” for so many years.
respect to the variables they are designed to
assess. Parallel-Forms and Alternate-Forms Reliability
- An estimate of test-retest reliability may be Estimates
most appropriate in gauging the reliability of - The degree of the relationship between
tests that employ outcome measures such as various forms of a test can be evaluated by
reaction time or perceptual judgments means of an alternate-forms or parallel-
(including discriminations of brightness, forms coefficient of reliability, which is
loudness, or taste). often termed the coefficient of equivalence.
o However, even in measuring - Parallel forms of a test exist when, for each
variables such as these, and even form of the test, the means and the variances
when the time period between the of observed test scores are equal. In theory,
two administrations of the test is the means of scores obtained on parallel
relatively small, various factors forms correlate equally with the true score.
(such as experience, practice, More practically, scores obtained on parallel
memory, fatigue, and motivation) tests correlate equally with other measures.
may intervene and confound an o The term parallel forms reliability
obtained measure of reliability. refers to an estimate of the extent to
which item sampling and other
*Psychology’s Replicability Crisis errors have affected test scores on
- Low replication rate helped confirm that versions of the same test when, for
science indeed had a problem with each form of the test, the means
replicability, the seriousness of which is and variances of observed test
reflected in the term replicability crisis. scores are equal.
- (1) A general lack of published replication - Alternate forms are simply different
attempts in the professional literature, (2) versions of a test that have been constructed
editorial preferences for positive over so as to be parallel. Although they do not
negative findings, and ((3) questionable meet the requirements for the legitimate
research practices on the part of authors of designation “parallel,” alternate forms of a
published studies. test are typically designed to be equivalent
o Replication by independent parties with respect to variables such as content and
provides for confidence in a level of difficulty.
finding, reducing the likelihood of o The term alternate forms
experimenter bias and statistical reliability refers to an estimate of
anomaly. Indeed, had scientists the extent to which these different
been as focused on replication as forms of the same test have been
they were on hunting down novel affected by item sampling error, or
results, the field would likely not be other error.
in crisis now. - Obtaining estimates of alternate-forms
o Positive findings typically entail a reliability and parallel-forms reliability is
rejection of the null hypothesis. In similar in two ways to obtaining an estimate
essence, from the perspective of of test-retest reliability:
most journals, rejecting the null (1) Two test administrations with
hypothesis as a result of a research the same group are required, and
study is a newsworthy event. By (2) test scores may be affected by
contrast, accepting the null factors such as motivation, fatigue,
hypothesis might just amount to or intervening events such as
“old news.” practice, learning, or therapy
- Moreover, replication efforts—beyond even (although not as much as when the
that of the Open Science Collaboration—are same test is administered twice).
becoming more common (Klein et al, 2013).
- An additional source of error variance, item the test and even-numbered items to the
sampling, is inherent in the computation of other half. This method yields an estimate of
an alternate- or parallel-forms reliability split-half reliability that is also referred to as
coefficient. odd-even reliability.
- Developing alternate forms of tests can be - Yet another way to split a test is to divide
time-consuming and expensive. Imagine the test by content so that each half contains
what might be involved in trying to create items equivalent with respect to content and
sets of equivalent items and then getting the difficulty.
same people to sit for repeated - In general, a primary objective in splitting a
administrations of an experimental test! On test in half for the purpose of obtaining a
the other hand, once an alternate or parallel split-half reliability estimate is to create
form of a test has been developed, it is what might be called “mini-parallel-forms,”
advantageous to the test user in several with each half equal to the other—or as
ways. nearly equal as humanly possible—in
- An estimate of the reliability of a test can be format, stylistic, statistical, and related
obtained without developing an alternate aspects.
form of the test and without having to - The third step requires the computation of
administer the test twice to the same people. the Spearman-Brown formula, which
Deriving this type of estimate entails an allows a test developer or user to estimate
evaluation of the internal consistency of the internal consistency reliability from a
test items. Logically enough, it is referred to correlation of two halves of a test. It is a
as an internal consistency estimate of specific application of a more general
reliability or as an estimate of inter-item formula to estimate the reliability of a test
consistency. that is lengthened or shortened by any
number of items.

- A formula is necessary for estimating the


Split-Half Reliability Estimates reliability of a test that has been shortened or
- An estimate of split-half reliability is lengthened. The general Spearman–Brown
obtained by correlating two pairs of scores (rSB) formula is:
obtained from equivalent halves of a single
test administered once. o Where rSB is equal to the reliability
- It is a useful measure of reliability when it is adjusted by the Spearman–Brown
impractical or undesirable to assess formula, rxy is equal to the Pearson r
reliability with two tests or to administer a in the original-length test, and n is
test twice (because of factors such as time or equal to the number of items in the
expense). revised version divided by the
o Step 1. Divide the test into number of items in the original
equivalent halves. version.
o Step 2. Calculate a Pearson r
between scores on the two halves - By determining the reliability of one half of
of the test. a test, a test developer can use the
o Step 3. Adjust the half-test Spearman–Brown formula to estimate the
reliability using the Spearman– reliability of a whole test. Because a whole
Brown formula (discussed shortly). test is two times longer than half a test, n
- Simply dividing the test in the middle is not becomes 2 in the Spearman–Brown formula
recommended because it’s likely that this for the adjustment of split-half reliability.
procedure would spuriously raise or lower The symbol rhh stands for the Pearson r of
the reliability coefficient. scores in the two half tests:
- One acceptable way to split a test is to
randomly assign items to one or the other
half of the test.
- Another acceptable way to split a test is to
assign odd-numbered items to one half of
of items that measure more than
- Usually, but not always, reliability increases one trait.
as test length increases. Ideally, the - The more homogeneous a test is, the more
additional test items are equivalent with inter-item consistency it can be expected to
respect to the content and the range of have. Because a homogeneous test samples a
difficulty of the original items. Estimates of relatively narrow content area, it is to be
reliability based on consideration of the expected to contain more inter-item
entire test therefore tend to be higher than consistency than a heterogeneous test. Test
those based on half of a test. homogeneity is desirable because it allows
- If test developers or users wish to shorten a relatively straightforward test-score
test, the Spearman–Brown formula may be interpretation.
used to estimate the effect of the shortening o It is often an insufficient tool for
on the test’s reliability. Reduction in test measuring multifaceted
size for the purpose of reducing test psychological variables such as
administration time is a common practice in intelligence or personality. One
certain situations. way to circumvent this potential
- A Spearman–Brown formula could also be source of difficulty has been to
used to determine the number of items administer a series of homogeneous
needed to attain a desired level of reliability. tests, each designed to measure
In adding items to increase test reliability to some component of a
a desired level, the rule is that the new items heterogeneous variable.
must be equivalent in content and difficulty
so that the longer test still measures what the The Kuder-Richardson Formulas
original test measured. - Dissatisfaction with existing split-half
methods of estimating reliability compelled
Other Methods of Estimating Internal G. Frederic Kuder and M. W. Richardson to
Consistency develop their own measures for estimating
- Other formulas developed by Kuder and reliability.
Richardson (1937) and Cronbach (1951). - The most widely known of the many
formulas they collaborated on is their
Inter-item consistency Kuder–Richardson formula 20, or KR-20,
- Refers to the degree of correlation among all so named because it was the 20th formula
the items on a scale. A measure of inter-item developed in a series.
consistency is calculated from a single o KR-20 is the statistic of choice for
administration of a single form of a test. An determining the inter-item
index of inter-item consistency, in turn, is consistency of dichotomous items,
useful in assessing the homogeneity of the primarily those items that can be
test. scored right or wrong (such as
o Tests are said to be homogeneous if multiple-choice items).
they contain items that measure a o Will get a lower reliability estimate
single trait. As an adjective used to than split-half method if test items
describe test items, homogeneity are more heterogeneous.
(derived from the Greek words o KR-20 is computed:
homos, meaning “same,” and
genos, meaning “kind”) is the
degree to which a test measures a
single factor. In other words,
homogeneity is the extent to which
items in a scale are unifactorial. rKR20 = Kuder–Richardson formula 20 reliability
o In contrast to test homogeneity, coefficient
heterogeneity describes the degree k = number of test items
to which a test measures different σ2 = variance of total test scores
factors. A heterogeneous (or p = proportion of test takers who pass the item
nonhomogeneous) test is composed q = proportion of people who fail the item
Σpq = sum of the pq products over all items.
- Also, a myth about alpha is that “bigger is
- The KR-21 formula may be used if there is always better.” As Streiner (2003b) pointed
reason to assume that all the test items have out, a value of alpha above .90 may be “too
approximately the same degree of difficulty. high” and indicate redundancy in the items.
o Formula KR-21 has become
outdated in an era of calculators Average Proportional Distance (APD)
and computers. Way back when, - Rather than focusing on similarity between
KR-21 was sometimes used to scores on items of a test (as do split-half
estimate KR-20 only because it methods and Cronbach’s alpha), the APD is
required many fewer calculations. a measure that focuses on the degree of
- The one variant of the KR-20 formula that difference that exists between item scores.
has received the most acceptance and is in - Accordingly, we define the average
widest use today is a statistic called proportional distance method as a measure
coefficient alpha. You may even hear it used to evaluate the internal consistency of a
referred to as coefficient α−20. test that focuses on the degree of difference
that exists between item scores.
Coefficient Alpha - The general “rule of thumb” for interpreting
- Developed by Cronbach (1951) and an APD is that an obtained value of .2 or
subsequently elaborated on by others it may lower is indicative of excellent internal
be thought of as the mean of all possible consistency, and that a value of .25 to .2 is
split-half correlations, corrected by the in the acceptable range. A calculated APD
Spearman–Brown formula. of .25 is suggestive of problems with the
- Appropriate for use on tests containing internal consistency of the test.
nondichotomous items. - One potential advantage of the APD method
over using Cronbach’s alpha is that the APD
index is not connected to the number of
items on a measure. Cronbach’s alpha will
be higher when a measure has more than 25
items (Cortina, 1993).
ra = coefficient alpha
k = number of items, is the variance of one Measures of Inter-Scorer Reliability
item - Unfortunately, in some types of tests under
Σ = sum of variances of each item some conditions, the score may be more a
σ2 = variance of the total test scores function of the scorer than of anything else.
- Variously referred to as scorer reliability,
- Preferred statistic for obtaining an estimate judge reliability, observer reliability, and
of internal consistency reliability. A inter-rater reliability, inter-scorer
variation of the formula has been developed reliability is the degree of agreement or
for use in obtaining an estimate of test-retest consistency between two or more scorers (or
reliability (Green, 2003). judges or raters) with regard to a particular
- It is widely used as a measure of reliability, measure.
in part because it requires only one - Inter-rater consistency may be promoted by
administration of the test. providing raters with the opportunity for
- Unlike a Pearson r, which may range in group discussion along with practice
value from −1 to +1, coefficient alpha exercises and information on rater accuracy
typically ranges in value from 0 to 1. The (Smith, 1986).
reason for this is that, conceptually, - Perhaps the simplest way of determining the
coefficient alpha (much like other degree of consistency among scorers in the
coefficients of reliability) is calculated to scoring of a test is to calculate a coefficient
help answer questions about how similar of correlation. This correlation coefficient is
sets of data are. referred to as a coefficient of inter-scorer
- Still, because negative values of alpha are reliability.
theoretically impossible, it is recommended
under such rare circumstances that the alpha The Importance of the Method Used for
coefficient be reported as zero (Henson, Estimating Reliability
2001).
- A published study by Chmielewski et al. factor, such as one ability or one trait, are
(2015) highlighted the substantial influence expected to be homogeneous in items. For
that differences in method can have on such tests, it is reasonable to expect a high
estimates of inter-rater reliability. degree of internal consistency.
- Diagnostic reliability must be acceptably
high in order to accurately identify risk - Dynamic versus static characteristics.
factors for a disorder that are common to Static characteristic is unchanging, such as
subjects in a research study. intelligence. Unlike in measuring dynamic
- The utility and validity of a particular characteristics, a test-retest measure would
diagnosis itself can be called into question if be of little help in gauging the reliability of
expert diagnosticians cannot, for whatever the measuring
reason, consistently agree on who should instrument.
and should not be so diagnosed.
- Results suggest that the reliability of - Restriction or inflation of range. Referred
diagnoses is far lower than commonly to as restriction of range or restriction of
believed. Moreover, the results demonstrate variance (or, conversely, inflation of range
the substantial influence that method has on or inflation of variance.
estimates of diagnostic reliability even when o If the variance of either variable in
other factors are held constant. a correlational analysis is restricted
by the sampling procedure used,
Using and Interpreting a Coefficient of Reliability then the resulting correlation
- Reliability is a mandatory attribute in all coefficient tends to be lower.
tests we use. o If the variance of either variable in
a correlational analysis is inflated
The Purpose of the Reliability Coefficient by the sampling procedure, then the
- Transient error - a source of error resulting correlation coefficient
attributable to variations in the test taker’s tends to be higher.
feelings, moods, or mental state over time.
Then again, this 5% of the error may be due - Speed tests versus power tests. A reliability
to other factors that are yet to be identified. estimate of a speed test should be based on
performance from two independent testing
The Nature of the Test periods using one of the following: (1) test-
- Closely related to considerations concerning retest reliability, (2) alternate-forms
the purpose and use of a reliability reliability, or (3) split-half reliability from
coefficient are those concerning the nature two separately timed half tests. If a split-half
of the test itself. Included here are procedure is used, then the obtained
considerations such as whether reliability coefficient is for a half test and
(1) The test items are homogeneous should be adjusted using the Spearman–
or heterogeneous in nature; Brown formula.
(2) The characteristic, ability, or
trait being measured is presumed to - Criterion-referenced tests. Designed to
be dynamic or static provide an indication of where a test taker
(3) The range of test scores is or is stands with respect to some variable or
not restricted; (4) the test is a speed criterion, such as an educational or a
or a power test; and (5) the test is or vocational objective.
is not criterion-referenced. o Some traditional procedures of
- The child tested just before and again just estimating reliability are usually
after a developmental advance may perform not appropriate for use with
very differently on the two testings. In such criterion-referenced tests.
cases, a marked change in test score might - In criterion-referenced testing, and
be attributed to error when in reality it particularly in mastery testing, how different
reflects a genuine change in the test taker’s the scores are from one another is seldom a
skills. focus of interest. In fact, individual
differences between examinees on total test
- Homogeneity versus heterogeneity of the scores may be minimal. The critical issue for
test items. Tests designed to measure one
the user of a mastery test is whether or not a - Proponents of domain sampling theory seek
certain criterion score has been achieved. to estimate the extent to which specific
sources of variation under defined
The True Score Model of Measurement and conditions are contributing to the test score.
Alternatives to It (CTT = portion of test score attributable to
- Classical Test Theory (CTT) or true score error)
(or classical) model of measurement. - Of the three types of estimates of reliability,
o One of the reasons it has remained measures of internal consistency are perhaps
the most widely used model has to the most compatible with domain sampling
do with its simplicity, especially theory.
when one considers the complexity
of other proposed models of Generalizability theory
measurement. - In one modification of domain sampling
- True score as a value that according to theory called generalizability theory, a
classical test theory genuinely reflects an “universe score” replaces that of a “true
individual’s ability (or trait) level as score” (Shavelson et al., 1989).
measured by a particular test. - Based on the idea that a person’s test scores
- One’s true score on one test of extraversion, vary from testing to testing because of
for example, may not bear much variables in the testing situation. Instead of
resemblance to one’s true score on another conceiving of all variability in a person’s
test of extraversion. scores as error, Cronbach encouraged test
o Comparing a test taker’s scores on developers and researchers to describe the
two different tests purporting to details of the particular test situation or
measure the same thing requires a universe leading to a specific test score.
sophisticated knowledge of the o This universe is described in terms
properties of each of the two tests, of its facets, which include things
as well as some rather complicated like the number of items in the test,
statistical procedures designed to the amount of training the test
equate the scores. scorers have had, and the purpose
- For starters, one problem with CTT has to of the test administration.
do with its assumption concerning the - According to generalizability theory, given
equivalence of all items on a test; the exact same conditions of all the facets in
o All items are presumed to be the universe, the exact same test score
contributing equally to the score should be obtained. This test score is the
total. universe score, and it is, as Cronbach noted,
- Another problem has to do with the length analogous to a true score in the true score
of tests that are developed using a CTT model.
model. Whereas test developers favor
shorter rather than longer tests (as do most - Cronbach (1970): The person will ordinarily
test takers), the assumptions inherent in CTT have a different universe score for each
favor the development of longer rather than universe. Mary’s universe score covering
shorter tests. tests on May 5 will not agree perfectly with
- For these reasons, as well as others, her universe score for the whole month of
alternative measurement models have been May. . . . Some testers call the average over
developed. a large number of comparable observations
a “true score”; e.g., “Mary’s true typing
Domain sampling theory rate on 3-minute tests.” Instead, we speak of
- The 1950s saw the development of a viable a “universe score” to emphasize that what
alternative to CTT. It was originally referred score is desired depends on the universe
to as domain sampling theory and is better being considered. For any measure there
known today in one of its many modified are many “true scores,” each
forms as generalizability theory. corresponding to a different universe.
- Rebels against the concept of a true score - When we use a single observation as if it
existing with respect to the measurement of represented the universe, we are
psychological constructs. generalizing. We generalize over scorers,
over selections typed, perhaps over days. If
the observed scores from a procedure agree
closely with the universe score, we can say
that the observation is “accurate,” or
“reliable,” or “generalizable.”
- There is a different degree of
generalizability for each universe.

- A generalizability study examines how


generalizable scores from a particular test
are if the test is administered in different
situations.
o Examines how much of an impact
different facets of the universe have
on the test score.
o Is the test score affected by the time
of day in which the test is
administered? The influence of
particular facets on the test score is
represented by coefficients of
generalizability.

- After this, it is recommended that the test


developers do a decision study.
o In the decision study, developers
examine the usefulness of test
scores in helping the test user make
decisions.
o Designed to tell the test user how
test scores should be used and how
dependable those scores are as a
basis for decisions, depending on
the context of their use.
o Taking a better measure improves
the sensitivity of an experiment in
the same way that increasing the
number of subjects does.
- From the perspective of generalizability
theory, a test’s reliability is very much a
function of the circumstances under which
the test is developed, administered, and
interpreted.

Item Response Theory


- Provide a way to model the probability that
a person with X ability will be able to
perform at a level of Y.
- Latent-trait theory – physically
unobservable
- Examples of two characteristics of items
within an IRT framework are the difficulty
level of an item and the item’s level of
discrimination

You might also like