Professional Documents
Culture Documents
SCHOOL OF PSYCHOLOGY
Measurement component
By
Dame Abera
January 2014
1
1. Measurement
1.1. Definition of basic terms:
Test:
It uses an instrument (such as ruler, thermometer, scale, and test) to determine the
quantity of something
It mostly describes a given behavior in terms of test scores, amount, size, or quantity
Assessment
It refers to the process of arriving at judgments about abstract entities such as programs,
curricula, organizations, and institutions
o For example, systemic evaluations (e.g., national assessments) are conducted to
ascertain how well an education system is functioning
In most education contexts, assessments are a vital component of any evaluation
It involves making a value judgment; the highest level of intellectual skill according to
Bloom; the ability to render judgments about the value of methods or materials for
specific purposes, making use of external or internal criteria
Psychometrics
Theories of psychometrics
Classical Test Theory (CTT) aims at studying the reliability of a (real valued) test score
variable (measurement, test) that maps a crucial aspect of qualitative or quantitative
observations into the set of real numbers
Aside from determining the reliability of a test score variable itself CTT allows
answering questions such as:
o How do two random variables correlate once the measurement error is filtered
out (correction for attenuation)?
o How dependable is a measurement in characterizing an attribute of an individual
unit, i.e., which is the confidence interval for the true score of that individual
with respect to the measurement considered?
o How reliable is an aggregated measurement consisting of the average (or sum)
of several measurements of the same unit or object (Spearman-Brown formula
for test length)?
o How reliable is a difference, e.g., between a pretest and posttest?
Basic Concepts of Classical Test Theory
Primitives
In the framework of CTT:
o Each measurement (test score) is considered being a value of a random variable Y
consisting of two components:
A true score and
an error score
o Two levels, or more precisely, two random experiments may be distinguished:
Sampling an observational unit (e.g., a person) and
Sampling a score within a given unit
Within a given unit:
o The true score is a parameter, i.e., a given but unknown number characterizing
the attribute of the unit
o Whereas the error is a random variable with an unknown distribution
Assumptions of CTT
Theory of true scores
Most classical approaches assume that the raw score (X) obtained by any one individual
is made up of a true component (T) and a random error (E) component: X = T + E.
o The true score of a person can be found by taking the mean score that the person
would get on the same test if they had an infinite number of testing sessions
Because it is not possible to obtain an infinite number of test scores, T is a
hypothetical, yet central, aspect of CTTs
Assumptions about the true scores
Domain sampling theory assumes that the items that have been selected for any one test
are just a sample of items from an infinite domain of potential items
The parallel test theory assumes that two or more tests with different domains sampled
(i.e., each is made up of different but parallel items) will give similar true scores but have
different error scores
o Two tests Yi and Yj are defined to be parallel:
if they are τ-equivalent (Ti = Tj)
if their error variables are uncorrelated (Cov (i, j) = 0, i ≠ j), and
if they have identical error variances (Var (i) = Var (j))
o Multiple forms of a test (e.g., Form A and Form B) are considered to be parallel if
their means, variances and reliabilities are equal
True scores in the population are assumed to be:
o Measured at the interval level and
o normally distributed
When these assumptions are not met, test developers convert scores,
combine scales, and do a variety of other things to the data to ensure that
this assumption is met
o The standard deviation of the distribution of random errors around the true score
is called the standard error of measurement
The lower it is, the more tightly packed around the true score the random
errors will be
Rules of Classical Test Theory
The first is that the standard error of measurement of a test is consistent across an entire
population
o That is, the standard error does not differ from person to person but is instead
generated by large numbers of individuals taking the test, and it is subsequently
generalized to the population of potential test takers
o In addition, regardless of the raw test score (high, medium, or low), the standard
error for each score is the same
The second is that as tests become longer, they become increasingly reliable
o Recall that in domain sampling, the sample of test items that makes up a single
test comes from an infinite population of items
o Also recall that larger numbers of items better sample the universe of items and
statistics generated by them (such as mean test scores) are more stable if they are
based on more items
The third is that the important statistics about test items (e.g., their difficulty) depend on
the sample of respondents being representative of the population
o The interpretation of a test score is meaningless without the context of normative
information
The major advantage of CTT
The major advantage of CTT are its relatively weak theoretical assumptions, which make
CTT easy to apply in many testing situations
o Relatively weak theoretical assumptions not only characterize CTT but also its
extensions (e.g., generalizability theory)
Although CTT’s major focus is on test-level information, item statistics (i.e., item
difficulty and item discrimination) are also an important part of the CTT model
o CTT does not invoke a complex theoretical model to relate an examinee’s ability
to success on a particular item
o Instead, CTT collectively considers a pool of examinees and empirically
examines their success rate on an item (assuming it is dichotomously scored)
This success rate of a particular pool of examinees on an item, well known
as the p value of the item, is used as the index for the item difficulty
o The ability of an item to discriminate between higher ability examinees and
lower ability examinees is known as item discrimination, which is often
expressed statistically as the Pearson product-moment correlation coefficient
between the scores on the item (e.g., 0 and 1 on an item scored right-wrong) and
the scores on the total test
When an item is dichotomously scored, this estimate is often computed as
a point-biserial correlation coefficient
Limitations of CTT
If item responses are dichotomous, CTT suggests that they should not be subjected to
factor analysis
o This poses problems in establishing the validity for many tests of cognitive
ability, where answers are coded as correct or incorrect
Once the item stems are created and subjected to content analysis by the experts, they
often disappear from the analytical process
o Individuals may claim that a particular item stem is biased or unclear, but no
statistical procedures allow for comparisons of the item content, or stimulus, in
CTT
CTT and its models are not really adequate for modeling answers to individual items in a
Questionnaire
CTT exclusively focuses on measurement errors
The major limitation of CTT can be summarized as circular dependency
o The person statistic (i.e., observed score) is (item) sample dependent, and
o the item statistics (i.e., item difficulty and item discrimination) are (examinee)
sample dependent
This circular dependency poses some theoretical difficulties in CTT’s
application in some measurement situations (e.g., test equating,
computerized adaptive testing)
Item Response Theory
Models of item response theory (IRT) specify how the probability of answering in a
specific category of an item depends on the attribute to be measured, i.e., on the value of
a latent variable
IRT is more theory grounded and models the probabilistic distribution of examinees’
success at the item level
As its name indicates, IRT primarily focuses on the item-level information in contrast to
the CTT’s primary focus on test-level information
The IRT framework encompasses a group of models, and the applicability of each model
in a particular situation depends on:
o The nature of the test items and
o The viability of different theoretical assumptions about the test items
For test items that are dichotomously scored, there are three IRT models, known as
three-, two-, and one-parameter IRT models
The IRT three-parameter model:
In the IRT three-parameter model, [c.sub.i] represents the guessing factor, [a.sub.i]
represents the item discrimination parameter commonly known as item slope, [b.sub.i]
represents the item difficulty parameter commonly known as the item location
parameter, D represents an arbitrary constant (normally, D = 1.7), and [Theta] represents
the ability level of a particular examinee
The item location parameter is on the same scale of ability [Theta], and takes the value of
[Theta] at the point at which an examinee with the ability-level [Theta] has a 50/50
probability of answering the item correctly
The item discrimination parameter is the slope of the tangent line of the item
characteristic curve at the point of the location parameter
The two-parameter model:
When the guessing factor is assumed or constrained to be zero ([c.sub.i] = 0), the three-
parameter model is reduced to the two-parameter model for which only:
o item location and
o item slope parameters need to be estimated
o Difficult to explain
Normally, when one hears the term measurement, they may think in terms of measuring
the length of something (the length of a piece of wood) or measuring a quantity of
something (a cup of flour)
This represents a limited use of the term measurement
In statistics, the term measurement is used more broadly and is more appropriately
termed scales of measurement
Scales of measurement refer to the ways in which variables/numbers are defined and
categorized
Each scale of measurement has certain properties which in turn determines the
appropriateness for use of certain statistical analyses
Nominal
Categorical data and numbers that are simply used as identifiers or names represent a
nominal scale of measurement
Numbers on the back of a football jersey/sport shirt and your social security number are
examples of nominal data
If a researcher conducts a study and includes gender as a variable, and he/she will code
Female as 1 and Male as 2 or visa versa when entering data into the computer, he/she is
using the numbers 1 and 2 to represent categories of data
Ordinal
A scale which represents quantity and has equal units but for which zero represents
simply an additional point of measurement is an interval scale
The Fahrenheit scale is a clear example of the interval scale of measurement
Thus, 60 degree Fahrenheit or -10 degrees Fahrenheit are interval data
Measurement of Sea Level is another example of an interval scale
With each of these scales there is direct, measurable quantity with equality of units
In addition, zero does not represent the absolute lowest value
Rather, it is point on the scale with numbers both above and below it (for example, -10
degrees Fahrenheit)
Ratio
The ratio scale of measurement is similar to the interval scale in that it also represents
quantity and has equality of units
However, this scale also has an absolute zero (no numbers exist below the zero)
Very often, physical measures will represent ratio data (for example, height and weight)
If one is measuring the length of a piece of wood in centimeters, there is quantity, equal
units, and that measure can not go below zero centimeters
A negative length is not possible
Generally:
Interval and Ratio data are sometimes referred to as parametric and nominal and ordinal
data are referred to as nonparametric
o Parametric means that it meets certain requirements with respect to parameters
of the population (for example, the data will be normal - the distribution
parallels the normal or bell curve)
o In addition, it means that numbers can be added, subtracted, multiplied, and
divided
o Parametric data are analyzed using statistical techniques identified as
Parametric Statistics
o As a rule, there are more statistical technique options for the analysis of
parametric data and parametric statistics are considered more powerful than
nonparametric statistics
Nonparametric data are lacking those same parameters and can not be added, subtracted,
multiplied, and divided
Dame: Measurement component
9
o For example, it does not make sense to add Social Security numbers to get a
third person
o Nonparametric data are analyzed by using Nonparametric Statistics
1.3.1. Introduction
o methods of interpretation
o personality traits
o attitudes
o Intelligence or
o emotional concerns
The primary impetus for the development of the major tests used today was the need for
practical guidelines for solving social problems
Dame: Measurement component
10
The first useful intelligence test was prepared in 1905 by the French psychologists
Alfred Binet and Théodore Simon
o The two developed a 30-item scale to ensure that no child could be denied
instruction in the Paris school system without formal examination
In 1916 the American psychologist Lewis Terman produced the first Stanford Revision
of the Binet-Simon scale to provide comparison standards for Americans from age three
to adulthood
o The test was further revised in 1937 and 1960, and today the Stanford-Binet
remains one of the most widely used intelligence tests
The need to classify soldiers during World War I resulted in the development of two
group intelligence tests:
o Army Beta
To help detect soldiers who might break down in combat, the American psychologist
Robert Woodworth designed the Personal Data Sheet, a forerunner of the modern
personality inventory
During the 1930s controversies over the nature of intelligence led to the development of
the Wechsler-Bellevue Intelligence Scale, which not only provided an index of general
mental ability but also revealed patterns of intellectual strengths and weaknesses
o The Wechsler tests now extend from the preschool through the adult age range
and are at least as prominent as the Stanford-Binet
As interest in the newly emerging field of psychoanalysis grew in the 1930s, two
important projective techniques introduced systematic ways to study unconscious
motivation:
During World War II the need for improved methods of personnel selection led to the
expansion of large-scale programs involving multiple methods of personality assessment
Since the late 1960s increased awareness and criticism from both the
public and professional sectors have led to greater efforts to establish
legal controls and more explicit safeguards against misuse of testing
materials
The first vocational tests were published during the 1940s, beginning with the General Aptitude
Test Battery
In educational settings:
o Administrators
o Teachers
o school psychologists, and
o career counselors use tests to make educational decisions including:
admissions
grading
improving instruction and curriculum planning and
career decisions
In clinical settings:
o clinical psychologists
Dame: Measurement component
12
o psychiatrists
o social workers, and
o other health care professionals use tests to make:
diagnostic decisions (identify developmental, visual, and auditory
problems)
determine interventions, and
assess the outcome of treatment programs
In organizational settings:
o human resources professionals and
o industrial/organizational psychologists use tests to make decisions such as:
who to hire for a particular position…selection and
what training individuals need… classification, and
what performance rating an individual will receive
Achievement Tests
These tests are designed to assess current performance or learned abilities in an academic
area
o Because achievement is viewed as an indicator of previous learning, it is often used
to predict future academic success
o An achievement test administered in a public school setting would typically include
separate measures of:
Vocabulary
language skills and reading comprehension
arithmetic computation and problem solving
science, and
social studies
o Individual achievement is determined by comparison of results with average scores
derived from large representative national or local samples
o Scores may be expressed in terms of “grade-level equivalents”; for example, an advanced
third-grade pupil may be reading on a level equivalent to that of the average fourth-grade
student
Aptitude Tests
These tests predict future performance in an area in which the individual is not currently
trained
They are intended to predict the ability of a person to learn new skills
Schools, businesses, and government agencies often use aptitude tests when assigning
individuals to specific positions
Vocational guidance counseling may involve aptitude testing to help clarify individual
career goals
o If a person's score is similar to scores of others already working in a given
occupation, likelihood of success in that field is predicted
Some aptitude tests cover a broad range of skills pertinent to many different occupations
o The General Aptitude Test Battery, for example, not only measures general
reasoning ability but also includes:
form perception
clerical perception
motor coordination, and
finger and manual dexterity
o Other tests may focus on a single area, such as art, engineering, or modern
languages
Intelligence Tests
Test scores are generally known as intelligence quotients, or IQs, although the various
tests are constructed quite differently
o The Stanford-Binet is heavily weighted with items involving verbal abilities
o the Wechsler scales consist of two separate verbal and performance subscales,
each with its own IQ
o There are also specialized infant intelligence tests, tests that do not require the
use of language, and tests that are designed for group administration
Interest Inventories
o They are not intended to predict success, but only to offer a framework for
narrowing career possibilities
– Outdoors
– Mechanical
– Computational
– Scientific
– Persuasive
– Artistic
– Literary
– Musical
– Clerical
For each item, the subject indicates which of three activities is best or
least liked
These tests measure social and emotional adjustment and are used to identify the need for
psychological counseling
Items that briefly describe feelings, attitudes, and behaviors are grouped into subscales,
each representing a separate personality or style, such as:
o social extroversion or
o depression
Research has shown that the MMPI may also be used to describe
differences among normal personality types
Projective Techniques
Some personality tests are based on the phenomenon of projection, a mental process
described by Sigmund Freud as the tendency to attribute to others personal feelings or
characteristics that are too painful to acknowledge
Because projective techniques are relatively unstructured and offer minimal cues to aid
in defining responses, they tend to elicit concerns that are highly personal and significant
o The best-known projective tests are the Rorschach test, popularly known as the
inkblot test, and the Thematic Apperception Test
o others include:
word-association techniques
Once a test has been administered, the results may be evaluated to identify deficiencies or
weaknesses in need of improvement or require special attention –to make a diagnosis
Placement
Specialized tests (entrance exams for colleges and universities, personality inventories,
and skin fold measurements) have long been used for the prediction of future events on
the basis of past or present data
Motivation
Specialized tests have been used to provide feedbacks, marks, and grades to one’s
performance for further improvement
Achievement
Specialized tests have been used to measure the level of ability or mastery of subjects
Program Evaluation
Some psychological tests are used for evaluating whether or not a particular program has
successfully achieved its objectives
Establishing norms
Standardized tests often provide such information as one’s level of achievement relative
to a clearly defined subgroup (people of the same age, sex, or class)
Decision
Generally, psychological tests can be used to make:
Three models for scoring are the cumulative model, the categorical model, and the ipsative model (using
yourself as the norm against which to measure something)
After completing the test plan, the test developer is ready to begin writing the actual test
questions and administration instructions
After writing the test, the developer conducts a pilot test followed by other studies that
provide the necessary data for validation and norming
3.1.8. Interpretation of results
The most important aspect of psychological testing involves the interpretation of test
results
Scoring
The raw score is the simple numerical count of responses, such as the number of correct
answers on an intelligence test
o The usefulness of the raw score is limited, however, because it does not convey
how well someone does in comparison with others taking the same test
o Percentile scores, standard scores, and norms are all devices for making this
comparison
– Normative data are derived from studies in which the test has been
administered to a large, representative group of people
Reliability
Before validity can be demonstrated, a test must first yield consistent, reliable
measurements
o It refers to the degree to which a test score is free from errors of measurement
A measure of how much an individual’s observed score is likely to differ
from the individual’s true score is provided by standard error of
measurement (SEM)
Standard deviation (SD) on the other hand provides a measure of how much
each individual score varies from the mean
Fundamental to the proper evaluation of a test are:
o identification of the major sources of measurement error
o determining the size of the errors resulting from these sources
o indicating the degree of reliability to be expected and
o generalizability of results across:
items
forms
raters
administrations, and
other measurement facets
In measurement, random error arises from:
o student variables
o task sampling
o item calibration/standardization
o Scaling
o measurement error
o item and test design
o administration, and
o scoring protocols
Derived from this theory, reliability is then defined as the ratio of the true score’s
variance to the observed score’s variance
o R= Var (T)/Var (X)
The variance of the raw scores, true scores, and random error
The equation for this process is as follows:
VAR (X) = VAR (T) + VAR (E)
Given this, it can be shown that the variance of the observed scores VAR(X) that is due to
true score variance VAR (T) provides the reliability index of the test:
VAR (T)/VAR (X) = R
When the variance of true scores is high relative to the variance of the observed scores,
the reliability (R) of the measure will be high (e.g., 50/60 = 0.83), whereas if the variance
of true scores is low relative to the variance of the observed scores, the reliability (R) of
the measure will be low (e.g., 20/60 = 0.33). Reliability values range from 0.00 to 1.00
Rearranging the terms from the above equations, it can be shown that:
R = 1 – [VAR (E)/VAR(X)]
That is, the reliability is equal to 1 – the ratio of random error variance to total score
variance
Further, there are analyses that allow for an estimation of R (reliability), and, of course,
calculating the observed variance of a set of scores is a straightforward process
o Because R and VAR(X) can be calculated, VAR (T) can be solved for with the
following equation: VAR (T) = VAR (X) × R
It is worth reiterating here that CTTs are largely interested in modeling the random error
component of a raw score
o Some error is not random; it is systematic
Estimating reliability
If the correlation between the two-halves is 0.94, then the actual correlation between the
two halves of the test is: 2(.94)/1+.94 = 1.88/1.94 = .97, where r = .94 is the reliability of
2
Specifically, coefficient alpha is typically used during scale development with items
that have several response options (i.e., 1 = strongly disagree to 5 = strongly agree)
Cronbach alpha is appropriately applied to norm-referenced tests and norm-
referenced decisions (e.g., admissions and placement decisions), but not to criterion-
referenced tests and criterion-referenced decisions (e.g., diagnostic and achievement
decisions)
Dame: Measurement component
21
If average inter-item correlation is > .6, then one can standardize items and add them
together as an index from which it can be seen that alpha measures true variance over
total variance
In 1951 Cronbach came up with the symbol for the first time
The formula for Cronbach alpha is given as:
Kuder-Richardson formula 20
N = # items
Sum (piqi) = sum of the product of the probability of alternative responses;
and
Var(X) = composite variance
Inter-rater reliability
It refers to the level of agreement between assessments made by two or more assessors of
the same material at a time
It is measured using a family of intraclass correlation coefficients (ICC - range 0-1; ICC
of = 0.7 is acceptable)
The formula used to measure inter rater agreements is Cohen’s kappa
Cohen's kappa is the ratio of the proportion of agreement (corrected for chance) divided
by the maximum number of times they could agree (corrected for chance)
the level of agreement between assessments made by two or more assessors of the same
material presented at two or more times
Factors influencing Reliability
Length of the test: in general, the longer the test, the higher the reliability
level of test difficulty: norm referenced tests which are too easy or too difficult will tend
to provide scores of low reliability
Spread of scores: other things being equal, the higher the spread of scores, the higher the
estimate of reliability
Objectivity: the higher the objectivity of a test, the higher its reliability
o The higher the reliability, the lesser the error of measurement
Methods of estimating reliability: in general, the size of the reliability coefficient is
related to the method of estimating reliability
Validity
Validity is the degree of truthfulness of a test score, referring to the extent to which a test
measures what it proposes to measure
Validity is not a property of the test or assessment, but rather of the meaning of the test
scores
Face validity:
A test has content validity if the sample of items in the test is representative of all the
relevant items that might have been used
o In short, face validity can be established by one person but content validity
should be checked by a panel of SMEs
o Content validity is sample-oriented rather than sign-oriented
o Words included in a spelling test, for example, should cover a wide range of
difficulty
Criterion-related validity
Unclear directions
Difficult vocabulary
Complex sentence structure
Inappropriate level of test item difficulty
poorly constructed test items
Ambiguity
Test too short
Improper arrangement of items
unintended clues
Identifiable pattern of answers
Specific determiners
Grammatical inconsistencies
Verbal associations
Factors in pupil’s response
What a test measures is influenced by such factors as age, sex, ability level, educational
background, and cultural background of the particular group tested
Other things being equal, the greater the similarity between the performance measured by
the test and the performance represented in the criterion, the larger the validity coefficient
Dame: Measurement component
26
The major psychological testing controversies stem from two interrelated issues:
Factor Analysis
Researchers use factor analysis to identify underlying variables or factors that contribute
to a construct or an overall test score
o Confirmatory factor analyses that confirm predicted underlying variables
provide evidence of construct validity
o Exploratory factor analyses take a broad look at the test data to determine the
maximum number of underlying structures