You are on page 1of 37

Measurement

Joseph Stevens, Ph.D.


© 2005
 Measurement
 Process of assigning quantitative or qualitative
descriptions to some attribute
 Operational Definitions
 Assessment
 Collection of measurement information
 Interpretation
 Synthesis
 Use
 Evaluation
 Value added to assessment information (e.g.
good, poor, “ought”, “needs improvement”)
Assessment Decisions/Purposes
 Instructional
 Curricular
 Treatment/Intervention
 Placement/Classification
 Selection/Admission
 Administration/Policy-making
 Personal/Individual
 Personnel Evaluation
Scaling
 Process of systematically translating
empirical observations into a
measurement scale
 Origin
 Units
 Information
 Types of scales
Score Interpretation
 Direct interpretation
 Need for analysis, relative
interpretation
 Normative interpretation
 Anchoring/Standards
Frames of Reference for
Interpretation
 Current versus future performance
 Typical versus maximum or potential
 Standard of comparison
 To self
 To others
 To standard
 Formative versus summative
Domains
 Cognitive
 Ability/Aptitude
 Achievement
 Memory, perception, etc.
 Affective
 Beliefs
 Attitudes
 Feelings, interests, preferences,
emotions
 Behavior
Cognitive Level

 Knowledge
 Comprehension
 Application
 Analysis/Synthesis
 Evaluation
Assessment Tasks
 Selected Response – MC, T-F, matching
 Restricted Response – cloze, fill-in,
completion
 Constructed Response - essay
 Free Response/Performance Assessments
 Products
 Performances
 Rating
 Ranking
 Magnitude Estimation
CRT versus NRT
 Criterion Referenced Tests (CRT)
 Comparison to a criterion/standard
 Items that represent the domain
 Relevance
 Representativeness
 Norm Referenced Tests
 Comparison to a group
 Items that discriminate one person from
another
Kinds of Scores
 Raw
 Standard scores
 Developmental Standard Scores
 Percentile Ranks (PR)
 Normal Curve Equivalent (NCE)
 Grade Equivalent (GE)
Scoring Methods
 Objective
 Subjective
 Holistic
 Analytic
100

80

60
Percent

40

20

0
Did Not Meet Met

Standard
Aggregating Scores
 Total scores
 Summated scores
 Composite scores

 Issues
 Intercorrelation of components
 Variance
 Reliability
Theories of Measurement
 Classical Test Theory (CTT)
X=T+E

 Item Response Theory (IRT)


http://work.psych.uiuc.edu/irt/tutorial.asp

Pg (  e x

1  ex
Item Characteristic Curv e: 2
a= 0.725 b = -1.367
1.0

0.8
Probability

0.6

0.4

0.2

b
0
-3 -2 -1 0 1 2 3
Ability

Logistic Reponse Model Item: 2


The parameter a is the item discriminating power, the reciprocal (1/a) is the item
dispersion, and the parameter b is an item location parameter.
Item Characteristic Curv e: 3
a= 0.885 b = -0.281
1.0

0.8
Probability

0.6

0.4

0.2

b
0
-3 -2 -1 0 1 2 3
Ability

Logistic Reponse Model Item: 3


The parameter a is the item discriminating power, the reciprocal (1/a) is the item
dispersion, and the parameter b is an item location parameter.
Reliability
 Consistency
 Consistency of Decisions
 Prerequisite to validity
 Errors in measurement
Reliability
 Sources of errors
 Variations in physical and mental condition of
person measured
 Changes in physical or environmental conditions
 Tasks/Items
 Administration conditions
 Time
 Skill to skill
 Raters/judges
 Test forms
Estimating Reliability
 Reliability versus standard error of
measurement (SEM)
 Internal Consistency
 Cronbach’s alpha
 Split-half
 Example
 Test-Retest
 Inter-rater
Estimating Reliability
 Correlations, rank order versus exact
agreement
 Percent Agreement
 Exact versus close
 (number of agreements/number of
scores x 100)
 Problem of chance agreements
Estimating Reliability
 Kappa Coefficient
 Takes chance agreements into account
 Calculate expected frequencies and subtract
 Kappa ≥ .70 acceptable
 Examine pattern of disagreements
 Example
 Percent agreement = 63.8%
 r = .509
 Kappa = .451
Below Meets Exceeds Total

Below 9 3 1 13

Meets 4 8 2 14

Exceeds 2 1 6 9

Total 15 12 9 36
Estimating Reliability
 Spearman-Brown prophecy formula
 More is better
Reliability as error
 Systematic error
 Random error
 SEM
_______
SEM = SDx √ 1 - rxx
Factors affecting reliability
 Time limits
 Test length
 Item characteristics
 Difficulty
 Discrimination
 Heterogeneity of sample
 Number of raters, quality of
subjective scoring
Validity
 Accuracy
 Unified View (Messick)
 Use and Interpretation
 Evidential basis
 Content
 Criterion
 Concurrent-Discriminant
 Construct
 Consequential basis
Validity
 Internal, structural
 Multitrait-Multimethod (Campbell &
Fiske)
 Predictive
Test Development
 Construct Representation
 Content analysis
 Review of research
 Direct observation
 Expert judgment (panels, ratings,
Delphi)
 Instructional objectives
Test Development
 Blueprint
 Content X Process
 Domain sampling
 Item frames
 Matching item type and response format
to purpose
 Item writing
 Item Review (grammar, readability,
cueing, sensitivity)
Test Development
 Writing instructions
 Form design (NAEP brown ink)
 Field and pilot testing
 Item analysis
 Review and revision
Equating
 Need to link across forms, people, or
occasions
 Horizontal equating
 Vertical equating
 Designs
 Common item
 Common persons
Equating
 Equipercentile
 Linear
 IRT
Bias and Sensitivity
 Sensitivity in item and test
development
 Differential results versus bias
 Differential Item Functioning (DIF)
 Importance of matching, legal versus
psychometric
 Understanding diversity and individual
differences
Item Analysis
 Difficulty, p
 Means and standard deviations
 Discrimination, r-point biserial
 Omits
 Removing or revising “bad” items
 Example
Factor Analysis
 Method of evaluating structural
validity and reliability
 Exploratory (EFA) example
 Confirmatory (CFA) example

You might also like