Quality of Measures

Quality of Measures
• Definitions
– Correlation, Reliability, Validity, Measurement
error
• Theories of Reliability
• Types of Reliability
– Standard Error of Measurement
• Types of Validity
• Article
• Exercise
Definitions
• Correlation
– reflect direction (+/-) & strength (0 to 1) of the
relation between two variables
• Variance explained
– Reflects the strength of relation of two variables
• Square of correlation
• Varies from 0 to 1
250
Vince Carter
210
Weight (pounds)
170 Tom Cruise
130
Julia Roberts
Calista Flockhart
90
150 160 170 180 190 200
Height (cm)
250
r = .76
Vince Carter
r2 = 58%
210
Weight (pounds)
170 Tom Cruise
130
Julia Roberts
Calista Flockhart
90
150 160 170 180 190 200
Height (cm)
Effect of Measurement Error on
Correlations
200
r = 1.00
190 r2 = 100%
180
Height (cm)
170
160
150
150 160 170 180 190 200
Height (cm)
200
r = .98
190
r2 = 96%
Self-Reported Height (cm)
180
170
160
150
150 160 170 180 190 200
Objective Height (cm)
250
r = .92; r2 = 85%
225
Self-Reported Weight (cm)
200
175
150
125
100
100 125 150 175 200 225 250
Objective Weight (cm)
Definitions
• Reliability
• Consistency & stability of measurement
• Reliability is necessary but not sufficient for
validity
• E.g. A measuring tape to is not a valid way to measure
weight although the tape reliably measures height and
height correlates w/weight
• Validity
• Accuracy/meaning of measurement
• Example: unstructured vs. structured job

interviews
Theories of Reliability
• Classical Test Theory explains random

variation in a person’s scores on a measure
• Effects of learning, mood, changes in
understanding etc.
• Test score=true score + error
• Errors have zero mean
• Errors are uncorrelated with each other
• Errors are uncorrelated with true score
• Constant error is part of true score
Types of Reliability
• Test-retest
• Consistency across time
• Parallel forms
• Consistency across versions
• Internal
• Consistency across items
• Scorer (inter-rater)
• Consistency across raters/judges
Example: The Satisfaction with Life Scale
(SWLS)
1. In most ways my life is close to ideal.
2. The conditions of my life are excellent.
3. I am satisfied with my life.
4. So far I have gotten the important things I
want in my life.
5. If I could live my life over, I would change
almost nothing.
1 2 3 4 5 6 7
Strongly Strongly
Disagree Agree
• Test-retest reliability
• Correlation of scores on the same measure taken at
two different times
• Time interval assumes no memory/learning effects
• Parallel-forms
• Correlation of scores on similar versions of the
measure
• Forms equivalent on mean, stan dev, inter-correlations
• Can have time interval b/w admin of two forms
Test-retest Reliability
Time 1 Time 2
I1 I2 I3 AvgT1 I1 I2 I3 AvgT2
P1
P2
P3
Correlate AvgT1 to AvgT2 to get reliability
Parallel Forms P=participant

Version 1 Version 2 I=item
I1 I2 I3 AvgV1 I1 I2 I3 AvgV2
P1
P2
P3
Correlate AvgV1 to AvgV2
7
r = .73; r2 = 50%
6
SWLS Time 2 (End of Semester)
1
1 2 3 4 5 6 7
SWLS Time 1 (Beginning of Semester)
Test-retest reliability of SWLS
• Good test-retest reliability
•Participants have similar scores at Time 1
(beginning of semester) and at Time 2
(end of semester).
•Retest reliability is useful for constructs
assumed to be stable
•Current mood (e.g., how you feel right
now) shows low-retest correlations, but
that does not mean that the mood measure
is not reliable
• Internal Consistency
• Correlation of scores on two halves of the measure
• Length of measure increases reliability
• Inter-rater
• Correlation of raters’ scores
• E.g., Scores on structured job interview
• Can also include time interval
– e.g., ratings of the worth of jobs across time & across judges
Internal Reliability
Half 1 Half 2
I1 I2 I3 AvgH1 I4 I5 I6 AvgH2
P1
P2
P3
Correlate AvgH1 to AvgH2
Inter-rater Reliability
Rater 1 Rater 2
I1 I2 I3 AvgR1 I1 I2 I3 AvgR2
P1
P2
P3
Correlate AvgR1 to AvgR2
7
r = .70; r2 = 49%
6
SWLS Items 3, 4, & 5
1
1 2 3 4 5 6 7
SWLS Items 1 & 2
Internal consistency of SWLS
• Satisfactory internal consistency.
•Participants respond similarly to items
that are supposed to measure the same
variable.
•Should be .70 or higher
•Measurement error accounts for half of the

variance in SWLS scores.
• Test-retest
• Parallel forms
• Internal
• Scorer (inter-rater)
Standard Error of Measurement
• SD of scores when a measure is completed

several times by the same individual
• Mostly used in selection contexts
• Decide which of two individuals are hired
• Decide whether a test score is significantly higher/lower
than a cutoff score
Correction for Attenuation
• Real correlation between two variables after

removing unreliability of each measure
• Divide observed correlation by product of the
square roots of individual reliabilities
• Note: Selection research only controls for unreliability
in criterion bec. we are more interested in the value of
the predictor given a perfectly reliable criterion
Quality of Measures
• Definitions
– Correlation, Reliability, Validity, Measurement
error
• Theories of Reliability
• Types of Reliability
• Standard Error of Measurement
• Types of Validity
Validity
Evidence that a measure assesses the construct
Reasons for Invalid Measures
• Different understanding of items
• Different use of the scale (Response Styles)
• Intentionally presenting false information
(socially desirable responding, other-
deception)
• Unintentionally presenting false
information (self-deception)
Types of Validity
Criterion Content Construct

Validity Validity Validity
Predictive Concurrent Convergent Discriminant

Validity Validity Validity Validity
Adapted from Sekaran, 2004

Types of Validity
• Content Validity
• Extent to which items on the measure are a good
representation of the construct
• e.g., Is your job interview based on what is required for
the job?
• Content validity ratio based on judges’
assessments of a measure’s content
• e.g., Expert (supervisors, incumbents) rating of job
relevance of interview questions
Types of Validity
• Criterion-related Validity
• Extent to which a new measure relates to another
known measure
• Validity coefficient= Size of relation between the new
measure (predictor) and the known measure (criterion)
(a.k.a correlation)
• e.g., do scores on your job interview predict
performance evaluation scores?
Types of Criterion Validity
• Concurrent
• Scores on predictor and criterion are collected
simultaneously (e.g., police officer study)
• Distinguishes between participants in sample who
are already known to be different from each other
• Weaknesses
• Range restriction
– Does not include those who were not hired, fired & promoted
• Differences in test-taking motivation (employees vs.
applicants)
• Experience with job can affect scores on criterion
Types of Criterion Validity
• Predictive
• Scores on predictor (e.g., selection test) collected
some time before scores on criterion (e.g., job
performance)
• Able to differentiate individuals on a criterion
assessed in the future
• Weaknesses
• Due to management pressures, applicants can be chosen
based on scores on predictor (can have range restriction,
but this can be corrected)
• Often, special measures of job performance are
developed for validation study
Correction for range restriction
• When full range of scores on predictor

variable is available
– Use unrestricted and restricted standard
deviations of predictor variable & the observed
correlations b/w predictor & criterion
Types of Validity (cont’d)
• Construct Validity
• Extent to which hypotheses about construct are
supported by data
1. Define construct, generate hypotheses about
construct’s relation to other constructs
2. Develop comprehensive measure of construct & assess
its reliability
3. Examine relationship of measure of construct to other,
similar and dissimilar constructs
• Examples: height & weight; Learning Style

Orientation measure; networking; career outcomes
Establishing Construct Validity
• Multi-trait multi-method matrix

• Convergent validity coefficient
• Absolute size of correlation between different measures
of the same construct
• should be large, significantly diff from zero,
• Discriminant validity coefficient
• Relative size of correlations between the same construct
measured by different methods compared to
• Different constructs measured by different methods
• Different constructs measured by same method (method bias)
Corr b/w Objective (O) & Self-
Reports (SR) of Height & Weight
O-H SR-H O-W SR-W
O-H 1.00
SR-H .98 1.00
O-W .55 .56 1.00
SR-W .68 .69 .92 1.00

Establishing Construct Validity
• Multi-trait multi-method matrix

– Different measures of the same construct should be
more highly correlated than different measures of
different constructs
• e.g., Perceived career success & promotion vs.
networking vs. promotion/salary
– Different measures of different constructs should
have lowest correlations
• e.g., Networking vs. promotion/salary
Learning Style Orientation Measure
• Item Development Study (generate
critical incidents)
– N=67
– Yes/no responses to statements
– Recall of learning events
• Two types of learning: theoretical, practical
• Two types of outcomes=success, failure
• 2 x 2 events per participant
• 112 items constructed in total
• Item Development Study (questionnaire)
– N=154
– 112 items, 5 point likert scale (agree/disagree)
• 5 factor solution w/factor analyses
• 54 items
• Content validity sorting by 8 grad students
– Goldberg personality scale
• Item Development Study
• Correlations b/w LSO & personality
• Only 1 sig correlation b/w 5 factors of LSOM!
• High reliabilities of subscales of LSOM (.81-.91)
• Construct (not really convergent) validity
– r b/w LSOM & personality subscales
• .42 to -.26.
• Validation Study
– N=350 -193
– LSOM, Personality, old LSI, preferences for
instructional & assessment methods
• Construct validity
– r b/w LSOM subscales & old LSI= .01 to .31
– r b/w LSOM & personality subscales= .01 to .55
– Confirmatory factor analysis
• 5-dimensions confirmed
• High reliability
• Validation Study
– Incremental validity
• Additional variance explained (LSOM vs LSI)
DV LSOM LSI
Subjective assessment .15 .01
Interactional instruction .21 .04
Informational instruction .06 .00
In-class Exercise
• Brainstorm constructs to develop measures
• E.g. Dimensions of CIR professor effectiveness, CIR
student effectiveness
• Choose two constructs that can be measured
similarly and be defined clearly
• Example measures
– Self-report (rating scales)
– Peer/informant reports
– Observation
– Archival measures
– Trace measures etc etc.
In-class Exercise
• Form two-person groups to
• Generate items of the 2 different measures for each of
the two constructs
• Appointed person collects all items for both
measures for both constructs
• Compiles & distributes measures to class
• Class gathers data on both measures & both
constructs
• Class enters data into SPSS format
• Compute reliabilities,means, correlations
Fill in the correlations
C2 C1
M1
M1
M2
M2
Types of Validity
Criterion Content Construct

Validity Validity Validity
Predictive Concurrent Convergent Discriminant

Validity Validity Validity Validity
Adapted from Sekaran, 2004

Quality of Measures

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Quality of Measures

Uploaded by

Copyright:

Available Formats

Quality of Measures

170 Tom Cruise

170 Tom Cruise

• Example: unstructured vs. structured job

• Classical Test Theory explains random

Parallel Forms P=participant

•Measurement error accounts for half of the

• SD of scores when a measure is completed

• Real correlation between two variables after

Criterion Content Construct

Predictive Concurrent Convergent Discriminant

Adapted from Sekaran, 2004

• When full range of scores on predictor

• Examples: height & weight; Learning Style

• Multi-trait multi-method matrix

O-H SR-H O-W SR-W

SR-H .98 1.00

O-W .55 .56 1.00

SR-W .68 .69 .92 1.00

• Multi-trait multi-method matrix

Criterion Content Construct

Predictive Concurrent Convergent Discriminant

Adapted from Sekaran, 2004

You might also like