Identify

Identifying Good
Measurement
LEARNING
WHETHER STUDYTNG THE NUMBER of polarbears leftintheArctic OBJEGTIVES
Circle, the strength of a bar of steel, the number of steps people I
take each day, or the level ofhuman happiness, every scientist faces
A year from now, you should
the challenge of measurement. When researchers test theories or
still be able to:
pursue empirical questions, they have to systematically observe the
phenomena by collecting data. Such systematic observations require t.
measurements, and these measurements must be good ones-or else lnterrogate the construct validity
they are useless. of a study's variables.
Measurement in psychological research can be particularly chal- 2.
lenging. Many of the phenomena psychologists are interested in- Describe the kinds of evidence
motivation, emotion, thinking reasoning-are difficult to measure that support the construct validity
directly. Happiness, the topic of much research, is a good example of a measured variable.
of a construct that could be hard to assess. Is it really possible to
quantify how happy people are? Are the measurements accurate?
Before testing, for example, whether people who make more money
are happier, we might ask whether we can really measure happiness.
Maybe people misrepresent their level of well-being, or maybe peo-
ple aren't aware of how happy they are. I{ow do we evaluate who is
really happy and who isn't? This chapter explains how to ask ques-
tions about the quality of a study,s measures-the construct validity
of quantifications of things like happiness, gratitude, or wealth.
Construct validity, remember, refers to how well a study,s variables
are measured or manipulated.
Construct validity is a crucial piece of any psychological
research study-for frequency, association, or causal claims. This
chapter focuses on the construct validity ofmeasured variables. you
will learn, first, about different ways researchers operationalize
717
3. I am satisfied with my life.
4. So far I have gotten the important things I want in life.
rhen vou'll learn howvou
)) measured variabres' The constru :lffi}t:ffi :ii;;t:i:;::t:'i:i" -
-
5. lf I could live my life over, I would change almost nothing.
Eor a review of measured validity of those measurements'
mani Putated
variables' is covered in ChaPter
10' -The unhappiest people would get a total score of 5 on this self-report scale
"'.le
"na- cnaPttr 3' PP' 58-59'
because they would answer "strongly disagree," or 1, to all five items (1 + 1 + I + I +
1 = 5). The happiest people would get a total score of 35 on this scale because they
VARIABLES
WAYS TO ]VIEASURE would answer "strongly agree," or 7,to all five items (7 + 7 + 7 + 7 + 7 = 35). Those
at the neutral point would score 2O-right in between satisfied and dissatisfied
:f :TTff ffi1:t"|iffi:,t5:"i'"fl *:Ti
g var i abre s t
rhe pr oces s or me a su r i n (+ + + + 4 + 4 + 4 = 2o).Diener and Diener (ul0) reported some data based on this
al'
i"iti" it"* they should operationalize'eacn ;;J"'uutio"ut' and phvsiologic scale, concluding that most people are happy, meaning most people scored above
three comm"" t'n":':;il;;;''"lf-'";;t' 20. For example, 63% ofhigh school and college students scored above 20 in one
Thevalsodecide""il;*o'iunn'on'i"t"';;i;li*"u'*'"-entforeachvariable study, and 72% of disabled adults scored above 20 in another study.
theY Plan to investigate' In choosing this operational definition of subjective well-being, the research
team started with only one possible measure, even though there are many other
and Operational ways to study this concept. Another way to measure happiness is to use a single
More About Conceptual question called the Ladder of l-ife (Cantril, 1965). The question goes like this:
Variables
lmagine a ladder with steps numbered from O at the bottom to 10 at the top. The
rnchapter3,voulearnedabout"l"'":i":*;r1,:;;tl;ffi :"ttrJlT:ililid41 top of the ladder represents the best possible life for you and the bottom of the
lii"T"lihilerheconceptuardennition'or
*::*;i*::*:ilTi'ff
i' th" u"i"ii'"" "f
;";;;ile in theoretical question at a
decision about
ladder represents the worst possible life for you. On which step of the ladder
would you say you personally stand at this time?
construct, '"'"lrli'j'l'
i"n"iti"" '"n'"t""t"i;;;;;;"t'Jspecific
level' The op"rationui variable. the
*;;;;;;" th" On this measure, participants respond by giving a value between O and t0.
how to measure ", "o,,""ptual
Gallup polling organization uses the Ladder of Life scale in its daily Gallup-
Healthways Well-Being Index.
You might be thinking one of these operational definitions seems like a better
::,',ff [",|ff ili""ffi;'JTr,ittt"mpre.oneres-ellclteam'ledbv
of huppin"" Uf al""toping a precise
conceptual
measure of happiness than the other. Which one do you think is best? We'll see
Ed Diener, began tn"
""J'
Diener's * u'n'"utild ;;;; i;; w or d happiness might that they both do a good job of measuring the construct. Diener's research team
fi aliv'
defi nition. speci
c
*"*,li#:, ff; i;t,;ltneir interest to "subjective and Gallup have both learned their measures ofhappiness are accurate because
have a varie., o, ";prt.t,b perspective)'
p*'on s own they have collected data on them, as we'll see later in this chapter.
u
well-being" (o'*"tt-U"'iig'fro*
After denn ing h;i;" " i' :h: ""::"::l *j;t;:ffl :ll'H tTl::?L': OPERATIONALIZI NG OTHER CONCEPTUAL VARIABLES
To study conceptual variables other than happiness, researchers follow a simi-
lT''"';
i:fl:{:**""ru':fit:l"j#l.1m,**,.;*h"**:J'f
well-being, in Part' t t':'tTl'own criteria to
lar process: They start by stating a definition of their construct (the conceptual
variable) and then create an operational definition. For example, to measure the
'i'""ra 1ee3)"Thev worded their
tionnaire format'
describe *h"t.or,rtri".", "
"*urir"" cpuiii;;i;;"t'
"hl'"*;;:;;d"-"td'"-d;;nr" association between wealth and happiness, researchers need to measure not only
happiness, but also wealth. They might operationally define wealth by asking
;"il;,;r"*:;hifr':,:*:Hi"T",lHff.',:'*'",,,j$ixllllf:+*
tor n:*ilbout their satts-
about salary in dollars, by asking for bank account balances, or even by observing
was apPropriate
subjective *"lt-u"i'sin "Ji"'
p"opt".to
'"';;J.o l*to "strongly disagree" and the kind ofcar people drive.
u 7-point sc.ale;
I toi'""po"a"d Consider another variable that has been studied in research on relationships:
faction with life ""ig gratitude toward one's partner. Researchers who measure gratitude toward a rela-
;;;;;;';""ded to "stronglY agree": tionship partner might operationalize rtby asking people how often they thank
ideal' their partner for something they did. Or they might ask people how appreciative
life is close to my
In most ways my
my life are excellent'
2 The conditions of
-1' Ways to Measure Variables 119
CHAPTER 5 ldentifying Good Measurement

118
or the child's typical classroom behaviors. (Chapter 6 discusses situations when
self-report measures are likely to be accurate and when they might be biased)
TABLE 5.1 OBSERVATIONAL MEASURES

Definitions
Variables and DEFINITION ANOTHER POSSIBLE
OPERATI ONAL An observational measure, sometimes called abehavioral medsure, operational-
ONE POSSIB LE OPERATIONAL DEFINITION izes a variable by recording observable behaviors or physical traces ofbehaviors.
(o PERATIONALIZATION)
a nd counting
VARIABLE Watching couPles interact For example, a researcher could operationalize happiness by observing how many
agree with the each other.
Asking PeoPle if they ho* munY times they thank
times a person smiles. Intelligence tests can be considered observational measures,
Gratitude toward one's statement: "l aPpreciate
mY partner."
relationshiP Partner ln ohone interviews' a researcher or because the people who administer such tests in person are observing people's
on a survey throush the sound
Asking PeoPle to report or remale ;":;;;;t;;;er intelligent behaviors (such as being able to correctly solve a puzzle or quickly
Gender ;'t.'lfiJ;;;; identirY as mare the Person's voice' detect a pattern). Coding how much a person's car cost would be an observational
from 1(older measure of wealth (Piff, Stancato, C6t6, Mendoza-Denton, & Keltner,2Ol2).
Coding the value of a car
their income to s (new' hish-
Asking people to report ;;;;-ilr;' vehicle) Observational measures may record physical traces of behavior. Stress behav-
Wealih (less than $20'ooo' condition)'
;;;;;ou,;""ses and more ,"tuiu. u"f,i.f" in good iors could be measured by counting the number of tooth marks left on a person's
i",*""" $2o,obo and 5o'000'
than $50,000)' pencil, or a researcher could measure stressful eyents by consulting public legal
while people
problem-solving Recording brain activity records to document whether people have recently married, divorced, or moved.
An lO test that includes questions' solve difficult Problems' (Chapter 6 addresses how an observer's ratings of behavior might be accurate and
lntelligence ;;;;, ;;."tY and vocabularY
and Puzzles' how they might be biased)
well-being
Diener's 5-item subjective
scale'
lO-Point Ladder of Life scale. PHYSIOLOGICAL MEASURES
Well-being (haPPiness)
Aphysiologicalmeasure operationalizes avariablebyrecordingbiological data, such
As as brain activity, hormone levels, or heart rate. Physiological measures usually require
a simple variable
such as gender mustb3'onerltionalized' the use of equipment to amplify, record, and analyze biological data. For example,
they usually feel' Even in a number of wavs'
t"'i"ur" """o" "?"t"tionalized research moment-to-moment happiness has been measured using facial electromyography
Table 5.1 shows, "", "";;;;;"t *rr"*'.t*ir"ity comes into the
one place (EMG)-a way of electronically recording tiny movements in the muscles in the face.
In fact, operationalirJ;"r';; measures oftheir constructs'
*o'k to develop new anibetter Facial EMG can be said to detect a happy facial expression because people who are
process, as researcher'
smiling show particular patterns of muscle movement around the eyes and cheeks.
Other constructs might be measured using a brain scanning technique called
of Measures
Three Common TYPes functional magnetic resonance imaging, or fMRI. In a typical fMRI study, people
engage in a carefully structured series ofpsychological tasks (such as looking at
observational' and
futii"to three categ"t; *it;"po"'
Thetypesofmeasurespsychologicalscientiststypicallyuse'tooperationalize three types of photos or playing a series of rock-paper-scissors games) while lying
variables generally in an MRI machine. The MRI equipment records and codes the relative changes
physiological'
in blood flow in particular regions of the brain, as shown in Figure 5.1. When more
SELF-REPORT MEASURES
Aself-reportmeasureoperationaliz::::Tl*:"tJ;:T#:#::l;;Ji-ffi FIGURE 5.I
Wins vs. Losses contrast in Rock, Paper, Scissors
Images from fMRI scans showing brain
*:muJl'":'#Tl"*Ti:J+T:l"il;;;;;;pi",orse,f-reportmeasures their
asking n";;i;;; much thev appreciate activity.
about life satisfaction'
ii*itu'to measures' If stress ln this study of how people respond to rewards
identity ur" both self-report
oartner and asking to self-report on the and losses, the researchers tracked blood flow
was the variable being
studied''"'"u"h"" it;;;;t;n"*le
"i""ig""a"r patterns in the brain when people had either
won, lost, or tied a rock-paper-scissors game
frequencyofspecific"-'ent'they've"*p"'ien""linthepastyeaqsuchasmarriage' played with a computer. They found that many
& Rahe' 1e67)'
;;;;;;'-oving (e'g'' Holmes *;;;" replaced *ttl- n^1t""t
reports or regions of the brain respond more to wins than
rn research o" thild'""' self-reports to losses, as indicated by the highlighted regions.
(Source: Vickery, Chun, & Lee, 2011.)
ttte words the child knows'
teacherreports.rr'","-"u,oresaskparentsorteacherstorespondtoaseriesof
tif"
a""'iUi"g tf'" "*"t''
questions, such as
"hild'' '""""i
Ways to Measure Variables 121
r20 CHAPTER 5 ldentifying Good Measurement

chimps, and"3" for bonobos) during the data-entry process. However, the numbers
task, researchers con-
region while people perform a certain do not have numerical meaning-a bonobo is different from a chimpanzee, but
blood flows to abrain on the scanned images'
i'""'"ti-'ated because of the patterns being a bonobo ('3") is not quantitatively "higher" than being a chimpanzee ('2').
ciude thatbrui" u'"u *"u'u'" intelligence in the
t"d;; ;;;vrm might;J;Jt effrcient In contrast, the levels of quantitative variables are coded with meaning-
Some research
tn";;;;;ipeople *itt' t'iltt"t intelligence ale more for /uI numbers. Height and weight are quantitative because they are measured in
future. Specificallv, less brain activity
fMRr numbers, such as 17O centimeters or 65 kilograms. Diener's scale of subjective
at solving compl""
p'obl"*'' their
t '"u; ';;;
p"^fte' Johnsonl
'elatively future researchers
to*'v' well-being is quantitative too, because a score of 35 represents more happiness
complex problems 'oto)' pr'wriological.measure of intel-
"f't"t"fore'
ttre'e?vci;:;;il;i" than a score of 7. IQ score, level of brain activity, and amount of salivary cortisol
maybe able to use ".riti,":i"r""
are also quantitative variables.
il;;;h",.':F:.:#il'*:"Til?,",""1#*,""H{t?i,ll,*"-"il:::i,l:
to rntrdrur: ".;:;^':^-
used head circumterence (Gould 199t
"t ,,ll" (Gould' .1996).
*oJiai" inside larger skulls THREE TYPES OF OUANTITATIVE VARIABLES
that smarter u'ui"' 'i"t"a the amount
;;;"rationalize' strlss might be to.measure For certain kinds ofstatistical purposes, researchers may need to further classify
A physiologit^' ; under stress showhigher
t""i':i;"iJi" of a quantitative variable in terms of ordinal, interval, or ratio scale.
of thehormo"" '"r*"d;;t;6ple
sti"-"ot'it'"t*""' an electronic recording
levels of cortisol
(c;;;;' 2009)'
wav to measure
An ordinal scale of measurement applies when the numerals of a quantitative
the hands or feet' is another variable represent a ranked order. For example, abookstore's website might display
the activity i" th"
';;;;'*f""J' "i ;-" activitv in these glands'of
p"o-ple under -o'" the top 10 best-sellingbooks. We know that the #1book sold more than the #2 book,
stress phvsiofogi""*v' "'l'J;;;
used in nr*tt"i"*n tesearch
is the detection
and that #2 sold more than #3, but we don't know whether the number of books
Another pt vriotogiiui;;;;r" (EBG)'
patterns i" tft" L'"i" using electroencephalography that separates #1 and #2 is equal to the number ofbooks that separates *2 and *3.
electrical
In other words, the intervals may be unequal. Maybe the first two rankings are only
IS BEST?
WHICH OPERATIONALIZATION 10 books apart, and the second two rankings are 150,0O0 books apart. Similarly,
Asingleconstructcanbeoperationalizedinseveralways,fromself-reporttobehav- a professor might use the order in which exams were turned in to operationalize
hu"e to be validated
ioralobservationtophysiologicalmeas,',",.**,neopleerroneouslybelievephys. how fast students worked. This represents ordinal data because the fastest exams
mea"""' J*ir'"
;il;;;;h"i'
m-ost accurat"' '"'olt' are on the bottom of the pile-ranked 1. However, this variable has not quantified
iological
byusingoth"'t"""""t'Forinstance'.asmentionedabove'researchersusedfMRl But how how much faster each exam was turned in, compared with the others.
to level of intelligence'
to learn that th" b;;;;'t'
*o'" "m"i*;'"I";e fMRr An interval scale of measurement applies to the numerals of a quantitative
i;;;; frt" place? Before doing the measure variable that meet two conditions: First, the numerals represent equal intervals
was particip""t
participa;;;" re test-an observational
'"*ld;;;^"-*uti'n"a
scans, the r"r""r;;?;;;the it"tt an fMnr pattern to indicate (distances) between levels, and second, there is no "true zero" (aperson can get
(Deary et al', zorojis"ii"'fo '"'"u'"h"" -t*ftt a score of 0, but the 0 does not really mean "nothing'). An IQ test is an interval
whenaperso"i'g""ui""lyhuppy'Ilowever'ih"o"lywuyaresearchercouldknow isby asking each scale-the distance between IQ scores of tOO and 105 represents the same as the
of brainactivitywurrr.-o"i*awithhappiness time the brain distance between IQ scores of 105 and 110. However, a score of 0 on an IQ test does
that some pattern
r"a' c *"""*l lliT ::*"
person r,o* t'uppv-h" "'''i-'" '"lf ';;;;
learn later-i'n this chapter,
it's best when self-report' not mean a person has "no intelligence." Body temperature in degrees Celsius
scan was u"ir.rg aorr". or r"r,il patterns of results' is another example of an interval scale-the intervals between levels are equal;
similar
observational, and physiological -"u""""how however, a temperature of 0 degrees does not mean a person has "no tempera-
ture." Most researchers assume questionnaire scales like Diener's (scored from
Sca1es of Measurement | = strongly disagree to 7 = strongly agree) are interval scales. They do not have a
true zero but we assume the distances between numerals, from I to 7, are equiva-
Allvariablesmusthaveatleasttwolevels(seeChapter3).Thelevelsofoperational
of measurement'
using difierent scales lent. Because they do not have a true zero, interval scales cannot allow a researcher
variables, h"*J;;;;;" "od"d to say things like "twice as hot" or "three times happier."
OUANTITATIVE VARIABLES
CATEGORICAL VS' Finally, a ratio scale of measurement applies when the numerals of a quanti-
tative variable have equal intervals and when the value of 0 truly means "none"
(categortcat
operationalvariablesareprimarily.classifiedascategoricalorquantitative'The
suggests' are categories'
catt*J;;;;;tibl"'' u' th";;;
levels of
u*ioii"r)nl-"*pl"'
levels are
a:e. sex' whose
or "nothing" of the variable being measured. On a knowledge test, a researcher
might measure how many items people answer correctly. If people get a 0, it truly
variables
r"""j, itt u ,tnay might be rhesus macallue'
"r" ^rrJl"rif,i-*înot
species, *rr"r" represents "nothing correct" (0 answers correct). A researcher might measure how
male and {emale; and frequently people blink their eyes in a stressful situation; number of eyeblinks is a
t"n*t*t rhesus macaques' "2" tor
chimpanzee'unaUo"oto'Aresearcn"'*igtttdecideto'assignnumberstothe
G'g''
levels of a
"ut"go'itul "ariable
"';;;;;?" 12=
Ways to Measure Variables
122 cHAPTER 5 ldentifying Good Measurement

Reliability refers to how consistent the results of a measure are, and validity
concerns whether the operationalization is measuringwhat it is supposed to mea-
sure. Both are important, and the first step is reliability.
TABLE 5.2
for Operational Variables
Measurement Scales
CHARACTERISTICS
EXAMPLES
Introducing Three Types of Reliability
TYPE OF VARIABLE NationalitY' TYPe of music Ki nd of Phone
Levels are categories' peoPle use Before deciding on the measures to use in a study, researchers collect their own
Categorical
data or review data collected by others. They use data because establishing the
meaningful numbers' reliability of a measure is an empirical question. A measure's reliability is just
Levels are coded with race'
Ouantitative
Order of finishers in a swimming to least what the word suggests: whether or not researchers can rely on a particular
in which numerals from most
A ouantitative variable ;fiin"g oi ro rnoui"t score. If an operationalization is reliable, it will yield a consistent pattern of
Ordinal ::;:";i" rank order' Distance between
f avorite.
may not be equal' scores every time.
:"r;**;;;merals
lO score. Reliability can be assessed in three ways, depending on how a variable was
in which subsequent
A quantitative variable Shoe size. operationalized, and all three involve consistency in measurement. With test-
lnterval distances' but agreement on a 1-7 scale'
nui."rut, represent equal retest reliability, the researcher gets consistent scores every time he or she uses
there is no true zero' """t* ", answered correctly'
Number of exam questions task the measure. With interrater reliability, consistent scores are obtained no matter
A quantitative variable
in which numerals
Number of secondsto respond to a computer
Ratio and zero represe nts who measures the variable. With internal reliability (also called internal consis-
rePresent equal distances Height in cm'
measured. tency), a study participant gives a consistent pattern of answers, no matter how
"none" of the variable being
the researcher has phrased the question.
do have
Because ratio scales
because o would represent zero eyeblinks' twice as
ratio scale tik"-ttte""l- TEST.RETEST RELIABILITY
can t"'
*";;";;;i; '"-"irti"f i:T*"u
atrte zero,one all the above variations' To illustrate test-retest reliability, let's suppose a sample of people took an IQ
as Diogo'" Tlble 5'2 summarizes
many problems test today. When they take it again I month later, the pattern of scores should be
consistent: People who scored the highest at Time I should also score the highest
at Time 2. Even if all the scores from Time 2have increased since Time I (due to
{ practice or training), the pattern should be consistent: The highest-scoring Time 1
f
people should still be the highest scoring people at Time 2. Test-retest reliability
CHECK YOUR UNDERSTANDING can apply whether the operationalization is self-report, observational, or physio-
but
one conceptual definition
will usually have only logical, but it's most relevant when researchers are measuring constructs (such as
l. Explain why a variable
definitions' intelligence, personality, or gratitude) they expect to be relatively stable. Happy
can have multiple operational
mood, for example, may reasonably fluctuate from month to month or year to year
2.Namethethreecommonwaysinwhichresearchersoperationalizetheirvariables'
between categorical and for a particular person, so less consistency would be expected in this variable.
words' describe the difference
!. ln your own of variables that would
variables' come up with new examples
quantitative INTERRATER RELIABILITY
interval' and ratio scales'
fii the definition of ordinal' With interrater reliability, two or more independent observers will come up with
aes'Z'OZ t-8tl"dd oos' l
' .'ZL-ZZl'cld aas'i' ZZl.. O:cl'dd consistent (or very similar) findings. Interrater reliability is most relevant for
observational measures. Suppose you are assigned to observe the number of times
each child smiles in t hour at a daycare playground. Your lab partner is assigned
ARE THE to sit on the other side of the playground and make his own count of the same
RELIABILITY OF MEASUREMEHT: children's smiles. If, for one child, you record 12 smiles during the first hour, and
your lab partner also records 12 smiles in that hour for the same child, there is
SCORES CONSISTENT? interrater reliability. Any two observers watching the same children at the same
can ask the
established difierent types of operationalizatiolt' y" time should agree about which child has smiled the most and which child has
Now that we've if a study's operation-
construct *iian' question: Ho*'JJVo" know smiled the least.
a measure has two
important aspects'
rruriaitv of
alizations are good "#r;;'#;;;struct
Reliability of Measurement: Are the Scores Consistent? 125

124
A B Kurt's first
80 measurement
Head circumference (cm) was 75
'ilTHff il::H,I",",:::1^,:i::::T;?'i;il:lT#"::ffi""#:1[: 2
75
*t,i*"n*te items' Suppose a sample "in-'"^"#: #;;ilitr"rentlv' but each item

htt agree
rhe questions rherefore, peopie who
70
;"iilt;;;ale' :" ::Tjstruct.
iff"'"Tllilniin"'""o"di**(aswe'as Mateo
il*:il*f;i;",1'J::;::ffi
3,4, and';:';#'';il p""pr" *ntsJn;;;;;1;'the frst
item shourd
across items tn First
65
Kendra at
with rtems is consistent
t' -i"iJI' tr'rt" nLi"* measurement 60
also disagree with (cm)
ffi;;t"* ""';i'i'
scale has internal
reliability'
q(
a Scatterplot to Quantify R'eliability Taylor

Using 50 O
Kurt's second
Lvices for data anal-
Beforeusingaparticularmeasureinastudytheyareplanning,researcherscollect 45
measurement
data to see if it is
reiiJl"' n"'""'"n""-"'"' '1"";*ffidi;ali1 below)' was 80
pr",, t ii g .r'i "#il;'F":* : llo*'sed 40

vsis: sc atter "o*,
" "'.i 50 ss 60 6s 70 75 80
between one coder
infact,evidencer",,"ii"uirr.yisaspecial"*ampl"ofan-associationciaim_the
40 45
association b"t*""r,
;;;;;;" ot,t -""lrir1"'l"a """trt"t'
"
Second measurement (cm)
time and a later time'
and another, o' U"t*l"i
a" Years
"urii"r ;;;;"d to dot"tnent reliabilitY' FIGURE 5.2
Irere's u"
go, whe n p e
""u*ii-oirto*
opre
"o""r"ti""'
.l;*il;
;' i"' o " "n'"
#Uf"*ffiJ':fi l"ffi *
Two measurements of head circumference.
(A) The data for four participants in table form. (B) The same data presented in a scatterplot.
.;J** ;:ffiT$"ffiTil.1""J# ;:fi ":il T::*;ili:;:?il

a
;"
cla"'oo-l; u (te st-retest ratings agree. From these notes, they create a scatterplot, plotting Observer Mark's
centimeters' fo' "'""-'vo"" "
i twice
JffilJilml;:n; ratings on the x-axis and Observer Matt's ratings on the y-axis.
If the data looked like those in Figure 5.3A, the ratings would have high inter-
i*n*;m:'.m;[H"""tilili';;;;'henhavesomeoneersemeasure rater reliability. noth Mark and Matt rate Jay's happiness as 9-one of the happiest
them (interrater reliability)' a ' kids on the playground. Observer Mark rates Jackie a 2-one of the least happy
the first measurements
Fieures'2'h;;l;;ih"'""'1t'ofsuchameasurement^mishtlook'inthe
t";';;;plot' kids; Observer Matt agreed because he rates her 3, and so on. The two observers
form of a data
of he ad ti"
t'b;;;;
o'"t"'
"
scatterplot'
Li"*#io""i"a""t: ;*" ti ;
"" l* l?lJ'T-'H:;ff lil do not show perfect agreement, but there are no great disagreements about the
happiest and least huppy kids. Again, the points are scattered around the plot a
*;;:H;)"::li'F::iTT:Yl:l1Tlil;;';;'io'l'"uch bit, but they hover close to the sloping line that would indicate perfect agreement.
In contrast, suppose the data looked like Figure 5.3B, which shows much less
person measured-twice.nts to be about the
dot represents a
- of head circumference agreement. Here, the two observers are Mark and Peter, and they are watchingthe
two measurem
W; would exPect the all rall almost exactlv
th" d;:'"ffi;;**lot
""":TT:ffi n"]i;;' il; ;; * won't same children at the same time, but Mark gives Jay a rating of 9 and Peter thinks
same ror The two measures
"*n lineit inai.ut" p""r"rJi;;;;";J he rates only a 6. Mark considers Jackie's behavior to be shy and withdrawn and
on the sloping "i*""ra rates her a2,butPeter thinks she seems calm and content and rates her a 7. Here
a,waysu""*"*iiiiil,#;;;11,rt*:*I:y.n:nH:[:lHT::',"JJ,""1
different scort the interrater reliability would be considered unacceptably low One reason could
ift"t *iff lead to slightly
for each trial)' be that the observers did not have a clear enough operational definition of 'ohap-
tions in the t"n" *1""'e placement piness" to work with. Another reason could be that one or both of the coders has
AGREE}IENT
CAN SHOW INTERRATER notbeen trained well enough yet.
SCATTERPLOTS
A scatterplot can thus be a helpful tool for visualizing the agreement between
two administrations of the same measurement (test-retest reliability) or between
::"".:til:::':f"X'supp:set*I""X:-*""nildrenarebeingobservedataplav- two coders (interrater reliability). Using a scatterplot, you can see whether the
,,*"a"y"'"1"-n*:x1";ffi
on a sci
appears to be,
+i$ltl*:*i::ir'r'J"''"?."',"li?
126 cHAPTER 5 ldentifying Good Measurement

and the strength of the relationship, both of which psychologists use in evaluating
B reliability evidence. Notice that when the slope is positive, r is positive; when the
10
A a slope is negative, r is negative. The value of r can fall only between 1.O and -1.0.
10
Jay o When the relationship is strong, r is close to either 1 or -1; when the relationship
8 Jackie
I o o a is weak, r is closer to zero. An r of l.O represents the strongest possible positive
a relationship, and an r of-1.0 represents the strongest possible negative relation-
6
Observer
Petet's
o Jay
ship. If there is no relationship between two variables, r will be .00 or close to .00
(
Observer o
Matt'S ratings 4 (i.e,.O2 or -.04).
For more on how to
ratings 4 Jackie
aa Those are the basics. How do psychologists use the strength and direction of
compute r, see Statistics
Review: Descriptive
I 2
o Statistics, pp. 47 O - 47 2.
r to evaluate reliability evidence?
2 o
o
o 246810 A B
c
observer Mark's ratings
8
o246810 195
Observer Mark's ratings 7 t 190 a

6 a 185 I a
a
FIGURE 5.3 o 180 oao

5
crlo
Interrater reliabilitY'
1a) lnterrater reliability
is
ilinft <e) lnterrater reliabilitY wgrfii,n,gffi, H$rfi*ffitffi'- Y

4 o taoa
t a aa
Y 175
170
t a
3
165 aa a
is low' 2 c aa aa 160
them)
a straight oca a a'
(if the dots are close to
1
lt"" {1Y:,lroughline drawn

two ratings agree rrr" ao,, ,"u*", *ia""ly from a straight o 150
50 52 54 56 58 60 62 64
or whether they disag'r'JJi o 5 10 15 66
through them)' x X
r=.56 r=.93
Coefficient r to c D
Using the Correlation 70 2400 o

o a a
OuantifY R'eliabilitY 60 o
a a
a 2200
o
t O
of a me asure's reli
atter plots an
Sc
prov ide a pictur e
c *itJI J:ilJ#; :tlffi ::T- 50 o
a 2000 o
a aao
40 a o
""*"."'"'lincoencien'f' Y
h#t*:""1"x;::::':iil;il::"*;;ff
indicate n"* ;";;;" il'; o' noi"t'' oitTscatterplot
to line drawn are a Y
30 o a o
reoo a a a
a
or r, to a a 1600 o a a
them. one dif- a
through ,,- a .
in Figure'5'a difier in
eia...^c a differ two importan'
intwo importantways'
In Figure
20 a o a
I o
I
a
the scatterplots
D Notice that
olpoints tf"n" t" ite"rent directions' 1400
a o
to right' in Figure 5'4c
10
slope ot€ ference is that tfr"
,.ut't'"'r,jj.i""a,
For more on the points'r"n"
Figure s'4;;;; '1n1""iJ;,o,i;it
,r,"v,rop"a"-*--lf?1',*'l# j#,:l;ffi *ry:ryffi ::il#;
5' o 1200
scatterPlot' see ChaPter 5.4A and 4 6 50 52 54 56 58 60 62 64 66
PP' 53-66' o 2 3 5
x X
,r"n i' Lr"'re d to as the u o"
"-tl":,"i J* r,, or not slopin g eattoall'a
:l"T:':ffi#il':*T:Ti* r=-.59 r=.O1
" *'..";"j u, clo s
"- I
il"
",iu"r
iil" fi ::J;,TJ #l',? H ;lff tnt""a
other wav the. sca.1t1r-1t::t^:::'r." ,,'"r"
"
out' This spread corre- FIGURE 5.4
d:' fiare,;:-spreadout' stro n g

rel ation ship i s
Correlation coefficients.
'#-J'd';;j *ffi :{4i* {: :til:
t'o'"?o-ihe line; it is weak when
dots
or r' The r
Notice the differences in the correlation coefficients (r) in these scatterplots. The correlation
coefficient describes both the direction and the strength of the association between the two
when dots u'" Jo""t"tio" coefficients' variables, regardless of the scale on which the variables are measured.
The numbe*b;;il;catterplotsscattl;;t;t""" direction of the +il relationship
indicates trr" ,u-" ti,o;i;, as the

128
to each item, despite the different wordings. Internal reliability means people gave
consistent answers every time, no matter how the researchers asked the questions.
TEST.RETEST RELIABILITY would measure the same
reliability of some measure' we partic- Let's consider the following version of Diener's well-being scale. would a group
To assess the test-retest we'd give the set of
on that rneasure at leas, a*il* tirct of people give consistent responses to all five items? would people who agree with
setof participants months)' and contact
while (say' 2 Item I also agree with Items 2,3, and 4?
;' ;;"n we'd wait a
at
ioants the measure "t;;;
z' e{i"' '"to'ding each person's score
.h" ,urr," set of peopl:'j}]t;';;;ime ,. rf ,irrrn, out to be positive
and strong 1. ln most ways my life is close to my ideal.
Time 1 and Time ,, ;;"";iJ compute_
very good test-retest The conditions of my life are excellent.
ubou")' w" would have - 3. I am fond of polka dots.
(for test-rete", *" 'n'''fti"-i"ti't "t participants' scores on the
we wouldknow that -2. 4. I am a good swimmer.
reliability. tf r is positisvel*"*""0,
2' - 5. lf I could live my life over, I would change almost nothing.
i"ri.fr""s"a from Time I to Time we are measurrng Isomething
that
of poor reliability if -
A low r would b" ; ,i;" tlr1tf"" intelligence is not usuallv
should stay the ,"*":;:?l;L"r""r"*"*orJ, "*" assess the test-retest reliability
-
obviously, these items do not seem to go together, so we could not average them
expected to change ""* "

f"* months' 33if this together for a meaningful well-being score. Items I and2 areprobably correlated,
I
about the. reliabilitv of

I
i;'r, we would b" a""Uit"f

I
and would since they are similar to each other, but rtems I and 3 are probably not correlated, i
of an re test we
"b;;; flu';;;;J-"i t"it"-""Ttress' do not
I
test. rn contrast, tf *;;;;easuring

i
since people can like polka dots whether or not they are living their ideal lives.
constructs
il;;;'il;iv b ec au se the se
I
ect te st-retest'":ffiil:'; Item 4 doesn't seem to go with any other item, either. But how could we quantify
exp i
tlme' these intuitions about internal reliability?

staY the same over
l
Researchers typically will run a correlation-based statistic called Cronbaclrfs l
INTERRATER RELIABILITY alpha (or cofficient alpha) to see if their measurement scales have internal reliabil-
pos- ity. First, they collect data on the scale from a large sample ofparticipants, and then I
time, u,'iir,"" *" *ould comput e r.If r is

Totestinterraterreliabilityofsomemeasure,wemightasktwoobserverstorate
at the same have they compute all possible correlations amongthe items. The formula for Cronbach,s
the same participanis = 'zo or higher)' we would
i
strong (according to -u"' "'' trust the alpha returns one number, computed from the average of the inter-item correlations
itive and '"'"u'Jn"
po'iiive but weak' we could
not
very good i"t""ut"' '"iiJuilit''
rt' i' and the number of items in the scale. The closer the cronbach,s alpha is to 1.o, the
a-big problem' In the
observets,ratings.wewo-,,ldr"t,uinthecodersorrefineouroperationaldefinition better the scale's reliability. (For self-report measures, researchers are looking for
,,.or" ,"r"rlorl;"d"d. e rr"guti.rJj *""ia i"ai.",e but cronbach's alpha of .70 or higher) If cronbach's alpha is high, there is good internal
so it can be
trtJ *""ra mean obser;:;
;;;t ttnsidered Jav verv happvrackie reliability and researchers can sum all the items together. If Cronbach's alpha is less
daycare example'
"ro;;;;,-*1"r*, Yi:l :"""uered
and so on' when we're
assesstng than .70, then internal reliability is poor and the researchers are not justified in com-
observer
"",",
ut"
"o,rJi"',;lur.,,"r,
p"t"' 'lactti" il;;;; bining all the items into one scale. They have to go back and revise the items, or they
unhappv "l;J";; is rare and undesirable'
reliability, a negative correlation interrater reliability when
the observers are might select only those items that were found to correlate strongly with one another. i
Although r ."ri" evaluate

called kappa' is used
rating a quantitative
"r"Jao
variable, u *or" rffii"" !,"i1ir", Reading About Reliability in Journal Articles
i
I
Although the computations t
are rating u raters

when the observers the extent to which two
"u,"rorriuiiuri"bl"' I
th" b-ook' kullt -"u"'res means

Authors of empirical journal articles usually present reliability information for
are beyond with r' a kappa close to 1'O I
'";;;;;hi'
the same tut"go'i"s' As the measures they are using. one example of such evidence is in Figure 5.5,
I
place participants into

which comes from an actual journal article. According to the table, the subjective
L
ihut th" two raters agreed'

I
well-being scale, called Satisfaction with Life (swl), was used in six studies.
INTERNAL RELIABILITY The table shows the internal reliability (labeled as coefficient alpha) from each
subjec- of these studies, as well as test-retest reliability for each one. The table did not
Internalreliabilityisrelevantformeasuresthatusemorethanoneitemtoget
as Diener's five-item
on self-repo't '""h present interrater reliability because the scale is a self-report measure, and inter-
at the same
"o"'t'u"t' u""""t
'lJ"' qo"ttio" phrased in multiple wavs'
tilt;;; rater reliability is relevant only when two or more observers are doing the ratings.
tive well-bei"r ';t;';;;t"
b"""t;;;y;"" *uyot wording the question
Researchers rephrase
the items Based on the evidence in this table, we can conclude the subjective well-being
scale has excellent internal reliability and excellent test-retest reliability. you,ll see
mightintroducemeasurement"""''*o""'cherspredictanysucherrorswill
another example of how reliability is discussed in a journal article in the Working
cancel"uthotf'"'outwhentheit"""u'""mmed"ptofot-eachperson'sscore'
It Through section at the end ofthis chapter.
i","r"lil?fi"Uifi ,. *i"th"' peopl" responded consistently
Beforecombiningtheitemso"u.t"lf-'"portscale'researchersneedtoassess
ru-*
the scale,s "rrut
13(, cHAPTER 5 ldentifying Good Measurement

ask whether the five-item well-being scale Diener's team uses really reflects how
subjectively happy people are. You might ask if a self-report measure of gratitude
Table 2 Reliability really reflects how thankful people are. You might ask if recording the value of
Consistency and Temparal
STRAIGHT
FROM THE
Estimates of InPrnal
wilh Lde Scale the car a person drives really reflects that person's wealth.
L
SOURCE for the Satisfaefion
Coeftcient T€st- TemPoral Measurement reliability and measurement validity are separate steps in
retest interval
alpha establishing construct validity. To demonstrate the difference between them, t
FIGURE 5.5
SamPle
Alfonso & Allison

(1992a) .89
.85
-83
.84
2 weeks
I month
2 months
consider the example of head circumference as an operationalization of intelli-
gence. Although head size measurements may be very reliable, almost all studies
r
Pavot et al. (1991)
ReliabilitY of the 79-.84 .64
Blais et al. ( 1989) .82 2 months have shown that head circumference is not related to intelligence (Gould,'tleO).
well-being scale' Diener et al. ( 1985)
.87
.50 l0 weeks
this table (le9l) .80, .86 Therefore, like a bathroom scale that always reads too light (pigure s.6), the head
The researchers created YardleY & Rice
;;;;;"* six studies suPPorted
Magnus' Diener, Fujita' .54 4 years circumference test may be reliable, but it is not valid as an intelligence test: It
test-retest .87
ii"',","t""' reliability and
scale'
& Pavot ( r9e2) does not measure what it's supposed to measure.
of their SWL
tt""*."io"ttt
r."u"|"',u 2)
& Diener' 1993' Table
Coefficient (Cronbach's) ili i":::ifl 3'i[::il;.3:,,
Authors of study
using SWL scale'
aloha above '7O means
sWL scale has gooo
means scale has
gooo Measurement Validity of Abstract Constructs FIGURE 5.6
test-retest reliabillty' Reliability is not the
internal reliability' Does anyone you know use an activity monitor? Your friends may feel proud when same as validity.
they reach a daily steps goal or boast about how many miles they've covered that This person's bathroom
scale may report that he
1
day (rigure 5.7). How can you know for sure these pedometers are accurate?
weighs 50 pounds (22.7 kg)
Of course, it's straightforward to evaluate the validity of a pedometer: You'd sim- every time he steps on
ply walk around, countingyour steps while wearing one, then compare your own it. The scale is certainly
{ count to that of your device. If you're sure you walked 20O steps and your pedom- reliable, but it is not valid.
eter says you walked 2O0, then your device is valid. Similarly, if your pedometer
- YOUR UNDERSTANDING counted the correct distance after you've walked around a track or some other
"*="* path with a known mileage, it's probably a valid monitor.
t.Reliabilityisaboutconsistency'Definethethreekindsofreliability'usingthe
definitions' In the case of an activity monitor, we are lucky to have concrete, straight-
each of your
word cons'sfenf in
types of operationalizations-self-report' forward standards for accurate measurement. But psychological scientists often
2. Foreach of the three common which type(s) of reliability
would
want to measure abstract constructs such as happiness, intelligence, stress, or
plnv'iofogical-indicate
observationaf ' uno self-esteem, which we can't simply count (Cronbach & Meehl, 1955; Smith ,2OO5a, Flex
r = '25' r
-- -'65' r = -'o1' 2005b). Constructvalidity is therefore important in psychological research, espe- 7,A0t,""-
is the stronse st"
t. ilil:lTe followins correlations cially when a construct is not directly observable. Take happiness: We have no r 199''
.r:" -* -
or r = '43? means of directly measuring how happy a person is. We could estimate it in a
. ss.- = t'', u eâ r o,,
" I I:y,::E
j,l:l ;i::'i:,['i::''i'.-j;l?:l :"X Tli t number of ways, such as scores on a well-being inventory, daily smile rate, blood
,-; !4!4,.
t6.*
.,,--'
:1ue-n-aler oq Aeru leu'alu! pui pressure, stress hormone levels, or even the activity levels ofcertain brain regions.
rslellelu! :leuollenlosqo
Yet each ofthese measures ofhappiness is indirect. For some abstract constructs,
there is no single, direct measure. And that is the challenge: How can we know if
indirect operational measures of a construct are really measuring happiness and
DOES FIGURE 5.7
VALIDITY OF I{EASUREMENT: not something else?
Are activity monitors
IT'S SUPPOSED We know by collecting a variety of data. Before using a measure in a study, valid?
IT MEASURE WHAT researchers evaluate the measure's validity, by either collecting their own A friend wore a pedometer
data or reading about the data collected by others. Furthermore, the evidence during a hike and recorded
TO MEASURE? for construct validity is always a matter of degree. Psychologists do not say a these values. What data
could you collect to know
get at the particular measure is or is not valid. Instead, they ask:'What is the weight of whether or not it accurately
th"y;i';;ant to be sure'they
Beforeusingparticularoperationa.li.zati'onsinastudy,researchersnotonlycheck evidence in favor of this measure's validity? There are a number of kinds of
to be sure tt'" *"u"i'J'"*" '"liubl";
might counted his steps?
fJfi;;;;nstruct validitv' You
conceptual vu'iuur"t'Jrt""v?-"*"i*""a"a
Validity of Measurement: Does lt Measure What lt's Supposed to Measure? f53
172 CHAPTER 5 ldentifying Good Measurement

measure, it has face validity. Head circumference has high face validity as a
measurement of hat size, but it has low face validity as an operationalization of
Statistical External
lnternal Construct ValiditY intelligence. In contrast, speed of problem solving, vocabulary size, and creativ-
ValiditY ValiditY
ValiditY ity have higher face validity as operationalizations of intelligence. Researchers
generally check face validity by consulting experts. For example, we might assess
the face validity of Diener's well-being scale by asking a panel of judges (such as
: Reliabiltty: Da vou
personality psychologists) their opinion on how reasonable the scale is as a way
Three empirical waYs get consistent scores of estimating happiness.
Two subiective walts Reliability is
to assess validitY every time? Content validity also involves subjective judgment. To ensure content validity,
to assess validitY necessarY, but
not sufficient, a measure must capture all parts of a defined construct. For example, consider this
for validity conceptual definition of intelligence, which contains distinct elements, including
a
i the ability to "reason, plan, solve problems, think abstractly, comprehend com-
Tesi-retest
! lnternal reliabiIity:
content validity: PeoPle give plex ideas, learn quickly, and learn from experience" (Gottfredson , L997, p. l3).
reliabilitY:
The measure consistent To have adequate content validity, any operationalization ofintelligence should
Face validity: , People get
It looks like contains all the consistent responses on
i include questions or items to assess each of these seven components. Indeed, most
what You want parts that Your everY item of
scores every time
to measure' theory sayt it a questionnaire. IQ tests have multiple categories of items, such as memory span, vocabulary, and
:
they,take the tegti
should contain. :
problem-solving sections.
' lnterrater
relidbllltY:
Two coders ratings
Criterion Validity: Does It Correlate
of a set of targets with Key Behaviors?
are consistent'with
eaeh othor- To evaluate a measurement's validity, face and content validity are a good place to
start, but most psychologists rely on more than a subjective judgment: They prefer
to see empirical evidence. There are several ways to collect data on a measure, but
Discriminant validitY: in all cases, the point is to make sure the measurement is associated with some-
Convergent validity:
Your self-report Your self'report thing it theoretically shouldbe associated with. In some cases, such relationships
Criterion validitY; measure ig less li
measure is more can be illustrated by using scatterplots and correlation coefficients. They can be
Your measule is stronglY associated
correlated with a strongly associated illustrated with other kinds of evidence too, such as comparisons of groups with
1l
with self-rePort with self'report

relevant behavioral measures of known properties.
outcome. measures of similar
constructs. dissimiliar constructs'
CORRELATIONAL EVIDENCE FOR CRITERION VALIDITY
FIGURE 5.8 Criterion validity evaluates whether the measure under consideration is asso-
A concept map of measurement
reliability and validity' ciated with a concrete behavioral outcome that it should be associated with, rl
according to the conceptual definition. Suppose you work for a company that
evidencethatcanconvincearesearcher'andwe'lldiscussthembelow'First' wants to predict how well job applicants would perform as salespeople. Of the 1
of the reliability and validity concepts

take a look at Figure ;; ;" overview several commercially available tests of sales aptitude, which one should the com- i
covered in this chaPter' pany use? You have two choices, which we'll call Aptitude Test A and Aptitude
Test B. Both have items that look good in terms of face validity-they ask about
Face Validity and Content Validity: a candidate's motivation, optimism, and interest in sales. But do the test scores
Does It Look Like a Good Measure?

correlate with a key behavior: success in selling? It's an empirical question. your
company can collect data to tell them how well each of the two aptitude tests is
Ameasurehasfacevalidityifitissubjectivelyconsideredtobeaplausible correlated with success in selling.
a good
variable in iuestion. rf it looks like
;;;;";"ptual
operationali zation"f
Validity of Measurement: Does lt Measure What lt's Supposed to Measure? lt5

L4 CHAPTER 5 ldentifying Good Measurement
behavior. criterion validity provides some of the strongest evidence for a mea-
B sure's construct validity.
A 120 Here's another example. Most colleges in the united states use standar-
120 dized tests, such as the SAT and ACT, to measure the construct "aptitude for
a
100 o lrina college-level work." To demonstrate that these tests have criterion validity,
100
lrina a a an educational psychologist might want to show that scores on these mea-
a
sures are correlated with college grades (a behavioral outcome that represents
a Sales Bo ao "college-level work").
Sales 80 a figures oa
figures (thousands
of dollars) 60
aa Gallup presents criterion validity evidence for the t0-point Ladder of Life
(thousands o scale they use to measure happiness. They report that Ladder of Life scores
of dollars) eo a' correlate with key behavioral outcomes, such as becoming ill and missing work
o 40 O Alex (Gallup, n.d).
40 aa o If an IQ test has criterion validity, it should be correlated with behaviors
Alex that capture the construct ofintelligence, such as how fast people can learn a
20
100 105 110 120 130 140
20 complex set of symbols (an outcome that represents the conceptual definition
20 40 60 80 100
Aptitude Test B sco
of intelligence). of course, the ability to learn quickly is only one component of
Aptitude Test A score lower
Soread-out dots indicate that definition. Further criterion validity evidence could show that Ie scores
FIGURE 5.9 ::'.1"""':,:: :, ::X ?l In"'',',3'?Til' are correlated with other behavioral outcomes that are theoretically related to
for criterion validity' validitv as a measure
Correlational evidence so il'i"i".iit"tion intelligence, such as the ability to solve problems and indicators of life success
(A) Aptitude Test A stronelv
predicts:"1:ti::ltj'ance' predict
of sales Performance'
Test B doelnot (e.g., graduating from college, being employed in a high-level job, earning a
criterion validitv is hish iiiioi'i'0" would
i' to*"' A company high income).
sales as well, so t'it"'ion'Jutioity potential selling
probably want to u." r"ril
ioiidentifving
*r'"n selecting salespeople' KNOWN.GROUPS EVIDENCE FOR CRITERION VALIDITY
Io"iJ
Another way to gather evidence for criterion validity is to use a known-groups
give each sales test to
criterion validity' your company could paradigm, in which researchers see whether scores on the measure can dis-
To assess each person's sales
atl the current '""'
;;;;;;i;' ""t
You would then compute
two criminate among two or more groups whose behavior is already confirmed.
tlir' """1-;;;i"a
performance' other For example, to validate the use of salivary cortisol as a measure of stress, a
fisures-a measure "f '"iling figures' and the
or" u"t*""r, aptit,ra" test e and sales shows scatterplot
researcher could compare the salivary cortisol levels in two groups of people:
cirelations, ,al"s fig.rr"sliit"," s'sA
Aptitude ,."ra u on the x-axis' and
those who are about to give a speech in front of a classroom, and those who
between ""a
d; r;;;-olreptit,rd" ;"t;fit nft""a are in the audience. Public speaking is recognized as being a stressful situation
resutts for Test J;-';#scored 39on the test and
^. o" th" for almost everyone. Therefore, if salivary cortisol is a valid measure of stress,
actual sales figure' "* nt** ' people in the speech group should have higher levels ofcortisol than those in
broughting3s,o00i;;il;-*:"""'*;,"";;:?*:'in*t*Tl?lili
shows the assocra' the audience group.
;n;;" 5.98, in contrast'
in the Lie detectors are another good example. These instruments record a set of
tude Test B' ' - ^--++^t'nlnrs w€ can see that the correlation physiological measures (such as skin conductance and heart rate) whose levels
,"1*:t-X,:ffiil:,:*1"#f #;;
ruture sales
one rn other worJs' are supposed to indicate which of a person's statements are truthful and which
| * u,"
",
io,
",, "
with scores on Aptlt
"*lffi **i* J1ff :n: :4
i1*
::ffi'J, - *'o
as a measure of
are lies. If skin conductance and heart rate are valid measures of lying, we could
conduct a known-groups test in which we know in advance which of a person's
o rr", u"tt"r"".lt"rion validity salespeople' rn statements are true and which are false. The physiological measures should be
conclude tt ut eptitli-"
I
I
i
for selecting
""r, o"" trt"v "o,'-o|.i.,,a" elevated only for the lies, not for the true statements. (For a review of the mixed
1
selling abilitv, ""d;;"h" 'hJ;^;;
jata show th"t ,"o,", Test.B. are a poorer
indi-.
evidence on lie detection, see Saxe, 1991)
a measure ot
I
the other
contfast, vaiidity as
it has ;;;t*"on The known-groups method can also be used to validate self-report measures.
\ cator of futu'" 'ul"' performance; PsychiatristAaron Beck and his colleagues developed the Beck Depression Inven-
I sales aPtitude' tory (BDI), a 2l-item self-report scale with items that ask about major symptoms of
predict their actual
Criterionvalidityisespeciallyimportantforself-reportmeasuresbecause
;;;'-"ll people's self-reports
the correlatio" tu"'i"i;;"
Validity of Measurement: Does lt Measure What lt's Supposed to Measure? 1r7

expected to vary in happiness level (Pavot & Diener, 1993). 35
Participants circle
& Erbaugh' 1e61)' For example, male prison inmates, a group that would be
depression C?":!
*l*}l]illffi.,1##k' expected to have low subjective well-being, showed a
30
su 25
one of four choices' lower mean score on the scale, compared with Canadian
college students, who averaged much higher-indicated by 20
BDI
0 I do not feel sad' the M column in Table 5.3. Such known-groups patterns score 15
1 I feel sad'
out of it' provide strong evidence for the criterion validity of the
and I can't snap
I am sad all the time
10
2 it' SWB scale. Researchers can use this scale in their studies
that I can't stand
3 I am so sad or unhappy with confidence. 5
other people' What about the Ladder of Life scale, the measure of hap- o
O I have not lost interest in
piness used in the Gallup-Healthways Well-Being Index? None Mild Moderate Severe
]'lamlessinterestedinotl.rerpeoplethanlusedtobe. This measure also has some known-groups evidence to Psychiatrists' rating
2lhavelostmostofmyinterestinotherpeople.
other people' support its criterion validity. For one, Gallup reported
in
3 I have lost all of my interest FIGURE 5.17
that Americans' well-being was especially low in 2008 BDI scores offour known groups.
for a total BDI score'
uP the scores on each of the 21 items and 2009, a period corresponding to a significant down- This pattern of results means it is valid to
A clinical scientist
adds to a high of 63' turn in the U.S. economy. Well-being is a little bit higher in
from a low of 0 (not at ali dePressed) gave this
use BDI cutoff scores to decide if a person
which can range of the BDI, Beck and his colleagues American summer months, as well. These results fit what has mild, moderate, or severe depression.
validitY one group was suf-
To test the criterion of peoPle. TheY knew we would expect if the Ladder of Life is a valid measure of
(Source: Adapted from Beck et al., 1961.)
grouPs
self-rePort scale to
two known because they had
and the other grouP was not well-being.
dePressron se each Person'
fering from clinical interviews and diagno
trists to conduct clinical group s and created a
asked PsYchia BDI scores of the two
the mean suPPorts the crite-
The researchers comPuted
shown in Figure 5.1O. The evidence the
Convergent Validity and TABLE 5.3
bar graph, the exPected result:
of the BDI' The gr aph shows was Discriminant Validity: Does
rion validitY of depressed peoPle Subjective Well-Being (SWB) Scores for
of the known grouP were not the Pattern Make Sense?
average BDI score the known grouP who Known Groups from Several Studies
score of this
30 higher than the average was established in Criterion validity examines whether a mea-
its cn terion validitY
SAMPLE STUDY
rf it's a valid measure.of Because need a CHARACTERISTICS N MSD
depressed researchers REFERENCE
1tr .l"oression, BDI should is still wideiY used todaY when to sure correlates with key behavioral outcomes.
way, the BDI who are vulnerable Another form of validity evidence is whether American college 244 23.7 6.4
way to identifY new PeoPle
Pavot & Diener
3?"I 3 l"i'iil3; 3i, il"'" " quick and valid students (1993)
6iagnosed wlth there is a meaningful pattern of similarities
20
dePression depressron' to calibrate low'
used the known-grouPs paradigm and differences among self-report measures. French Canadian 355 2s.8 6.1 Blais et al.
BDI Beck also When the P sychiatrists
I. A self-report measure should correlate more college students (1989)
scores on the BD
medium, and high
15
onlY
score
in the samPle' theY evaluated not strongly with self-report measures of similar
(male)
le dePre ssion in
10
intervrewe d the PeoP but also the level of constructs than it does with those of dissimi- Korean university 4L3 1-9.8 5.8 Suh (1993)
depressed
whether theY were severe. As exPec
ted, the students
none' mild, moderate, or (asse ssed
lar constructs. The pattern of correlations with
5 each Person:
e as their level
of dePre ssion measures of theoretically similar and dissimilar
grouPs ros was Printing trade 3O4 24.2 6.0 George (1991)
BDI scores ofthe s.11). This result
was more severe (Figure constructs is called convergent validity and workers
o bvp sychiatrists) measure of dePres-
Not Depressed that the BDIwas avalid discriminant validity (or div e r g ent v ali dity), Veterans Affairs 52 11.8 5.6
even clearer evidence can confidentlY
use
respectively.
Frisch (l-991-)
depressed clinicians and researchers hospital inpatients
s10n. with the BDI, how severe a person's
Psychiailists' iudgment ofgot scores to categorize
specific ranges CONVERGENT VALIDITY Abused women 70 20.7 Fisher (1991)
FIGURE 5.1O depression mightbe. scale is another exam- Male prison

groups' well-being (SWB) As an example of convergent validity, let,s con- 75 12.3 7.0 Joy (1990)
BDI scores of two known Diener's subjective criterion validitY
s paradigm for
inmates
evidence known-group sider Beck's depression scale, the BDI, again.
This Pattern of results ProvidesBDI using ple of using the presented the SWB
for the criterion validitY
of the
article, he and his colleague y had One team of researchers wanted to test the Noler N = Number of people in group. M = Group mean on SWB.
Clients In one revlew ifferent stud ies. Each stud SD = Group standard deviation.
the known-grouPs method. several d convergent and discriminant validity of the
scale averages from be Source: Adapted from Pavot & Diener, l993, Table i.
judged to be depressed bY PsYchiatrists of peopie who could
(Source: AdaPted the SWB scale to different grouPs
also scored higher. given
from Beck et al., 1961.) Validity of Measurement: Does lt Measure What lt's Supposed to Measure? r39
138 GHAPTER 5 ldentifying Good Measurement

a too a
100 a oot
o o a
1E
aoaa aa a o
o o o o
75 a o a o
aa o
a Physical
CES.D
o a o o a o
total
aa
a o
c
health
problems
50
ae aa a
score a a o o
ao oo oo ooo aa
50 ao oaa T
a
tl. aa a
,q
o oo
o ra'
aaa
aco a FIGURE 5.I3
FIGURE 5'12
3l
ao a
ao
o
oa
aaa o a
a Evidence supporting the discriminant
validity of the BDI.
l'-i"*" the convergent .E o
"tpporting
validitY of the BDI' o
aa a As expected, the BDI is only weakly
correlated with perceived health problems
30
;; ;', stro n s rv
:...':l_"^."-0,:
tlrc vLr v \' :.l,T:::;; o 10
20 o 'to 20 30 (; = .16), providing evidence for discriminant
measure of dePresston' nt va I i d i tY BDI total score BDI total score validity. (Source: Segal et al., 2008.)
"
h#liU*, "
:oio'oo;;"'n
2Oo8)' If the BDI

really quanitfies
BDI (Segal, Coolidge'
Cahill' & "it'iltirai"'""*"r
O'Riley'
*e'd with (should
This example of convergent validity is somewhat obvious: A measure of depres-
depres si on, tt'"'"'"

u'li";'';;;""a' a"nt*tt*: T::t"*nle
of 37 6 sion should correlate with a diferent measure of the same construct-depression.
*"u"'""îi a But convergentvalidity evidence also includes similar constructs, not just the same
r",g" with) other'"tr-'"no"a n1mber"1i;; questionnaires' including one. The researchers showed, for instance, that the BDI scores were strongly cor-
"ot
adults completed
th";;';;; Studies Depressron
center fo' npid"miologic related with a score quantifying psychological well-being (r = -.65). The observed
self-report i"'t"'*""t-Jf"Jtf'"
strong, negative correlation made sense as a form of convergent validity because
scale (CES-D)' people who are depressed are expected to also have lower levels of well-being
D Asexpected'BDlscoreswerepositively'stronglyt:tt:l::"dwithCES-D
For more on the strength
clrrelations,
of
see chaPter.S'
sc ores (r =' 6 8)'' " ";i:
;;": ; ; :;"'i::^ i#l* :*l *:""H[: ;?::1ll (Segal, Coolidge, Cahill, & o'Riley,200s).
DISCRIMINANT VALIDITY
Table 8'4, and Statistlcs
Review: DescriPtive
{#,';;**"T:,'Ji?""'}'"':i:"".:{+:sl'il';;**;i:,[:,T::::1',{i The BDI should not correlate strongly with measures of constructs that are very
statistics' PP' 46A' 472' different from depression; it should show discriminant validity with them. For
This between similar self-report
:IH*'Tt*ffi1T#'T;""1""'1Hi#;i'"*"":'"d;ih"vscorelow example, depression is not the same as a person's perception of his or her over-
for the good evidence
on both the BDr "or'r"tution
"J;;;'";-D) 1a"n'"'ffi;;;";;;; all physical health. Although mental health problems, including depression, do
measures of the '"i" """""'"i overlap somewhat with physical health problems, we would not expect the BDI
BDI'
.orr*rg""t uaiiditY of the to be strongly correlated with a measure of perceived physical health problems.
restingro'"o"J"#*"l-"rn'""i1*:5:tll;3Ji,l"f J'iiiil:H"3'*i: More importantly we would expect the BDI to be more strongly correlated with
the CES-D and well-beingthan it is with physical health problems. Sure enough,
i 'ff J|'Jii},l1l!ii?;';ffii'**{ffiil;;"bti'h"d'.oo!rheresearch'
ffi;;; -"u"""' b"t that
measure's
1
ers might ne*t t" t"l"'u."t" tf'" cEs-D *itit *idence' Eventuallv' however' they Segal and his colleagues found a correlation of only r = .16 between the BDI and
a measure of perceived physical health problems. This weak correlation shows
i f tt' e weight and paftern
validity would "l'";;;;t" " "nn*t"a ;;; that the BDI is different from people's perceptions of their physical health, so we
might be satisfied tffi;;;;ffi'
"uut"uti"e shown
""lid when measures are
-'" -;;;inced definitive
can say that the BDI has discriminant validity with physical health problems.
of the evident"' wtu"y '"'earchers rrowever' no single Figure 5.13 shows a scatterplot of the results.
to predict oillt';;'?;;* "'L""^"" "aliditv)'
wilf
"tt""' 2o05a)'
t"tiiity lsmittr'
outcome """Uii'ft Validity of Measurement: Does lt Measure What lt's Supposed to Measure? 14,

140
school grades or life success. If a measure does not even correlate with itself, then
how can it be more strongly associated with some other variable?
protbecauseT":*j::::*Jt*:r['"'n"'Tr:
jru;xllr5il#iff l;
Noticealsothatmostofthedotsfallinthelowerleftportionofthescatter-
As another example, suppose you used your pedometer to count how many
steps there are in your daily walk from your parking spot to your building. If the
not dePressed: TheY
s'
pedometer reading is very different day to day, then the measure is unreliable-and
problems scale' r disorders have sim-
-^-^:.'l^* rh'r manv developmental of course, it also cannot be valid because the true distance ofyour walk has not
changed. Therefore, reliability is necessary (but not sufficient) for validity.
il;;;.";-,",,:::,#il,i*;:k:lim;*mi$'JlT:ifl
'"i;**:T:[-"""G3:!"i,lLli*r'r'iu"'-portanttospecirv'ror '"'*i
Therefore, a screenrn
nant validitv, it 't'o'r5ioi;;;;;it'
d;;; ;h" same child'as having a lan-
t
I
;;;;i;;',i*"*"1'?"'1X'i[*:*f
shouldnot be correlz
#m,'""lll:'fii:tilil: CHECK YOUR UNDERSTANDING
general intelligencelecessary
It is usuallY not r
to establish "ff
discriminant validity
between a mea-
,,,,,juJ. u""u.,," depression

is not likely
l. What do face validity and content validity have in common?
sure and something
ti"i ,,
""-nr"t"ty vo"-*"i"rt or the number of siblings you 2. Many researchers believe criterion validity is more important than
*u"y -oui"' convergent and discriminant validity. Can you see why?
to be associat"d -i';;;; validity with these
i , air.ji*i""",
variables'
have, we wo,rra ,roa,,"j1a" """*i"" *tt"" tttey want to be sure t. which requires stronger correlations for its evidence: convergent validity or
focus on dis"tt:]fl;;iiiitv
rnstead, researchers discriminant validity? Which requires weaker correlations?
theirmeasureisnotaccidentally"uptu,inga,imitu,u.'taifferentconstruct.Does 4. Can a measure be reliable but not valid? Can it be valid but unreliable?
,r,"uo,measured,"-t"""';{:::l*:,llii:l?"S*T:l':"J"*iJ;
haPPiness or 1ur
;;;;"tttre enduring '
ZtL-Z?l.dd oos :elqe!lalun
s! l!prle^ oq louue) oJnseotu e lnq .ptleA lou lnq alqetlal aq uef ll .?,Alrptle^ lua6Jaûof, .g
Jr
::'Uilii*l*il,$T,:'.iil?n'ii"lx11'ditvareusua'vevaruatedtogether' 'aVL-gZL 'dd aas lsalnseatu l,lodaj-Jles loqlo qlt^ Aldruts lou ,aLuof,lno
lelo!êqaq e qltM
A measurement should soteferrof, atnseaul e lloM moq soqs!lqelso Altptle^ uollalut Aluo asnef,ag .Z.SEL_VZL.dd oas .l
as apattern *1in titnitur traits (convergent

""rJl,'11,"i;;;;r"lf-r"port,r,"urur"r.
",to""luiions (higher-'
have highe' ""tt;t;
,rJ,iJ',(dtr.,iminant- validitv)' There are
*Lr, air-ri-ilu,
validity) than it a"", lt'ot'fa be' Instead' the
overall pattern
fo'i'oi"t th" to""l"tio"' d""ide whether their
no strict
"'t"' validitt n"t* ;;ur.rt"" ffi EVI EW: INTERPRETING COITSTRUCT
;;;;;inant it to measure'
of convergent ""d the construct they want VALIDITY EVIDEHCE
operationali'"ti"'i;;i;-"u"""'
R'eliability Before using a stopwatch in a track meet, a coach wants to be sure the stopwatch
The R elationship Between is workingwell. Before taking a patient's blood pressure, a nurse wants to be sure
and ValiditY the cuff she's using is reliable and accurate. similarly, before conducting a study,
researchers want to be sure the measures they plan to use are reliable and valid
oneessentialpointisworthreiterating:Thevalidityofameasureisnotthesame ones. when you read a research study, you should be asking: Did the researchers
asitsreliabil,.ril;;4ii,t-ight-boTi'$"'ff ;ti#if *ffiT$,tt"li:'"; collect evidence that the measures they are using have construct validity? Ifthey
*lm':-'l*;"."'iH1:Tfl illl-|ffi
didn't do it themselves, did they review construct validity evidence provided
";';;;;t'eliable'buiitstillmavnot
by others?
be valid for assessingintelligence' -r: In empirical journal articles, you'll usually find reliability and validity infor-
has to do
i
,r",1-"", sense' Reliability

Althoughu-"u,i.,,"maybelessvalidthanitisreliable,itcannotbemorevalid
I
,nr, -ut ", mation in the Method section, where the authors describe their measures. How
than it is reliable. rntuitively, an IQ test is reliable
;;".*" do you recognize this evidence, and how can you interpret it? The
!
I
""u*pl", how well working rt
with how w"n ""rra*"r.*i.ffi;ff;, ho*"u"'' tus to do with
1
over t'-"';;i;;; Through section shows how such information might be presented, using a study
I
if it is correl"t"d';;l;;elf u' a behavior that indicates

a measure ;;;;*"r' *11ilil;;;'u"t' variable' such as
conducted by Gordon, Impett, Kogan, Oveis, & Keltner
QOLZ) as an example.
" o" iq test is valid if t, t, ;:;t",ed with another
intelligence.
Review: lnterpreting Construct Validity Evidence 147

142
Item
tell my partner often that s/he is the

best'
1. I
partner how much I appreciate her/him"
2. I often tell my
my partner for granted (reverse scored item)
3. At times I take
4. I appreciate mY Partner'
or treat my partner
5. Sometimes I don't really acknowledge
(reverse scored item)
like s/he is someone special
6. I make sure my partner feels appreciated'
says that I fail to notice.thenice
' My partner sometimes
7.
item)
ahi"* iit", s/he does for me (reverse scored
the things that my partner does for me' even
8. I acknowledge
the reallY small things'
of awe and wonder
9. I am sometimes struck with a sense
my life'
when I think about my partner being in
FIGURE 5.14
(ArR) scale.
item" in the Appreciation in Relationships
people appreciate
These items *"r"u."d by the researchers to measure how much of
partner' Do you think these items have face validity as a measure
their relationship
2012')
appreciation? (Source: Gordon et al '
the AIRscale as a reli-

The evidence reported by Gordon etal' \O\2)supports
and there
has internal and test-retest reliability'
able and valid scale (Figure 5.14). It
isevidenceofitsconvergent,discriminant,andcriterionvalidity.Theresearch-
ers were confident ttr"V?"ia use
AIR when they later tested their hypothesis
thatmoreappreciative"o,,pt",wouldhavehealthierrelationships.Manyoftheir
by the AIR scale) were supported'
hypotheses about gratitudl (operationalized
One of the more dramatic '""'tt' was
from a study that followed couples over
that people who were more appreciative
time. The authors reported: "We found
likely to still be in their relationships at
of their partners -"r"lirrrincantly more
the 9-month follow-up" (Gordon et al', 2ol2' p' 268)'
Thisempiric"t;o,,,,'"tarticleillustrateshowresearchersusedatatoestablish
theconstructvalidityofthemeasuretheyplantouseaheadoftime,beforegoing
ontotesttheirhypotheses.Theirresearchhelpssupporttheheadline,..Gratitude
is for lovers."

Identify

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Identify

Uploaded by

Copyright:

Available Formats

Identifying Good

CHAPTER 5 ldentifying Good Measurement

TABLE 5.1 OBSERVATIONAL MEASURES

r20 CHAPTER 5 ldentifying Good Measurement

122 cHAPTER 5 ldentifying Good Measurement

CHAPTER 5 ldentifying Good Measurement

*t,i*"n*te items' Suppose a sample "in-'"^"#: #;;ilitr"rentlv' but each item

a Scatterplot to Quantify R'eliability Taylor

pr",, t ii g .r'i "#il;'F":* : llo*'sed 40

.;J** ;:ffiT$"ffiTil.1""J# ;:fi ":il T::*;ili:;:?il

126 cHAPTER 5 ldentifying Good Measurement

Observer Mark's ratings 7 t 190 a

FIGURE 5.3 o 180 oao

ilinft <e) lnterrater reliabilitY wgrfii,n,gffi, H$rfi*ffitffi'- Y

lt"" {1Y:,lroughline drawn

Using the Correlation 70 2400 o

d:' fiare,;:-spreadout' stro n g

CHAPTER 5 ldentifying Good Measurement

expected to change ""* "

about the. reliabilitv of

i;'r, we would b" a""Uit"f

test. rn contrast, tf *;;;;easuring

tlme' these intuitions about internal reliability?

Researchers typically will run a correlation-based statistic called Cronbaclrfs l

time, u,'iir,"" *" *ould comput e r.If r is

Although r ."ri" evaluate

Although the computations t

are rating u raters

th" b-ook' kullt -"u"'res means

place participants into

ihut th" two raters agreed'

13(, cHAPTER 5 ldentifying Good Measurement

Alfonso & Allison

Validity of Measurement: Does lt Measure What lt's Supposed to Measure? f53

172 CHAPTER 5 ldentifying Good Measurement

with self-rePort with self'report

of the reliability and validity concepts

Does It Look Like a Good Measure?

Validity of Measurement: Does lt Measure What lt's Supposed to Measure? lt5

136 CHAPTER 5 ldentifying Good Measurement

FIGURE 5.1O depression mightbe. scale is another exam- Male prison

138 GHAPTER 5 ldentifying Good Measurement

2Oo8)' If the BDI

depres si on, tt'"'"'"

CHAPTER 5 ldentifying Good Measurement

,,,,,juJ. u""u.,," depression

as apattern *1in titnitur traits (convergent

,r",1-"", sense' Reliability

if it is correl"t"d';;l;;elf u' a behavior that indicates

Review: lnterpreting Construct Validity Evidence 147

CHAPTER 5 ldentifying Good Measurement

tell my partner often that s/he is the

the AIRscale as a reli-

144 CHAPTER 5 ldentifying Good Measurement

You might also like

t,i"n*te items' Suppose a sample "in-'"^"#: #;;ilitr"rentlv' but each item

time, u,'iir,"" " ould comput e r.If r is