Module 5 - Reliability and Validity in Measurement

Module 5: Reliability and Validity in Measurement
Learning Outcomes
- Recognize and explain how the reliability and validity of psychological measures
determine the utility of these measures for testing hypotheses and for practical
applications
- Understand the concept of measurement error and distinguish between random and
systematic measurement error
- Differentiate between reliability and validity as indices of the quality of psychological
measures
- Identify key methods for assessing the reliability of psychological measures such as
testing parallel forms, temporal stability, internal consistency, and inter-rater reliability
- Recognize what face validity is and explain why it is an inadequate method for assessing
the validity of psychological measures
- Identify key methods for assessing the validity of psychological measures, such as
criterion validity testing and construct validity testing
- Recognize the need to establish convergent and discriminant validity of psychological
measures
Overview
- Measure: any observational tool that can be used to quantify a variable

- Underlying Variable (Latent Variable): the underlying construct that the researcher seeks
to study but which cannot be observed directly
The Parable of the Blind Men and the Elephant
- An ancient parable tells the story of 6 blind men who wanted to know what an elephant
was
- One man touches the elephant's trunk and says that an elephant is like a snake.
A second man touches the side of the elephant and says that an elephant is like
a wall. A third man touches the elephant's ear and says that an elephant is like a
fan. A fourth man touches the elephant's leg and says that an elephant is like a
tree. A fifth man touches the elephant's tail and says that an elephant is like a
rope. And a sixth man touches it's tusk and says that an elephant is like a spear.
- Most versions of the story, the men then argue about their different impressions
of what an elephant is like
- Other versions of the story say a wise man encourages them to listen to each
other and when they take his advice they come to the insight that the elephant
must be a large animal and they recognize that each of their observations
provided information about a part of the whole creature
- This parable illustrates how individual observations are prone to error and, on their own,
can lead to a misunderstanding of the underlying entity that is being studied
- However, if a collection of observations is pooled together this can help to sort out the
unique elements of each observation and identify the convergent elements that together
give a more accurate impression of the underlying entity
- If a visually impaired individual were to go about it, they would touch many parts of the
elephant in order to learn about the elephant as a whole
- the collection of multiple observations of what an elephant is like, would be
similar to that of another visually impaired individual's impression, given that this
other individual also sampled or touched many parts of the elephant
- **one of the most important principles of psychological measurement emphasizes that a
researcher's estimates of the value of a variable tend to be significantly more reliable if
they are based on aggregating many distinct observations and they tend to be much less
reliable if they are based on just a single observation or just a few observations**
- This parable is a useful metaphor for the process of psychological measurement -
in psychological measurement we usually cannot directly observe the variables
that we seek to study (just as blind men cannot see the whole elephant)
- We can directly observe outward manifestations that reveal aspects of the
underlying variable that we are interested in just as blind men were able to
touch parts of the elephant’s body
The Societal Impact of Psychological Measures
- There are at least 20,000 new psychological measures introduced - these measures are
used not just for research but for a wide variety of practical purposes including clinical
diagnosis, forensic assessment, educational testing, job screening, marketing, and
product evaluation
- Psychological measures are used to make decisions that affect important outcomes in
millions of people’s lives - given the widespread societal impact of psychological
measures, we all have a stake in ensuring that these measures are well-designed and
yield accurate results
- Understanding the methods that are used to evaluate the quality of psychological
measures is thus important not only for individuals who seek to become
well-trained researchers but for anyone who wants to be equipped to critically
evaluate measures that they are exposed to
- Ex. the personality tests that prospective employers may use during a job
screening, academic ability tests that a school system might administer to
their children for purposes of educational tracking, diagnostic tests that a
clinician might use to assess a family member for dementia
- If more people are equipped to ask critical questions about these and other
psychological measures that are used to make important decisions, then this could
provide a quality control check as these measures infiltrate more and more domains of
our lives
Example of Psychological Measure to Illustrate Reliability and Validity
- Reading the Mind in the Eyes test / Mind-Eyes test example to illustrate the principles
and procedures for testing the reliability and validity of a psychological measure:
- This is a performance-based measure that is designed to assess individual differences in
adults’ cognitive empathy skills (aka cold empathy)
- Cognitive empathy: assesses an individual’s ability to accurately infer what
another person might be thinking or feeling
- If you have a low score, you have low cognitive empathy, if you have a high score
you have high cognitive empathy
- Don’t draw conclusions after this test - very broad
- This is an example of a latent variable
Latent Variables
- Measure: is any observational tool that can be used to quantify a variable
- Psychologists employ a wide variety of types of measures in their research including
self-report surveys, performance tests, behavioural observations and psychological
assessments
- These measures assess outwardly detectable manifestations of a latent variable
- Latent variable: underlying construct that the researcher seeks to study but which cannot
be observed directly
- To the extent that this latent variable has certain reliable manifestations that can
be directly observed, recording and studying those manifestations can provide
information about the latent variable
Measurement Principles and Measurement Error
- Measured value (ie. score): the observed value or score that a participant obtains on a
measure
- Measurement error: the discrepancy between the participant’s measured value and their
true value
- Random measurement error (or noise): random, unpredictable discrepancies between
the measured value and the true value
- Reliability: the degree to which a measure yields consistent, dependable results when it
is used repeatedly to measure variability within a sample
- Systematic measurement error (or measurement bias): measured values that differ from
the true value in a predictable direction. Ex. consistently overestimate the true value or
consistently underestimate the true value
- True value: the participant’s actual value on the latent variable that the measure is
designed to assess
- Validity: the degree to which a measure assesses the specific variable that it was
designed to measure as opposed to measuring some other variable
Reliability and Validity
- For a measure to be useful to test hypotheses about a given variable researchers must
demonstrate that it is a reliable and valid measure of that variable
- Reliability of a measure assesses whether the measure gives consistent and
dependable results when it is used repeatedly to measure variability within a sample
- If measure is reliable when we know that it is measuring something as opposed
to just assessing random variation
- Reliability is the critical first step in evaluating the quality of a measure - if a
measure is not reliable then it is not measuring anything in particular and thus it
also cannot be valid
- Ex. a questionnaire that is designed to measure individual differences in
creativity produces inconsistent results such that the individuals who get
above-average creativity scores in one testing session get below-average
scores on another occasion
- Validity of measure assesses whether it measures the specific variable that it was
designed to measure as opposed to measuring some other variable
- Validity depends on reliability
- Just because a measure is reliable doesn’t mean that it is valid because the
measure could be measuring something other than the specific variable that it
was designed to measure individual differences in creativity, but actually only
measures individual differences in verbal fluency
Measured Values are a Mixture of Two Things
- Psychological measures are designed to measure variability in participants’ levels of
some particular latent variable
- Ex. a measure of depression is designed to measure individual differences in
levels of depression
- True differences in levels of the latent variable is not the only thing that determines the
measured values (ie. scores) that participants obtain on that measure
- The measured value actually reflects 2 factors:
1. The individual’s true value on the latent variable that the measure was designed to
assess
2. Measurement error that causes the measured value to deviate from this true value
Measured value (MV): True Value (TV) + Measurement Error (ME)
- There are 2 major types of measurement error that causes scores on a measure to differ
from the true value that the measure is designed to estimate: random measurement
error and systematic measurement error
Random Measurement Error
- The first major type of measurement error is random measurement error (aka noise)
- It causes scores on the measure to deviate from the true value in random, unpredictable
ways during the process of measurement
- The reliability of a measure is inversely related to the level of random measurement error
in that measure
- Increasing the number of distinct observations that get averages together into an
estimate will reduce random measurement error because random errors tend to cancel
each other out when they are averaged
- The amount of random measurement error in a measure is assessed through
tests of that measure’s reliability
- In psychological measurement there are a variety of possible sources of random
measurement error that may cause the measured value to deviate randomly from the
true value
Common Sources of Random Measurement Error in Psychological Measurement
1. Random variability in the psychological states of participants or respondents
- These instabilities in the tested individual include such factors as fluctuation in their
mood, states of fatigue, and attentiveness during the testing process
2. Random variability in the research situation

- Changes in the research situation include changes in the researcher's demeanor or
variability in exposure to distractions like random fluctuations in the noise levels in the
lab or other uncontrolled elements of the research environment
3. Random variability in the measuring instrument or process
- Random factors may also affect the research tools such as survey materials or
laboratory equipment that researchers use to record data
- Ambiguities in the research tasks or materials can lead to unpredictable shifts in how
different participants interpret and respond to those tasks and instabilities in the
functioning of research equipment can contribute to random differences in how research
stimuli are experienced by participants
4. Random errors occurring when the research personnel record data

- Many psychology studies involve researchers or their assistants observing and recording
participants’ behaviour as the participants respond to some situation or as the
participants complete some assigned task
- If the observers’ levels of attentiveness varies randomly over time due to things like
fatigue or incidental distractions, then these fluctuations of attention could introduce
some random variability into the data that they record
Reducing Random Measurement Error
- There are a number of things that researchers can do to try to reduce the levels of
random measurement error
- Standardize the conditions in which observations are made
- Increase the clarity of measures or procedures so that random differences in
interpretation across participants are less likely to arise
- Invest more time and practice to train research personnel to be consistent in how
they interact with participants and how they observe and record data
- Add safeguards against errors in data coding and entry
Systematic Measurement Error
- The second major type of measurement error is systematic measurement error (aka
bias)
- Systematic errors are observations that differ from the true value in a predictable
direction
- Ex. consistently overestimate the true value or consistently underestimate the
true value
- Ex. when responding to a relationship satisfaction questionnaire people might be
biased to report higher levels of satisfaction that they actually feel, perhaps
because they want to convey a favourable impression to the researcher
- The amount of systematic measurement error in a measure is assessed through
tests of that measure’s validity
Think and Respond
Reliability
- Random variance: variance in a measure that is unsystematic and does not exhibit
consistent patterns across repeated measures or modes of assessment
- Systematic variance: variance in a measure that exhibits discernable patterns including
properties such as stability of results across repeated measures and consistency of
results across modes of assessments
Useful Measures Must be Reliable Measures
- For a psychological measure to be useful it needs to be reliable, meaning that the
measure would likely yield similar results if it was used repeatedly to quantify a set of
behaviours or psychological experiences
- Reliable measures are needed in order to produce dependable results when researchers
test their hypothesis - for this reason psychologists have developed a variety of methods
to evaluate the reliability of their measures
- The reliability of a measure is inversely related to the level of measurement error
associated with that measure
- The more error-prone a measure is the less reliable the results of that measure
will tend to be
- You can get a sense of what reliable measure is like by considering common synonyms
for reliability:
- Consistency: a reliable measure is a measure that produces a consistent pattern
of results whenever it is applied to record a given set of behaviours or
psychological experience
- Stability: the scores that a reliable measure yields tend to be more stable across
time and across settings
- Dependability: a reliable measure is dependable in the sense that researchers
can depend on their results replicating if these results are based on reliable
measures
- Random error undermines the consistency, stability and dependability of measures
because when random error is high the results produced by a measure are likely to vary
in unpredictable ways
Systematic Variance: The Signal Researchers Aim to Detect
- Reliability assesses the proportion of that total measured variance in a sample that is
systematic variance, meaning that this variance shows discernable and stable patterns,
versus the variance that is due to random measurement error
- Signal-to-noise ratio is high -> we may be able to accurately detect the signal
- Signal-to-noise ratio is low -> it can be very hard to detect any signal amidst the high
background noise
- Low signal-to-noise ratio: trying to have a conversation with a friend in a noisy
environment, such as a busy restaurant
- High signal-to-noise ratio: conversing as you stroll down a quiet street
- *signal you are trying to detect is what your friend is saying
- *noise is the background sounds that are not part of your conversation
- Analogously in psychological measurement it is harder to estimate the value of some
latent variable when the measure that is used to assess that value contains a high level
of random measurement error or noise
- There are a variety of strategies that psychologists use to assess how well a
psychological measure captures systematic variance as opposed to measuring random
variability
Procedures for Assessing Reliability: Parallel Forms
- Parallel forms: 2 distinct forms of the same psychological measure that have the same
overall structure and format but differ in the specific items that they contain
- Parallel forms reliability: technique that estimated the reliability of a measure by
assessing the magnitude of the correlation between score on parallel forms of that
measure
- Test bank: a corpus of test items that are designed to assess some psychological
construct(s) or knowledge of some topic(s)
Creating Two Forms of a Measure to Assess Reliability
- One method for assessing the reliability of a measure involves constructing 2 parallel
forms of the measure that each have distinct content, with no identical items across the 2
forms of the measure
- Parallel forms reliability can be assessed if both of these parallel forms of the measure
are administered to the same sample of participants and the degree to which these 2
measures correlate is observed
- If there is a strong positive correlation in participants’ scores on the 2 parallel forms of
the measure then this indicates that the measures are assessing the same latent
variable
- A strong positive correlation would mean that participants who scored higher than the
sample average on one of the measures also scored higher than the sample average on
the parallel measure and participants who scored close to the sample average on one of
the measures also tended to score close to average on the parallel measure
- Participants who scored lower than the sample average on one of the measures also
tended to score lower than the sample average on the parallel measure
- If there is a strong positive correlation between the scores on the parallel measures it
doesn’t necessarily mean that individual participants get the exact same score on both
measures
- However, it meant that the ranks of an individual’s score relative to other
participants in the same is nearly the same for each measure of the construct
- Ex. if there is a high correlation between scores on the parallel measures
then a participant whose score was in the 95th percentile on one measure
will tend to be in the 95th percentile on the parallel measure
Procedures for Assessing Reliability: Temporal Stability
- Temporal stability (ex. Test-retest reliability): technique that estimates the reliability of a
measure by administering that same measure to a same sample of participants across
two or more sessions separated by some meaningful interval of time - the magnitude of
the correlation in participants’ scores across these sessions indicates the reliability of the
measure
Stability of a Measure Over Time
- Stability is also referred to as test-retest reliability
- To assess the stability of a measure researchers administer the same measure to a
sample of participants during an initial session then wait some interval of time before
administering the measure to the same sample again in a subsequent session
- The researchers then compute the correlation between participants’ scores
across these separate testing sessions
- If there is a strong positive correlation between the first set of scores and the
second set of score then this indicated that there is high stability of participants’
scores on the measure
- Strong positive correlation between scores at the different time points would mean
that participants who scored higher than the sample average in the first session also
scored higher than the sample average in the second session and participants who
scored close to the sample average in the first session also tended to score close to
average in the second session, and participants who scored lower than the sample
average in the first session also tended to score lower than the sample average in the
second session
- If a measure is highly stable it doesn’t necessarily mean that individual participants get
the exact same score each time that the measure is administered to them. However, it
means that the rank of an individual's score relative to other participants in the sample is
nearly the same each time that the measure is administered to that sample.
- For example, if a measure is highly stable then a participant whose score was in
the 95th percentile during the first session will tend to be in the 95th percentile
again in the second session. Returning to the Mind-Eyes example, if this
measure is highly stable we would expect to see participants' rankings, based on
their score, to be consistent across different time points. In other words,
participants that ranked high at time 1, relative to other participants, should rank
high at time 2, though their specific score may vary slightly. Conversely, if this
measure is not stable or inconsistent over time, individual participants' rankings
should vary across time points
- To the extent that participants’ scores on a given measure are highly stable, in terms of
how they rank relative to other scores in the sample, across testing sessions then this
indicates that the measure is reliably discriminating individual differences in something
Some Authentic Change May be Expected Over Time
- Using the temporal stability of a measure to assess reliability assumes that any change
in the relative ranking of participants’ scores within the sample is due to random
measurement error (ex. Low reliability) rather than authentic change in whatever
psychological variable the measure is assessing
- This assumption is more plausible if there is not an extended interval between
the administrations of the measure
- The length of the interval during which authentic change might plausibly occur
depends on the nature of the variable that is being assessed
- Usually personality traits, fundamental values, and core abilities would not be expected
to change across long stretches of time - months or even years
- For assessing the temporal stability of measure of these types of variables
researchers should schedule the testing sessions to be separated by at least
several weeks or even months
- Even shorter intervals between testing sessions should be scheduled for
assessing temporal stability of measures if there are variables of more
rapid change (childhood, early adolescence)
- Other variables (person’s attitudes, preferences, and feelings) might be expected
to undergo authentic change in relatively shorter intervals of just a few weeks
- If longer intervals are used to test the stability of measure of these more rapidly changing
variables then it would be ambiguous whether low correlations between the testing
sessions indicate random measurement error or authentic change
Determining an Interval that is ‘Just Right’
- Researchers who are testing the temporal stability of a measure need to be careful not
to schedule too long an interval between testing sessions because this makes it
ambiguous whether low correlations in scores are due to measurement error or authentic
change
- However, researchers also need to be careful not to schedule too short an
interval between testing sessions because in this case it would make it
ambiguous whether high correlations in scores are due to reliability of the
measure or due to repeated testing effects
- If there is a short interval between testing sessions then participants may
be able to recall their responses from the first session and they may try to
give a similar response in the second session just to appear consistent
- This inflates the estimates of the stability of the measure
- Goldilocks problem in determining the optimal interval between sessions to test the
temporal stability of a measure
- If the interval between testing sessions is too long the reliability of the measure
may be underestimated
- If interval between the testing sessions is too short then the reliability of the
measure may be overestimated
- There is no one-size-fits-all rule for scheduling the interval between testing sessions
- For most variables it’s customary to separate the testing sessions by at least 2
weeks to balance the cross-pressures of problems due to repeated testing effects
on the one hand versus authentic change in the variable on the other hand
Think and Respond
https://skepticalinquirer.org/2002/01/snaring_the_fowler_mark_twain_debunks_phrenology/
Procedures for Assessing Reliability: Internal Consistency

- Cronbach’s alpha coefficient: technique that estimates the internal consistency of a
measure using a formula that relates the variance of each of the individual items in the
measure to the total variance in the combines measure to estimate how tightly the items
interrelate with each other
- Internal consistency: technique for assessing the reliability of a measure by assessing
how strongly the component observations within that measure (ex. The individual items
on a test) are correlated with one another
- Item-total correlation: technique that estimates the internal consistency of a measure by
computing the correlation between each measure item with the participant’s scores on
these subscales
- Split-half correlation: technique that estimates the internal consistency of a measure by
splitting the items into 2 different subscales of approximately equal length and computing
the correlation between participants’ scores on these subscales
The Importance of Aggregating
- Most psychological measures involve aggregating multiple distinct observations that are
believed to share some common relation to the underlying latent construct that the
measure was designed to assess
- When multiple distinct observations are aggregated through some process of summing
or averaging, the resulting aggregated measure tends to be more reliable than any of the
individual component observations are
- This is because any given observation contains some amount of measurement
error
- As more distinct observations are aggregated together their individual errors
should tend to cancel each other out and the signal for whatever content these
observations share in common should by consequence become clearer following
aggregation
Internally Consistent Observations “Hang Well Together”

- The process of aggregating distinct observations together assumes that these
observations share some variance in common, which means that these observations are
each related to the same underlying latent variable
- To test the assumption of shared variance for the component observations in an
aggregate measure it is necessary to measure how strongly correlated the individual
observations are with one another, which is referred to as the internal consistency of the
measure
- If the component observations in a measure are positively correlated with each other and
these correlations are moderate or large in magnitude then the measure has strong
internal consistency
- When the internal consistency of a measure is strong a researcher can be more
confident that these observations share something in common and are each
measuring the same underlying latent variable
- This strong internal consistency is often described as indicating that the
components “hang well together”
- If the component observations in a measure are only weakly correlated with each other
and/or if there are some negative correlations among them then the measure will have
weak internal consistency - in this case a researcher will likely conclude that the
observations are not measuring the same underlying latent variable
Procedures for Assessing Reliability: Inter-rater Agreement
- Behavioural checklist: a list of target behaviours to record in an observational study

- Behavioural coding scheme: a set of criteria for sorting observed behaviours into defined
categories or rating those behaviours on defined dimensions
- Inter-rater agreement: technique that estimates the reliability of an observational
measure by computing the correlation between the ratings that 2 or more independent
observers give to the same sample of observations
- Motivated skepticism: bias to be especially skeptical of evidence or opinions that have
unfavourable implications and relatively credulous towards evidence or opinions that
have favourable implications
Agreement Between Raters
- Another commonly used procedure to assess the reliability of a psychological measure
involves testing inter-rater agreement
- If a measure is reliable then different observers who use that measure should
tend to show adequate levels of agreement in the scores that they assign to a
sample of participants
- Inter-rater agreement is relevant to assessing the reliability of measures that
entail observers recording and/or coding a sample of target behaviours
- The following examples illustrate the variety of contexts in which psychologists might
assess inter-rater agreement to evaluate the reliability of a measure:
- Developmental psychologists might test inter-rater agreement in recording and
coding children's attachment style during critical interactions with their caregivers.
- Comparative psychologists might test the inter-rater agreement in using a
checklist to record dominance-related behaviours in a primate colony.
- Personality psychologists might test inter-rater agreement in using a scheme to
code achievement-related themes in essays that participants write in response to
a narrative prompt.
- Cognitive psychologists might test inter-rater agreement in using a coding
scheme to code critical details in participants' reported memories for some target
event.
- To test inter-rater agreement 2 or more observers use a measurement instrument, such
as a behavioural checklist or a more extensive behavioural coding scheme, to
independently record their rating of the same sample of targets’ behaviours
- The consistency between these observers’ independent ratings is then assessed to
determine the overall level of agreement in their use of that measure
- If these independent observers’ ratings of the same behaviours exhibit a strong
positive association with each other then this would indicate that the measure is
reliable
- If observers’ ratings are only weakly associated or if they show a negative
association in their independent ratings of some of the same behaviour, then this
would indicate a lack of agreement and an unreliable measure
- Thus, the consistency in how independent observers rate the same variables can be an
important indicator of the reliability of that measure
- In general measures that attempt to assess relatively concrete, well-specified behaviours
tend to achieve higher levels of inter-rater agreement than behaviours that are more
abstract or open to alternative interpretations
- Ex. it should be easier to achieve a consensus between observers’ ratings of
whether or not a target person yawned that it is to achieve consus in whether a
target person appears to be bored
- So if a psychological measure shows low levels of inter-rater agreement when it is pilot
tested then the levels of inter-rater reliability might be able to be improved by revising the
content of the measure to focus on more concrete and specific operationalizations of the
behaviours of interest
- If an observational measure needs to assess more abstract behaviours or ambiguous
behaviours that are potentially open to different interpretations then inter-rater
agreement can be improved by investing more time in training the observers to use a
common set of criteria for applying the behavioural coding scheme
- With extensive training, instruction and practice it is often possible to achieve sufficient
levels of inter-rater agreement even for measures of fairly ambiguous or complicated
behaviour patterns
Think and Respond
Approaches for Enhancing the Reliability of Measures
What Can be Done About Low Reliability: Two Key Strategies

- If a psychological measure is found to have low reliability then there are some useful
strategies that might be used to enhance the reliability of that measure
1. Increasing Reliability by Taking Additional Measurements
- Measurement is in some respects a kind of sampling
- Ex. asking questions on a survey can be thought of as sampling a person's
attitude
- Items on a test can be thought of as sampling a person's knowledge or ability
- Behaviours that are inventoried in an observational study can be thought of as
sampling an individual's behaviour patterns
- Deviation of a sample estimate from the population’s true value is affected by random
sampling error (like we noted in a previous module)
- Similarly, when we are attempting to measure a given variable the deviation of the
measured values from the true values is influenced by random measurement error
- Increasing one’s sample size can be valuable because a larger sample reduces
the impact of random sampling error
- All else being equal, you will have a larger random sampling error if you draw a
small sample from a population than you would have if you drew a larger sample
from that population
- Analogously, random measurement error might be reduced by increasing the number of

measures that are taken to estimate a target’s true value
- If we are constructing a questionnaire to measure a person’s attitude then the
questionnaire’s random measurement error will tend to be reduced and its reliability will
be enhanced if we include some additional items that are similar in content to the
existing items
- If we are designing a test to measure a person’s knowledge or aptitude then
adding more items to that test that are similar to the existing items will tend to
reduce the random measurement error and improve the reliability of those test
scores
- Or if we are using a code system to measure targets’ behaviour patterns then we
will tend to reduce the random measurement error and increase the reliability of
the coding system if we include more observers and aggregate their coding
results with those of the existing coders
- So, as a general rule increasing the number of measurements taken will tend to reduce
random measurement error and improve the overall reliability of an aggregate measure
- However, there are some important considerations to bear in mind when deciding
whether to include additional items to a scale or test or additional observers in an
observational study.
- To Add Items or Not to Add Measurement Items? Further Considerations
1. When considering whether to add measurements to improve the reliability of a
measurement system it is important to take care to add measurements that will share
some content in common with the existing measurements
- Adding more measurements to an aggregate measure will only improve the reliability of
that measure if the added measurements overlap somewhat in their content with the
original measurements
- If the added measurements involve unique content that diverges strongly from the
content of the existing measurements then adding these additional measurements will
not improve the reliability of that measurement system
- If added items are poorly correlated it may not improve reliability of the questionnaire
and also have some risk that the added items will lower the reliability
- If poorly correlated, it will not improve the inter-rater reliability of a coding system
2. Even if the added measurements share something in common with the existing
measurements there are diminishing returns to adding more and more measurements
to enhance an aggregate measure’s reliability
- Ex. you will get a larger boost in reliability by adding 5 items to a 10-item scale (50%
increase in items) than you will get from adding 5 items to a 50-item scale (just a 10%
increase in items)
3. Additional measurements to not reduce the impact of systematic measurement error, of
the level of bias in measurement
- A measure contains systematic biases in estimating some target value (ex. A systematic
bias to overestimate the value of the variable), adding additional measurements will not
counteract such systematic biases
- Ex. if people are biased to report higher levels of satisfaction with their romantic
relationship then they actually feel, perhaps because they want to convey a favourable
impression to the researcher, then adding more items to a relationship satisfaction scale
will not correct this bias in self-reporting
2. Increasing Reliability by Removing Weakly Associated Measurements
- Another approach for enhancing the reliability of a measure involves identifying any
items that show very low correlations with the other items\
- Items that are not correlated with the other items will be contributing to the noise in the
measure and this noise will make it harder to detect the signal from the other items
- Removing these poorly performing items from the measure could help to improve that
measure’s reliability metrics
- With this approach, researchers could use the patterns of intercorrelation among the
items within a measure to select the most consistently performing items and refine the
measure
Overview of Measurement Reliability

- https://www.psychologytoday.com/ca/blog/cui-bono/201710/measurement-reliability-expl
ained-in-simple-language
- The greater the error, the lower the reliability of measurement
- The same thing is true for psychological tests. I think there is a tendency by
psychologists to think of reliability as a "property" of a test or questionnaire. But any time
that tests are administered, the results can be affected by the behavior of the person
administering the test—by their tone of voice and body language, even when standard
instructions are being followed. And when tests are administered on the Internet, who
knows how the conditions in a person's immediate environment (noise level, distractions
from other people) and their own state of mind (whether they are attentive, sleepy, or
drunk) are affecting the reliability of the test.
Think and Respond
- Dr. Johnson makes the insightful observation that when experimenters examine the
consistency of people’s behaviour across just two experimental situations this is
essentially like looking at the test-retest reliability of a single-item measure. As we
reviewed in the module, measures that are consistent with a low number of items tend
not to be very reliable. So, when researchers assess a single behaviour across two
situations, we should not be surprised that there tends to be relatively low stability in
those behaviours. This low stability likely represents the unreliability of a single-item
measure rather than a lack of consistency in personality expression. To provide a more
reliable assessment of the consistency of personality what researchers should do is
observe the behaviour of a sample of individuals in multiple situations, and compute the
average of each individual’s behaviour in a randomly selected half of those situations
and the average of their behaviour in the other half of the situations and then compute
the correlation between those two averages. If multiple observations go into each of
these averages then each average should be a relatively reliable measure of the
individuals’ behavioural tendencies and thus we would expect the correlation between
these two behavioural averages to be relatively large. For example, suppose that you
observed a sample of individuals across 10 situations. You could then randomly divide
these 10 situations into two sets of 5 situations. You would then calculate each
individual’s average behavior score in each of these sets of 5 situations and compute the
correlation between those two averages. This procedure should result in a much higher
consistency in behaviour then you would get if you computed the correlations in
individuals’ behaviours across only 2 situations, as was typically done in the past
experimental research that questioned the consistency of personality.
Validation of Measures
- Criterion validity: technique for assessing the validity of a measure by examining how
well it predicts some key outcome that it was specifically designed to predict
- Face validity: the extent to which the content of the measure appears to resemble
whatever latent construct it was designed to measure
- Measurement confound: variability in the measured value that can be attributed to a
source that is not the latent construct that the measure was designed to assess
Does the Measure Actually Get at the Variable of Interest?
- Establishing the reliability of a psychological measure is the critical first step in
determining whether it will be a useful instrument for testing hypotheses about the
variable of interest
- When a measure is reliable this tells us that it is measuring something that is
consistent and stable enough to yield replicable results
- For a measure to be useful it is not sufficient to demonstrate that it is measuring
something
- We must establish that the psychological measure is measuring the specific
variable that it was designed to measure, as opposed to something else
- This critical next step entails testing the validity of the measure
- There is always some risk that a measure that was designed to measure a particular
variable might measure variance in other unintended variables
- If some variability in a measure can be attributed to sources other than the variable that
it was designed to measure then this is referred to as a measurement confound
- To establish that a measure will be useful for research purposes a researcher
needs to demonstrate that most of the variability in that measure is related to the
variable that it was designed to measure and relatively little of its variability is due
to some measurement confound
- Much of the work in assessing the validity of a measure focuses on testing
whether potential measurement confounds can be ruled out
Face Validity
- Face validity is the most simplistic and least scientifically compelling evidence for the
validity of a measure
- A measure is said to be face valid if most people who examine just the content of the
measure would agree that it appears to be measuring what it is designed to measure
- Face validity is thus equivalent to the proverbial “duck test”
- “If it looks like a duck and it quacks like a duck, then it must be a duck”
- Similar to the duck test, a researcher might claim that a measure is a face valid
measure of relationship satisfaction because the items consist of statements that
sound like the kinds of things a person would endorse a statement
- Although face validity is intuitively compelling it is not scientifically compelling evidence
of a measure’s validity because experienced researchers know to expect things often
are not as simple and straightforward as they may appear
- Some people might endorse a statement such as “I enjoy spending time with my
partner” not because they are actually satisfied with their relationship but
because they fear that they will look bad if they don’t say positive things like this
about their partner
- So on the surface appears to be a face valid measure of relationship
satisfaction may actually be tapping into a person’s motivation to convey
a positive impression to the experimenter
- The principle that we should be suspicious of seemingly face valid evidence
extends beyond research to many practical domains of life
- “You can’t judge a book by its cover”
- Some of the most compelling examples of fallibility of face validity come from the
criminal justice system
- Ex. people naively assume that a suspect’s confession to a crime is face
valid evidence of that person’s guilt - there have been such a large
number of confessions that were proven to be false and were overturned
based on definitive forensic evidence that we should be extremely
skeptical of leaping to the conclusion that a confession proves a person’s
guilt
Criterion Validity
- Some measures are designed to assess variability in a psychological construct in order
to predict specific criterion outcomes of interest
- A criterion outcome is a specific, definitive outcome that usually has some real
world significance or practical value
- In these cases the criterion outcome serves as a “gold standard” for
indexing the validity of the measure in question
- Although some psychological measures are designed to predict a specific criterion

outcome, such as using the SAT to predict academic performance, most psychological
measures are designed to measure more general psychological characteristics that are
relevant to assessing a broad range of potential outcomes and behaviours
- So, while the standard of criterion validity can be useful in some special cases, in
most cases of psychological measurement a different approach that involves
assessing how the measure is related to a broader variety of other relevant
measures and outcomes is needed to assess the validity of measures
- This example illustrates that criterion validity is restricted to measures that are
designed for a quite narrow purpose and is not relevant to most psychological
measures that are designed to assess broader constructs. To validate such broader
measures psychologists need to engage in a more extensive form of validity testing,
called construct validation
Construct Validity: Convergent Validity
- Construct validity: approach for estimating the validity of a measure by testing a network
of predictions about what patterns the measure should exhibit in relation to other
measures and outcomes, and in relation to meaningful groups
- Convergent validity: technique for estimating the validity of a measure by assessing
correlations between that target measure and other existing measures that were
designed to assess the same construct or same closely related construct
- Halo effect: source of systematic measurement error in observational measures that
occurs when observers are biased to attribute positive qualities to targets who are
physically attractive or who make a favourable initial impression compared to targets
who are less attractive or make a less favourable initial impression
- Known groups validation: technique for estimating the validity of a measure by
administering that measure to two or more groups of participants that, according to the
researchers’ theory of the construct, are predicted to differ in their levels of the
psychological variable of interest
- Shared method variance: correspondence between measures that is due to
methodological elements that they share in common rather than due to their convergent
measurement of the same psychological construct
- Social desirability response bias: source of systematic measurement error in self-report
measures that occurs when participants are motivated to respond to self-report items in
a way that will promote a favourable impression
- Theory of the construct: the researcher’s theoretical assumptions about the nature,
scope, and properties of the latent construct that they are studying
Validity Derived From Accumulation of Many Successful Predictions
- Construct validity assesses the validity of a measure by testing a network of
predictions about what patterns the measure should exhibit in relation to other measures
and outcomes and in relation to meaningful groups
- These predictions are derived from the researcher’s theoretical understanding of
whatever latent variable the measure was designed to assess
- This theory of the latent variable is referred to as the researcher’s theory of the
construct, which is why this approach is known as construct validity
- The accumulation of many successful predictions involving the measure and diverse
observations relevant to the theory of the construct enhances the research community’s
confidence in the construct validity of that measure
- Establishing a measure’s construct validity is thus an ongoing process that is never
definitively resolved and that may need to be revisited as the field’s theory of the
construct evolves
Testing Construct Validity: Known Groups Validation
- One useful approach for testing the construct validity of a measure involves
administering the measure to two or more groups of participants that, according to the
researchers’ theory of the construct, are predicted to differ in their levels of the
psychological variable of interest
- This is referred to an known groups validation
- These groups could be a group that has a clinical condition that entails atypically high or
atypically low levels of the variable of interest and a comparison group of participants
who should have typical levels of the variable of interest
- Known groups testing of a measure might also involve administering the measure to a
group of participants who work in a field that require above-average levels of the variable
of interest and a comparison group of participants who work in a field that does not
require particularly high levels of that variable
- Known groups validation has been used to test the construct validity of the Mind-Eyes
test
- As you will recall this measure was designed to assess the latent variable,
cognitive empathy
- The researcher’s theory of cognitive empathy led them to hypothesize that people who
have an autism spectrum diagnosis (ASD) should be lower in cognitive empathy than
people who were not on the autism spectrum
- This was predicted because ASD is theorized to involve a core deficit in the
individuals’ theory of mind, which is the ability to make accurate inferences about
other people’s cognitive stress
- Thus, if the Mind-Eyes test is a valid measure of cognitive empathy then the
researchers predicted that a sample of participants who had ASD should get
significantly lower scores on the Mind-Eyes test than a comparison sample of
participants who did not have ASD
Testing Construct Validity: Convergent Validity

- Another major approach for testing the construct validity of a target measure involves
assessing how that measure relates to other measures that have been designed to
measure the same construct as well as other measures that were designed to measure
a different construct
- If the measure that is being tested is valid then it should show moderate to strong
positive correlations with other measures of the same construct and it should have
relatively low correlations or be uncorrelated with measures of a different construct
- The correlations between the target measure and other measures of the same construct
assess the target measure’s convergent validity
- To have good convergent validity the target measure needs to show relatively
high positive correlations with other existing measures that used a different
methodology to measure the same construct or some closely related construct
- Ex. if a new measure of a latent variable, such as agreeableness, is authentically
capturing variance in people’s levels of this trait then we should predict that when
this new measure and other existing agreeableness measures are administered
to the same sample of participants those participants’ scores on the new
measure should be at least moderately positively correlated with the other
agreeableness measures
- Convergent validity is critical because it shows that the measure that is being tested
overlaps with other operationalizations (ex. Other questionnaires or behavioural tasks) of
the same construct
- This overlap between these measures indicates that they may be converging on
the same latent variable
How do Convergent Validity and Parallel Forms Differ?
- Convergent validity might at first sound similar to parallel forms reliability but these two
concepts are actually quite different
- To assess a measure’s parallel-forms reliability researchers test within a
given sample the correlation between participants’ scores on two versions of
the same measure, which have the same overall structure and format but
merely differ in the particular content of their items
- Ex. researchers test the correlation between participants’ scores on two
different versions of the Mind-Eyes test, where the images of the eyes
and the response options are a little different
- To assess a measure’s convergent validity researchers test within a given
sample the correlation between participants’ scores on two or more distinct
measures that use different methods to assess the same underlying construct
- Ex. researchers test the correlation of participants score on the
Mind-Eyes test with their scores on a distinct measure, such as the
empathy quotient questionnaire, which was also designed to assess
variability in cognitive empathy
Strong Evidence of Convergent Validity Using Distinct Measures
- Evidence of convergent validity is especially impressive when different measures of the
same construct utilize quite distinct methodologies to measure the variable
- Ex. it is more impressive to show that a self-report measure of agreeableness
correlates significantly with a behavioural measure of agreeableness (2 very
distinct measures) than it is show that two self-report measures of agreeableness
are significantly correlated (2 similar measures)
- When two different measures of the same construct use the same general methodology
(ex. Both are self-report measures) then some of their correlation with each other is likely
due to shared method variance rather than due to their convergent measurement of
the same psychological construct
- Shared method variance is an example of systematic measurement error
because it biases a measure in some consistent direction away from the true
value that the measure is designed to estimate
Testing Construct Validity: Discriminant Validity
- Discriminant validity: technique for estimating the validity of a measure by assessing

whether that target measure is unrelated to measures that it should be distinct from
according to the researcher’s theory of the construct
Testing Construct Validity: Discriminant Validity
- The correlations between target measure and measures of different constructs assess
the target measure’s discriminant validity
- To have good discriminant validity the target measure needs to show relatively
low correlations with measure of other variables that it should be distinct from
according to the researcher’s theory of the construct
- A target measure will be judged to have overall good construct validity if its correlations
with other measures of the same construct are consistently higher than its correlations
with measures of constructs that it should theoretically be distinct from
Overview of Measurement Validity
- https://www.psychologytoday.com/ca/blog/cui-bono/201711/measurement-validity-explai
ned-in-simple-language
Think and Respond
- Dr. Johnson notes that items such as "I feel a kinship with other people” assess social
adjustment. Social adjustment is one of the outcomes that spirituality should be
correlated with according to theories of spirituality, but social adjustment is not one of the
defining characteristics of spirituality. If items that assess social adjustment are included
in a measure of spirituality then this would result in spirituality being confounded with
social adjustment in that measure. We know from other research that social adjustment
tends to predict positive physical and psychological outcomes. So, the confounding of
spirituality and social adjustment in a spirituality measure makes it difficult to determine
whether the positive correlation between that spirituality measure and
physical/psychological outcomes is due to the spirituality-relevant items in the spirituality
measure or due to the social adjustment-related items within that measure. Thus, when
researchers are designing a new measure they need to be careful to ensure that the
contents of the measure only include items that are relevant to the core defining
characteristics of whatever construct the measure is being designed to assess and
exclude any items that are specific to other variables that the researchers may want to
use their measure to predict.
Summary
- Testing psychological hypotheses about the relations between variables depends upon
using measures that accurately represent those variables. Thus, researchers need to
demonstrate that the measures they use to operationalize their hypothesized variables
accurately represent those variables. For a measure to be considered an accurate
representation of a variable that measure must be shown to have adequate reliability
and validity
- A measure is reliable to the extent that it yields consistent, dependable results when it is
used repeatedly to measure variability within a sample. The reliability of a measure is
inversely related to the amount of random measurement error in that measure. A
variety of methods are used to assess the reliability of psychological measures including
the temporal stability of scores on that measure, the consistency between scores on
parallel forms of the same measure, and the internal consistency of the scores on
individual items within that measure. For observational measures, reliability is assessed
by estimating inter-rater agreement in the ratings that independent observers give to
the same sample of observations. When a measure is found to have low reliability
researchers may seek to improve its reliability by incorporating more items into the
aggregated measure, by removing items that are weakly or inconsistently associated
with other items in the measure, and by clarifying the items in the measure and
standardizing the conditions of administering or scoring the measure
- A measure is valid to the extent that it assesses the specific variable that it was designed
to measure as opposed to measuring some other confounded variable. To establish
that a measure will be useful for research purposes a researcher needs to demonstrate
that most of the variability in that measure is related to the variable that it was designed
to measure and relatively little of its variability is due to some measurement confound.
Researchers may be tempted to assume that a measure is valid simply because the
content of the measure appears to resemble the variable that it was designed to
measure. This is known as face validity and it is not considered to be an adequate
demonstration of a measure's validity. When a measure is designed to predict some
specific criterion outcome then the validity of a measure can be assessed by testing how
well the measure predicts that outcome, which is known as criterion validity. For most
psychological measures, which are designed to predict a broad range of psychological
states and behaviours, validity testing usually involves extensive research that tests the
relations between the measure of interest and a variety of other measures and
outcomes, which is known as construct validity. One method of assessing a measure's
construct validity involves administering that measure to two or more groups of
participants that, according to the researchers' theory of the construct, are predicted to
differ in their levels of the psychological variable of interest, which is referred to as
testing known groups validity. Another technique for testing the construct validity of a
measure involves assessing correlations between that target measure and other existing
measures that were designed to assess the same construct or some closely related
construct, which is known as testing the measure's convergent validity. Another
important technique for estimating the construct validity of a measure involves testing
whether that target measure is unrelated to measures that it should be distinct from
according to the researcher's theory of the construct, which is known as testing the
measure's discriminant validity.

Module 5 - Reliability and Validity in Measurement

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Module 5 - Reliability and Validity in Measurement

Uploaded by

Copyright:

Available Formats

Module 5: Reliability and Validity in Measurement

- Measure: any observational tool that can be used to quantify a variable

2. Random variability in the research situation

4. Random errors occurring when the research personnel record data

Procedures for Assessing Reliability: Temporal Stability

Procedures for Assessing Reliability: Internal Consistency

Internally Consistent Observations “Hang Well Together”

- Behavioural checklist: a list of target behaviours to record in an observational study

What Can be Done About Low Reliability: Two Key Strategies

- Analogously, random measurement error might be reduced by increasing the number of

Overview of Measurement Reliability

- Although some psychological measures are designed to predict a specific criterion

Testing Construct Validity: Convergent Validity

- Discriminant validity: technique for estimating the validity of a measure by assessing

You might also like