You are on page 1of 13

Week 2 Lectures, Part 1: Validity

Introduction to Validity
 What is validity?
 When we talk about validity, we are asking:
 Does a test measure what it claims to measure?
 Does the achievement test measure how well an individual has mastered the content of a course or
training program?
 Does the aptitude test measure a person’s ability to perform a task or activity?
 Does the test predict what it claims to predict?
 Does an employment test predict future performance on the job?
 What is the relationship between Validity and Reliability?
 If a test is unreliable, it can’t be valid
 Reliability is necessary but not sufficient for validity (places a ceiling on validity)
 How to Assess Validity?
 Validating a test refers to accumulating empirical data and logical arguments to show that the inferences
are indeed appropriate:
 Analyze the content of the test
 Relate the scores to specific criteria
 Examine the psychological constructs measured by the test
 What are the types of validity?
 Content Validity
 How well do the items on a test represent everything the test attempts to measure (I.e., how well has the
construct been operationalized)?
 Requires a well-defined trait/construct
 This can be difficult to do with abstract/complex constructs
 “Expert” judges can be used to determine content validity.
 Type of Content Validity = Face Validity:
 Does a test superficially measure the construct
 Tells us nothing about what a test really measures
 Non statistical
 Established by non-experts
 Criterion Validity
 Extent to which the results relate to a certain criterion
 Types of Criterion Validity
 Concurrent validity: correlation between results and a pre-existing criterion.
 Extent to which test scores accurately estimate an individual’s present position on relevant
criteria.
 Appropriate for validating clinical tests that diagnose behavioral, emotional, or mental disorders
(e.g., personality inventories).
 Often used in time sensitive areas, or with short forms of a measure.
 Predictive validity: predictive power of results on a future outcome (regression).
 Test scores are used to estimate outcome measures obtained at a later date.
 E.g., entrance examinations and employment tests (e.g., GRE)
 The good ol’ regression equation:
 Y= b0 + b1X
 Y= the predicted score on the criterion
 b0 = the intercept
 b1 = the slope
 X = the score the individual made on the predictor test
 Criterion Contamination: artificial inflation of the strength of relationship between a test and criterion.
 Max validity is set by the reliability of the test and the criterion.
 Construct Validity
 Extent to which a test measures the construct it claims to measure
 Construct: an intangible quality in which individuals differ, e.g., depression, intelligence, etc.
 Considerations:
 1)Test homogeneity
 2) Appropriate developmental changes
 3) Theory consistent group differences
 4) Theory consistent intervention effects
 Types of Construct Validity
 Convergent validity – when a test highly correlates with other related tests or variables.
 Discriminant validity – when a test does not correlate with unrelated variables.
 Ways to establish evidence
 C & D Validation
 When do you use which validation strategy?
 Content validity: for tests that measure concrete attributes (observable and measurable
behaviors).
 Criterion-related validity: for tests that predict outcomes.
 Construct validity: for tests that measure abstract constructs.
 How do you calculate and Evaluate Validity
Coefficients? Example: The correlation between SAT
 Calculating Validity Coefficients scores and college performance is 0.40. How
 Correlation coefficient – quantitative much of the variation in college performance
estimate of the relationship between two is explained by SAT Scores?
variables
 Validity Coefficient – the correlation  r2 = 0.16, so 16% of the variance is
coefficient between two sets of test explained (and so 84% is not
scores explained).
 There are 2 methods for evaluating validity
coefficients
 Tests of significance
 Expectations for strong relationship not as high as with reliability coefficients
 Answers question “How likely is it that the correlation between the test and the
criterion resulted from chance or sampling error?”
 If p<.05 we have evidence that the test and criterion are related
 Coefficient of determination
 Answers the question, “What amount of variance do the test and criterion share?”
 Equals validity coefficient squared (r2)
 E.g., if r = .50, then r2 = .25 (25% shared variance)
 Multitrait-multimethod matrix
 Background Information
 A way of representing the relationship between multiple traits (constructs) and multiple
methods for measuring those traits.
 Visually displays information about both reliability and validity in an succinct way.
 Basic principles of the MTMM:
 Coefficients in the reliability diagonal should consistently be the highest in the matrix.
 A trait should be more highly correlated with itself than with anything else
Coefficients in the validity diagonals should be significantly different from zero and
high enough to warrant further investigation.
 This is evidence of convergent validity
 A validity coefficient should be higher than values lying in its column and row in the
same heteromethod block.
 A validity coefficient should be higher than all coefficients in the heterotrait-
monomethod triangles.
 Trait factors should be stronger than methods factors
 The same pattern of trait interrelationship should be seen in all triangles.
 The Matrix

 Reliability diagonal: Test-Retest or internal consistency reliabilities


 These tell you how reliably you can measure each construct (A,B,C) with each method
( = mono-trait, mono-method correlations)
 Heterotrait monomethod triangles: These show different constructs measured by the
same method
 Correlations of the same trait measured by the same and different measures (Validity
diagonals) should be greater than correlations of a different trait measured by the same
and different measures (Heterotrait monomethod triangles)
 If not, what is going on?
 Validity diagonals: Tell you how well you can measure the same construct using different
methods (monotrait-heteromethod diagonals)
 Each entry shows the correlation between two different methods used to measure the
same construct
 We hope these will be highly correlated = convergent validity
 Heterotrait heteromethod triangles: These show different constructs measured by
different methods
 Because they share neither trait nor method, they should be expected to be low
 Factor Analysis
 Factors—the underlying commonalities of tests or test questions that measure a construct
 Exploratory Factor Analysis (EFA)
 Identifies underlying structure (factors) of a group of items/variables.
 Confirmatory Factor Analysis (CFA)
 Determines whether or not theoretically proposed factor(s) exist within a group of
items/variables.
 Steps to Complete Factor Analysis:
 enter a matrix of raw data (usually individual answers to test questions)
 software program calculates a correlation matrix of all variables or test questions
 software program uses the correlation matrix to calculate a factor solution based on a
geographic representation of each test question’s relationship to the other test questions
 as test questions group together, they form factors—underlying dimensions of questions that
measure the same trait or attribute
 researcher examines the questions in each factor and names the factor based on the content
of the questions that are most highly correlated with that factor

 What Affects Test Validity?


 Moderator variables
 Those characteristics that define groups, such as sex, age, personality type etc.
 A test that is well-validated on one group may be less good with another
 Validity is usually better with more heterogeneous groups, because the range of behaviours and test
scores is larger
 Base rates
 Tests are less effective when base rates are very high or very low (that is, whenever they are skewed
from 50/50)
 Test length
 Longer tests tend to be more reliably related to the criterion than shorter tests (this depends on the
questions being independent--every question increases information)
 Nature of the sample
 Homogenous-not good
 Heterogenous-good
 Inadequate understanding of constructs
 Solutions:
 Think through the concepts better--exhaustive examination of literature
 Use methods such as concept mapping to articulate concepts
 Get experts to critique operationalizations
 Mono-method bias
 Using a single measure may not tap the full construct
 Solution:
 Implement multiple versions—do they behave as you theoretically expect them to?
 Criterion contamination
 Too much overlap between predictor and criterion
 Social Threats to Validity
 Hypothesis guessing
 Evaluation apprehension
 If evaluation makes them perform poorly
 Tendency to “look good”
 Or sometimes to “look bad”
 Tester expectancies
 Both conscious and unconscious communication
 False Positives vs. False Negatives
 Consider the consequences of false positives and false negatives for the following test-based decisions:
 whether a law enforcement candidate is prone to using excessive force
 whether a client is suicidal
 whether a third grader is learning disabled
 whether an employed aircraft mechanic is a safety risk to the company
 In each case, which is worse—a false positive or a false negative? Which should we try to minimize?
Test Construction
 Review: What are the 4 levels of measurement?
 Nominal: #s = categories (e.g., 1= male; 2= female).
 Ordinal: rank order without clear quantitative difference between ranks (e.g., military ranks).
 Interval: clear metric to differentiate values (e.g., temperature in Celsius).
 Ratio: same as interval, but includes a “true zero” (e.g., weight).
 What are the different types of scales?
 Likert Scale:
 Rating scale that assigns numeric values to qualitative ratings (technically ordinal, but often treated as
an artificial interval).
 Guttman Scale:
 ordered sequence of endorsement (selected response implies agreement with all preceding responses).
 Method of Empirical Keying:
 how well test items differentiate criterion group from control group.
 Flowchart of the Test Development Process
 Steps to the Test Development Process
 Step 1: Define test universe, target audience, and test purpose
 First step in test development includes:
 Defining the Testing Universe
 Testing universe - body of knowledge or behaviors that the test represents
 Involves preparing a working definition of the construct the test will measure
 Requires a thorough review of literature to identify how construct can be defined and
other tests that measure similar construct
 Defining the Target Audience
 Target audience - group of individuals who will take the test.
 Make list of characteristics
 Consider appropriate reading level
 Consider disabilities requiring special test administration or interpretation
 Consider whether test takers will be motivated to answer honestly
 Defining the Test Purpose
 Provides foundation for all other development activities
 Step 2: Develop a test plan
 Test plan - specifies the characteristics of the test, including an operational definition of the
construct and the content to be measured (the testing universe), the format for the questions, and the
administration and scoring of the test.
 Choosing the Test Format:
 Objective test formats – have one response designed as “correct”
 Subjective test formats – interpretation left to judgment of person who scores/interprets test
 Specifying Administration Formats
 Some questions to consider:
 Will the test be administered in writing, orally, or by computer?
 How much time will the test takers have to complete the test?
 Will the test be administered to groups or individuals?
 Will the test be scored by a test publisher, the test administrator, or the test takers?
 What type of data is the test expected to yield?
 Scoring Methods
 Cumulative model – the more the test taker responds in a particular fashion (either with
“correct” answers or ones that are consistent with a particular attribute), the more the test taker
exhibits the attribute being measured (e.g., multiple-choice questions)
 Categorical model – places test takers in a particular group (e.g., a diagnosis of a psychological
disorder)
 Ipsative model – requires test taker to choose among the constructs the test measures (e.g.,
forced choice)
 EX: Personality items present job-applicants with options equal in desirability so that their
choice cannot be influenced by social desirability.
 Job-applicants are asked to indicate which items are ‘most true’ of them and which are ‘least
true’ of them
 Using Reverse Scoring to Balance Positive and Negative Items
 Method for offsetting effects of acquiescence
 Balance positive with negative statements
 Given the following 5-point scale:
 1 = rarely true
 2 = sometimes true
 3 = neither true nor false
 4 = very often true
 5 = almost always true
 Test scorer reverses the response numbers of negative items and use cumulative model of
scoring
 Step 3: Compose the test items
 Response Bias
 Response styles/sets – patterns of responding to test items that result in false or misleading
information
 Sample response sets:
 Social desirability
 Acquiescence
 Random responding
 Faking
 Writing Effective Items
 General suggestions for writing effective items:
 Write each item in a clear and direct manner
 Use vocabulary and language appropriate for target audience
 Avoid using slang or colloquial language
 Make all items independent
 Ask someone else to review items to reduce unintended ambiguity and inaccuracies
 Writing multiple choice and true/false items:
 Avoid using negative stems and responses
 Make all responses similar in detail and length
 Make sure the items has only one correct / best answer
 Avoid using words like “always” and “never”
 Avoid overlapping responses
 Avoid using inclusive distractors such as “all of the above”
 Use random assignment to position correct response
 Step 4: Write administration instructions
 Must write three sets of instructions:
 Administrator Instructions
 Whether test should be administered in group or individually
 Specific requirements for test administration location
 Required equipment
 Time limitations
 Script for administrator
 Credentials or training required
 Instructions for the Test Taker
 Test administrator typically delivers instructions by reading prepared script. 
 Instructions also typically appear in writing.
 Test taker needs to:
 Know where to respond
 Know how to respond
 Have specific directions for each type of item
 Scoring Instructions
 Ensure each person who scores tests follows same process.
 Must explain how test scores relate to construct measured.
 Step 5: Conduct pilot test(s)
 Step 6: Conduct item analysis
 Step 7: Revise the test
 Step 8: Validate the test
 Step 9: Developing norms
 Step 10: Compile test manual
Week 2, Part 1: Intelligence and Achievement
 Outline
 Defining Intelligence
 Theories of Intelligence
 Wechsler tests (WAIS-IV, WISC-V)
 Stanford Binet Intelligence scales
 Achievement tests
 Learning Disabilities

 What is Intelligence?
 Standardized test performance?
 Ability to perform certain mental operations quickly and accurately?
 Memory?
 Intelligence:
 An internal capacity hypothesized to explain people’s ability to solve problems, learn about new
materials, and adapt to new situations. 
 How accurate are we in judging intelligence?
 Anything missing in the definition?

 Theories of Intelligence
 Sir Francis Galton
 Spearman
 Thurstone
 Vernon
 Cattel + Cattell-Horn-Carroll
 Guilford
 Biological Theory
 Information Processing
 Gardner
 Sternberg

 Spearman’s Work in Psychometrics


 Developed factor analysis
 A procedure that groups together related items on tests by analyzing correlations among them
 Scores that reflect a single underlying skill should correlate
 Argued that a single factor, “g” (for general intelligence), underlies performance on a variety of mental
tests
 A separate factor, “s” (for specific intelligence) is unique to each particular test
 Explains why correlations among tests aren’t perfect
 Spearman’s General & Specific Factors
 Another Psychometric Approach: Hierarchical Models
 Thurstone argued for several kinds of primary
mental abilities instead of just one ‘g’
 Examples: verbal comprehension, verbal
fluency, numerical ability, spatial ability,
memory, perceptual speed, reasoning
 Vernon (1950)
 Hierarchical idea: g exists, but is made up of
subfactors (abilities) that may operate
independently from one another
 Fluid and Crystallized Intelligence
 Raymond Cattell: made the distinction between two components of general intelligence:
 Fluid intelligence: Ability to solve problems, reason and remember
 Relatively uninfluenced by experience, schooling
 Crystallized intelligence: Knowledge and abilities acquired as a result of experience
 Reflects schooling, cultural background
 Bio Basis: Speed of Neural Transmission
 If there is such a thing as general intelligence, what is its biological basis?
 One possibility: Individual differences in how fast neurons can communicate with one another
 Evidence that speed of neural transmission is important
 Electrical response of the brain to a visual stimulus correlates with intelligence test scores
 Note: research was correlational and could not explain all of the variability in intelligence test
scores
 Metacognitive Strategies
 How we think about our own thinking or our awareness of our thinking processes can influence our
abilities 
 E.g., use of  metacognitive strategies can enhance performance on ability tests
 Multiple Intelligences
 Gardner’s Multiple Intelligences
 Idea that people possess a set of separate and independent “intelligences” ranging from musical to
linguistic to interpersonal ability
 Focus on exceptional individuals with particular abilities or talents
 Breadth is a strength of the theory, although some kinds of intelligence can be hard to test
 9 Types of Intelligence
 Logical-mathematical intelligence
 Linguistic intelligence
 Spatial intelligence
 Musical intelligence
 Bodily-kinesthetic intelligence
 Interpersonal intelligence (emotional)
 Intrapersonal intelligence
 Naturalistic intelligence
 Existential intelligence
 Sternberg’s Triarchic Theory
 Combines Gardner’s broad conception of intelligence with a concern for the mental operations that
underlie each part of intelligence
 Three parts:
 Analytic intelligence “book smart”
 Creative intelligence– invention and application
 Practical intelligence “street smart”
 Intelligence Quotient
 How Valid is IQ?
 Does IQ testing measure what it is supposed to measure (intelligence)?
 Different specific IQ tests: WAIS, WISC, Stanford-Binet
 These tend to correlate well with school performance, but not as well with broader ideas of how a
person adapts to environment
 Labelling effects: Does being labelled as high-IQ or low-IQ tend to affect educational
opportunities?
 If so, IQ can become a self-fulfilling prophecy?
 Stability of IQ
 Does IQ change significantly over a person’s life?
 Results of longitudinal studies suggest it does not
 Longitudinal studies: Test the same people repeatedly as they age
 IQ declines little until age 60; usually no drastic changes even in old age
 Crystallized intelligence declines less than fluid intelligence
 http://www.youtube.com/watch?v=v5knhWYvmL8

 Wechsler tests (WAIS-IV, WISC-V)


 Wechsler Adult Intelligence Scale
 Created by David Wechsler 1930 testing psychiatric patients
 Largely inspired by Binet scales and Alpha & Beta tests. First named Wechsler-Bellevue
 WAIS-III improvements included two new subtests, increase domains of cognitive functioning by use
of index scores and removed Verbal IQ and Performance IQ
 Mean and index score is 100 Standard deviation is 15
 WAIS-IV
 15 subtests, 10 used for obtaining IQ score, others used to look at individual strengths and weakness,
index scores for
 Verbal Comprehension (VCI) Similarities, Vocabulary, Information

Verbal
Proposed abilities measured
Comprehension
Similarities Abstract verbal reasoning
The degree to which one has learned, been able to comprehend and
Vocabulary
verbally express vocabulary
Information Degree of general information acquired from culture
Ability to deal with abstract social conventions, rules and
(Comprehension)
expressions

 Perceptual Reasoning (PRI) Block Design, Matrix reasoning, Visual puzzles


Perceptual
Proposed abilities measured
Reasoning
Spatial perception, visual abstract processing, and problem
Block Design
solving
Nonverbal abstract problem solving, inductive reasoning, spatial
Matrix Reasoning
reasoning
Visual Puzzles Spatial reasoning
(Picture
Ability to quickly perceive visual details
Completion)
(Figure Weights) Quantitative and analogical reasoning
 Working Memory (WMI) Digit Span, Arithmetic

Working Memory Proposed abilities measured


Digit span Attention, concentration, mental control
Concentration while manipulating mental mathematical
Arithmetic
problems
(Letter-Number
Attention, concentration, mental control
Sequencing)

 Processing Speed (PSI) Symbol search, Coding

Processing
Proposed abilities measured
Speed
Symbol Search Visual perception/analysis, scanning speed
Visual-motor coordination, motor and mental speed, visual
Coding
working memory
(Cancellation) Visual-perceptual speed

 WAIS-IV Standardization
 Used census data sample 2,220 adults stratification of population
 13 Age bands (16-17, 18-19)
 Reliability:
 Split half reliability full scale is .98
 Supported use with specialized populations
 SEM 2.6 pts 95% of time +/- 4 points of IQ
 Individual subtest are weaker
 Validity
 Good content validity, criterion related validity .94 with WAIS-III, appropriate convergent and
divergent validity, construct validity confirmed

 Stanford Binet Intelligence scales


 SB5
 Oldest individual intelligence test
 Fluid reasoning Knowledge, Qualitative Reasoning, Visual Spatial Processing, Working Memory for
each domain of Verbal and Non verbal
 IQ and Factor scores normed to mean of 100 and SD of 15
 Full scale IQ, Verbal IQ and Nonverbal IQ
 Standardization
 Used sample of 48000 individuals aged 2 to 85
 Stratified sample based on gender, ethnic background, regional and education levels based on 2000
US census
 Reliability in .90s and subtests .70 to .85
 Validity: strong criterion validity, predictive validity and concurrent validity
 Assessment to children as young as two
 Excellent for high range, low range scorers
 First test to consider religious transitions for test development

 Individual Tests for Achievement


 Appraise what the person has learned in school
 Kaufman Test of Educational Achievement (KTEA-II)
 Diagnostic Achievement Battery-3 (DAB-3)
 Wechsler Individual Achievement Test (WIAT-II)
 Woodcock-Johnson III Tests of Achievement (WJ III)
 Wide Range Achievement Test (WRAT-4)

 Learning Disabilities
 Previous method
 Difference of 1+ SDs between individual intelligence test and individual achievement test in one or
more areas
 Contemporarily
 National Joint Committee on Learning Disabilities (NJCLD) abandoned the approach between
discrepancy between ability and achievement
 Social and emotional difficulties common
 Operationalizing LD
 1) Discrepancy between general ability and specific achievements
 2) Related considerations
 Examine psychosocial skills, physical and sensory abilities
 Types: Verbal vs Non-Verbal
 3) Alternative Explanations
 Rule out non LD explanations for learning difficulties (hearing loss, depression, ADHD, etc.) 

You might also like