You are on page 1of 9

Week 1 Lectures, Part 1: Introduction to Psychometrics

Introduction – Skepticism and Psychometrics


 What do you know and how do you know you know it?
 Epistemology:
 What is knowledge?
 Belief + Truth + Justification…
 How do we get knowledge?

 What are the limits of knowledge?
 Global Skepticism

 Single-Fact Skepticism

 The Problem of Skepticism (video)

 Assuming we’re not a brain in a vat…
 Sources of knowledge:
 INTUITIVE : based on belief, intuition, gut feeling, etc.
 AUTHORITATIVE: based on trusted sources.
 LOGICAL: based on reasoning (given x, then y).
 EMPIRICAL: based on objective and observable evidence.
 *Which sources of knowledge apply to psychological assessment?

Psychometrics
 Why should we question what we know?
 Psychological testing/assessment seeks to “know” something about someone. This “knowledge” can have
significant consequences, good or bad.
 What we can extrapolate from test results is limited by numerous factors.
 What is your experience using/exposure to psychological tests, in your placement or elsewhere?

 What are Psychometrics?
 Theory and technique of measurement.
 Based on assumption that tests measure individual differences in magnitude of traits or characteristics that
exist.
 “Whatever exists at all, exists in some amount” Thorndike (1918).
 Test is a standardized procedure for sampling behaviour and describing it with categories or scores.
 What is a Psychological Test? – Five characteristics of norms-based tests
 Standardized procedures
 Uniform from one examiner to the next
 Direction from administrators
 Precise instructions for each item
 E.g., Digit span on the WAIS/WISC
 Behaviour sample
 Only a snap shot of behaviour, targeted and well defined.
 Inferences about total domain of relevant behaviours.
 Score or category
 What does a score mean?
 The purpose of testing is to measure the trait or quality that is defined by the construct and minimize the
amount of error.
 Measured using the following:
 X= T + e
 Norms or standards
 Selection and testing of a standardized sample is crucial for the usefulness of a test.
 Norms allow tester to determine degree in which a score deviates from what is expected
 Help identify limits and outlying scores.
 *Note: Norm-referenced vs. criterion referenced
 Prediction of behaviour
 Predict additional behaviours (other than those directly sampled by test).
 Validation research after test is released.
 Peer reviewed articles.
 Revisions to norms, more extensive for populations.
 What is the difference between a test and an assessment?

 What are the different types of Psychological Tests?
 Ability test:
 Achievement
 Intelligence
 Aptitude
 Personality test:
 Objective
 Projective
 Creativity test
 Interest inventories
 Behavioural Procedures
 Neuropsychological test
 What are the different uses of tests?
 Classification:
 Placement
 Screening
 Certification
 Selection
 Diagnosis and Treatment planning
 Self Knowledge
 Program Evaluation
 Research
 What are the broad ethical principles related to psychological testing?
 Assessments done in the best interests of the client
 Informed Consent
 Protect the confidentiality of the test results
 Examiners Competence and Experience
 Respecting current standards of care. e.g.,
 Communication of test results
 Consider individual differences
 Responsible report writing
 What factors affect test scores (i.e., outcomes of a psychological test)?
 Demand Characteristics
 Response Bias
 Characteristic of population being assessed
 Social Desirability
 Misperception of Items
 Instruction Format
 Response Format
 Setting Variables
 Previous Testing experiences
 Reactive Effects

History of Psychological Assessment


 Historical Perspective
 ~2200 BC: China
 ~350 BC: Physiognomy (Aristotle)
 1810: Phrenology (Franz Joseph Gall)
 1880s: Galton
 “Hereditary Genius” (influenced by cousin, Darwin).
 Reaction time and sensory discrimination.
 1890s: Cattell
 “Mental Tests”
 Wundt and Experimental Psychology
 1905: Alfred Binet
 First modern intelligence test.
 Identify intellectually “weak” kids, in order to provide
appropriate education.
 1908/1911: Binet-Simon Scale (“mental age”)
 Intelligence is not a fixed quality.
 1916: Terman’s Stanford-Binet (IQ)
 Intelligence studied through higher psychological
processes rather than sensory processes.
 IQ tests used in eugenics movement.
 1910: Goddard
 Classifications: idiot, imbecile, feeble-minded
 “moron”
 Ellis Island
 1917: WWI – Army Alpha and Army Beta (US)
 ~1920-40: Leta Hollingworth
 Modern Intelligence Testing
 1939: David Weschler
 Scores based on normal distribution.
 Verbal and performance scales.
 Less emphasis on verbal ability.
 Tests for adults and children.
 1917: Woodworth Personal Data Sheet
 First objective personality test.
 Developed during WWI to screen soldiers.
 Yes/no questions about physical symptoms, fears, worries, etc.
 Assumed: 1) people are aware of their own problems, and 2) they respond accurately.
 Personality Testing
 Projective Tests
 1921: Rorschach Inkblot Test
 1935: Thematic Apperception Test
 Responses to ambiguous stimuli reveal emotions, thought processes, etc. at an unconscious level.
 Objective Tests
 1942: MMPI (Hathoway & McKinley)
 Objective information:
 Independent from clinicians view
 With validity scales (L, F, & K)
 Empirical keying
Week 1 Lectures, Part 2: Norms and Test Standardization
Norms
 Obtained by administering the test to a sample of people (presumably representative of a population)
 Distribution of scores obtained for that sample (norm group)
 Used to estimate a person’s standing on a test relative to similar others
 Provides standards for interpreting scores
 Requires careful use: must use correct norm group!
 Can take different forms:
 % rank
 Age equivalents
 Grade equivalents
 Standard score
 Such tests are different from Criterion-referenced
 norm referenced

Using the Wrong Norm Group…


Alan is a graduate student in philosophy at Athol University. Alan answered 219
items correctly, out of a total of 300 items, on a hypothetical test. Depending on
which norm group you use as the standard for comparison, we can say that Alan
did as well as or better than:
 99% of seventh graders in the Malone Public Schools
 92% of high school seniors at Athol High School
 91% of high school graduates in Worcester Academy
 85% of entering freshmen at Patricia Junior College
 70% of Lamia College’s philosophy majors
 55% of the University of Thessaloniki graduating seniors
 40% of graduate assistants at American College of Athens
 15% of English professors at the University College of London

 What are the Procedures for Interpreting Test Scores?


 Raw scores
 Linear Transformations - change unit of
 Most basic score
measurement but do not change the
 Meaningless in isolation characteristics of raw data
 Making sense of raw scores  Percentages
 Descriptive Statistics (numerical take-away)  Standard deviation units
 Central Tendency  Z scores
 Describe the middle of the distribution  T scores
 Mean: average score  Area Transformations - changes unit of
 Mode: most common score measurement and unit of reference (rely
on normal curve)
 Median: middle score in the distribution
 Percentile – percentage scoring below
 Range & Standard Deviation (variability) a raw score
 Describes the spread in the distribution  Stanine (all raw scores converted into
 Range: high score minus low score single digit)
 Variance: average squared distance from the
mean
 Standard deviation: average distance from the  Standard Deviations and Z-Scores
mean  Used to calculate confidence intervals
and percentile ranks
 Frequency Distributions
 Within 1 SD on either side of the
 Provide visual image of group data mean: ~68%
 Orderly arrangement of group of scores  Within 2 SD on either side of the
mean: ~95%
 1 SD above the mean: ~84%
 Show number or percentage of observations that fall into a certain category or range
 Normal Distribution (the bell curve)
 Percentiles
 The value of a variable below which a certain percent of observations fall
 The percentile rank of a score is the percentage of scores in its frequency distribution that are
the same or lower than it
 The Normal Curve
 Bell-shaped & Symmetrical (50% above
and below mean)
 A distribution of probabilities
 Total area under the curve always equals 1
 Extends infinitely in both directions,
approaching zero
 Most scores cluster in center
 34.1%, 13.6%, 2.1% (Magic numbers 68,95,99%)
 Skewness
 Correlation
 Normal Distribution is preferred.
 Positive correlation
 Negatively skewed (items too easy).  As X goes up, Y goes up
 Positively skewed (items too difficult).  As X goes down, Y goes
 Standard Scores down
 Universally understood units in testing  Negative correlation
 Expresses the distance from the mean SD  As X goes up, Y goes down
 Z-score: mean = 0; SD = 1  As X goes down, Y goes up
 No correlation
 T-score: mean = 50; SD = 10 (eliminates fractions)
 X and Y are independent
 Percentile: describe percentage of people who score (not related)
below a particular raw score  Recall…
Reliability  Correlation ≠ Causation
 What is reliability and how is it measured? Reliability
 Reliability = Consistency of measurement
 Measured using …
Inter-Rater Test-Retest Parallel Forms Internal
 Inter-Rater Reliability Reliability Reliability of Reliability Consistency
 Amount of consistency among scorers’
judgments. Split Half
 Two or more individuals score the same
test. Cronbach’s
Alpha
 Appropriate when scoring involves
some level of subjectivity.
KR-20
 On tests requiring evaluative
judgments, different scorers may give
different scores
 Inter-rater reliability is the correlation between their scores
 Measure: Cohen’s Kappa
 Test-Retest Reliability
 Similarity of test scores of the same people on different administrations.
 Test is administered to the same group on two different occasions.
 Must consider impact of test on test taker and practice effects.
 Correlate scores of single test on two different occasions
 The r is called the test-retest coefficient or coefficient of stability
 Only expect high coefficient of stability for constructs that are stable over time
 E.g., IQ vs. clinical syndromes
 Beware practice effects
 Coefficient also related to the time interval between tests
 Parallel Forms of Reliability
 Create two or more forms of a test
 parallel forms coefficient or coefficient of equivalence
 The two forms must be equivalent
 Now you have error from two sources: time and different forms
 Use only if you intend to use both forms
 Useful for minimizing practice effects
 Use Form A for pretest, and Form B for posttest
 E.g., when memory is a factor
 Internal Consistency
 Similarity of item scores within the same test.
Example of Internal Consistency
 Administer test to one group and compare item You have a 20-item scale with X = .50. How
values. many items would need to be added to increase
 Appropriate for homogeneous tests. the scale reliability to .70?
 Needed if using a composite score. ❑ (1−❑ x )
k= K
 Split Half ❑x (1−❑K )
 The split half method takes arbitrary halves .70 ( 1−.50 )
(e.g., odd and even) k=
.50 ( 1−.70 )
 Spearman-Brown k =2.33
 Reducing the number of items reduces  To reach K (.7), we will need 20 * k
the reliability of the test and vice versa  20 * 2.33
 To get the reliability of the test as a  47 items
whole (assuming equal means and  Add 27 new items to the existing 20
variance): items
 Spearman-Brown prophecy formula:
 predict the reliability of a test after  Please note: This use of the formula assumes
changing the test length that the items to be added are “as good” as
the items already in the scale
 estimate how many more items
needed
 assumes that the items to be added are “as good” as the items already in the scale
 Using the Spearman - Brown Formula
nr
 rrx =
1+ ( n−1 ) (r )
 rxx = is the estimated reliability of the test
 n = is the number of questions in the revised version divided by the number of questions
in the original version of the test
 r = is the calculated correlation coefficient for the two short forms of the test
 Important notes on the Split-Half Method
 Works only for tests with homogeneous content
 Different arbitrary halves may give different r values: How can we know which half to take?
 Wherever we have uncertainty, measure many times and average (e.g., take several
arbitrary halves and average their correlation)
 A better method would be to take all possible split halves and average their values
 Cronbach’s Alpha
 Mathematically equivalent to the average of all possible split-half estimates
 Computes correlations between every item
 The most common of the internal consistency measures
 ranges from 0 to 1
 Higher inter-item corr = higher α
 Need at least 2 items to compute α
 Formula:

 ra = coefficient alpha estimate of reliability


 k = number of questions on the test
 2i = variance of the scores on one question
 2 = variance of all the test scores
 KR-20
 Kuder-Richardson 20: Like Cronbach’s, only for dichotomous variables (e.g., true/false)
 Formula:

rKR20 = Kuder–Richardson formula 20 reliability coefficient


 k = number of questions on the test
 p = proportion of test takers who gave the correct answer to question
 q = proportion of test takers who gave an incorrect answer to question
 2 = variance of all the test scores
 What is classical test theory?
 Classical Test Theory
 No instrument is perfectly reliable or consistent.
 The reliability of a test is the extent to which it can be relied upon to produce ‘true’ scores.
 All test scores contain some error (X=T+e)
 X = observed score
 T = true score
 e = random error
 Assumptions:
 Errors are random.
 The true score will not change with repeated measurement.
 The distribution of errors will be the same for everyone.
 What is errors within test scores, and how does it relate to reliability?
 Error
 Random error: unexplained difference between the true (T) and obtained scores (X) (See Sources of
Error slides later on).
 Lowers reliability.
 Systematic error: error that increases or decreases the true score by same amount each time (e.g., test
of Extraversion also measures Anxiety).
 Represents a validity problem.
 Does not lower reliability - test is reliably inaccurate every time.
 How do you estimate reliability?
 Reliability is a ratio: Proportion of truth in the measure.
 Will always range between 0-1.
 E.g., Reliability of .8 = 80% true and 20% error.
 Can never be measured, only estimated.
 The Reliability Coefficient
 Correlation provides an index of the strength and direction of the linear relationship between two
variables.
 Correlation coefficient = r
 Reliability coefficient = rxx
 (varies from 0.0 if no reliability & lots of error; 1.0 if no error & perfect reliability)
 Standard Error of Measurement
 How much observed test score is likely to differ from
true test score. Example of Standard Error of Measurement
 Reliability allows us to estimate measurement error in IQ tests have a mean of 100 and a SD of 15. One
standardized unit. test has a reliability of 0.89. You score 130 on
that IQ test. What is the standard error of
 Inverse relation between reliability and error
measurement? What is the 95% confidence
 The lower the reliability r, the higher the SEM interval on your IQ?
 Direct relation between error and Sobserved
 The wider the distribution of scores you have, SEM =σ √1−r xx
the larger the error on any individual score is SEM =15 √ 1−0.89
going to be SEM =4.97
 Calculating Standard Error of Measurement 95 CI =x ± ( 1.96 ) ( 4.97 )
SEM =σ √1−r xx 95 % CI =130 ± 9.74
 SEM = standard error of measurement
 σ = standard deviation of one administration of the  We can be 95% confident that your true IQ is
test scores between 120.26 (130 - 9.74) and 139.74 (130
 rxx = reliability coefficient + 9.74)
 Calculating 95% Confidence Interval
95% CI = X + 1.96 (SEM)
 95% CI = the 95% confidence interval
 X = an individual’s observed test score
 + 1.96 = the two points on the normal curve that include 95% of the scores
 SEM = the standard error of measurement for the test
 Factors that Influence Reliability
 Sources of error that can increase/decrease reliability of test
 Test itself
 Too easy or too difficult
 Trick or ambiguous questions
 Not enough questions
 Poorly written questions
 Reading level higher than the reading level of target population
 Test administration
 Unstandardized administration (not following instructions)
 Disturbances during the test period
 Answering test takers’ questions inappropriately
 Other environmental factors, e.g., temperature
 Test scoring
 Not scoring according to instructions
 Errors in judgment
 Errors calculating test scores
 Test takers
 Fatigue
 Illness
 Prior exposure to test
 Guessing or Faking
 A million other things…
 Stability of Construct

 How reliable is reliable enough? It depends!
 Reliability is not fixed
 Tests will “perform” differently with different samples and populations.
 Type of test
 E.g., Personality tests tend to have lower alphas than achievement and aptitude tests.
 Type of reliability coefficient
 Test-retest and inter-rater tend to be lower than parallel forms and internal consistency.
 Purpose of test
 .7 is the lower bound of acceptable validity, but 0.8+ is better and sometimes necessary, depending on
test use.

You might also like