You are on page 1of 12

Tests & Measures

Chapter 1: Introduction
Learning Objectives

- Define the basic terms pertaining to psychological and educational tests


- Distinguish between an individual test and a group test
- Define the terms achievement, aptitude, and intelligence and identify a concept that can encompass all the tree
terms
- Distinguish between ability tests and personality tests
- Define the term structured personality test
- Explain how structured personality tests differ from projective personality tests
- Explain what a normative or standardization sample is and why such a sample is important
- Identify the major developments in the history of psychological testing
- Explain the relevance of psychological tests in contemporary society

Standardized tests tend to disadvantage women, test takers whose parents have lower incomes and levels of education,
and ethnic minorities

TEST – a measurement device or technique used to qualify behaviour or aid in the understanding and prediction of
behaviour.

- A test measures only a sample of behaviour, and error is always associated with a sampling process.
- Test scores are not perfect measures of a behaviour or characteristic, but they do add significantly to the
prediction process
- The meaning of test scores can change dramatically, depending on how a well defined sample of individuals
scores on a test

ITEM – a specific stimulus to which a person responds overtly; this response can be scored or evaluated (classified,
graded on a scale, or counted)

- Items are the specific questions or problems that make up a test

PSYCHOLOGICAL TEST – or educational test is a set of items that are designed to measure characteristics of human beings
that pertain to behaviour.

- There are many types of behaviour


o Overt behaviour – an individual’s observable activity
 Some tests attempt to measure the extend to which someone might engage in or emit a
particular overt behaviour
o Covert behaviour – it takes place within an individual and cannot be directly observed (your feelings and
thoughts)
 Some tests attempt to measure this behaviour

TRAITS – enduring characteristics or tendencies to respond in a certain manner

SCALES – relate raw scores on test items to some defined theoretical or empirical distribution

INDIVIDUAL TESTS – those that can be given to only one person at a time (the same way a psychotherapist sees only one
person at a time)
GROUP TEST – can be administered to more than one person at a time by a single examiner (instructor gives everyone in
the class a test at the same time

TEST ADMINISTRATOR – the person giving the test

ABILITY TESTS – measure skills in terms of speed, accuracy, or both

Tests can also be categorized by the behaviour they measure. Historically experts distinguish among achievement,
aptitude, and intelligence as different types of ability.

- ACHIEVEMENT – previous learning, a test that measures or evaluates how many words you can spell correctly is
called a spelling achievement test
- APTITUDE – potential for learning or acquiring a specific skill, a spelling aptitude test measures how many words
you might be able to spell given a certain amount of training, etc.
- INTELLIGENCE – a person’s general potential to solve problems, adapt to changing circumstances, think abstractly,
and profit from experience

HUMAN ABILITY – encompasses the considerable overlap between achievement, aptitude, and intelligence

There is a clear distinction between ability and personality tests. ABILITY TESTS are related to capacity or potential

PERSONALITY TESTS – related to the overt and covert dispositions of the individual – measure typical behaviour

There are several types of personality tests

- STRUCTURED PERSONALITY TESTS – provide a statement, usually of the “self-report” variety, and require the subject
to choose between two or more alternative responses
- PROJECTIVE PERSONALITY TESTS – either the stimulus (test materials) or the required response – or both – are
ambiguous

Psychological testing – refers to all the possible uses, applications, and underlying concepts of psychological and
educational tests.

- The main use is to evaluate individual differences or variations among individuals


- Measure individual differences in ability and personality and assume that the differences shown on the test
reflect actual differences or variations among individuals

Principals of Psychological Testing


- The basic concepts and fundamental ideas that underlie all psychological and educational tests

RELIABILITY – refers to the accuracy, dependability, consistency, or repeatability of test results


- The degree to which test scores are free of measurement error

VALIDITY – refers to the meaning and usefulness of test results

- The degree to which a certain inference or interpretation based on a test is appropriate

Applications of Psychological Testing

INTERVIEW – a method of gathering information through verbal interaction, such as direct questions

- Major technique of gathering psychological information

Historical Perspective
- Most major developments occurred over the last century in the US
- Chinese had a relatively sophisticated civil service testing program 4000 years ago – every third-year oral
examinations were given to help determine work evaluations and promotion decisions
- By the Han Dynasty (206-220 BCE) test batteries was common
o TEST BATTERIES – two or more tests used in conjunction
- The English learned it from the Chinese and adopted it, then the French and US in 1883 – American Civil Service
Commission

Charles Darwin published The Origin of Species in 1859

- an important step toward understanding individual differences


- Sir Francis Galton, a relative of Darwin, published Hereditary Genius 1869
o Applied Darwin’s theories to humans and tried to show that some characteristics made some people
more fit than others
- Galton’s work was extended by James McKeen Cattell, coined the term mental test

STANDARD CONDITIONS – precisely the same instructions and format

REPRESENTATIVE SAMPLE – one that comprises individuals similar to those for whom the test is to be used

Mental age – revised 1908 Binet-Simon Scale, measures age in terms on similar abilities as the average person from that
age group which may differ from chronological age

World War I - Army needed testing to deal with the influx of recruits, Robert Yerkes, President of the American
Psychological Association, recruited psychologist and created two structured tests of human abilities. Army Alpha –
required reading ability, and the Army Beta – measured the intelligence of illiterate adults

Woodworth Personal Data Sheet – first structured personality test

Rorschach inkblot test -

Thematic Apperception Test (TAT) – Henry Murray and Christina Morgan, used ambiguous pictures and required the
subject to make up a story about the scene (personality test)

MMPI 2 – Minnesota Multiphasic Personality Inventory – used empirical methods to determine the meaning of a test
response – currently the most widely used and referenced personality test

Factor Analysis – a method of finding the minimum number of dimensions, called factors, to account for a large number
of variables
Chapter 2: Norms and Basic Statistics for Testing
Learning Objectives

- Discuss three properties of scales of measurement


- Determine why properties of scale are important in the field of measurement
- Identify methods for displaying distribution scores
- Calculate the mean and standard deviation for a set of scores
- Define a Z score and explain how its used
- Relate the concepts of mean, standard deviation, and Z score to the concept of a standard normal distribution
- Define quartiles, deciles, and stanines and explain how they are used
- Tell how norms are created
- Relate the notion of tracking to the establishment of norms

Tests are devises used to translate observations into numbers

Statistical methods serve two important purposes in the quest for scientific understanding

1. Used for purposes of description, numbers provide summaries


2. We can use it to make inferences

INFERENCES are logical deductions about events that cannot be observed directly

- Ex, can’t know how many people watched a movie but can use a sample to infer a percentage of people who
saw the film

Data gathering and analysis

1. EXPLORATORY DATA ANALYSIS – John Tukey, gathering and displaying clues


2. CONFIRMATORY DATA ANALYSIS – the clues are evaluated against rigid statistical rules (like judges and juries)

DESCRIPTIVE STATISTICS – are methods used to provide a concise description of a collection of quantitative information

INFERENTIAL STATISTICS – are methods used to make inferences from observations of a small group of people known as a
sample to a larger group of individuals known as a population

MEASUREMENT – the application of rules for assigning numbers to objects

Properties of Measurement Scales


MAGNITUDE – the property of “moreness”. A scale has a property of magnitude if we can say that a particular instance of
the attribute represents more, less, or equal amounts of the given quantity than does another instance. (ex, if height and
we can say john is taller than Fred, the scale has a property of magnitude)

EQUAL INTERVALS – a scale has the property of equal intervals if the difference between two points at any place in the
scale has the same meaning as the difference between two other points that differ by the same number of scale units

ABSOLUT 0 (ZERO) – An absolute 0 is obtained when nothing of the property being measured exists, ex. Hearts rate – you
can have a heart rate of 0 (dead) but you cant have an intelligence of 0

- Extremely difficult if not impossible for many psychological qualities to define an absolute zero
Types of Scales
NOMINAL SCALES – not really scales at all, only purpose is to name objects (baseball jersey number)

- When attached to a category, most statistical procedures are not meaningful (ie, 1=red, 2=blue, what would a
mean of 1.87 signify?)

ORDINAL SCALES – a scale with a property of magnitude but not equal intervals or an absolute 0

- Allows you to rank but tells you nothing about the difference between the ranks
- For most problems in psychology the precision to measure the exact differences between intervals does not
exist – so most often ordinal scales are used

INTERVAL SCALE – has the property of magnitude and equal intervals but not absolute zero

- Fahrenheit –

RATIO SCALE – has all 3 properties, magnitude, equal intervals, and an absolute 0

- Example, speed of travel, 0 km per hour is no speed, and 60km per hour is twice as fast as 30km per hour

- A single test score means more if one relates it to another test score.
- A distribution of scores summarizes the scores for a group of individuals

FREQUENCY DISTRIBUTIONS – displays scores on a variable or a measure to reflect how frequently each value was obtained

- Define all the possible scores and determine how many people obtained each of those scores
- For most distribution is bell shaped
- Positive skew – the tail goes off toward the higher or positive of the X axis
- CLASS INTERVAL – when you draw a frequency distribution you must decide on the width of the class intervals

PERCENTILE RANKS – replace a simple rank when we want to adjust for the number of scores in a group,

- What percentage of the scores fall below a particular score (cases below the case of interest)
- The formula is:
B
- Pr = ×100=percentile rank of X i
N
-
B (the number of scores below X i (the score of interest ))
Pr ( Percentile rank )= ×100= percentile rank of X i
N ( thetotal number of scores )
- You form a ratio of the number of cases below the score of interest and the total number of scores
- Will always be less than or equal to 1
- Measure of relative performance

PERCENTILES are the specific scores or points within a distribution


- Percentile divide the total frequency for a set of observations into hundredths
- Not the same as percentile ranks, instead, indicate the particular score below which a defined percentage of
scores falls

Describing Distributions
MEAN – arithmetic average

X=
∑X
N
X = “X bar” which is the mean
∑ = Greek letter sigma, means sum or add scores together

X = a variable that takes on a different value

X i = each value represents a raw score, also called an obtained score

N = the number of cases

STANDARD DEVIATION – an approximation of the average deviation around the mean

- Mean doesn’t tell you anything about variability, mean can be the same for sets of scores but they could vary
greatly.
- One way to measure variability is to subtract the mean from each score ( X −X ) and then total the deviations
shown as lower case x . x = ( X −X )
o Sum of the deviations around the mean will always equal 0
- To avoid this you square all the deviations around the mean to get rid of negatives
- Then obtain the average squared deviation around the mean- variance

The squared root of the variance is the standard deviation - σ – thus the squared root of the average squared deviation
around the mean

σ =√ ∑ ¿ ¿ ¿

Z SCORE – the difference between a score and the mean, divided by the standard deviation

X i−X
Z=
S
S – the standard deviation of a population

MCCALL’S T

Same as Z score but the mean is 50 rather than 0 and Standard Deviation is 10 rather than 1

T = 10Z + 50

QUARTILES are points that divide the frequency distribution into equal fourths

- First is 25th percentile (Q1), the second quartile is the median or 50th percentile (Q2), and the 3rd is the 75th
percentile (Q3)
- The INTERQUARTILE RANGE is the interval of scores bounded by the 25th and 75th percentile

DECILES use 10%, thus D9 is is the point below which 90% of the cases fall
STANINE SYSTEM – converts any set of scores into a transformed scale which ranges from 1-9

NORMS – refer to the performances by defined groups on particular tests

- Obtained by administering the test to a sample of people


- Mean and 50th percentile are norms

NORM-REFERENCED TEST – compares each person with a norm

CRITERION-REFERENCED TEST – describes the specific types of skills, tasks, or knowledge that the test taker can
demonstrate such as mathematical skills
Chapter 3: Reliability

Abraham De Moivre – the basic notion of sampling error

Karl Pearson – the product moment correlation

Charles Spearman – advanced development of reliability assessment

- Reliability puts sampling error and product moment correlation together

Test-retest – consider the consistency of the test results when the test is administered on different occasions

Parallel forms (equivalent forms) – evaluate the test on different forms of the test that measure the same attribute

Internal consistency – how people perform on similar subsets if items selected from the same form of the measure

Split-half measure – a test is given and divided into halves that are scored separately

- Spearman-brown formula – allows you to estimate the correlation of 2 halves

KR20 not for internal

Cronbach – coefficient alpha, estimates the internal consistency of tests in which the items are not scored as 0 or 1
Class Notes
(DOPE)

Determinism – the world is knowable

Order – the world has structure and patterns, operates by laws

- Sierpinski’s Triangle

Parsimony – simplest explanation is most likely correct (Occam’s Razor)

Empiricism – truth is revealed through observation, can’t trust your eyes, data is the only way to know

- Hypothesis – a hunch or best guess


- Theory – collection of interrelated statements, live and die with evidence (empiricism), must be falsifiable

Relativism – there is no ultimate truth just what is true for each perceiver (Protagoras)

- Socrates disagreed (can’t teach wisdom if there isn’t any)


- Heraclitus – river in flux, you can never step in the same river twice. Since everything is in flux you can never
know anything
- Xenophanes- if oxen and horses could sculpt their gods they would look like oxen and horses
- Gorgias – nothing exists and even if it did you can’t understand it, and even if you could you cant communicate it

Platonic forms – Plato sought to fix relativism, we can know things – the number 6 exists, we can have 6 things

- Believed there is a universal standard of beauty, what exists in the world around us is just a shadow of a perfect
form that we can’t know

Measurement Theory
(X=T+E)

- Every measurement is to some degree imperfect, is there a master ruler template?


- True Score + Error, anything besides real difference is measurement error
- Philosophy, we will never know T
- We need the OPERATIONAL DEFINITION, we need to define things to be able to measure them. How you
measure something depends on how its defined
- Systematic – same error happens every time the instrument is used
- Random – can’t predict, people come in with a different psychology that impacts the test and increases error
- (X = T + Es + Er)

Psychophysics was the start of philosophy as a science which is psychology.

Statistics – 3 lessons to know – mean, variances, and the proportion of explained variance

- Mean
- Variance
- Explained variance- some differences are explainable, sometimes no change can be alarming (ie, we expect kids
to grow up), but some we just can’t explain

Errors
- Explainable: systematic factors that produce systematic changes in scores (height in kids)
o Learning, training, growth, fatigue
- Unexplainable: unsystematic factors that produce random changes in scores
o Marks in school are sometimes higher or lower, with no consistent pattern or order

Reliability (explained) vs. Unreliability (unexplained)

- Unreliability then represents the extent of unexplained or unsystematic variation in scores of a person on some
trait/ability when that trait/ability is repeatedly measured

o Not necessarily the same measurement device

o Reliability is the extent of systematic variation in scores of one person on some trait/ability

 Reliability is the ratio of the variance of T and X

Test variance = true variance + error variance

σ2Obs = σ2X = σ2T + σ2E

SD2Obs = SD2X = SD2T + SD2E

when SD2E = 0, SD2X = SD2T

when SD2E = 0, SD2T / SD2X = 1 (range: 0 to 1)

Sigma – any Greek letter refers to the true value that’s out there that we won't know

Different ways of measuring reliability

- Internal consistency – analyze same instrument and look for comparable items
- Split-half reliability – divide a single test into to halves (odd/even – most common)

Cronbach’s Alpha (internal consistency)

- Standard deviation squared is variance


- .7 is good enough - .8 is great
- Can find misbehaving items, take them out and do it again

Kunder-Richardson - Dichotomous Measures (having only 2 possible variables, yes/no or t/f)

- KR20 - For each item what percentage of the people got it right or endorsed it in a particular direction, and the
percentage that got it wrong, and the overall variance of the test
- Uses each item
o Sum of p x q for each item (p1*q1+p2*q2+…)
- KR21 uses an average of each item (mean of p and mean of q)
- P is percentage correct and q is percentage incorrect (q is 1-p)

Saupe’s Quickie

[.19∗number of items]
- Reliability = 1 - 2
SD
- 1/5 th of the number of items = .19

Standard error of measurement (give or take in stats speak) what is the wiggle of any point

- = SD X∗√ (1−r xx )
- 1-Rxx(Reliability) 1 minus the reliability is the unreliability
- If the answer is 2.5, and the person scored 50, not likely to be less than 48.5 or more than 52.5

Item Response Theory

- Item difficulty = how much of the trait is needed to answer the item correctly 50%

Validity
- Reliability is consistency – the property is repeatable
- Validity (what a score means; the extent to which a test measures what it claims to) is a much more difficult and
complex issue
- Much harder on pieces we can’t see – personality, leadership, etc.
- Invalid test can still be reliable but unreliable test can never be valid

Divided into 3 types (hinges on research)

- Content - Degree to which questions, tasks, items on a test are representative of the universe of behaviour the
test was designed to measure (grade 3 spelling), hard with ill defined trait
o Face Validity – items are valid if they look valid
- Criterion-related - A test is shown to be effective in estimating one’s performance on an outcome measure
o If test is valid we can discriminate between those who will or wont
o Concurrent: scores obtained simultaneously
o Predictive: criterion obtained mos/yrs later, Test scores are used to estimate outcome measures
obtained at a later date, ie, using high school marks to predict if you will graduate university used to
determine admission
- Construct - The most difficult and elusive form of validity
o No single external referent is sufficient
o A network of interlocking suppositions can be derived from existing theory
o Boils down to rational argument, this is what I expect to measure this is what it will show and this is
what we expect to see etc.

- Convergent Validity: test correlates with other tests with which it overlaps

o Two tests of intelligence should be correlated

- Divergent Validity: test does not correlate with tests from which it should differ

o Social interest and intelligence are unrelated

You might also like