You are on page 1of 30

CHAPTER 1: PSYCHOLOGICAL TESTING AND ASSESSMENT CHAPTER 1: PSYCHOLOGICAL TESTING AND ASSESSMENT

(2) some test administers don’t even have to


TESTING AND ASSESSMENT
be present
Roots can be found in early twentieth century in France 1905
(a) usually administered to larger
Alfred Binet published a test designed to help place Paris school children groups
WW1, military used the test to screen large numbers of recruits quickly (b) test takers complete tasks
for intellectual and emotional problems independently
WW2, military depend more on tests to screen recruits for service b) Scoring and interpretation procedures
(1) score: a code or summary statement,
PSYCHOLOGICAL usually (but not necessarily) numerical in
PSYCHOLOGICALTESTING
ASSESSMENT nature, that reflects an evaluation of
Process of measuring performance on a test, task, interview, or
Gathering & integration of
psychology-related some other sample of behavior
psychology-related data for the
DEFINITION variables by means of
purpose of making a (2) scoring: process of assigning such
devices/procedures
psychological evaluation with evaluative codes/ statements to
designed to obtain a
accompany of tools.
sample of behavior performance on tests, tasks, interviews, or
To answer a referral question, other behavior samples.
To obtain some gauge,
OBJECTIVE solve problem or arrive at a (3) different types of score:
usually numerical in
decision thru the use of tools of
nature (a) cut score: reference point,
evaluation
Testing may be usually numerical, derived by
PROCESS Typically individualized judgement and used to divide
individualized or group
Key in the process of selecting Tester is not key into the a set of data into two or more
ROLE OF
tests as well as in drawing process; may be classifications.
EVALUATOR
conclusions substituted (i) sometimes reached
SKILL OF Typically requires an educated Requires technician-like without any formal
EVALUATIOR selection, skill in evaluation skills
method: in order to
Entail logical problem-solving
Typically yields a test “eyeball”, teachers
OUTCOME approach to answer the referral
score who decide what is
ques.
passing and what is
3 FORMS OF ASSESSMENT: failing.
1. COLLABORATIVE PSYCHOLOGICAL ASSESSMENT – assessor and (4) who scores it
assesse work as partners from initial contact through final feedback (a) self-scored by testtaker
2. THERAPEUTIC PSYCHOLOGICAL ASSESSMENT – self-discovery and new
(b) computer
understandings are encouraged throughout the assessment process
3. DYNAMIC PSYCHOLOGICAL ASSESSMENT – follows a model (a) (c) trained examiner
evaluation (b) intervention (a) evaluation. Provide a means for evaluating c) psychometric soundness/ technical quality
how the assesse processes or benefits from some type of intervention during (1) Psychometrics: the science of
the course of evaluation.
psychological measurement.
Tools of Psychological Assessment (a) referring to how consistently and

A. The Test (a measuring device or procedure) how accurately a psychological


test measures what it purports
1. psychological test: a device or procedure designed to measure
to measure.
variables related to psychology (intelligence, personality, aptitude,
interests, attitudes, or values) (2) Utility: refers to the usefulness or practical
value that a test or other tool of assessment
2. format: refers to the form, plan, structure, arrangement, and layout of
has for a particular purpose.
test items as well as to related considerations such as time limits.
B. The Interview: method of gathering information through direct
a) also referred to as the form in which a test is
communication involving reciprocal exchange
administered (pen and paper, computer, etc)
Computers can generate scenarios.
1. interviewer in face-to-face is taking note of

b) term is also used to denote the form or structure of other a) verbal language
evaluative tools, and processes, such as the guidelines b) nonverbal language
for creating a portfolio work sample (1) body language movements
3. Ways That tests differ from one another: (2) facial expressions in response to
a) administrative procedures interviewer
(1) some test administers have an active (3) the extent of eye contact
knowledge (4) apparent willingness to cooperate
(a) some test administration c) how they are dressed
involves demonstration of (1) neat vs sloppy vs inappropriate
tasks 2. interviewer over the phone taking note of
(b) usually one-on-one a) changes in the interviewee’s voice pitch
(c) trained observation of b) long pauses
assessee’s performance c) signs of emotion in response
3. ways that interviews differ:
a) length, purpose, and nature
CHAPTER 1: PSYCHOLOGICAL TESTING AND ASSESSMENT CHAPTER 1: PSYCHOLOGICAL TESTING AND ASSESSMENT
b) in order to help make diagnostic, treatment, 6. interpretive report: a formal or official computer-generated account
selection, etc of test performance presented in both numeric and narrative form
4. panel interview and including an explanation of the findings;
a) an interview conducted with one interviewee with more a) the three varieties of interpretive report are
than one interviewer (1) descriptive
C. The Portfolio (2) screening
1. files of work products: paper, canvas, film, video, audio, etc (3) consultive
2. samples of ones abilities and accomplishments b) some contain relatively little interpretation and simply call
D. Case History Data: records, transcripts, and other accounts in written, pictorial attention to certain high, low, or unusual scores that
or other form that preserve archival information, official and informal accounts, needed to be focused on.
and other data and items relevant to assessee c) consultative report: A type of interpretive report
1. Sheds light on an individual's past and current adjustment as well as on designed to provide expert and detailed analysis of test
events and circumstances that may have contributed to any changes in data that mimics the work of an expert consultant.
adjustment. d) integrative report: a form of interpretive report of
2. Provides information about neuropsychological functioning prior to the psychological assessment, usually computer-
occurrence of a trauma or other event that results in a deficit. generated, in which data from behavioral, medical,
3. insight into current academic and behavioral standing administrative, and/or other sources are integrated
4. useful in making judgments for future class placements 7. CAPA: computer assisted psychological assessment. (assistance to the

5. Case history Study: a report or illustrative account concerning test user not the test taker)

person or an event that was compiled on the basis of case history a) enables test developers to create psychometrically sound
data tests using complex mathematical procedures and
calculations.
a) Might shed light on how one individual’s personality and
particular set of environmental conditions combined to b) enables test users the construction of tailor-made test
produce a successful world leader. with built-in scoring and interpretive capabilities.

b) Groupthink: work on a social psychological c) Pros:


phenomenon: contains rich case history material on (1) test administrators have greater access to
collective decision making that did not always result in the potential test users because of the global
best decisions. reach of the internet.
E. Behavioral Observation: monitoring the actions of others or oneself by visual or (2) scoring and interpretation of test data tend
electronic means while recording quantitative and/or qualitative information to be quicker than for paper-and- pencil
regarding those actions. tests
1. often used as a diagnostic aid in various settings: inpatient (3) costs associated with internet testing tend to be
facilities, behavioral research laboratories, classrooms. lower than costs associated with paper-and-
2. naturalistic observation: behavioral observation that takes place in pencil tests
a naturally occurring setting (as opposed to a research laboratory) for (4) the internet facilitates the testing of otherwise
the purpose of evaluation and information- gathering. isolated populations, as well as people with
3. in practice tends to be used most frequently by researchers in settings disabilities for whom getting to a test center
such as classrooms, clinics, prisons, etc. might prove as a hardship.
F. Role- PlayTests (5) greener: conserves paper, shipping
1. role play: acting an improvised or partially improvised part in a materials etc.
simulated situation. d) Cons:

2. role-play test: tool of assessment wherein assessees are directed to (1) test client integrity
act as if they were in a particular situation. Assessees are then evaluated (a) refers to the verification of the
with regard to their expressed thoughts, behaviors, abilities, etc identity of the test taker when a
G. Computers as tools test is administered online
1. local processing: on site computerized scoring, interpretation, or (b) also refers to the sometimes
other conversion of raw test data; contrast w/ CP and varying interests of the test
teleprocessing taker vs that of the test
2. central processing: computerized scoring, interpretation, or other administrator. The test taker
conversion of raw data that is physically transported from the same or might have access to notes,
other test sites; contrast w/ LP and teleprocessing. aids, internet resources etc.

3. teleprocessing: computerized scoring, interpretation, or other (c) internet testing is only testing, not
conversion of raw test data sent over telephone lines by modem from a assessment
test site to a central location for computer processing. contrast with CP 8. CAT: computerized adaptive testing: an interactive, computer-
and LP administered test taking process wherein items presented to the test
4. simple score report: a type of scoring report that provides only a listing taker are based in part on the test taker's performance on previous
of scores items

5. extended scoring report: a type of scoring report that provides a listing a) EX: on a computerized test of academic abilities, the
of scores AND statistical data. computer might be programmed to switch from testing
math skills to English skills after three consecutive failures
on math items.
CHAPTER 1: PSYCHOLOGICAL TESTING AND ASSESSMENT CHAPTER 1: PSYCHOLOGICAL TESTING AND ASSESSMENT
H. Other Tools satisfaction, personal values, quality of living conditions, and
1. DVD- how would you respond to the events that take place in the quality of friendships and other social support.
video BUSINESS AND MILITARY SETTINGS
a) sexual harassment in the workplace
GOVERNMENTAL AND ORGANIZATIONAL CREDENTIALING
b) respond to various types of emergencies How are Assessments Conducted?
c) diagnosis/treatment plan for clients on videotape
protocol: the form or sheet or booklet on which a testtaker’s
2. thermometers, biofeedback, etc responses are entered.
TEST DEVELOPER o term might also be used to refer to a description of a set of test- or
assessment- related procedures, as in the sentence , “the
They are the one who create tests.
examiner dutifully followed the complete protocol for the stress
They conceive, prepare, and develop tests. They also find a way to
interview”
disseminate their tests, by publishing them either commercially or through
professional publications such as books or periodicals. rapport: working relationship between the examiner and the
TEST USER examinee
They select or decide to take a specific test off the shelf and use it for some
ASSESSEMENT OF PEOPLE WITH DISABILITITES
purpose. They may also participate in other roles, e.g., as examiners or
scorers. Define who requires alternate assessment, how such assessment are to be
TEST TAKER conducted and how meaningful inferences are to be drawn from the data
derived from such assessment
Anyone who is the subject of an assessment
Accommodation – adaptation of a test, procedure or situation or the
Test taker may vary on a continuum with respect to numerous
variables including: substitution of one test for another to make the assessment more suitable
o The amount of anxiety they experience & the degree to which for an assesee with exceptional needs.
the test anxiety might affect the results Translate it into Braillee and administered in that form.
o The extent to which they understand & agree with the Alternate Assessment – evaluative or diagnostic procedure or process that
rationale of the assessment
varies from the usual, customary, or standardized way a measurement is
o Their capacity & willingness to cooperate
derived either by virtue of some special accommodation made to the assesee
o Amount of physical pain/emotional distress they are
by means of alternative methods
experiencing
o Amount of physical discomfort Consider these four variables on which of many different types of
o Extent to which they are alert & wide awake accommodation should be employed:
o Extent to which they are predisposed to agreeing or o The capabilities of the assesse
disagreeing when presented with stimulus o The purpose of the assessment
o The extent to which they have received prior coaching o The meaning attached to test scores
o May attribute to portraying themselves in a good light o The capabilities of the assessor
REFERENCE SOURCES
Psychological autopsy – reconstruction of a deceased individual’s
TEST CATALOUGES – contains brief description of the test
psychological profile on the basis of archival records, artifacts, &
interviews previously conducted with the deceased assesee TEST MANUALS – detailed information
TYPES OF SETTINGS REFERENCE VOLUMES – one stop shopping, provides detailed
EDUCATIONAL SETTING information for each test listed, including test publisher, author,
o Achievement test: evaluation of accomplishments or the purpose, intended test population and test administration time
degree of learning that has taken place, usually with regard to JOURNAL ARTICLES – contain reviews of the test
an academic area.
ONLINE DATABASES – most widely used bibliographic databases
o Diagnosis: a description or conclusion reached on the basis of
evidence and opinion though a process of distinguishing the nature TYPES OF TESTS
of something and ruling out alternative conclusions. INDIVIDUAL TEST – those given to only one person at a time
o Diagnostic test: a tool used to make a diagnosis, usually to
GROUP TEST – administered to more than one person at a time by single
identify areas of deficit to be targeted for intervention examiner
o informal evaluation: A typically nonsystematic, relatively brief,
ABILITY TESTS:
and “off the record” assessment leading to the formation of an o ACHIEVEMENT TESTS – refers to previous learning (ex.
opinion or attitude, conducted by any person in any way for any Spelling)
reason, in an unofficial context and not subject to the same o APTITUDE/PROGNOSTIC – refers to the potential for
ethics or standards as evaluation by a professional learning or acquiring a specific skill
o INTELLIGENCE TESTS – refers to a person’s general
CLINICAL SETTING
potential to solve problems
o these tools are used to help screen for or diagnose
behavior problems PERSONALITY TESTS: refers to overt and covert dispositions
o group testing is used primarily for screening: identifying those o OBJECTIVE/STRUCTURED TESTS – usually self-report,
individuals who require further diagnostic evaluation. require the subject to choose between two or more
alternative responses
COUNSELING SETTING
o PROJECTIVE/UNSTRUCTURED TESTS – refers to all possible
o schools, prisons, and governmental or privately owned
uses, applications and underlying concepts of psychological
institutions
and educational tests
o Ultimate objective: the improvement of the assessee in terms
o INTEREST TESTS –
of adjustment, productivity, or some related variable.
GERIATRIC SETTING
o quality of life: in psychological assesment, an evaluation of
variables such as perceived stress, loneliness, sources of
CHAPTER 2: HISTORICAL, CULTURAL AND LEGAL/ETHICAL CONSIDERATIONS CHAPTER 2: HISTORICAL, CULTURAL AND LEGAL/ETHICAL CONSIDERATIONS
A HISTORICAL PERSPECTIVE testakers from young children through senior
19TH CENTURY adulthood.
Tests and testing programs first came into being in China B. THE MEASUREMENT OF PERSONALITY
Testing was instituted as a means of selecting who, of many o Field of psychology was being too test oriented
applicants would obtain government jobs (Civil service) o Clinical psychology was synonymous to mental testing
o ROBERT WOODWORTH – develop a measure of adjustment
The job applicants are tested on proficiency in endeavors such as music,
and emotional stability that could be administered quickly and
archery, knowledge and skill etc.
efficiently to groups of recruits
GRECO-ROMAN WRITINGS (Middle Ages)
To disguise the true purpose of the test,
World of evilness questionnaire was labeled as Personal Data
Deficiency in some bodily fluid as a factor believed to influence Sheet
personality He called it Woodworth Psychoneurotic
Hippocrates and Galen Inventory – first widely used self-report test of
RENAISSANCE personality
Christian von Wolff – anticipated psychology as a science and o Self-report test:
psychological measurement as a specialty within that science Advantages:
CHARLES DARWIN AND INDIVIDUAL DIFFERENCES Respondents best qualified
Disadvantages:
Tests designed to measure these individual differences in ability and
Poor insight into self
personality among people
One might honestly believe
“Origin of Species” chance variation in species would be selected or
something about self that isn’t true
rejected by nature according to adaptivity and survival value. “survival of the
Unwillingness to report seemingly
fittest”
negative qualities
FRANCIS GALTON o Projective test: individual is assumed to project onto some
Explore and quantify individual differences between people. ambiguous stimulus (inkblot, photo, etc.) his or her own unique
Classify people “according to their natural gifts” needs, fears, hopes, and motivations
Displayed the first anthropometric laboratory Ex.) Rorschack inkblot
KARL PEARSON o
Developed the product moment correlation technique. C. ACADEMIC AND APPLIED TRADITIONS Culture
His work can be traced directly from Galton
WILHEM MAX WUNDT and Assessment
First experimental psychology laboratory in University of Leipzig
Focuses more on relating to how people were similar, not different from Culture: ‘the socially transmitted behavior patterns, beliefs, and products of work f a
particular population, community, or group of people’
each other.
JAMES MCKEEN CATELL
Evolving Interest in Culture-Related Issues
Individual differences in reaction time  Goddard tested immigrants and found most to be feebleminded
Coined the term mental test o -invalid; overestimated mental deficiency, even in native English-
CHARLES SPEARMAN speakers
Originating the concept of test reliability as well as building the  Lead to nature-nurture debate about what intelligence tests actually measure
mathematical framework for the statistical technique of factor  Needed to “isolate” the cultural variable
analysis  Culture-specific tests: tests designed for use with ppl from one culture, but not
VICTOR HENRI from another
Frenchman who collaborated with Binet on papers suggesting how mental o -minorities still scored abnormally low
tests could be used to measure higher mental processes ex.) loaf of bread vs. tortillas
 today tests undergo many steps to ensure its suitable for said nation
EMIL KRAEPELIN
o -take testtakers reactions into account
Early experimenter of word association technique as a formal test
LIGHTNER WITMER Some Issues Regarding Culture and Assessment
“Little known founder of clinical psychology” Verbal Communication
Founded the first psychological clinic in the U.S. o Examiner and examinee must speak the same language
PSYCHE CATELL o Especially tricky with infrequently used vocabulary or unusual
Daughter of James Cattell idioms employed
o Translator may lose nuances of translation or give unintentional
Cattel Infant Intelligence Scale (CIIS) & Measurement of Intelligence in Infants
hints toward more desirable answer
and Young Children
o Also requires understanding of culture
RAYMOND CATTELL
Nonverbal Communication and Behavior
Believed in lexical approach to defining personality which examines human
o Different between cultures
languages for descriptors of personality dimensions o Ex.) meaning of not making eye contact
20t h CENTURY o Body movement could even have physical cause
- Birth of the first formal tests of intelligence o Psychoanalysis: Freud’s theory of personality and psychological
- Testing shifted to be of more understandable relevance/meaning treatment which stated that symbolic significance is assigned to
A. THE MEASUREMENT OF INTELLIGENCE many nonverbal acts.
o Binet created first intelligence to test to identify mentally o Timing tests in cultures not obsessed with speed
retarded school children in Paris (individual) o Lack of speaking could be reverence for elders
o Binet-Simon Test has been revised over again Standards of Evaluation
o Group intelligence tests emerged with need to screen o Acceptable roles for women differ throughout culture
intellect of WWI recruits o “judgments as to who might be the best employee, manager, or
o David Wechsler – designed a test to measure adult leader may differ as a function of culture, as might judgments
intelligence test regarding intelligence, wisdom, courage, and other psychological
for him Intelligence is a global capacity of the variables”
individual to act purposefully, to think rationally and
to deal effectively with his environment.
Wechsler-Bellevue Intelligence Scale
Wechsler Adult Intelligence Test – was revised
several times and extended the age range of
CHAPTER 2: HISTORICAL, CULTURAL AND LEGAL/ETHICAL CONSIDERATIONS CHAPTER 2: HISTORICAL, CULTURAL AND LEGAL/ETHICAL CONSIDERATIONS
o must ask ‘how appropriate are the norms or other The right to be informed of test findings
standards that will be used to make this evaluation’ o Formerly test administrators told to give participants only
positive information
Tests and Group Membership o No realistic information is required
ex.) must be 5’4” to be police officer - excludes cultures with short stature o Tell test takers as little as possible about the nature of their
ex.) Jewish lifestyle not well suited for corporate America performance on a particular test. So that the examinee would
affirmative action: voluntary and mandatory efforts to combat leave the test session feeling pleased and statisfied.
discrimination and promote equal opportunity in education and employment o Test takers have the right also to know what recommendations are
for all being made as a consequence of the test data
Psychology, tests, and public policy The right to privacy and confidentiality
o Private right: “recognizes the freedom of the individual to pick
Legal and Ethical Consideration
and choose for himself the time, circumstances, and particularly
Code of professional ethics: defines the standard of care expected of members of a given
the extent to which he wishes to share or withhold from others
profession.
his attitudes, beliefs, behaviors, and opinions”
o Privileged information: information protected by law from being
The Concerns of the Public
disclosed in legal proceeding. Protects clients from disclosure in
Beginning in world war I, fear that tests were only testing the ability to take
tests judicial proceedings. Privilege belongs to the client not the
psychologist.
Legislation
o Minimum competency testing programs: formal testing programs o Confidentiality: concerns matters of communication outside
designed to be used in decisions regarding various aspects of the courtroom
students’ educations Safekeeping of test data: It is not a good policy to
o Truth-in-testing legislation: state laws to provide testtakers with a maintain all records in perpetuity
means of learning the criteria by which they are being The right to the least stigmatizing label
judged o The standards advise that the least stigmatizing labels should
Litigation always be assigned when reporting test results.
o Daubert ruling made federal judges the gatekeepers to
determining what expert testimony is admitted
o This overrode the Frye policy which only admitted scientific
testimony that had won general acceptance in the scientific
community.

The Concerns of the Profession


Test-user qualifications
o Who should be allowed to use psych tests
o Level A: tests or aids that can adequately be administered,
scored, and interpreted with the aid of the manual and a general
orientation to the kind of institution or organization in which one is
working
o Level B: tests or aids that require some technical knowledge of test
construction and use and of supporting psychological and
educational fields
o Level C: tests and aids requiring substantial understanding of
testing and supporting psych fields with experience
Testing people with disabilities
o Difficulty in transforming the test into a form that can be taken
by testtaker
o Transferring responses to be scorable
o Meaningfully interpreting the test data
Computerized test administration, scoring, and interpretation
o simple, convenient
o easily copied, duplicated
o insufficient research to compare it to pencil-and-paper versions
o value of computer interpretation is questionable
o unprofessional, unregulated “psychological testing” online

The Rights of Testtakers


the right of informed consent
o right to know why they are being evaluated, how test data will be
used and what information will be released to whom
o may be obtained by parent or legal representative
o must be in written form:
general purpose of the testing
the specific reason it is being undertaken
general type of instruments to be administered
o revealing this information before the test can contaminate
the results
o deception only used if absolutely necessary
o don’t use deception if it will cause emotional distress
o fully debrief participants
CHAPTER 3: A STATISTICS REFRESHER CHAPTER 3: A STATISTICS REFRESHER
No absolute zero point
Why We Need Statistics
Can take average
- Statistics are important for purposes of education RATIO SCALE
o Numbers provide convenient summaries and allow us to In addition to all the properties of nominal, ordinal, and interval
evaluate some observations relative to others measurement, ratio scale has true zero point
- We use statistics to make inferences, which are logical deductions Equal intervals between numbers
about events that cannot be observed directly Ex.) measuring amount of pressure hand can exert
o Detective work of gathering and displaying clues –
True zero doesn’t mean someone will receive a scor e of 0, but means that 0
exploratory data analysis
has meaning
o Then confirmatory data analysis
- Descriptive statistics are methods used to provide a concise NOTE:
description of a collection of quantitative information Permissible Operations
- Inferential statistics are methods used to make inferences from observations - Level of measurement is important because it defines which
of a small group of people known as a sample to a larger group of individuals mathematical operations we can apply to numerical data
known as a population - For nominal data, each observation can be placed in only one
mutually exclusive category
SCALES OF MEASUREMENT
- Ordinal measurements can be manipulated using arithmetic
MEASUREMENT – act of assigning numbers or symbols to - With interval data, one can apply any arithmetic operation to the
characteristics of things according to rules. The rules serves as a differences between scores
o Cannot be used to make statements about ratios
guideline for representing the magnitude. It always involves error.
SCALE – set of numbers whose properties model empirical properties of the DESCRIBING DATA
objects to which the numbers are assigned. Distribution: set of scores arrayed for recording or study
CONTINUOUS SCALE – interval/ratio. A scale used to measure
Raw Score: straightforward, unmodified accounting of performance, usually
continuous variable. Always involves error
numerical
DISCRETE SCALE – nominal/ordinal used to measure a discrete
variable (ex. Female or male) Frequency Distributions
ERROR – collective influence of all of the factors on a test score. Frequency Distribution: All scores listed alongside the number of times
each score occurred
PROPERTIES OF SCALES
Grouped Frequency Distribution: test-score intervals (class intervals),
- Magnitude, equal intervals, and an absolute 0
replace the actual test scores
Magnitude
o Highest and lowest class intervals= upper and lower limits of
- The property of “moreness”
distribution
- A scale has the property of magnitude if we can say that a particular instance
Histogram: graph with vertical lines drawn at the true limits of each test
of the attribute represents more, less, or equal amounts of the given
score (or class interval) forming TOUCHING rectangles- midpoint in center of
quantity than does another instance
bar
Equal Intervals
Bar Graph: rectangles DON’T touch
- A scale has the property of equal intervals if the difference between two
points at any place on the scale has the same meaning as the difference Frequency Polygon: data illustrated with continuous line connecting the
between two other points that differ by the same number of scale units points where test scores or class intervals meet frequencies
- A psychological test rarely has the property of equal intervals A single test score means more if one relates it to other test scores
- When a scale has the property of equal intervals, the relationship between A distribution of scores summarizes the scores for a group of
the measured units and some outcome can be described by a straight line or a individuals
linear equation in the form Y=a+bX Frequency distribution: displays scores on a variable or a measure to reflect
o Shows that an increase in equal units on a given scale reflects how frequently each value was obtained
equal increases in the meaningful correlates of units o One defines all the possible scores and determines how many
Absolute 0 people obtained each of those scores
- An Absolute 0 is obtained when nothing of the property being Income is an example of a variable that has a positive skew
measured exists
Whenever you draw a frequency distribution or a frequency polygon, you
- This is extremely difficult/impossible for many psychological qualities
must decide on the width of the class interval
NOMINAL SCALE Class interval: for inches of rainfall is the unit on the horizontal axis
Simplest form of measurement
Measures of Central Tendency
Classification or categorization Measure of central tendency: statistic that indicates the average or
Arithmetic operations can be performed with nominal data midmost score between the extreme scores in a distribution.
Ex.) Male or female The Arithmetic Mean
Also includes test items o “X bar”
o Ex.) yes/no responses o sum of observations divided by number of observations
ORDINAL SCALE o Sigma (X/n)
o Used for interval or ratio data when distributions are
Classifies in some kind of ranking order
relatively normal
Individuals compared to others and assigned a rank
The Median
Imply nothing about how much greater one ranking is than another o The middle score
Numbers/ranks do not indicate units of measure o Used for ordinal, interval, and ratio data
No absolute zero point o Especially useful when few scores fall at extremes
Binet: believed that data derived from intelligence test are ordinal in nature The Mode
INTERVAL SCALE o Most frequently-occurring score
o Bimodal distribution- 2 scores both have highest
In addition to the features of nominal and ordinal scales, contain equal
frequency
intervals between numbers
o Only common with nominal data
CHAPTER 3: A STATISTICS REFRESHER CHAPTER 3: A STATISTICS REFRESHER
Measures of Variability Standard Scores
Standard Score: raw score that has been converted from one scale to another scale,
Variability: indication of how scores in a distribution are scattered or
where the latter has arbitrarily set mean and standard deviation
dispersed
-used for comparison
The Range Z-score
o Difference between the highest and lowest scores
conversion of a raw score into a number indicating how many standard
o Quick but gross description of the spread of scores
deviation units the raw score is below or above the mean of the distribution.
The interquartile and semi-interquartile range The difference between a particular raw score and the mean divided by the
o Distribution is split up by 3 quartiles, thus making 4 standard deviation
quarters each representing 25% of the scores
Used to compare test scores with difference scales
o Q2= median
o Interquartile range measure of variability equal to the T-score
difference between Q3 and Q1 Standard score system composed of a scale that ranges from 5 standard
o Semi-interquartile range interquartile range divided by 2 deviations below the mean to 5 standard deviations above the mean
Quartiles and Deciles No negatives
o Quartiles are points that divide the frequency distribution into
equal fourths Other Standard Scores
o First quartile is the 25th percentile; s econd quartile is the SAT
median, or 50th percentile; third quartile is the 75th GRE
percentile Linear transformation: when a standard score retains a direct
numerical relationship to the original raw score
o The interquartile range is bounded by the range of scores that
represents the middle 50% of the distribution Nonlinear transformation: required when data are not normally
o Deciles are similar but use points that mark 10% rather than distributed, yet comparisons with normal distributions need to be made
25% intervals o Normalized Standard Scores
o Stanine system: converts any set of scores into a When scores don’t fall on normal distribution
transformed scale, which ranges from 1 to 9 “normalizing a distribution involves ‘stretching’
The average deviation he skewed curve into the shape of a normal
o X-mean=x curve and creating a corresponding scale of
o Average deviation= (sum of all deviation scores)/ total standard scores, a scale called a normalized
number of scores standard score scale”
o Tells us on average how far scores are from the mean
The Standard Deviation
o Similar to average deviation
o But in order to overcome the (+/-) problem, each deviation is
squared
o Standard deviation: a measure of variability equal to the
square root of the average squared deviations about the mean
o Is square root of variance
o Variance: the mean of the squares of the difference b/w the
scores in a distribution and their mean
Found by squaring and summing all the deviation
scores and then dividing by the total
number of scores
o s = sample standard deviation
o sigma = population standard deviation
Skewness
skewness: nature and extent to which symmetry is absent
POSITIVE SKEW Ex.) test was too hard
NEGATIVELY SKEWED ex.) test was too easy
can be gauges by examining relative distances of quartiles from the median
Kurtosis
steepness of distribution
leptokurtic: relatively peaked
mesokurtic: somewhere in the middle
platykurtic: relatively flat

The Normal Curve


Normal curve: bell-shaped, smooth, mathematically defined curve, highest at center;
both sides taper as it approaches the x-axis asymptotically
-symmetrical, and thus have mean, median, mode, is same

Area under the Normal Curve


Tails and body

Standard Scores
Standard Score: raw score that has been converted from one scale to another scale,
where the latter has arbitrarily set mean and standard deviation
-used for comparison
CHAPTER 4: OF TESTS AND TESTING CHAPTER 4: OF TESTS AND TESTING
Tasks on some tests mimic the actual behaviors that the
Some Assumptions About Psychological Testing and Assessment test user is attempting to understand
- Assumption 1: Psychological Traits and States Exist o Obtained behavior is usually used to predict future behavior
o Trait: any distinguishable, relatively enduring way in which one
o Could also be used to postdict behavior to aid in the
individual varies from another
understanding of behavior that has already taken place
o States: distinguish one person from another but are relatively less
o Tools of assessment, such as a diary, or case history data, might be of
enduring
great value in such an evaluation
Trait term that an observer applies, as well as strength or
magnitude of the trait presumed present
- Assumption 4: Tests and Other Measurement Techniques Have Strengths and
Weaknesses
based on observing a sample of behavior
o Competent test users understand a lot about the tests they use
o Trait and state definitions also refer to individual variation How it was developed
make comparisons with respect to the hypothetical average person Circumstances under which it is appropriate to
o Samples of behavior: administer the test
Direct observation How test should be administered and to whom
Analysis of self-report statements How results should be interpreted
Paper-and-pencil test answers o Understand and appreciation limitations for tests they use
o Psychological trait covers wide range of possible - Assumption 5: Various Sources of Error Are Part of the Assessment Process
characteristics; ex: o Everyday error= misstates and miscalculations
Intelligence
o Assessment error= a long-standing assumption that factors other
Specific intellectual abilities
than what a test attempts to measure will influence performance on a
Cognitive style
test
Psychopathology
o Error variance: component of a test score attributable to
o Controversy regarding how psychological tests exist
sources other than the trait or ability measured
Psychological tests exist only as constructs: an
Assessees themselves are sources of error variance
informed, scientific concept developed or
o Classical test theory (CTT)/ True score theory: assumption is
constructed to describe or explain a behavior
made that each testtaker has a true score on a test that would be
Cant see, hear or touch infer existence
obtained but for the action of measurement error
from overt behavior: refers to an
- Assumption 6: Testing and Assessment Can Be Conducted in a Fair and
observable action or the product of an
Unbiased Manner
observable action, including test- or
o Court challenged to various tests and testing programs have
assessment-related responses
sensitized test developers and users to the societal demand for fair
o Traits not expected to be manifested in behavior 100% of the time
tests used in a fair manner
Seems to be rank-order stability in personality
Publishers strive to develop instruments that are fair when
traits relatively high correlations between trait
used in strict accordance with guidelines in the test manual
scores at different time points o Fairness related problems/questions:
o Whether and to what degree a trait manifests itself is
Culture is different from people whom the test was
dependent on the strength and nature of the situation
intended for
- Assumption 2: Psychological Traits and States Can Be Quantified and Politics
Measured
- Assumption 7: Testing and Assessment Benefit Society
o After acknowledged that psychological traits and states do exist, the
o Many critical decisions are based on testing and assessment
specific traits and states to be measured need to be defined
procedures
What types of behaviors are assumed to be
indicative of trait? WHAT’S A “GOOD TEST”?
Test developer has to provide test users with a clear
- Criteria
operational definition of the construct under study
o Clear instruction for administration, scoring, and interpretation
o After being defined, test developer considers types of item
content that would provide insight into it
- Reliability
Ex: behaviors that are indicative of a particular trait o A “good test”/measuring tool reliable
o Should all questions be weighted the same? Involves consistency: the prevision with which the test
Weighting the comparative value of a test’s items measures and the extent to which error is present in
comes about as the result of a complex interplay measurements
among many factors: Unreliable measurement needs to be avoided
Technical considerations - Validity
The way a construct has been defined (for o Test is considered valid if it doesn’t indeed measure what it
particular test) purports to measure
Value society (and test developer) attach to o If there is controversy over the definition of a construct then the validity
behaviors evaluated is sure to be criticized as well
o Need to find appropriate ways to score the test and interpret results o Questions regarding validity focus on the items that collectively make
Cumulative scoring: test score is presumed to up the test
represent the strength of the targeted ability or trait or Adequately sample range of areas to measure
state construct
The more the testtaker responds in a Individual items contribute to or take away from
particular direction (as keyed by test test’s validity
manual) the higher the testtaker is o Validity may also be questioned on grounds related to the
presumed to possess the targeted trait or interpretation of test results
ability - Other Considerations
- Assumption 3: Test-Related Behavior Predicts Non-Test-Related Behavior o “Good test” one that trained examiners can administer, score and
o Objective of test is to provide some indication of some aspects interpret with minimum difficulty
of the examinee’s behavior Useful
Yields actionable results that will ultimately benefit
individual testtakers or society at large
CHAPTER 4: OF TESTS AND TESTING CHAPTER 4: OF TESTS AND TESTING
o Purpose of test compare performance of testtaker with o STANDARD ERROR OF THE DIFFERENCE – estimate how large
performance of other testtakers (contains adequate norms: a difference between two scores should be before the
normative data) difference is considered statistically significant
Normative data provides standard with which results - Developing norms for a standardized test
measured can be compared o Establish a standard set of instructions and conditions under
NORMS which the test is given makes scores of normative
- Norm-referenced testing and assessment: method of evaluation and a sample more comparable with scores of future testtakers
way of deriving meaning from test scored by evaluating an individual o All data collected and analyzed, test developer will summarize
testtaker’s score and comparing it to scores of a group of testtakers data using descriptive statistics (measures of central tendency
- Meaning of individual score is relative to other scores on the same test and variability)
- Norms (scholarly context): usual, average, normal, standard, expected or typical Test developer needs to provide precise
- Norms (psychometric context): the test performance data of a description of standardization sample itself
Descriptions of normative samples vary widely in
detail
Tracking
particular group of testtakers that are designed for use as a reference when - Comparisons are usually with people of the same age
evaluating or interpreting individual test scores - Children at the same age level tend to go through different growth patterns
- Normative sample: group of people whose performance on a particular test - Pediatricians must know the child’s percentile within a given age
is analyzed for reference in evaluation the performance of individual testtakers group
o Yields a distribution of scores - This tendency to stay at about the same level relative to one’s peers is
- Norming: refers to the process of deriving norms; particular type of norm known as tracking (ie height and weight)
derivation - Diets may alter this “track”
o Race norming: controversial practice of norming on the basis - Faults: some believe there is an analogy between the rates of physical growth
of race or ethnic background and the rates of intellectual growth
- Norming a test can be very expensive user norms/program norms: consist o Some say that children learn at different rates
of descriptive statistics based on a group of testtakers in a given period of o This system discriminates against some children
time rather than norms obtained by form sampling methods
- Sampling to Develop Norms TYPES OF NORMS
- Standardization: process of administering a test to a representative o Classification of norms ex: age, grade, national, local,
sample of testtakers for the purpose of establishing norms percentile, etc.
o Standardized when has clear, specified procedures o PERCENTILES
- Sampling Median= 2nd quartile: the point at or below which 50% of
o Developer targets defined group as population test the scores fell and above which the remaining 50% fell
designed for Might wish to divide distribution of scores into
All have at least one common, observable Deciles (instead of quartiles): 10 equal parts
characteristic The X th percentile is equal to the score at or below which
o To obtain distribution of scores: X% of scores fall
Test administered to everyone in targeted Percentile: an expression of the percentage of people
population whose score on a test or measure falls below a particular
Administer test to a sample of the population raw score
Sample: portion of universe of people Percentage correct: refers to the
deemed to be representative of whole distribution of raw scores (number of items
population that were answered correctly) multiplied by
Sampling: process of selecting the 100 and divided by the total number of items
portion of universe deemed to be *not same as percentile
representative of whole Percentile is a converted score that refers to a
o Subgroups within a defined population may differ with percentage of testtakers
respect to some characteristics and it is sometimes Percentiles are easily calculated popular way of
essential to have these differences proportionately organizing test related data
represented in sample Using percentiles with normal distribution real
Stratified sampling: sample reflects statistics of differences between raw scores may be minimized near
whole population; helps prevent sampling bias and the ends of the distribution and exaggerated in the middle
ultimately aid in interpretation of findings (worsens with highly skewed data)
Purposive sampling: arbitrarily select sample we o AGE NORMS
believe to be representative of population Age-equivalent scores/age norms: indicate the
Incidental/convenience sampling: sample that is average performance of different samples of testtakers
convenient or available for use who were at various ages at the time the test was
Very exclusive (contain exclusionary administered
criteria) Age norm tables for physical
- TYPES OF STANDARD ERROR: characteristics
o STANDARD ERROR OF MEASUREMENT – estimate the extent “Mental” age vs. physical age (need to
to which an observed score deviates from a true score identify mental age)
o STANDARD ERROR OF ESTIMATE – In regression, an o GRADE NORMS
estimate of the degree of error involved in predicting the value Grade norms: designed to indicate the average test
of one variable from another performance of testtakers in a given school grade
o STANDARD ERROR OF THE MEAN – a measure of sampling error Developed by administering the test to
representative samples of children over a
range of consecutive grades
Mean or median score for children at
each grade level is calculated
CHAPTER 4: OF TESTS AND TESTING CHAPTER 4: OF TESTS AND TESTING
Great intuitive appeal CORRELATION
Do not provide info as to the content or type Degree and direction of correspondence between two things.
of items that a student could or could not Correlation coefficient (r) – expresses a linear relationship between two
answer correctly continuous variables
Developmental norms: (ex: grade norms and age o Numerical index that tells us the extent to which X and Y
norms) term applied broadly to norms developed on are “co-related”
the basis of any trait, ability, skill, or other
Positive correlation: high scores on Y are associated with high scores on X,
characteristic that is presumed to develop,
and low scores on Y correspond to low scores on X
deteriorate, or otherwise be affected by chronological
Negative correlation: higher scores on Y are associated with lower scores
age, school grade, or stage of life
on X, and vise versa
o NATIONAL NORMS
No correlation: the variables are not related
National norms: derived from a normative sample that
was nationally representative of the population -1 to 1
at the time the norming study was conducted Correlation does not imply causation.
o NATIONAL ANCHOR NORMS o Ie weight, height, intelligence
Many different tests purporting to measure the same
human characteristics or abilities PEARSON r
National anchor norms: equivalency tables for scores on Pearson Product Moment Correlation Coefficient
tests that purpose to measure the same thing Devised by Karl Pearson
Could provide the tool for comparisons Relationship of two variables are linear and continuous
Provides stability to test scores by
Coefficient of Determination (r 2) – indication of how much variance is
anchoring them to other test scores
shared by the X and the Y variables
Begins with the computation of percentile
SPEARMAN RHO
norms for each test to be compared
Rank order correlation coefficient
Equipercentile method: equivalency of
scores on different tests is calculated with Developed by Charles Spearman
reference to corresponding percentile scores Used when the sample size is small and when both sets of
o SUBGROUP NORMS measurements are in ordinal form (ranking form)
Normative sample can be segmented by an criteria BISERIAL CORRELATION
initially used in selecting subjects for sample expresses the relationship between a continuous variable and an artificial
Subgroup norms: result of segmentation; more dichotomous variable
narrowly defined o If the dichotomous variable had been true then we would use
o LOCAL NORMS the point biserial correlation
Local norms: provide normative info with respect to o When both variables are dichotomous and at least one of the
the local population’s performance on some test dichotomies is true, then the association between them can be
Typically developed by test users estimated using the phi coefficient
themselves o If both dichotomous variables are artificial, we might use a special
- Fixed Reference Group Scoring Systems correlation coefficient – tetrachoric correlation
o Norms provide context for interpreting meaning of a test score
REGRESSION
o Fixed reference group scoring system: distribution of scored
obtained on the test from one group of testtakers (fixed reference analysis of relationships among variables for the purpose of
group) is used as the basis for the calculation of test scores for future understanding how one variable may predict another
administrators on the test SIMPLE REGRESSION: one IV (X) and one DV (Y)
Ex: SAT test (developed in 1962) - Regression line: defined as the best-fitting straight line through a set of
NORM-REFERENCED VERSUS CRITERION-REFERENCED EVALUATION points in a scatter diagram
o Found by using the principle of least squares, which
- Way to derive meaning from test score is to evaluate test score in relation
minimizes the squared deviation around the regression line
to other scores on same test ( Norm-referenced)
- Criterion-referenced: derive meaning from a test score by evaluating it on the Primary use: To predict one score or variable from another
basis of whether or not some criterion has been met Standard error of estimate: the higher the correlation between X and Y, the
o Criterion: a standard on which a judgment or decision may be greater the accuracy of the prediction and the smaller the SEE.
based MULTIPLE REGRESSION: The use of more than one score to predict Y.
- Criterion-referenced testing and assessment: method of evaluation and Regression coefficient: (b) slope of the regression line
way of deriving meaning from test scores by evaluating an o Sum of squares for the covariance to the sum of squares for X
individual’s score with reference to a set standard (ex: to drive must past o Sum of squares is defined as the sum of the squared
driving test) deviations around the mean
o Derives from values and standards of an individual or o Covariance is used to express how much two measures
organization covary, or vary together
o Also called Domain/content-referenced testing and Slope describes how much change is expected in Y each time X
assessment
increases by one unit
o Critique: if followed strictly, important info about
Intercept (a) is the value of Y when X is 0
individual’s performance relative to others can be
o The point at which the regression line crosses the Y axis
potentially lost
THE BEST-FITTING LINE
Culture and Inference
The difference between the observed and predicted score (Y- Y’) is
- Culture is a factor in test administration, scoring and interpretation
called the residual
- Test user should do research in advance on test’s available norms to
check how appropriate it is for targeted testtaker population The best-fitting line is most appropriately found by squaring each residual
o Helpful to know about the culture of the testtaker Best-fitting line is obtained by keeping these squared residuals as small as
possible
o Principle of least squares:
Correlation is a special case of regression in which the scores for both variables
are in standardized, or Z, units
CHAPTER 4: OF TESTS AND TESTING CHAPTER 4: OF TESTS AND TESTING
In correlation, the intercept is always 0
Third Variable Explanation
Pearson product moment correlation coefficient is a ratio used to - Third variable, ie poor social adjustment, causes TV viewing and
determine the degree of variation in one variable that can be estimated aggression
from knowledge about variation in the other variable - External influence is the third variable
Testing the Statistical Significance of a Correlation Coefficient Restricted Range
- Begin with the null hypothesis that there is no relationship between - Correlation and regression use variability on one variable to explain
variables variability on a second variable
- Null hypothesis rejected is there is evidence that the association - Restricted range problem: correlation requires variability; if the
between two variables is significantly different from 0 variability is restricted, then significant correlations are difficult to find
- t distribution is not a single distribution, but a family of distributions, Mulvariate Analysis
each with its own degrees of freedom - Multivariate analysis considers the relationship among combinations of three
- Degrees of freedom are defined as the sample size minus 2, or N-2 of more variables
- Two-tailed test General Approach
- Linear combination of variables is a weighted composite of the original
How to Interpret a Regression Plot
variables
- Regression plots are pictures that show the relationship between - Y’ = a+b1X1 + … bkXk
variables
- Common use of correlation is to determine the criterion validity
evidence for a test, or the relationship between a test score and some
well-defined criterion
- Middle level of enjoyableness because it is the one observed most
frequently – normative because it uses info gained from representative groups
- Using the test as a predictor is not as good as perfect prediction, but
it is still better than using the normative info
- A regression line such as in 3.9 shows that the test score tells us nothing
about the criterion beyond the normative info

TERMS AND ISSUES IN THE USE OF CORRELATION


Residual
- Difference between the predicted and the observed values is called the
residual
o Y-Y’
- Important property of residual is that the sum of the residuals always equals 0
- Sum of the squared residuals is the smallest value according to the principle
of least squares
Standard Error of Estimate
- Standard deviation of the residuals is the standard error of estimate
- A measure of the accuracy of prediction
- Prediction is most accurate when the standard error of estimate is
relatively small
Coefficient of Determination
- Correlation coefficient squared is known as the coefficient of
determination
- Tells us the proportion of the total variation in scores on Y that we know as
a function of information about X
Coefficient of Alienation
- Coefficient of alienation is a measure of nonassociation between two
variables
- Square root of 1-r2 –-- r is the coefficient of determination
- High value means there is a high degree of nonassociation between 2 variables
Shrinkage
- Tendency to overestimate the relationship, particularly if the sample of
subjects is small
- Shrinkage is the amount of decrease observed when a regression equation
is created for one population and then applied to another
Cross Validation
- Use regression equation to predict performance in a group of subjects other
than the ones to which the equation was applied
- Standard error of estimate obtained for relationship between the values
predicted by the equation and the values actually observed – called cross
validation
The Correlation-Causation Problem
- Experiments are required to determine whether manipulation of one variable
causes changes in another variable
- A correlation alone does not prove causality, although it might lead to other
research that is designed to establish the causal relationships between
variables
CHAPTER 5: RELIABILITY CHAPTER 5: RELIABILITY
RELIABILITY SOURCES OF ERROR VARIANCE
- Dependability and consistent TEST CONSTUCTION
o Item sampling or content sampling – refer to variation among
- Error implies that there will always be some inaccuracy in our
items within a test as well as to variation among items between
measurements
test\
- Tests that are relatively free of measurement error are deemed to be reliable
The extent to which a test takers score is affected
- Reliability estimates in the range of .70 and .80 are good enough for most by the content sampled on a test and
purposes in basic research by the way the content is sampled (that is, the way in
- Reliability coefficient: an index that indicates the ratio between the true which the item is constructed) is a source of error
score variance on a test and the total variance variance
- HISTORY OF RELIABILITY: - TEST ADMINISTRATION
o Charles Spearman (1904): The Proof and Measurement of o may influence the test takers attention or motivation
Association between Two Things
o Environment variables, test taker’s variables, examiner
o Then Thorndike variables. Level of professionalism
o Item response theory has taken advantage of computer
- TEST SCORING AND INTERPRETATION
technology to advance psychological measurement
o Computer scoring and a growing reliance on objective,
significantly computer-scorable items have virtually eliminated error
o Based on Spearman’s ideas variance caused by scorer differences
- X=T+E CLASSICALTEST THEORY o However, other tools of assessment still require scoring by trained
o assumes that each person has a true score that would be personnel
obtained if there were no errors in measurement o If subjectivity is involved in scoring, then the scorer can be a
o Difference between the true score and the observed score results source of error variance
from measurement error o Despite rigorous scoring criteria set forth in many of the better
o Assumption here is that errors of measurement are known test of intelligence, examiner occasionally still are
random confronted by situations where an examinees response lies in
o Basic sampling theory tells us that the distribution of a gray area
random errors is bell-shaped
TEST-RETEST RELIABILITY
The center of the distribution should represent the
true score, and the dispersion around the - Also known as time-sampling reliability
mean of the distribution should display the - Correlating pairs of scores from the same group on two different
distribution of sampling errors administration of the same test
o Classical test theory assumes that the true score for an - Measure something that is relatively stable over time
individual will not change with repeated applications of the - Sources of Error variance:
same test o Passage of time: the longer the time that passes, the greater
the likelihood that reliability coefficient will be lower.
o Variance: standard deviation squared. It is useful because it can o Coefficient of stability: when the interval between testing is
be broken into components: greater than 6 months,
o True variance: variance from true differences are - Consider possibility of carryover effect: occurs when first testing
assumed to be stable session influences scores from the second session
o Error variance: random irrelevant sources - If something affects all the test takers equally, then the results are
- Standard error of measurement: we assume that the distribution of uniformly affected and no net errors occurs
random errors will be the same for all people, classical test theory uses the - Practice tests may make this effect happen
standard deviation of errors as the basic measure of error - Practice can also affect tests of manual dexterity
o Standard error of measurement tells us, on the average, how - Time interval between testing sessions must be selected and
much a score varies from the true score evaluated carefully
o Standard deviation of the observed score and the - Poor test-retest correlations do not always mean that a attest is unreliable –
reliability of the test are used to estimate the standard error suggest that the characteristic under study has changed
of measurement
- Reliability: proportion of the total variance attributed to true PARALLEL-FORM OR ALTERNATE FORMS RELIABILITY
variance. - compares two equivalent forms of a test that measure the same
o the greater portion of total variance attributed to true attribute
variance, the more reliable the test - Two forms should be equally constructed, both format, etc.
- Measurement error: refers to collectively, all of the factors associated - When two forms of the test are available, one can compare
with the process of measuring some variable, other than the variable being performance on one form versus the other – equivalent forms
measured reliability or parallel forms
o Random error: a source of error in measuring a targeted - Coefficient of equivalence: degree of relationship between various forms of
variable caused by unpredictable fluctuations and a test can be evaluated by means of an alternate-forms
inconsistencies of other variables in the measurement process - Parallel forms: each form of the test, the means and variances of
This source of error fluctuates from one testing observed test scores are equal
situation to another with no discernible pattern - Alternateforms: different versions of a test that have been
that would systematically raise or lower scores constructed so as to be parallel
o Systematic Error: - (1) two test administrations with the same group are required
A source of error in measuring a variable that is - (2) test scores may be affected by factors such as motivation etc.
typically constant or proportionate to what is - Problem: developing a new version of a test
presumed to be true value of the variable being INTERNAL CONSISTENCY
measured - How well does each item measure the content/construct under
consideration
Error is predictable and fixable
- How consistent the items together
Does not affect score consistency
- Used when tests are administered once
- If all items on a test measure the same construct, then it has a good internal
consistency
- Split-half reliability, KR20, Cronbach Alpha
CHAPTER 5: RELIABILITY CHAPTER 5: RELIABILITY
o Test takers with the same score on a heterogeneous test may
SPLIT-HALF RELIABILITY have quite different abilities
- Correlating two pairs of scores obtained from equivalent halves of a single o However, homogenous testing is often an insufficient tool for
test administered once. measuring multifaceted psychological variable such as
- This is useful when it is impractical to assess reliability with two tests or to intelligence or personality
administer test twice
- Results of one half of the test are then compared with the results of the Measures of Inter-Scorer Reliability
other  In some types of tests under some conditions, the score may be more a function
- Rules in splitting forms into half: of the scorer than of anything else
o Do not divide test in the middle because it would lower the  Inter-scorerreliability: is the degree of agreement or consistency between two or
reliability more scorers (or judges or rather) with regard to a particular measure
o Different amounts of anxiety and differences in item  Coefficient of inter-scorer reliability: coefficient of correlation to
difficulty shall also be considered determine the degree of consistency among scorers in the scoring of a test
o Randomly assign items to one or the other half of the test  Kappa statistic is the best method for assessing the level of agreement among
o use the odd-even system: where one subscore is obtained for the several observers
odd-numbered items in the test and another for the even- o Indicates the actual agreement as a proportion of the potential
numbered items agreement following the correction for chance agreement
- To correct for half-length, apply the Spearman-Brown formula, which o Cohen’s Kappa – 2 raters
allows you to estimate what the correlation between the two halves would o Fleiss’ Kappa – 3 or more raters
have been if each half had been the length of the whole test
HOMOGENEITY VS. HETEROGENEITY OF TEST ITEMS
Use this if test user wish to shorten a test
 Homogeneous items has high degree of reliability
o

o Used to determine the number of items needed to attain a


desired level of reliability DYNAMIC VS. STATIC CHARACTERISTICS
- Reliability increases as the test length increases  Dynamic: trait, state, ability presumed to be ever-changing as a function of
situational and cognitive experiences
KUDER-RICHARDSON FORMULAS OR KR20/KR21  Static: trait, state, ability relatively unchanging
- Kuder-Richardson technique simultaneously considers all possible ways
of splitting the items RESTRICTION OR INFLATION OF RANGE
- The formula for calculating the reliability of a test in which the items are  If it is restricted, reliability tends to be lower.
dichotomous, scored 0 or 1, is the Kuder-Richardson 20 (see p.114)  If it is inflated, reliability tends to be higher.
- Introduced KR21 – uses an approximation of the sum of the pq
SPEED TESTS VS. POWER TESTS
products – the mean test score
 Speed test: test is homogenous, means that it is easy but short time
CRONBACH ALPHA  Power test: Few items, but more complex.
- Cronbach developed a formula that estimates the internal consistency of
CRITERION-REFERENCED TESTS
tests in which the items are not scored as 0 or 1 – a more general reliability
estimate, which he called coefficient alpha  Provide an indication of where a testtaker stands with respect to some variable
or criterion.
- Sum the individual item variances
o Most general method of finding estimates of reliability  Tends to contain material that has been mastered in hierarchical fashion.
through internal consistency  Scores here tend to be interpreted in pass-fail terms.
- Domain sampling: define a domain that represents a single trait or  Measure of reliability depends on the variability of the test scores: how different
the scores are from one another.
characteristic, and each item is an individual sample of this general
characteristic
The Domain Sampling Model
- Factor analysis deals with the situation in which a test apparently
- This model considers the problems created by using a limited number of items
measures several different characteristics
to represent a larger and more complicated construct
o Good for the process of test construction
- Our task in reliability analysis is to estimate how much error we would make by
- Most widely used as a measure of reliability because it requires only one
using the score from the shorter test as an estimate of your true ability
administration of the test
- Ranges from 0 to 1 “bigger is always better”
- Conceptualizes reliability as the ratio of the variance of the observed score on
the shorter test and the variance of the long-run true score
Other Methods of Estimating Internal Consistencies
- Inter-item consistency: refers to the degree of correlation among all the
- Reliability can be estimated from the correlation of the observed test score
with the true score
items on a scale
o A measure of inter-item consistency is calculated from a single Item Response Theory
administration of a single form of a test
- Classical test theory requires that exactly the same test items be
o An index of inter-item consistency, in turn, is useful in
administered to each person – BAD
assessing the homogeneity of the test
- Item response theory (IRT) is newer – computer is used to focus on the
o Tests are said to be homogenous if they contain items that
range of item difficulty that helps assess an individual’s ability level
measure a single trait
o More reliable estimate of ability is obtained using a
o Definition: the degree to which a test measures a single factor
shorter test with fewer items
o Heterogeneity: degree to which a test measures different
o Takes a lot of items and effort
factors
o Ex: homo=test that assesses knowledge only of #-D television
repair skills vs. a general electronics repair test (hetero)
o The more homogenous a test is, the more inter-item
consistency it can be expected to have
o Test homogeneity is desirable because it allows relatively
straightforward test-score interpretation
o Test takers with the same score on a homogenous test
probably have similar abilities in the area tested
CHAPTER 5: RELIABILITY CHAPTER 5: RELIABILITY
Generalizability theory
 based on the idea that a person’s test scores vary from testing to testing because
of variables in the testing situation
 Instead of conceiving of all variability in a person’s scores as error, Cronbach
encouraged test developers and researchers to describe the details of the
particular test situation or universe leading to a specific test score
 This universe is described in terms of its facets: which include things like the
number of items in the test, the amount of training the test scorers have had, and
the purpose of the test administration
 According to generalizability theory, given the exact same conditions of all
the facets in the universe, the exact same test score should be obtained
 Universe score: the test score obtained and its analogous to a true score in the true
score model
 Cronbach suggested that tests be developed with the aid of a
generalizability study followed by a decision study
 Generalizability study: examines how generalizable scores from a
particular test are if the test is administered in different situations
 How much of an impact different facets of the universe have on the test score
 Ex: is the test score affected by group as opposed to individual
administration
 Coefficients of generalizability: the influence of particular facts on the test score is
represented by this. These coefficients are similar to reliability coefficients in the
true score model
 Decision study: developers examine the usefulness of test scores in helping
the test user make decision
 The decision study is designed to tell the test user how test scores should be used
and how dependable those scores are as a basis for decisions, depending on the
context of their use

What to Do About Low Reliability


- Two common approaches are to increase the length of the test and to throw
out items that run down the reliability
- Another procedure is to estimate what the true correlation would have
been if the test did not have measurement error
Increase the Number of Items
- The larger the sample, the more likely that the test will represent the true
characteristic
o This could entail a long and costly process however
- Prophecy formula
Factor and Item Analysis
- Reliability of a test depends on the extent to which all of the items
measure one common characteristic
- Factor analysis
o Tests are most reliable if they are unidimensional : one factor
should account for considerably more of the variance than
any other factor
- Or examine the correlation between each item and the total score for
the test called
o Discriminability analysis: when the correlation between
the performance on a single item and the total test score is
low, the item is probably measuring something different
from the other items on the test

Correction for Attenuation


- Potential correlations are attenuated, or diminished, by measurement error
CHAPTER 6: VALIDITY CHAPTER 6: VALIDITY
The Concept of Validity
- Validity: as applied to a test, is a judgment or estimate of how well a test o CVR Content validity ratio
measures what it purports to measure in a particular context o ne Number of panelists
o Judgment based on evidence about the appropriateness of stating “essential”
inferences drawn from test scores o N Total number of panelists
o Validity of test must be shown from time to time to account for culture
CVR is calculated for each item
and advancement o Culture and the relativity of content validity
- Inference: a logical result or deduction Tests thought of as either valid or invalid
- “Acceptable” or “weak” validity of tests and test scores What constitutes historical fact depends to some
- Validation: process of gathering and evaluating evidence about validity extent on who is writing the history
o Test user and testtaker both have roles in validation of test Culture relativity
o Test users may conduct their own validation studies: may yield insights Politics (politically correct)
regarding a particular population of testtakers as compared to the Criterion-Related Validity
norming sample (in manual) - Criterion-relatedvalidity: judgment of how adequately a test score can be used to
o Local validation studies: absolutely necessary when test user plans to infer an individual’s most probable standing on some measure of interest (measure
alter in some way the format, instructions, language, or content of the of interest being the criterion)
test - 2 types:
- Types of Validity (Trinitarian view) *not mutually exclusive all contribute to a o Concurrent validity: index of the degree to which a test score is
unified picture of a test’s validity/ critiq ue approach is fragmented and related to some criterion measure obtained at the same time
incomplete (concurrently)
o Content validity: measure of validity based on an evaluation of the o Predictive validity: index of the degree to which a test score
subjects, topics, or content covered by the items in the test predicts some criterion measure
o Criterion-related validity: measure of validity obtained by - What Is a Criterion?
evaluating the relationship of scores obtained on the test to scores o Criterion: a standard on which a judgment or decision may be based;
on other tests or measures standard against which a test or a test score is evaluated (criterion-
o Construct validity: measure of validity that is arrived at by related validity)
executing a comprehensive analysis of: (umbrella validity every o Characteristics of criterion
other variety of validity falls under it) Relevancy pertinent or applicable to the matter at hand
How scores on test relate to other test scores and Validity (for the purpose which it is being used)
measures Uncontaminated Criterion contamination: term
How scores on test can be understood within some applied to a criterion measure that has been based, at
theoretical framework for understand the construct that least in part, on predictor measures
the test was designed to measure - Concurrent Validity
- Strategies: ways of approaching the process of test validity o Test scores are obtained at about the same time as the criterion
o Content validation strategies measures are obtained measures of the relationship between the test
o Criterion-related validation strategies scores and the criterion provide evidence of concurrent validity
o Construct validation strategies o Indicate the extent to which test scores may be used to
- Face Validity estimate an individuals present standing on a criterion
o Face validity: relates more to what a test appears to measure to the o Once validity of inference from test scores is established= faster, less
person being tested than to what the test actually measures expensive way to offer a diagnosis or a classification decision
o Judgment concerning how relevant the test items appear to o Concurrent validity of a test can be explored with respect to
be usually from testtaker, not test user another test
o Lack of face validity= lack of confidence in perceived Prior research must have satisfactorily demonstrated the 1 st
effectiveness of test which decreases testtaker’s test’s validity
motivation/cooperation *may still be useful 1st test= validating criterion
- Content validity - Predictive validity
o Content validity: a judgment of how adequately a test samples o Test scores may be obtained at one time and the criterion
behavior representative of the universe of behavior that the test was measures obtained at a future time, usually after some
designed to sample intervening event has taken place
Ideally, test developers have a clear vision of the Intervening event training, experience, therapy,
construct being measured clarity reflected in the medication, etc.
content validity of the test Measures of relationship between the test scores and a
o Test blueprint: structure of the evaluation; a plan regarding the types criterion measure obtained at a future time provide an
of information to be covered by the items, the number of items tapping indication of the predictive validity test (how accurately
each area of coverage, the organization of the items in the test, etc. scores on the test predict some criterion measure)
Behavior observation is a technique frequently used in test o Ex: SAT test score and freshman gpa
blueprinting o Judgments of criterion validity are based on 2 types of statistical
o The quantification of content validity evidence:
Important in employment settings tests used to The validity coefficient
hire and promote Validity coefficient: correlation coefficient
One method: method for gauging agreement among raters that provides a measure of the relationship
or judges regarding how essential a particular item is (C.H. between test scores and scores on the
Lawshe) criterion measure
“Is the skill or knowledge measured by this Ex: Pearson correlation coefficient used to
item… determine validity between 2 measures (r)
o Essential Affected by restriction or inflation of
o Useful but not essential range
o Not necessary
To the performance of the job?”
Content validity ratio (CVR):
CVR= ne – (N/2)
(N/2)
CHAPTER 6: VALIDITY CHAPTER 6: VALIDITY
Is the range of scores employed Construct: an informed, scientific idea developed or
appropriate to the objective of the hypothesized to describe or explain behavior
correlational analysis Ex: intelligence, depression, motivation,
No rules regarding the validity coefficient (how personality, etc.
high or low it should/could be for test to be Unobservable, presupposed (underlying) traits
valid) that a test developer invokes to describe test
Incremental validity behavior/criterion performance
o More than one predictor Viewed as unifying concept for all validity evidence
o Incremental validity: the o Evidence of Construct Validity
degree to which an additional Various techniques of construct validation that
predictor explains something provide evidence:
about the criterion measure that Test is homogeneous measures single
is not explained by predictors construct
already in use Test scores increase/decrease as function of
Expectancy data age, passage of time, or experimental
Expectancy data: provides info that can be manipulation (theoretically predicted)
used in evaluating the criterion-related validity Test scored obtained after some even or
of a test passage of time differ from pretest scores
Score obtained on expectancy (theoretically predicted)
test/tables likelihood testtaker will score Test scores obtained by people from
within some interval of scores on a criterion distinct groups vary (theoretically
measure (“passing”, “acceptable”, etc.) predicted)
Expectancy table: shows the percentage of Test scores correlate with scores on other tests
people within specified test-score intervals (theoretically predicted)
who subsequently were placed in various Evidence of homogeneity
categories of the criterion Homogeneity: refers to how uniform a test
o May be created from is in measuring a single concept
scatterplot Evidence correlations between subtest
o Shows relationships scores and total test scores
Expectancy chart: graphic representation of Item-analysis procedures have been used in
an expectancy table quest for test homogeneity
o The higher the initial rating, the Desirable but not necessary
greater the probability of Contributes no info about how construct
job/academic success being measured relates to other constructs
Taylor Russell Table – provide an estimate of the Evidence of changes with age
extent to which inclusion pf a particular test in the If test purports to measure a construct that
selection system will actually improve selection changes over time then the test scores,
Selection ratio – relationship between the too, should show progressive changes to
number of people to be hired and the number be considered valid measurement of
of people available to be hired construct
Base rate – percentage of people under Does not in itself provide info about how
existing system for a particular position construct relates to other constructs
Relationship between predictor and Evidence of pretest-posttest changes
criterion must be linear Can be evidence of construct validity
Naylor-shine Tables – difference between the means of Some more typical intervening experiences
the selected and unselected groups to derive an index of responsible for changes in test scores are:
what the test is adding to already established procedures o Formal education
o Decision theory and Test utility o Therapy/medication
Base rate – extent to which a particular trait, o Any life experience
behavior, characteristic or attribute exists in the Evidence from distinct groups/method of contrasted
population groups
Hit rate – defined as the proportion of people a test Method of contrasted groups: one way of
accurately identifies as possessing or exhibiting a providing evidence for the validity of a test is
particular trait. to demonstrate that scores on the test vary in a
Miss rate – proportion of people the test fails to predictable way as a
identify as having or not having attributes function of membership in some group
False positive (type I error) – possess Rationale if a test is a valid measure of a
particular attribute but actually does not have. particular construct, test scores from groups of
Ex: score above cutoff score, hired but failed people who would presumed with respect to
the job. that construct should have
False negative (type II error) – does not correspondingly different test scores
possess particular attribute but actually does Convergent evidence
have. Ex. Scored below cutoff score, not hired, Evidence for the construct validity of a
but could have been successful particular test may converge from a
in the job number of sources, such as tests or
- Construct Validity measures designed to assess the
o Construct validity: judgment about the appropriateness of inferences same/similar construct
drawn from test scores regarding individual standings on a variable
called a construct
CHAPTER 6: VALIDITY CHAPTER 6: VALIDITY
Criterion data may be influenced by
Convergent evidence: scores on a test
rater’s knowledge of ratee race,
undergo construct validity and correlate
highly in the predicted direction with scores gender, etc.
on older, more established and o Test fairness
already validated tests designed to Issues of fairness tend to be more difficult and
measure the same/similar construct involve values
Discriminant evidence Fairness: the extent to which a test is used in an
Discriminant evidence: validity coefficient impartial, just, and equitable way
showing little relationship between test Sources of misunderstanding
scores and /or other variables with which Discrimination
scores on the test being construct- validated Group not included in standardization
should not theoretically be correlated sample
Provides evidence of construct validity Performance differences between
Multitrait-multimethod matrix: “two or identified groups
more traits”, “two or more methods”
Relationship Between Reliability and Validity
matrix/table that results from correlating
variables (traits) within and between - A test should not correlate more highly with any other variable than it
methods correlates with itself
Factor analysis - A modest correlation between the true scores on two traits may be missed
Factor analysis: shorthand term for a class of if the test for each of the traits is not highly reliable
mathematical procedures designed to identify - We can have reliability without validity
factors or specific variables that are typically o It is impossible to demonstrate that an unreliable test is valid
attributes, characteristics, or
dimension on which people may differ
Frequently used as a data reduction
method in which several sets of scores and
correlations between them are analyzed
Exploratory factor analysis: researchers test
the degree to which a hypothetical model
fits the actual data
o Factor loading: conveys
information about the extent to
which the factor determines the
test score or scores
o Complex procedures
- Validity, Bias, and Fairness
o Test Bias
Bias: a factor inherent in a test that systematically
prevents accurate, impartial measurement
Technical means to identify and remedy bias
(mathematically)
Bias implies systematic variation
Rating error
Rating: a numerical or verbal judgment (or
both) that places a person or an attribute
along a continuum identified by a scale of
numerical or word descriptions, known as a
rating scale
Rating error: judgment resulting from
intentional or unintentional misuse of a
rating scale
Leniency error/generosity error: error in
rating that arises from the tendency on the
part of the rater to be lenient in scoring,
marking, and/or grading
Severity error: rater exhibits general and
systematic reluctance to giving ratings at
either the positive or negative extreme
Overcome restriction of range rating errors is to use
rankings: procedure that requires the rater to measure
individuals against one another instead of against an
absolute scale
Rater is forced to select 1st, 2nd, 3rd, etc.
Halo effect: fact that for some raters, some rates can do no
wrong
Tendency to give a particular ratee a higher
rating than he or she objectively deserves
CHAPTER 7: UTILITY CHAPTER 7: UTILITY
Utility: usefulness or practical value of testing to improve efficiency Based on norm-related considerations rather than
on the relationship of test scores to a criterion
Factors that Affect a Test’s Utility Also called norm-referenced cut score
Psychometric Soundness Ex.) top 10% of test scores get A’s
o Reliability and validity of a test
o Fixed cut score: set with reference to a judgment concerning
o Gives us the practical value of both the scores (reliability and
a minimum level of proficiency required to be included in a
validity)
particular classification.
o They tell us whether decisions are cost-effective
Also called absolute cut scores
o A valid test is not always a useful test
o Multiple cut scores: using two or more cut scores with
especially if testtakers do not follow test reference to one predictor for the purpose of categorizing
directions testtakers
Costs Ex.) having cut score that marks an A, B, C etc. all
o Economic and non economic
measuring same predictor
o Ex.) using a less expensive and therefore less stringent
o Multiple hurdles: for success, requires one individual to
application process for airline personnel. complete many tasks, with elimination at each level
Benefits
Ex.) written application group interview
o Profits, gains, advantages
personal interview etc.
o Ex.) more stringent hiring policy more productive
o Compensatory model of selection: assumption is made that
employees
high scores on one attribute can compensate for low scores on
o Ex.) maintaining successful and academic environment of
another attribute
university
Methods for Setting Cut Scores
Utility Analysis
The Angoff Method
What is Utility Analysis?
-a family of techniques that entail a cost-benefit analysis designed to yield information
 Judgments of experts are averaged
relevant to a division about the usefulness and/or practical value of a tool of assessment. The Known Groups Method

Utility analysis: An illustration


 Collection of data on the predictor of interest from group known to posses and not to
possess trait, attribute, or ability
What’s the companies goal?
Limit the cost of selection
 Cut score based on which test best discriminates the two groups performance
o Don’t use FERT
IRT-Based Method
Ensure that qualified candidates are not rejected
 Based on testtaker’s performance across all items on a test
o Set a cut score that yields the lowest false negative rate
 Some portion of test items must be correct
Ensure that all candidates selected will prove to be qualified
o Lowest dales positive rate  Item-mapping method: determining difficulty level reflected by cut score (?)
Book-Mark method: test items are listed, one per page, in ascending level of
Ensure, to the extent possible, that qualified candidates will be
selected and unqualified candidates will be rejected difficulty. An expert places a bookmark to mark the divide which separates
o False positives are no better or worse than false negatives testtakers who have acquired minimal knowledge, skills, or abilities and those
that have not.
o Highest hit rate and lowest miss rate
 Problems include training of experts, possible floor and ceiling effects, and the
How Is a Utility Analysis Conducted? optimal length of item booklets
-objective: dictate what sort of information will be required as well as the specific
methods to be used Other Methods
Expectancy Data  discriminant analysis: family of statistical techniques used to shed light on the
o Expectancy table provides indication of the likelihood that a relationship between certain variables and two or more naturally occurring
testtaker will score within some interval of scores on a criterion groups
measure ex.) the relationships between scores of tests and ppl judged to be
o Used to measure costs vs. benefits successful or unsuccessful at job
Brogden-Cronbach-Gleser formula
o Utility gain: estimate of the benefit of using a particular test
or selection method
o Most simply is benefits-cost
o Productivity gain: estimated increase in work output

Some Practical Considerations


The Pool of Job Applicants
o There is rarely a limitless supply of potential employees
o Dependent on many factors, including economic
environment
o We assume that top scoring individuals will accept the job, but
those individuals are more likely to be the ones being offered
higher positions
The complexity of the Job
o It is questionable whether the same utility analysis methods can
be used to measure the eligibility of varying complexities of jobs
The cut score in use
o Relative cut score: may be defines as reference point
CHAPTER 8: TEST DEVELOPMENT CHAPTER 8: TEST DEVELOPMENT
TEST DEVELOPMENT - Item format
o Item format: variables such as the form, plan, structure,
STEPS: o
arrangement and layout of individual test items
1. 2 types
TEST CONCEPTUALIZATION o
1.) selected-response format: testtaker selects a response from a set
2. TEST CONSTRUCTION
of alternative responses
3. TEST TRYOUT
includes multiple choice, true-false, and matching
4. ITEM ANALYSIS o
2.) constructed-response format: testtaker supplies or creates the
5. TEST REVISION
correct answer
TEST CONCEPTUALIZATION includes completion item, short answer and essay
- Thoughts or stimulus that could be almost everything. - Writing Items for computer administration
- An emerging social phenomenon or pattern of behavior might serve as the o Item bank: relatively large and easily accessible collection of test
stimulus for the development of a new test. questions
- Norm referenced: An item for which high scorers on the test respond o Computerized Adaptive Testing (CAT): interactive, computer-
correctly. Low scorers respond to that same item incorrectly administered testtaking process wherein items presented to the
testtaker are based in part on testtaker’s performance on previous
- Criterion referenced: high scorers on the test get a particular item right
whereas low scorers on the test get that same item wrong. items.
o Floor effect: the diminished utility of an assessment tool for
- Pilot work: pilot study or pilot research. To know whether some items should
distinguishing testtakers at the low end of the ability, trait, or other
be included in the final form of the instrument.
attribute being measured
o the test developer typically attempts to determine how best to
o Ceiling effect: diminished utility of an assessment tool for
measure a targeted construct
distinguishing testtakers at the high end of the ability, trait,
TEST CONSTRUCTION
attribute being measured
- Scaling: process of setting rules for assigning numbers in
o Item branching: ability of computer to tailor the content and order
measurement.
of presentation of test items on the basis of responses to previous
- L.L. Thurstone: credited for being the forefront of efforts to develop items
methodologically sound scaling methods.
SCORING ITEMS
TYPES OF SCALES:
- Cummulative scoring: testtakers earn cumulative credit with regard to a
- Nominal, ordinal, interval or ratio
particular construct
- Age-based scale
- Class/category scoring: testtaker responses earn credit toward
- Grade-based scale
placement in a particular class or category with other testtakers whose
- Stanine scale (raw score converted to 1-9) pattern of responses is presumably similar in some way
- Unidimensional vs. multidimensional - Ipsative scoring: comparing a testtaker’s score on one within a test to
o Unidimensional: measuring one construct another scale within that same test
o Multidimensional: measuring more than one construct o ex.)“John’s need for achievement is higher than his need
- Comparative vs. categorical for affiliation”
o Comparative scaling: entails judgments of a stimulus in ITEM WRITING (KAPLAN BOOK)
comparison with every other stimulus on the scale Item Writing
o Categorical scaling: stimuli are placed into one of two or more - Personality and intelligence tests require different sorts of responses
alternative categories that differ quantitatively with respect to
- Guidelines for item writing
some continuum
o Define clearly what you want to measure
- Rating Scale: Which can be defined as a grouping of words, o Generate an item pool
statements, or symbols on which judgments of the strength of a o Avoid exceptionally long items
particular trait, attitude, or emotion are indicated by the testtaker
o Keep the level of reading difficulty appropriate for those who will
- Summative scale: when final score is obtained by summing the complete the scale
ratings across all the items o Avoid “double-barreled” items that convey two or more ideas at
- Likert scale: each item presents the testtaker with five alternative the same time
responses usually on agree-disagree, or approve-disapprove continuum o Consider mixing positively and negatively worded items
- Method of paired comparisons: presented with two stimuli and - Must be sensitive to ethnic and cultural differences
asked to compare
- Items that retain their reliability are more likely to focus on skills, while those
- Comparative scaling: judging of a stimulus in comparison with every other that lost reliability focused on more abstract concepts
stimulus on the scale Item Formats
- Categorical scaling: testtaker places stimuli into a category; those - Simplest test uses dichotomous format
categories differ quantitatively on a spectrum. The Dichotomous Format
- Guttman scale (Scalogram analysis): items range from sequentially - Dichotomous format offers two alternatives for each item
weaker to stronger expressions of attitude, belief, or feeling. A testtaker who o Ie. True-false examination
agrees with the stronger statement is assumed to also agree with the milder
- Advantages:
statements
o Simplicity
- Equal-appearing intervals (Thurstone): direct estimation because o True-false items require absolute judgment
don’t need to transform testtaker’s response to another scale
- Disadvantages:
WRITING ITEMS
o True-false encourage students to memorize material
- 3 Questions of test developer o “truth” often comes in shades of gray
o What range of content should the items cover? o mere chance of getting any item correct is 50%
o Which of the many different types of item formats should be
- Yes-no format on personality tests
employed?
- Multiple-choice = polytomous
o How many items should be written in total and for each content area
The Polytomous Format
covered?
- Polytomous format resembles the dichotomous format except that each item
- Item pool: reservoir from which items will not be drawn for the final version of
has more than two alternatives
the test (should be about double the number of questions as final will have)
o Multiple-choice exams
- Advantage:
o Little time for test takers to respond to a particular item
because they do not have to write
- Incorrect choices are called distractors
CHAPTER 8: TEST DEVELOPMENT CHAPTER 8: TEST DEVELOPMENT
- Disadvantages: o The midpoint representing the optimal difficulty is obtained by
o How many distractors should a test have? --> 3 or 4 summing up the chance of success proportion and 1.00 and then
o Distractors hurting reliability / validity of test dividing the sum by 2
o Three alternative multiple-choice items may be better than five Item Reliability Index
alternative items because they retain the psychometric value but take o Indication of the internal consistency of a test
less time to develop and administer o Equal to the product of the item-score standard deviation (s) and the
o Scoring of the MC exams? --> simply guessing should elicit correlation (r)
correctness o Factor analysis and inter-item consistency
o Correcting for this though, the expected score is 0 – as getting a o Factor analysis determines whether items on a test appear to be
question wrong loses you a point measuring the same thing
- Guessing can be good if you can narrow down a couple answers The Item-Validity Index
- Students are more likely to guess when they anticipate a lower grade on a test than o Statistic designed to provide an indication of the degree to which a test is
when they are more confident measuring what it purports to measure
- Guessing threshold describes the chances that a low-ability test taker will obtain o Requires: item-score standard deviation, the correlation between the item
each score score and criterion score
- True-false and MC tests are common to educational and achievement tests The Item-Discrimination Index
- Likert format, category scale, and the Q-sort used for personality-attitude tests o Measures how adequately an item separates or discriminates
between high scorers and low scorers
Likert Format o “d”
- Likert format: requires that a respondent indicate the degree of agreement with a o compares performance on a particular item with performance in the upper
particular attitudinal question and lower regions of a distribution of continuous test scores
o Strongly disagree ....... Strongly agree o higher d means greater number of high scorers answering the item
o For measurements of attitude correctly
- Used to create Likert Scales: scales require assessment of item o negative d means low-scoring examinees are more likely to answer the item
discriminability correctly than high-scoring examinees
- Familiar and easy ------ likely to remain popular in personality and attitude o Analysis of item alternatives
tests Item-Characteristic Curves?
Category Format o Graphic representation of item difficulty and discrimination
- Category format: uses more choices than Likert; 10-point rating scale
Other Considerations in Item Analysis
- Disadvantage: responses to items on 10-pt scales are affected by the
o Guessing
groupings of the people or things being rated
o Usually in some direction
- People change their ratings depending on context
o Depends on individuals ability to take risks
o This problem can be avoided if the endpoints of the scale are clearly
o Item fairness
defined and the subjects are frequently reminded of the definitions of
o Bias
the endpoints
o Speed tests
- Optimal number of points is 7? o Last items will appear to be more difficult because not
o Number depends on the fineness of the discrimination that
everyone got to them
subjects are willing to make
o When people are highly involved with some issue, they will tend to Qualitative Item Analysis
respond best to a greater number of categories Qualitative methods: techniques of data generation and analysis that rely
- Increasing the number of response categories may not increase reliability and primarily on verbal rather than mathematical or statistical procedures
validity Qualitative item analysis: various nonstatistical procedures designed to
- Visual analogue scale: respondent is given a 100-millimeter line and asked explore how individual test items work
to place a mark between two well-defined endpoints o Through means like interviews and group discussions
o Measures self-rate health “Think aloud” test administration
Checklists and Q-Sorts o approach to cognitive assessment that entails respondents
- Adjective Checklist: subject receives a long list of adjectives and indicates vocalizing thoughts as they occur
whether each one is characteristic of himself or herself o used to shed light on the testtker’s though processes
o Requires subjects either to endorse such adjectives or not, thus allowing during the administration of a test
only two choices for each item Expert panels
- Q-Sort: increases the number of categories o Sensitivity review: study of test items in which they are
o Used to describe oneself or to provide ratings of others examined for fairness to all prospective testtakers as well as for
Other Possibilities the presence of offensive language, stereotypes, or situations
- Forced-choice and Likert formats are clearly the most popular in ITEM ANALYSIS (KAPLAN BASED)
contemporary tests and measures The Extreme Group Method
- Checklists have fallen out of favor because they are more prone to error than are - Compares people who have done well with those who have done poorly
formats that require responses to every item on a test
- Frequent advice is to not use “all of the above” as a response option - Difference between these proportions is called the discrimination index
The Point Biserial Method
TEST TRYOUT - Find the correlation between performance on the item and
What is a good item? performance on the total test
o Reliable and valid - Correlation between a dichotomous variable and a continuous variable
o Helps to discriminate testtakers is called a point biserial correlation
- On tests with only a few items, using this is problematic because
ITEM ANALYSIS
performance on the item contributes to the total test score
o The Item-Difficulty Index
Pictures of Item Characteristics
o Obtained by calculating the proportion of the total number
- Valuable way to learn about items is to graph their characteristics, which
of testtakers who answered the item correctly “p”
you can do with the item characteristic curve
o Higher p= easier item
o Difficulty can be replaced with endorsement in non-
achievement tests
CHAPTER 8: TEST DEVELOPMENT CHAPTER 8: TEST DEVELOPMENT
- Prepare a graph for each individual test item - First step in developing these tests involves clearly specifying the
o Total test score is used as an estimate of the amount of a objectives by writing clear and precise statements about what the
‘trait’ possessed by individuals learning program is attempting to achieve
- Relationship between performance on the item and performance on the test - To evaluate the items: one should give the test to two groups of
gives some info about how well the item is tapping the info we want students – one that has been exposed to the learning unit and one that has
Drawing the Item Characteristic Curve not
- To draw this, we need to define discrete categories of test - Bottom of the V is the antimode – the least frequent score
performance - This point divides those who have been exposed to the unit from those who
- If the test has been given to many people, we might choose to make each have not been exposed and is usually taken as the cutting score or point , or
test score a single category what marks the point of decision
- Gradual positive slope of the line demonstrates that the proportion of people - When people get scores higher than the antimode, we assume that
who pass the item gradually increases as test scores increase they have met the objective of the test
o This means that the item successfully discriminates at all levels Limitations of Item Analysis
of test performance - Main Problem: though statistical methods for item analysis tell the test
- Ranges in which the curve changes suggest that the item is sensitive, constructor which items do a good job of separating students, they do not
while flat ranges suggest areas of low sensitivity help the students learn
- Item analysis breaks the general rule the increasing the number of items - Although the data are available to give the child feedback on the
makes a test more reliable “bug” in their thinking, nothing in the testing procedure initiates this
- When bad items are eliminated, the effects of chance responding can be guidance
eliminated and the test can become more efficient, reliable, and valid TEST REVISION
Item Response Theory Test Revision in the Life Cycle of an Existing Test
- According to classical test theory, a score is derived from the sum of an Tests get old and need revision
individual’s responses to various items, which are sampled from a larger Questions arise over equivalence of two tests
domain that represents a specific trait or ability Cross-validation and Co-validation
- New approaches consider the chances of getting particular items right o Cross-validation: revalidation of a test on a sample of
or wrong – item response theory – make extensive use of item analysis testtakers other than those on whom test performance was
o With this, each item on a test has its own item characteristic curve originally found to be a valid predictor of some criterion
that describes the probability of getting each particular item right o Validity shrinkage: decrease in item validities that
or wrong given the ability level of each test taker inevitably occurs after cross-validation of finding
o Testers can make an ability judgment without subjecting the o Co-validation: test validation process conducted on two or more
test taker to all of the test items tests using the same sample of testtakers
o Co-norming: when co-validation is used in conjunction with
- Technical adv: builds on traditional models of item analysis and can provide
the creation of norms or the revision of existing norms
info on item functioning, the value of specific items, and the reliability of a
scale o Quality assurance during test revision
test givers must have some degree of
- Two dimensions used are difficulty and discriminability
qualification, training, and testing
- Most attractive adv. Is that one can easily adapt the IRT tests for
anchor protocol: test protocol scored by a highly
computer administration
authoritative scorer that is designed as a model for
o Computer can rapidly identify the specific items that are
scoring and a mechanism for resolving scoring
required to assess a particular ability level
discrepancies
- “peaked conventional”
scoring drift: a discrepancy between scoring in an
- “rectangular conventional” – requires that test items be selected to
anchor protocol and the scoring of another protocol
create a wide range in level of difficulty
The Use of IRT in Building and Revising Tests
o problem: only a few items of the test are appropriate for
(item response theory)
individuals at each ability level; many test takers spend much of
Evaluating the properties of existing tests and guiding test revision
their time responding to items either considerably below their
Determining measurement equivalence across testtaker populations
ability level or too difficult to solve
o Differential item functioning (DIF): phenomenon, wherein an
- IRT addresses traditional problems in test construction well item functions differently in one group of testtakers as compared
- IRT can identify respondents with unusual response patterns and offer to another group of testtakers known to have the same level of
insights into cognitive processes of the test taker the underlying trait
- May also reduce the biases against the people who are slow in Developing item banks
completing test problems o Items from other instruments item pool scrutiny
External Criteria preliminary item bank psychometric testing item bank
- Item analysis has been persistently plagued by researchers’ continued
dependence on internal criteria, or total test score, for evaluating items
Linking Uncommon Measures
- One challenge in test applications is how to determine linkages
between two different measures
Items for Criterion-Referenced Tests
- Traditional use of tests requires that we determine how well someone has
done on a test by comparing the person’s performance to that of others
- Criterion-referenced tests compares performance with some clearly
defined criterion for learning
o Popular approach in individualized instruction programs
o Regarded as diagnostic instruments
CHAPTER 9: INTELLIGENCE AND ITS MEASUREMENT CHAPTER 9: INTELLIGENCE AND ITS MEASUREMENT
What is Intelligence? group factors: neither as general as g nor as specific as s
Intelligence: a multifaceted capacity that manifests itself in different ways across the o ex.) linguistic, mechanical, arithmetical abilities
lifespan. Usually includes abilities to: Guilford: multiple-factor models of intelligence
Acquire and apply knowledge o Explain mental activities by deemphasizing, any reference to g
Reason logically Thurstone: conceived intelligence as being composed of 7 primary
Plan effectively abilities.
Infer perceptively Gardner: developed theory of multiple intelligences
Make judgment and solve problems o Question over whether emotional intelligence exists.
Grasp and visualize concepts o Logical-mathematical, bodily-kinesthetic, linguistic,
Pay attention musical, spatial, interpersonal and intrapersonal
Be intuitive Raymond Cattell: fluid vs. crystallized intelligence
Find the right words and thoughts with facility o Crystallized intelligence: acquired skills and knowledge and
Cope with, adjust to, and make the most of new situations their retrieval. Retrieval of information and application of
Intelligence Defines: Views of the Lay Public
general knowledge
Both social and academic o Fluid intelligence: nonverbal, relatively culture-free, and
Intelligence Defined: Views of Scholars and Test Professionals
independent of specific instruction.
Francis Galton
Horn: added more to 7 factors
o First to publish on heritability of intelligence
o Vulnerable abilities: decline with age and tend to return
o Most intelligent persons were those with the best sensory
preinjury levels following brain damage
abilities
o Maintained abilities: tend not to decline with age and may return
Alfred Binet to preinjury levels following brain damage.
o Made tests about intelligence, but didn’t define it
Carrol:
o Components of intelligence: reasoning, judgment,
o Three-stratum theory of cognitive abilities: like geology
memory, abstraction
o Hierarchical model: meaning that all of the abilities listed in a
o Added that definition is complex; requires interaction of
stratum are subsumed by or incorporated in the strata above.
components
o Those in the first stratum are narrow abilities
o He argued that when one solves a particular problem, the abilities
CHC model (Cattell-Horn-Carroll)
used cannot be separated because they interact to produce the
o Some overlap some difference
solution.
o Doesn’t use g
David Wechsler
o Has broader abilities than Carroll’s theory
o Best way to measure this global ability was by measuring
McGrew: Integrated the Cattell-Horn and Carroll’s model
aspects of several “qualitatively differentiable” abilities
McGrew and Flanagan: integrated McGrew-Flanagan CHC Model
o Complexity of intelligence
o Features 10 broad stratum abilities
o Conceptualization as an “aggregate” or “global” capacity
o 70 narrow-stratum abilities
Jean Piaget
o Makes no provision for the general intellectual ability factor
o Studied children
(g)
o Believed order of maturation to be unchangeable
o It was omitted because it has little practical relevance to cross-
o With age, increased schema: organized action or mental
battery assessment and interpretation
structure that, when applied to the world, leads to knowing or
The Information-Processing View
understanding.
Aleksandr Luria
o Learning occurred through assimilation (actively organizing
o How (not what) information is processed
new information so that it fits in with what already is perceived
o Simultaneous/parallel processing: integrated all at once
and thought) and accommodation (changing what is already
o Successive/sequential processing: each bit individually
perceived or though so that it fits with new information)
processed
o Sensorimotor (0-2)
PASS model: (Planning, attention, simultaneous, successive)-model of
o Preoperational (2-6) assessing intelligence
Sternberg ‘The essence of intelligence is that it provides a means to govern
o Concrete Operational (7-12) ourselves so that our thoughts and actions are organized, coherent, and
responsive to both out internally driven needs and to the needs of the
o Formal Operational (12 and older) environment”

All share interactionism: complex concept by which heredity and Measuring Intelligence
environment are presumed to interact and influence the development of
one’s intelligence Types of Tasks Used in Intelligence Test
Factor-analytic theories: focus is squarely on identifying the Infants: test sensorimotor, interviews with parents
ability(ies) deemed to constitute intelligence Older child: verbal and performance abilities
Information-processing theories: focus is on identifying the specific Mental Age: index that refers to chronological age equivalent to
mental processes that constitute intelligence. one’s test performance
Adults: retention of general information, quantitative reasoning,
Factor-Analytic Theories of Intelligence: expressive language and memory, and socialjudgment
Charles Spearman: pioneered new techniques to measure Theory in Intelligence Test Development and Interpretation
intercorrelations between tests. Weschler made a dichotomous test (Performance and Verbal), but
o Existence of a general intellectual ability factor (g) that advocated multifaceted definition
tapped by all other mental abilities. Thorndike: intelligence = social, concrete, abstract
g representing the portion of the variance that all intelligence tests have in Putting theories into test are extremely hard
common and the remaining portions of the variance being accounted for
either by specific components (s) or by error components (e) Intelligence: Some Issues:
greater g = better test was thought to predict overall intelligence Nature vs. Nurture
Currently believed to be mix of two
CHAPTER 9: INTELLIGENCE AND ITS MEASUREMENT CHAPTER 9: INTELLIGENCE AND ITS MEASUREMENT
Performationism: all structures, including intelligence are had at birth
and can’t be improved upon
Led to predeterminism: one’s abilities are predetermined by genetic
inheritance and no learning or intervention can enhance it
Interactionist: ppl inherit certain intellectual potential
o Theres a limit to genetic abilities (i.e. can’t ever have x-ray vision)
The Stability of Intelligence
Stable pretty much throughout one’s adult life
Cognitive abilities seem to decline with age
The Construct Validity of Tests of Intelligence
Having construct validity requires having unified understanding of what
intelligence is
Very difficult. Spearman says its one thing, Guilford says its many
Thorndike approach is sort of compromise
o Look for one central factor with three additional factors
representing social, concrete, and abstract intelligences
Other Issues
Flynn effect: IQ scores seem to rise every year, but not coupled with
rise in “true intelligence”
Personality
o High IQ: Need for achievement, competition, curiosity,
confidence, emotional stability etc.
o Low IQ: passivity, dependence, maladjustment
o Temperament (used to describe infants)
Gender
o Men usually outscore in visual spatialization tasks and
intelligence scores
o Women tend to outscore in language-skill tasks
o But differences can be bridged
Family Environment
o Divorce can have negative effects
o Begins with “maternal effects” in womb
Culture
o Provides specific models for thinking, acting and feeling
o Assumed that if cultural factors can be controlled then
differences between cultural groups will be lessened
o Assumed that culture can be removed by the reliance on
exclusively nonverbal tasks
Tend not to be very good at predicting success in
various academic and business settings
o Culture loading: the extent to which a test incorporates the
vocabulary, concepts, traditions, knowledge and feelings
associated with a particular culture
o No test can be culture free
o Culture-fair intelligence test: test/assessment process
designed to minimize the influence of culture with regard to
various aspects of evaluation procedure
o Another approached called for cultural-specific intelligence tests
Ex.) BITCH measured streetwiseness
Lacked predictive validity and useful, practical
information
CHAPTER 10: TESTS OF INTELLIGENCE CHAPTER 10: TESTS OF INTELLIGENCE
The Stanford-Binet Intelligence Scales Other Measures of Intelligence
First to have detailed administration and scoring instructions Tests Designed for Individual Administration
First American test to test IQ Kaufman Adolescent and Adult Intelligence Test
First to use alternate items (an item that can be used in place of Kaufman Brief Intelligence Test
another) Kaufman Assessment Battery for Children
Lacked minority group representation Away from information processing and towards a distinction
Ratio IQ =(mental age/chronological age)x100 between sequential and simultaneous processing
Deviation Ratio/test composite: performance of one individual Tests Designed for Group Administration
compared to the performance of others of the same age. Has mean of Group Testing in the Military
100 and standard deviation of 16 o WWI need for government to test intelligence as
Age scale: items grouped by age means of differentiating “unfit” and “exceptionally
Point scale: items organized by category The superior ability”
Stanford-Binet Intelligence Scales: Fifth Edition o Army Alpha Test: to army recruits who could read.
Measures fluid intelligence, crystallized knowledge, quantitative Included general information questions, analogies, and
knowledge, visual-processing, and short-term (working) memory
scrambled sentences to reassemble
Utilizes adaptive testing: testing individually tailored to testtakers to o Army Beta Test: to foreign or illiterate recruits,
ensure that items are neither too difficult (frustrating) or too easy (false
included mazes, coding, and picture completion.
hope)
o After the war, the alpha and beta test were used
Examiner establishes rapport with testtaker, then administers routing
rampantly, and oftentimes misused
test to direct, route examinee to test items most likely at optimal level of
o Screening tools: instrument of procedure used to
difficulty
identify a particular trait or constellation of traits
Teaching items: show testtaker what is expected, how to do it.
o ASVAB (Armed Services Vocational Aptitude Battery):
o Can be used for qualitative assessment, but not scoring
administered to prospective to recruits or high school
Subtests for verbal and nonverbal tests share same name, but involve
different tasks students looked for career guidance
5 career areas: clerical, electronics,
Floor: lowest level of items on subtest
Ceiling: highest-level item of subtest mechanical, skill-technical, and combat
Basal level: base-level criterion that must be met for testing on the operations
subtest to continue Group Testing in Schools
Ceiling level is met when testtaker fails certain number of items in a row. o Useful in developing child’s profile - but cannot be sole
Test discontinues here. indicator
Scores: raw standard composite o Groups of 10-15
Extra-test behavior: behavioral observation The o Starting in Kindergarten
Wechsler Tests o Also called traditional group testing, because more modern
-commonality between all versions: all yield deviation IQ’s with mean of 100 forms can utilize computer. These more aptly called
and standard deviation of 15 individual testing
Wechsler Adult Intelligence Scale-Fourth Edition (WAIS-IV) Measures of Specific Intellectual Abilities
Core subtest: administered to obtain a composite score Widely used intelligence tests only test a sampling of the many
Supplemental/Optional Subtest: provides additional clinical attributable factors aiding in intelligence
information or extending the number of abilities or processes Ex.) Creativity
sampled. o Commonly thought to be composed of originality,
Yields four index scores: Verbal Comprehension Index, a Working Memory fluency, flexibility, and elaboration
Index, a Perceptual Reasoning Index, and a Processing Speed Index o If the focus is too heavily on whether an answer is
Wechsler Intelligence Scale for Children –Fourth Edition (WISC-IV) correct, doesn’t allow for creativity
Process score: index designed to help understand how testtakers process o Achievement tests require convergent thinking:
various kinds of information deductive reasoning process that entails recall and
WISC-IV compared to the SB5 consideration of facts as well as a series of logical
Wechsler Preschool and Primary Scale of Intelligence-Third Edition judgments to narrow down solutions and eventually arrive
(WPPSI-III) at one solution
New school for children under 6 o Divergent thinking: a reasoning process in which
First major intelligence test which adequately sampled total thought is free in many different directions, making
population of the United States several solutions possible
Subtests labeled core, supplemental, or optional Associated words, uses of rubber band etc.
Wechsler, Binet, and the Short Form Test-retest reliability for some of these tests are
Short form: test that has been abbreviated in length to reduce time near unacceptable
needed to administer, score and interpret
used with caution, only for screening
provide only estimates
reducing the number of items usually reduces reliability and thus validity
Wechsler Abbreviated Scale of Intelligence
The Wechsler Test in Perspective
Factor Analysis
o Exploratory factor analysis: summarizing data when we
are not sure how many factors are present in our data
o Confirmatory factor analysis: used to test highly
specific factor analysis
CHAP.11: OTHER INDIVIDUAL TESTS OF ABILITY IN EDUCATION AND SPECIAL EDUCATION CHAP.11: OTHER INDIVIDUAL TESTS OF ABILITY IN EDUCATION AND SPECIAL EDUCATION

Alternative Individual Ability Tests Compared with the Binet and Wechsler Scales Bayley Scales of Infants and Toddler Development – Third Edition (BSID-III)
- None of these are clearly superior from a psychometric - Base assessments on normative maturational developmental data
standpoint - Designed for infants between 1 and 42mths
- Some less stable, most more limited in their documented validity - Assesses development across 5 domains: cognitive, language, motor,
- Compare poorly to Binet and Wechsler on all accounts socioemotional, and adaptive
- They don't rely on a verbal response as much as the B and W - Motor scale: assumes that later mental functions depend on motor
- Just use pointing or Yes/No responses, thus do not depend on the complex development
integration of visual and motor functioning - Excellent standardization
- Contain a performance scale or subscale - Generally positive reviews
- Their specificity often limits the range of functions or abilities that they - Strong internal consistency
can measure - More validity studies needed
- Because they are designed for special populations, some - Widely used in research – children with Down syndrome,
alternatives can be administered totally without the verbal instructions pervasive developmental disorders, cerebral palsy, language
impairment, etc
SPECIFIC INDIVIDUAL ABILITY TESTS - Most psychometrically sound test of its kind
- Earliest individual tests typically designed for specific purposes or - Predictive though?
populations Cattell Infant Intelligence Scale (CIIS)
- One of the first – Seguin Form Board Test – in 1800s – produced only a - Based on normative developmental data
single score
- Downward extension of Stanford-Binet scale for 2-30mth olds
o Used primarily to evaluate mentally retarded adults and
- Similar to Gesell scale
emphasized speed and performance
- Rarely used today
- After, the Healy-Fernald Test was developed as an exclusively
nonverbal test for adolescent delinquents
- Sample is primarily based on children of parents from lower and middle
classes and therefore does not represent the general population
- Knox developed a battery of performance tests for non-English adult
immigrants to the US – administered without language; speed not
- Unchanged for 60yrs
emphasized - Psychometrically unsatisfactory
- These early individual tests designed for specific populations, produced MAJOR TESTS FOR YOUNG CHILDREN
a single score, and had nonverbal performance scales
McCarthy Scales of Children’s Abilities (MSCA)
- Could be administered without visual instructions and used with children - Measure ability in children between 2-8yrs
as well as adults
- Present a carefully constructed individual test of human ability
Infant Scales
- Where mental retardation or developmental delays are suspected,
- Meager validity
these tests can supplement observation, genetic testing, and other - Produces a pattern of scores as well as a variety of composite scores
medical procedures - General cognitive index (CGI): standard score with a mean of 100 and a
Brazelton Neonatal Assessment Scale (BNAS) standard deviation of 16
- Individual test for infants between 3days and 4weeks o Index reflects how well the child has integrated prior
learning experiences and adapted them to the demands
- Purportedly provides an index of a newborn’s competence
of the scales
- Favorable reviews
- Considerable research base
- Relatively good psychometric properties

- Wide use as a research tool and as a diagnostic tool for special


- Reliability coefficients in the low .90s
purposes - In research studies
- Commonly used scale for the assessment of neonates - Good validity? Good assessment tool
Kaufman Assessment Battery for Children - Second Edition (KABC-II)
- Drawbacks:
o No norms are available - Individual ability test for children between 3-18yrs
o More research is needed concerning the meaning and - 18 subtests in 5 global scales called sequential processing,
implication of scores simultaneous processing, learning, planning, and knowledge
o Poorly documented predictive and construct validity - Intended for psychological, clinical, minority-group, preschool, and
o Test-retest reliability leaves much to be desired neuropsychological assessment as well as research
Gesell Developmental Schedules (GDS) - Sequential-simultaneous distinction
- Infant intelligence measures o Sequential processing refers to a child’s ability to solve
problems by mentally arranging input in sequential or serial
- Used as a research tool by those interested in assessing infant
order
intellectual development after exposure to mercury, diagnoses of
o Simultaneous processing refers to a child’s ability to
abnormal brain formation in utero and assessing infants with autism
synthesize info from mental wholes in order to solve a
- Children of 2.3mth to 6.3yrs
problem
- Obtains normative data concerning various stages in maturation
- Nonverbal measure of ability too
- Individual’s developmental quotient (DQ) is determined according
- Well constructed and psychometrically sound
to a test score, which is evaluated by assessing the presence or
absence of behavior associated with maturation
- Not much evidence of (good) validity
- Poorer predictive validity for school achievement – smaller
- Provides an intelligence quotient like that of the Binet
differences between whites and minorities
o (development quotient / chronological age) x 100
- Test suffers from a noncorrespondence between its definition and
- But, falls short of acceptable psychometric standards
its measurement of intelligence
- Standardization sample not representative of the population
- No reliability or validity GENERAL INDIVIDUAL ABILITY TESTS FOR HANDICAPPED AND SPECIAL
- Does appear to help uncover subtle deficits in infants POPULATIONS
Columbia Mental Maturity Scale – Third Edition (CMMS)
- Purports to evaluate ability in normal and variously handicapped children
from 3-12yrs
- Requires neither a verbal response nor fine motor skills
CHAP.11: OTHER INDIVIDUAL TESTS OF ABILITY IN EDUCATION AND SPECIAL EDUCATION CHAP.11: OTHER INDIVIDUAL TESTS OF ABILITY IN EDUCATION AND SPECIAL EDUCATION

GENERAL INDIVIDUAL ABILITY TESTS FOR HANDICAPPED AND SPECIAL Illinois Test of Psycholinguistic Abilities (ITPA-3)
POPULATIONS cont… - Assumes that failure to respond correctly to a stimulus can result not only
from a defective output system but also from a defective input or
Columbia Mental Maturity Scale – Third Edition (CMMS) cont… information-processing system

- Requires subject to discriminate similarities and differences by


- Stage 1: info must first be received by the senses before it can be
indicating which drawing does not belong on a 6-by-9inch card analyzed
containing 3-5 drawings - Stage 2: info is analyzed or processed
- Multiple choice - Stage 3: with processed info, individual must make a response
- Standardization sample is impressive - Theorizes that the child may be impaired in one or more specific sensory
modalities
- Vulnerable to random error
- Reliable instrument that is useful in assessing ability in many people - 12 subtests that measure individual’s ability to receive visual, auditory,
with sensory, physical, or language handicaps or tactile input independently of processing and output factors
- Good screening device - purports to help isolate the specific site of a learning disability
Peabody Picture Vocabulary Test – Fourth Edition (PPVT-IV) - For children 2-10yrs
- 2-90yrs - Early versions hard to administer and no reliability or validity
- multiple choice tests that require subject to indicate Yes/No in some - Now, with revisions, ITPA-3 psychometrically sound measure of
manner children’s psycholinguistic abilities
- Instructions administered aloud (not for the deaf) Woodcock-Johnson III
- Purports to measure hearing or receptive vocabulary, presumably - Evaluates learning disabilities
providing a nonverbal estimate of verbal intelligence - Designed as a broad-range individually administered test to be used in
- Can be done in 15mins, requires no reading ability educational settings
- Good reliability and validity - Assesses general intellectual ability, specific cognitive abilities,
scholastic aptitude, oral language, and achievement
- Should never be used as a substitute for a Wechsler or Binet IQ
- Important component in a test battery or used as a screening device - Based on the CHC three-stratum theory of intelligence
- Easy to administer and useful for variety of groups - Compares child’s score on cognitive ability with sore on
achievement – can evaluate possible learning problems
- BUT: Tendency to underestimate IQ scores, and problems
inherent in the multiple-choice format are bad - Relatively good psychometric properties
Leiter International Performance Scale – Revised (LIPS-R) - For learning disability tests, three conclusions seem warranted:
- Strictly a performance scale o 1. Test constructors appear to be responding to the same
criticisms that led to changes in the Binet and Wechsler
- Aims at providing a nonverbal alternative to the Stanford-Binet scale
scales and ultimately to the development of the KABC
for 2-18yr olds
o 2. Much more empirical and theoretical research is
- For research, and clinical settings, where it is still widely utilized to assess
needed
the intellectual function of children with pervasive developmental disorders
o 3. Users or learning disabilities tests should take great pains
- Purports to provide a nonverbal measure of general intelligence to understand the weaknesses of these procedures and not
by sampling a wide variety of functions from memory to
overinterpret results
nonverbal reasoning
VISIOGRAPHIC TESTS
- Can be applied to the deaf and language-disabled
- Require a subject to copy various designs
- Untimed Benton Visual Retention Test – Fifth Edition (BVRT-V)
- Good validity
- Tests for brain damage are based on the concept of psychological deficit ,
Porteus Maze Test (PMT)
in which a poor performance on a specific task is related to or caused by
- Popular but poorly standardized nonverbal performance measure of some underlying deficit
intelligence
- Assumes that brain damage easily impairs visual memory ability
- Individual ability test
- For individuals 8yrs+
- Consists of maze problems (12) - Consists of geometric designs briefly presented and then removed
- Administered without verbal instruction, thus used for a variety of special - Computerized version developed
populations
Bender Visual Motor Gestalt Test (BVMGT)
- Needs restandardization
- Consists of 9 geometric figures that the subject is imply asked to copy
TESTING LEARNING DISABILITIES
- By 9yrs, any child of normal intelligence can copy the figures with only
- Major concept is that a child average in intelligence may fail in school one or two errors
because of a specific deficit or disability that prevents learning
- Errors occur for people whose mental age is less than 9, brain
- Federal law entitles every eligible child with a disability to a free
damage, nonverbal learning disabilities, emotional problems
appropriate public education and emphasizes special education and
- Questionable reliability
related services designed to meet his or her unique needs and prepare
Memory-for-Designs (MFD) Test
them for further education, employment, and independent living
- Drawing test that involves perceptual-motor coordination
- To qualify, child must have a disability and educational
- Used for people 8-60yrs
performance affected by it
- Good split-half reliability
- Educators today can find other ways to determine when a child needs
extra help
- Needs for validity documentation

- Processed called Response to Intervention (RTI): premise is that early


- All these tests criticized because of their limitations in reliability and
validity documentation
intervening services can prevent academic failure for many students with
learning difficulties - Good as screening devices though
Creativity: Torrance Tests of Creative Thinking (TTCT)
- Signs of learning problem:
o Disorganization - Measurement of creativity underdeveloped in psychological testing
o Careless effort - Creativity: ability to be original, to combine known facts in new ways,
o Forgetfulness or to find new relationships between known facts
o Refusal to do schoolwork or homework - Evaluating this a possible alternative to IQ
o Slow performance - Creativity tests in early stages of development
o Poor attention
o Moodiness
CHAP.11: OTHER INDIVIDUAL TESTS OF ABILITY IN EDUCATION AND SPECIAL EDUCATION

Creativity: Torrance Tests of Creative Thinking (TTCT) cont…

- Torrance tests separately measure aspects of creative thinking such as


fluency, originality, and flexibility
- Does not meet the Binet and Wechsler scales in terms of
standardization, reliability, or validity
- Unbiased indicator of giftedness
- Inconsistent tests, but available data reflect the tests’ merit and fine
potential
Individual Achievement Tests: Wide Range Achievement Test-3 (WRAT-4)
- Achievement tests measure what the person has actually acquired or done
with that potential
- Discrepancies between IQ and achievement have traditionally been
the main defining feature of a learning disability
- Most achievement tests are group tests
- WRAT-4 purportedly permits an estimate of grade-level functioning
in word reading, spelling, math computation, and sentence
comprehension
- Used for children 5yrs+
- Easy to administer
- Problems:
o Inaccuracy in evaluating grade-level reading ability
o Not proven as psychometrically sound
CHAP: 12: STANDARDIZED TESTS IN EDUCATION, CIVIL SERVICE, AND THE MILITARY CHAP: 12: STANDARDIZED TESTS IN EDUCATION, CIVIL SERVICE, AND THE MILITARY

Summary of K-12 Group Tests


- When justifying the use of group standardized tests, test users often have
problems defining what exactly they are trying to predict, or what the test - All are sound, viable instruments
criterion is
College Entrance Tests
Comparison of Group and Individual Ability Tests
- SAT Reasoning Test, Cooperative School and College Ability Tests, and American
- Individual tests require a single examiner for a single subject
College Test
o Examiner provides instructions
SAT Reasoning Test
o Subject responds, examiner records response
- Most widely used college entrance test
o Examiner evaluates response
o Examiner takes responsibility for eliciting a maximum
- Used for 1000+ private and public institutions
performance - Renorming of the SAT did not alter the standing of test takers relative to one
another in terms of percentile rank
o Scoring requires considerable skill
- Those who use the results of group tests must assume that the subject - New scoring (2400) is likely to reduce interpretation errors, as interpreters
was cooperative and motivated can no longer rely on comparisons with older versions
o Many subjects tested at a time - 45mins longer – 3hrs and 45mins to administer
o Subjects record own responses - may disadvantage students with disabilities such as ADD
o Subjects not praised for responding - Verbal section now called “critical reading” – focus on reading
o Low scores on group tests often difficult to interpret comprehension
o No safeguards - Math section eliminated much of the basic grammar school math questions
Advantages of Individual Tests - Weakness: poor predictive power regarding the grades of students who score in
- Provide info beyond the test score the middle ranges
- Allow the examiner to observe behavior in a standard setting - Little doubt that the SAT predicts first-year college GPA
- Allow individualized interpretation of test scores o But, AfricanAmericans and Latinos tend to obtain lower scores
Advantages of Group Tests on average
- Are cost-efficient o Women score lower on SAT but higher in GPA
- Minimize professional time for administration and scoring
Cooperative School and College Ability Tests
- Require less examiner skill and training
- Falling out of favor
- Have more objective and more reliable scoring procedures
- Developed in 1955, not been updated
- Have especially broad application
- Purports to measure school- learned abilities as well as an individual’s potential to
OVERVIEW OF GROUP TESTS undertake additional schooling
Characteristics of Group Tests
- Psychometric documentation not strong
- Characterized as paper-and-pencil or booklet-and-pencil tests because
- Little empirical data support its major assumption – that previous success in
only materials needed are a printed booklet of test items, a test manual, scoring
acquiring school-learned abilities can predict future success in acquiring such
key, answer sheet, and pencil
abilities
- Computerized group testing becoming more popular
- Most group tests are multiple choice – some free response American College Test
- Group tests outnumber individual tests - Updated in 2005, particularly useful for non-native speakers of English
o One major difference is whether the test is primarily verbal, - Produces specific content scores and a composite
nonverbal, or combination - Makes use of the Iowa Test of Educational Development Scale
- Group test scores can be converted to a variety of units - Compares with the SAT in terms of predicting college GPA alone or in
Selecting Group Tests conjunction with high-school GPA
- Test user need never settle for anything but well-documented and - Internal consistency coefficients are not as strong in the ACT
psychometrically sound tests
Using Group Tests
- Reliable and well standardized as the best individual tests
- Validity data for some group tests are weak/meager/contradictory
Use Results with Caution
- Never consider scores in isolation or as absolutes
- Be careful using tests for prediction
CHAP: 12: STANDARDIZED TESTS IN EDUCATION, CIVIL SERVICE, AND THE MILITARY CHAP: 12: STANDARDIZED TESTS IN EDUCATION, CIVIL SERVICE, AND THE MILITARY

GRADUATE AND PROFESSIONAL SCHOOL ENTRANCE NONVERBAL GROUP ABILITY TESTS


TESTS GRADUATE Raven Progressive Matrices
Record Examination Aptitude Test - RPM one of the best known and most popular nonverbal group tests
- GRE purports to measure general scholastic ability - Suitable anytime one needs an estimate of an individual’s general
- Most frequently used in conjunction with GPA, letters of rec, and other intelligence
academic factors - Groups or individuals, 5yrs-adults
- General section with verbal and quantitative scores - Used throughout the modern world
- Third section which evaluates analytical reasoning – now essay format - Uses matrices – nonverbal; with or without a time limit
- Contains an advanced section that measures achievement in at least 20 majors - Research supports RPM as a measure of general intelligence, or
- New 130-170 scoring scale Spearman’s g
- Standard mean score of 500, and SD of 100 - Appears to minimize the effects of language and culture
- Normative sample is relatively small - Tends to cut in half the selection bias that occurs with the Binet or
- Psychometric adequacy is less than that of SAT – validity and reliability Wechsler
- Predictive validity not great
Goodenough-Harris Drawing Test (G-HDT)
- Overpredicts the achievement of younger students while
- Nonverbal intelligence test, group or individual
underpredicting performance of older students
- Quick, east, and inexpensive
- Many schools have developed their own norms and psychometric
- Subject instructed to draw a picture of a whole an and to do the best
documentation and can use the GRE to predict success in their programs
job possible
- By looking at a GRE score in conjunction with GPA, graduate success
can be predicted with greater accuracy than without the GRE
- Details get points

- Graduate schools also frequently complain that grades no longer predict


- One can determine mental ages by comparing scores with those of the
normative sample
scholastic ability well because of grade inflation – the phenomenon of
rising average college grades despite declines in average SAT scores - Raw scores can be converted to standard scores with a mean of 100 and SD
of 15
o Led to corresponding restriction in the range of grades
- As the validity of grades and letters of rec becomes more questionable, reliance - Used extensively in test batteries
on test scores increases
The Culture Fair Intelligence Test
- Definite overall decline in verbal scores while quantitative and analytical scores - Designed to provide an estimate of intelligence relatively free of cultural
are gradually rising and language influences
Miller Analogies Test - Paper-and-pencil procedure that covers three age groups
- Designed to measures scholastic aptitudes for graduate studies - Two parallel forms are available
- Strictly verbal - Acceptable measure of fluid intelligence
- 60 minutes Standardized Tests Used in the US Civil Service System
- knowledge of specific content and a wide vocab are very useful - General Aptitude Test Battery (GATB) – reading ability test that
- most important factors appear to be the ability to see relationships and a purportedly measures aptitude for a variety of occupations
knowledge of the various ways analogies can be formed o Makes employment decisions in govt agencies
- psychometric adequacy is reasonable o Attempts to measure wide range of aptitudes from general
- does not predict research ability, creativity, and other factors important to grad intelligence to manual dexterity
school - Controversial because it used within-group norming prior to the
passage of the Civil Rights Act of 1991
The Law School Admission Test
- Today, any kind of score adjustments through within-group norming in
- LSAT problems require almost no specific knowledge employment practices is strictly forbidden by law
- Extreme time pressure
- Three types of problems: reading comprehension, logical reasoning Standardized Tests in the US Military: The Armed Services Vocational Aptitude
(~half), and analytical reasoning Battery
- Weight given to the LSAT score is openly published for each school - ASVAB administers to more than 1.3 million people a year
approved by the American Bar Association - Designed for students in grades 11 and 12 and in postsecondary schools
- Entrance into schools based on weighted sum of score and GPA
- Yields can help identify students who potentially qualify for entry into the
- Psychometrically sound, reliability coefficients in the .90s military and can recommend assignment to various military occupational
- Predicts first-year GPA in law school training programs
- Content validity is exceptional - Great psychometric qualities
- Bias for minority group members, as well as women - Reliability coefficients are excellent
- - Through computerized format, subjects can be tested adaptively, meaning
that the questions given each person can be based on his or her unique
ability
- This cuts testing time in half

You might also like