You are on page 1of 209

PSYCHOLOGICAL ASSESSMENT

HISTORICAL PERSPECTIVE
Historical Perspective
Early Antecedents
Antiquity to the 19th Century
Evidence suggests that the Chinese had a
relatively sophisticated civil service testing
program more than 4000 years ago or as
early as 2200B.C. Every third year in China,
oral examinations were given to help
determine work evaluations and promotion
decisions
By the Han Dynasty (206 B.C.E. to 220 C.E.), the use of test
batteries (two or more tests used in conjunction) was quite
common. These early tests related to such diverse topics as
civil law, military affairs, agriculture, revenue, and
geography. Tests had become quite well developed by the
Ming Dynasty (1368–1644 C.E.).
Reports by British missionaries and diplomats encouraged
the English East India Company in 1832 to copy the Chinese
system as a method of selecting employees for overseas
duty. Because testing programs worked well for the
company, the British government adopted a similar system
of testing for its civil service in 1855.
the French and German governments followed suit. In 1883,
the U.S. government established the American Civil Service
Commission, which developed and administered
competitive examinations for certain government jobs. 
Ancient Greco-Roman Writing indicative of attempt to
categorize people interms of personality types(i.e reference
to abandance or defieciency in some bodily fliud such as
blood or phlegm)
Charles Darwin and Individual
Differences
To develop a measuring device, we must
understand what we want to measure.
An important step toward understanding
individual differences came with the publication
of Charles Darwin’s highly influential book, The
Origin of Species, in 1859.
Darwin spurred interest in individual difference.
According to him, individual differences are the
highest importance, for they afford materrials
for natural selection to act on.
Sir Francis Galton, a relative of Darwin’s, soon began applying
Darwin’s theories to the study of human beings
Given the concepts of survival of the fittest and individual
differences, Galton set out to show that some people possessed
characteristics that made them more fit than others, a theory he
articulated in his book Hereditary Genius, published in 1869. He
aspired to classify people “ according to their natural gifts” and to
ascertain their “deviation from the average ”.
Realized the need for measuring the characteristics of related and
unrelated person and focused on INDIVIDUAL DIFFERENCES
Galton was instrumental in inducing a number of educational in
institutions to keep systematic ANTHROPOMETRIC RECORDS of
their students. 1884; Galton set up an anthropometric laboratory at
the international exposition, where visitors coulod be measured on
certain variables such as height (standing and sitting), arm span,
weight, breathing capacity, keenness of vison and hearing,
strength of pull, strength of squeeze, swiftness of blow, memory of
form, discrimination of color, ahand steadiness, reaction time, and
other simple sensorimotor functions. For all of these efforts, Galton
is credited to be primarily to be responsible for the launching of
the testing movement and pioneered in the application of rating
scale and questionaire methods (included self-report
inventories ). He is also responsibe for the development of
statistical method for analysis of data on individual differences(i.e.
coefficient of correlation)
psychologist James McKeen Cattell, who
coined the term mental test, Cattell’s
doctoral dissertation was based on Galton’s
work on individual differences in reaction
time. As such, Cattell perpetuated and
stimulated the forces that ultimately led to
the development of modern tests. He
became active in the spread of the testing
movement; first to use the term “MENTAL
TEST”. He was instrumental in founding the
Psychological Corporation”
1838: Esquirol
French physician whose two volume work made the first explicit
distinction between mentally retarded and insane individuals
More than 100 pages of his work devoted to “mental retardation”
Esquirol pointed that there are many degrees of mental retardation
The individual’s use of languange provides the most dependedable
criterion of his intellectual level

Seguin
Another french physician
Pioneered in the training of mentally retarded persons
1837: establish the first school devoted to education of mentally
retarded children
1848: migrated to the USA, made suggestions regarding the
training of metally retarded persons
Some of the proceduresdeveloped by Seguin were eventually
incorperated into performance or nonverbal tests of intelligence .
He argue that chance variation in species would be selected or
rejected by nature according to adaptability and survival value
Experimental Psychology and
Psychophysical Measurement
J. E. Herbart. Herbart eventually used these
models as the basis for educational theories
that strongly influenced 19thcentury
educational practices. Following Herbart, E.
H. Weber attempted to demonstrate the
existence of a psychological threshold, the
minimum stimulus necessary to activate a
sensory system. Then, following Weber, G. T.
Fechner devised the law that the strength of
a sensation grows as the logarithm of the
stimulus intensity
Wilhelm Wundt, who set up a laboratory at the University of
Leipzig in 1879, is credited with founding the science of
psychology
Wundt was succeeded by E. B. Titchner, whose student, G.
Whipple, recruited L. L. Thurstone. Whipple provided the basis
for immense changes in the field of testing by conducting a
seminar at the Carnegie Institute in 1919 attended by
Thurstone, E. Strong, and other early prominent U.S. From this
seminar came the Carnegie Interest Inventory and later the
Strong Vocational Interest Blank.
classifying and identifying the mentally and emotionally
handicapped. One of the earliest tests resembling current
procedures, the Seguin Form Board Test was developed in an
effort to educate and evaluate the mentally disabled. Similarly,
Kraepelin devised a series of examinations for evaluating
emotionally impaired people.
The French minister of public instruction appointed a
commission to study ways of identifying intellectually
subnormal individuals in order to provide them with
appropriate educational experiences. One member of that
commission was Alfred Binet. Working in conjunction with the
French physician T. Simon, Binet developed the first major
general intelligence test.
The Evolution of Intelligence
and Standardized Achievement
Tests
the Binet-Simon Scale, was published in 1905. This
instrument contained 30 items of increasing difficulty and
was designed to identify intellectually subnormal
individuals.
A representative sample is one that comprises individuals
similar to those for whom the test is to be used. When the
test is used for the general population, a representative
sample must reflect all segments of the
population in proportion to their actual numbers.
The 1908 Binet-Simon Scale also determined a child’s
mental age, thereby introducing a historically significant
concept. By 1916, L. M. Terman of Stanford University had
revised the Binet test for use in the United States.
Terman’s revision, known as the Stanford-Binet
Intelligence Scale.
World War I
World War I, the army requested the
assistance of Robert Yerkes, who was then
the president of the American Psychological
Association. Yerkes headed a committee of
distinguished psychologists who soon
developed two structured group tests of
human abilities: the Army Alpha and the
Army Beta. The Army Alpha required reading
ability, whereas the Army Beta measured
the intelligence of illiterate
adults.
Achievement Tests
Standardized achievement tests provide
multiple-choice questions that are
standardized on a large sample to produce
norms against which the results of new
examinees can be compared.

In 1923, the development of standardized


achievement tests culminated in the
publication of the Stanford Achievement Test
by T. L. Kelley, G. M. Ruch, and L. M. Terman
Rising to the Challenge
David Wechsler published the first version of
the Wechsler intelligence scales, the
Wechsler-Bellevue Intelligence Scale (W-B).
The Wechsler-Bellevue scale contained
several interesting innovations in intelligence
testing. Wechsler test shows the
performance IQ
Personality Tests:
1920–1940
One of the basic goals of traditional
personality tests is to measure traits. As
Traits are relatively enduring dispositions
(tendencies to act, think, or feel in a certain
manner in any given circumstance) that
distinguish one individual from another
The first structured personality test, the Woodworth
Personal Data Sheet, was developed during World War I
and was published in final form just after the war
Woodworth developed a personality test for civilian use
that was based on the Personal Data Sheet. He called it
the Woodworth Psychoneurotic Inventory. This
instrument was the fi rst widely used self-report test of
personality

One of early structured personality tests was projective


Rorschach inkblot test. The Rorschach test was first
published by Herman Rorschach of Switzerland in 1921.
Thematic Apperception Test (TAT) by Henry Murray and
Christina Morgan in 1935. TAT required the subject to
make up a story about the ambiguous scene. The TAT
purported to measure human needs and thus to
ascertain individual differences in motivation
The Emergence of New
Approaches to Personality
Testing
In 1943, the Minnesota Multiphasic Personality
Inventory (MMPI) began a new era for structured
personality tests. MMPI use empirical methods
to determine the meaning of a test response
Factor analysis is a method of finding the
minimum number of dimensions (characteristics,
attributes), called factors, to account for a large
number of variables.
In the early 1940s, J. R Guilford made the first
serious attempt to use factor analytic techniques
in the development of a structured personality
test.R. B. Cattell had introduced the Sixteen
Personality Factor Questionnaire (16PF)
The Period of Rapid Changes
in the Status of Testing
By 1949, formal university training standards
had been developed and accepted, and
clinical psychology was born.
A position paper of the American
Psychological Association published 7 years
later (APA, 1954) affirmed that the domain of
the clinical psychologist included testing.
The Current Environment
Neuropsychologists use tests in hospitals
and other clinical settings to assess brain
injury. Health psychologists use tests and
surveys in a variety of medical settings.
Forensic psychologists use tests in the
legal system to assess mental state as it
relates to an insanity defense, competency to
stand trial or to be executed, and emotional
damages.
Child psychologists use tests to assess
childhood disorders.
Cultural, and Legal/Ethical
Considerations
Some Issues Regarding Culture and
Assessment 
Communication between assessor and assessee
is a most basic part of assessment. Assessors
must be sensitive to any differences between the
language or dialect familiar to assessees and the
language in which the assessment is conducted

Verbal communication Language, the means by


which information is communicated, is a key yet
sometimes overlooked variable in the assessment
process
Nonverbal communication and behavior Humans
communicate not only through verbal means but also through
nonverbal means. Facial expressions, fi nger and hand signs, and
shifts in one’s position in space may all convey messages 

Standards of evaluation
Psychology, tests, and public policy
 
Tests and Group Membership
 
Legal and Ethical Considerations
 
Laws are rules that individuals must obey for the good of the
society as a whole—or rules thought to be for the good of society
as a whole.

code of professional ethics is recognized and accepted by


members of a profession, it defines the standard of care
expected of members of that profession.
The Concerns of the Public

The Concerns of the Profession


Test-user qualifications
Testing people with disabilities
(1) transforming the test into a form that can be taken by the
testtaker, (2) transforming the responses of the testtaker so that
they are scorable, and (3) meaningfully interpreting the test data

The Rights of Testtakers


The right of informed consent Testtakers have a right to know
why they are being evaluated, how the test data will be used, and
what (if any) information will be released to whom. The written form
should specify (1) the general purpose of the testing, (2) the specifi
c reason it is being undertaken in the present case, and (3) the
general type of instruments to be administered
The right to be informed of test findings
The right to privacy and confidentiality
The right to the least stigmatizing label
INTERVIEWING TECHNIQUES
Interviews

1. Directive.
2. nondirective

The Interview as a Test


Reciprocal Nature of Interviewing

Social facilitation, we tend to act like the models around us (see Augustine, 2011). If
the interviewer is tense, anxious, defensive, and aloof, then the interviewee tends to
respond in kind. Thus, if the interviewer wishes to create conditions of openness,
warmth, acceptance, comfort, calmness, and support, then he or she must exhibit
these qualities.
Principles of Effective Interviewing
The Proper Attitudes
The session received a good evaluation by both participants when the patient
saw the interviewer as warm, open, concerned, involved, committed, and
interested, regardless of subject matter or the type or severity of the problem.
On the other hand, independent of all other factors, when the interviewer was
seen as cold, defensive, uninterested, uninvolved, aloof, and bored, the session
was rated poorly. To appear effective and establish rapport, the interviewer
must display the proper attitudes
Good interviewing is actually more a matter of attitude than skill (Duan &
Kivlighan, 2002). Experiments in social psychology have shown that
interpersonalinfluence (the degree to which one person can influence another)
is related to interpersonal attraction (the degree to which people share a
feeling of understanding, mutual respect, similarity, and the like) (Dillard &
Marshall, 2003). Attitudes related to good interviewing skills include warmth,
genuineness, acceptance, understanding,
openness, honesty, and fairness.
Responses to Avoid
As a rule, however, making interviewees feel uncomfortable tends to place them on
guard, and guarded or anxious interviewees tend to reveal little information about
themselves. If the goal is to elicit as much information as possible or to receive a
good rating from the interviewee, then interviewers should avoid certain responses,
including judgmental or evaluative statements, probing statements, hostility, and
false reassurance.

Judgmental or evaluative statements are particularly likely to inhibit the


interviewee. Being judgmental means evaluating the thoughts, feelings, or actions
of another. When we use such terms as good, bad, excellent, terrible, disgusting,
disgraceful, and stupid, we make evaluative statements

- Most interviewers should also avoid probing statements. phrase a probing


statement is to ask a question that begins with “Why?” Asking “Why?” tends to
place others on the defensive
- replacing them with “Tell me” or “How?” statements
The hostile statement directs anger toward the interviewee. Clearly, one
should avoid such responses unless one has a specific purpose, such as determining
how an interviewee responds to anger.
The reassuring statement attempts to comfort or support the interviewee: “Don’t worry.
Everything will be all right.” Though reassurance is sometimes appropriate, you should almost
always avoid false reassurance. For example, imagine a friend of yours flunks out of college,
loses her job, and gets kicked out of her home by her parents. You are lying to this person
when you say, “Don’t worry; no problem; it’s okay.” This false reassurance does nothing to help
your friend except perhaps make her realize that you are not going to help her.
Effective Responses
One major principle of effective interviewing is keeping the interaction flowing.
The interview is a two-way process; one person speaks first, then the other, and
so on
Responses to Keep the Interaction Flowing
To make such a response, the interviewer may use any of the following types of statements:
verbatim playback, paraphrasing, restatement, summarizing, clarifying, and understanding.
Measuring Understanding

to measure understanding or empathy originated with Carl Rogers’s seminal


research into the effects of client-centered therapy It culminated in a 5-point
scoring system

Each level in this system represents a degree of empathy. The levels range
from a response that bears little or no relationship to the previous statement
to a response that captures the precise meaning and feeling of the
statement.
Level-One Responses
Level-one responses bear little or no relationship to the interviewee’s response

Level-Two Responses
The level-two response communicates a superficial awareness of the meaning of
a statement. The individual who makes a level-two response never quite goes
beyond his or her own limited perspective. Level-two responses impede the flow
of communication.

Level-Three Responses
A level-three response is interchangeable with the interviewee’s statement.
According to Carkhuff and Berenson (1967), level three is the minimum level of
responding that can help the interviewee. Paraphrasing, verbatim playback,
clarification statements, and restatements are all examples of level-three
responses.
Level-Four and Level-Five Responses
Level-four and level-five responses not only provide accurate empathy but also go
beyond the statement given. In a level-four response, the interviewer adds
“noticeably” to the interviewee’s response

Active Listening
An impressive array of research has accumulated to document the power of the
understanding response. This type of responding, sometimes called active
listening, is the foundation of good interviewing
skills for many different types of interviews
Types of Interviews

Evaluation Interview
A confrontation is a statement that points out a discrepancy or inconsistency. Though
confrontation is usually most appropriate in therapeutic interviews, all experienced
interviewers should have this technique at their disposal

Carkhuff (1969) distinguished among three types:


(1) a discrepancy between what the person is and what he or she wants to become,
(2) a discrepancy between what the person says about himself or herself and what he or she
does, and (3) a discrepancy between the person’s perception of himself or herself and the
interviewer’s experience
of the person

Structured Clinical Interviews


a structured clinical interview is the best practice for postpartum depression screening;
Structured interviews lend themselves to scoring procedures from which norms can be
developed and applied. Typically, cutoff scores are used so that a particular score indicates the
presence or absence of a
given condition.
Case History Interview
To obtain a complete case
history—that is, a biographical sketch—one often needs to ask specific questions. Case
history data may include a chronology of major events in the person’s life, a work
history, a medical history, and a family history. A family history should include a
complete listing of the ages and genders of each member of the immediate family.
One should also note whether any family members—including parents, grandparents,
uncles, aunts, and siblings—have had difficulties similar to those of the interviewee

Mental Status Examination


An important tool in psychiatric and neurological examinations, the mental status
examination is used primarily to diagnose psychosis, brain damage, and other major
mental health problems. Its purpose is to evaluate a person suspected of having
neurological or emotional problems in terms of variables known to be related to these
problems.
THE INTERVIEW
Assessment interview One of the most basic techniques employed by the clinical psychologist
for the purpose of answering a referral question. If administered skillfully, the assessment
interview can provide insight into the problem and inform clinical decision making.

General Characteristics of Interviews

An Interaction. An interview is an interaction between at least two persons. Each participant


contributes to the process, and each influences the responses of the other.
clinical interview is initiated with a goal or set of goals in mind. The interviewer approaches the
interaction purposefully, bearing the responsibility for keeping the interview on track and
moving toward the goal.

Interviews Versus Tests


The hallmark of psychological testing is the collection of data under standardized conditions by
means of explicit procedures. Most interviews, however, make provision for at least some
flexibility. Thus, a unique characteristic of the interview method is the wider opportunity it
provides for an individualized approach that will be effective in eliciting data from a particular
person or patient
Interviewing Essentials and Techniques:

1. The Physical Arrangements


Two of the most important considerations are privacy and protection frominterruptions

2. Note-Taking and Recording

All contacts with clients ultimately need to be documented. However, there is some debate
over whether notes should be taken during an interview.
a moderate amount of note-taking seems worthwhile

3. Rapport
rapport A word often used to characterize the relationship between patient and clinician. In
the
context of the clinical interview, building good rapport involves establishing a comfortable
atmosphere and sharing an understanding of the purpose of the interview.
4. Communication

Beginning a Session- a brief conversation designed to relax things before plunging


into the patient’s reasons for coming will usually facilitate a good interview.

Language. - Some initial estimate of the patient’s background, educational level, or


general sophistication should be made. The kind of language employed should then
reflect that judgment
The Use of Questions-

Maloney and Ward (1976) observed that the clinician’s questions may become
progressively more structured as the interview proceeds
Silence- silences can mean many things. The important point is to assess the meaning and
function of silence in the context of the specific interview. The clinician’s response to
silence should be reasoned and responsive to the goals of the interview rather than to
personal needs or insecurities. (silence is indicative of some resistance, or the client is
organizing)

Listening - If we are to communicate effectively in the clinician’s role, our communication


must reflect understanding and acceptance. The skilled clinician is one who has learned
when to be an active listener.

Gratification of Self –
The clinical interview is not the time or the place for clinicians to work out their own
problems. clinicians must resist the temptation to shift the focus to themselves. Rather,
their focus must remain on the patient.
The Impact of the Clinician
Each of us has a characteristic impact on others, both socially and professionally. As a
result, the same behavior in different clinicians is unlikely to provoke the same response
from a patient

The Clinician’s Values and Background - clinicians must examine their own experiences
and seek the bases for their own assumptions before making clinical judgments of others.
What to the clinician may appear to be evidence of severe pathology may actually reflect
the patient’s culture. (or gender sensitivity)
The Patient’s Frame of Reference
the patient views the first meeting(or the whole process)is important. It may affect the
interview.
A patient may have an entirely distorted notion of the clinic and even be ashamed of
having to seek help. For many individuals, going to see a clinical psychologist arouses
feelings of inadequacy.
there are patients who start with a view of the clinician as a kind of savior

The Clinician’s Frame of Reference


clinician should have carefully gone over any existing records on the patient, checked the
information provided by the person who arranged the appointment,and so on. “Be
prepared.”
clinician should be perfectly clear about the purpose of the interview.(basis of referral)
the clinician must remain focused.
Psychological Assessment
•Basic Concepts in
Psychometrics
Lesson 1: Basic Concepts
• Psychometrics Defined
• Types of Assessments
• Types of Measurement
• Qualities of Good Assessments
• Types of Psychological Tests
• Aptitude vs. Achievement Tests
• Performance Assessment
• Factors Influencing Performance
What is Psychometrics?
Psychometrics is a field of study that deals specifically with psychological
measurement. This measurement is done through testing.

There are various types of psychometric tests, but most are objective tests
designed to measure educational achievement, knowledge, attitudes, or
personality traits. In addition to the tests themselves, there is another part of
psychometrics that deals with statistical research on the measurements that
psychometric tests are attempting to obtain.

Validity and Reliability


When writing and administering psychometric tests, psychometrists have
to make sure the tests are both valid and reliable. Validity simply means
that the test measures what it's supposed to measure. In other words, the
Stanford-Binet IQ test wouldn't be valid if it measured personality traits or
attitudes rather than intelligence.

Reliability means that the psychometrist will get roughly the same result
from the same person each time the test is administered. In other words,
does the test reliably do what it is designed to do?

Psychometrists and test writers worry a great deal about reliability and
validity, which is why new psychometric tests undergo rigorous trials
and norming periods before they go on the market. Norming refers to a
means of 'testing' the test and developing baseline scores before it's used
to test the general population.
Types of Assessments Used in Psychology
Psychological Assessment

Psychological assessment involves tests that are designed to tell a person


something about him or herself.

For example, a psychological assessment can tell you about your


personality, like whether you are introverted or extroverted or whether
you are a dreamer or more levelheaded. Psychological tests can also tell
you about your skills and abilities, including your intelligence and what
types of things you're good at. They can even point you to a career based
on your personality and skills.

If it sounds like psychological assessments can tell you a lot of


information, it's because they can! But one assessment is not going to be
enough to tell you all of those things. There are many, many different
types of psychological tests, and they all serve different purposes.

Let's look closer at some of the general categories of psychological


assessments and what they can tell you about yourself.

Projective Tests

Projective tests involve showing images to a person and asking them to


interpret the images. Remember the word 'project' to help you remember
'projective tests': it is like the person is projecting themselves onto the
image.

You've probably seen examples of projective tests in movies or television.


The most common two types of projective tests are the Rorschach inkblot
test and the Thematic Apperception Test, also known as the TAT. You know
in movies when a psychologist holds up a piece of paper with a random
inkblot on it and asks their patient what the patient sees? That's the
Rorschach test, and it is typical of a projective test.

One great thing about projective tests is that they can get at unconscious
aspects of a person. Patterns in thoughts and feelings can sometimes be
difficult to see when you are living with them, but through projective
tests, those patterns begin to emerge. For example, if you tend to see
death and destruction when you look at inkblots or the images in the TAT,
you may have an issue with depression. You may not even realize that it's
a part of the way you see the world day in and day out; it's just a part of
you. But the projective tests can get at that part of you in a way other
tests can't.

However, projective tests do have downsides. The most common issue


with them is that the results are difficult to interpret. If you see a flower in
an inkblot, what does that mean about you? The interpretation of the
results is highly subjective. That is, you can say the same thing to two
different psychologists and get two different analyses of what it means.

Inventory-Type Tests
To try to make a test that is more standardized and less subjective,
inventories, or inventory-type tests, were developed. These include
surveys that try to measure a person's characteristics or attitudes. They
might include things like true-false questions or questions that ask you to
rate an idea on a scale of one to five.

The most famous psychological inventories are the Minnesota


Multiphasic Personality Inventory, or MMPI, and the Myers-Briggs Type
Indicator. Both of these try to figure out what a person is like based on
how they respond to questions about what they believe or think or feel.

As we mentioned, inventories have the advantage of being more


standardized and objective than projective tests. If a person marks true to a
statement about family being important, then it's pretty clear that family is
important to them!

But inventories have their own problems. One common issue with
inventories is that they depend on a person answering questions about who
they are. Why is this an issue? Some people might lie, either intentionally or
not. For example, a person might not want to mark that family is not very
important to them because they might feel that they would then be judged
as a bad person.

So, while projective tests don't generally have an issue with people lying
about who they are, they are more subjective than inventories. And while
inventories are more objective and standardized, they do face the
possibility that the person taking them won't be completely honest, which
can throw off the results.

Aptitude Tests

So far, we've talked about two types of tests that try to answer the
question of what a person's personality is like. But there are other
questions in psychology, too, including the question, 'What are you good
at?'

Aptitude tests try to measure what you are capable of doing. Tests like the
SAT and ACT are aptitude tests: they want to see if you are capable of
handling college-level work. Aptitude tests often cover general skills, like
problem-solving, critical thinking and perceptual speed, which is just a fancy
way of saying how quickly you react to something. They can also sometimes
cover basic subject-area concepts, like math or English.
There are generally two types of aptitude tests:

1. Speed tests
These types of tests have easier, more simple questions, but they
have many more of them. They want to see how many questions you can
answer correctly in the allotted time. The instructions for a speed aptitude
test will be something like, 'You can have 30 minutes to answer as
many questions correctly as you can.' As the name implies, doing well on
a speed test takes, well, speed - the faster you are able to correctly answer
questions, the better you will do.

2. Power tests
These types of aptitude tests have fewer questions than speed tests, but
the questions are more complex. These tests are concerned with how you
are able to figure out how to answer complex questions. The instructions
for a power test might be something along the lines of, 'Take your time and
try to answer each question correctly.' Power tests might also be timed, but
it's not about finishing each question quickly; it's about figuring out how to
get the correct answer.
Types of Measurement: Direct, Indirect, &
Constructs
Measurement

Imagine that you are a psychologist, and you decide to do a study to see if
people with red hair are more temperamental than those with brown or
blonde hair. But, how will you know if your subjects are temperamental? For
that matter, how will you know if they have red hair?

Psychological measurement is the process of assessing


psychological traits, like temperament, perceptions, feelings, and
thoughts. In our example, whether a person has a short temper or not is a
psychological trait that we want to measure so that we will know which of
our subjects is quick to anger and which have a slow fuse.

Of course, sometimes in psychology we have to measure non-


psychological traits as well. Remember that we want to know about the
difference between redheads and brunettes and blondes. Technically, hair
color isn't a psychological trait. But, we want to measure it, just like we
want to measure temperament, so that we will know which of our subjects
have red hair.

Let's look at several types of measurement, including direct observation,


indirect observation, and constructs.

Direct Observation

Observing someone's hair color is an example of direct observation,


whereby a researcher can look at a person and see the trait they are
measuring. Physical characteristics, like hair color, eye color, and body
type, are all able to be directly observed.

But, what about temperament? Can we directly observe that? Well, sort
of. We can watch subjects interact with somebody and see who loses
their temper and who keeps calm. This might give us a clue about what
their temperament is like.

Notice, though, that when we observe someone's reaction to others, we are


observing their reaction and nothing else. We are not actually observing their
temperament; we are seeing their behavior in a situation and making
inferences about their temperament.

This is why direct observation can sometimes be tricky in psychological


measurement: how do you directly observe things like depression, eating
disorders, or schizophrenia? The answer is that you can't. You always have
to observe behavior and make inferences based on your observations.
Indirect Observation
Direct observation is a good start when it comes to psychological
measurement. We can look at our subjects and see which ones are
redheads and which aren't. But, what if we see someone with reddish-
brown hair? Is he a brunette or a redhead?

What if we asked him if he considers himself a redhead or a brunette? That


would be an example of indirect observation, which is when a psychologist
makes an observation based on the observation of another person.

If we ask our subjects to take a survey and check off if they have red hair,
brown hair, or blonde hair, we can observe their checkmarks. We are not
directly observing their hair color but are making assumptions based
on the subjects' own observations.

How do you know what happened during a board meeting of a company?


You could look up the minutes of the meeting, which are recorded by
someone in the room during the meeting. You aren't in the meeting yourself
and therefore aren't directly observing events, but you can take someone
else's observations about what happened, which is indirect observation.

Of course, there's a problem with indirect observation, too: how do you know
if someone's being honest? What if we ask our subjects to tell us whether
they have a short temper or not? That's seen as a negative trait, so some
people might not want to answer yes. As a result, they might not be
completely honest.

Constructs
Remember how we tried to directly observe temperament? We
watched our subjects' interactions with others to see who lost their
tempers. But, remember that we said we weren't actually observing
temperament; we were making an inference about temperament based
on the behavior we saw.

The reason we can't directly observe temperament is that it is a


construct, or an abstract or theoretical idea that cannot be observed.

The human brain likes things to be quick and easy. We want to be able to
find patterns and communicate them to other human beings. So,
someone who acts angry and aggressive on a regular basis is said to be
short-tempered. This allows us to communicate with others. I can warn
my sister that her new boyfriend is 'short-tempered,' and she'll know
what I mean by that.

Constructs are constructed through direct or indirect observation. If I


see my sister's boyfriend getting into fights or yelling at other people, I
have directly observed him and might construct the idea that he has a
short fuse. If my friend tells me that she has seen him do those things, I
have indirectly observed that he is short-tempered.

Many of the traits psychologists are interested in studying are constructs.


Whether it's depression or intelligence or prejudice, psychological traits
are almost always constructs. Of course, there's no real way to measure a
construct, so we rely on both direct and indirect observation to try to
measure them.
Qualities of Good Assessments:
Standardization, Practicality, Reliability &
Validity

Reliability

Reliability is defined as the extent to which an assessment yields consistent


information about the knowledge, skills, or abilities being assessed. An
assessment is considered reliable if the same results are yielded each time
the test is administered.

For example, if we took a test in History today to assess our


understanding of World War I and then took another test on World War I
next week, we would expect to see similar scores on both tests. This
would indicate the assessment was reliable.

Reliability in an assessment is important because assessments provide


information about student achievement and progress.

There are many conditions that may impact reliability. They include:
day-to-day changes in the student, such as energy level, motivation,
emotional stress, and even hunger; the physical environment, which
includes classroom temperature, outside noises, and distractions;
administration of the assessment, which includes changes in test
instructions and differences in how the teacher responds to questions
about the test; and subjectivity of the test scorer.

Validity

The third quality of a good assessment is validity. Validity refers to the


accuracy of the assessment. Specifically, validity addresses the question
of: Does the assessment accurately measure what it is intended to
measure?

An assessment can be reliable but not valid. For example, if you weigh
yourself on a scale, the scale should give you an accurate measurement of
your weight. If the scale tells you that you weigh 150 pounds every time you
step on it, it is reliable. However, if you actually weigh 135 pounds, then the
scale is not valid.

Similar to reliability, there are factors that impact the validity of an


assessment, including students' reading ability, student self-efficacy, and
student test anxiety level.
Standardization

Another quality of a good assessment i s standardization. We take


many standardized tests in school that are for state or national
assessments, but standardization is a good quality to have in classroom
assessments as well. Standardization refers to the extent to which the
assessment and procedures of administering the assessment are
similar, and the assessment is scored similarly for each student.

Standardized assessments have several qualities that make them unique


and standard. First, all students taking the particular assessment are given
the same instructions and time limit. Second, the assessments contain the
same or very similar questions. And third, the assessments are scored, or
evaluated, with the same criteria.

Standardization in classroom assessments is beneficial for several


reasons. First, standardization reduces the error in scoring, especially
when the error is due to subjectivity by the scorer. Second, the more
attempts to make the assessment standardized, the higher the reliability
will be for that assessment. And finally, the assessment is more
equitable as students are assessed under similar conditions.

Practicality
The fourth quality of a good assessment is practicality. Practicality
refers to the extent to which an assessment or assessment procedure is
easy to administer and score. Things to consider here are:

• How long will it take to develop and administer the assessment?


• How expensive are the assessment materials?
• How much time will the assessment take away from instruction?
Psychological Tests
Suppose that you are a psychologist. A new client walks into your office
reporting trouble concentrating, fatigue, feelings of guilt, loss of interest in
hobbies and loss of appetite. You automatically think that your client may be
describing symptoms of depression. However, you note that there are
several other disorders that also have similar symptoms. For example, your
client could be describing post-traumatic stress disorder (PTSD), insomnia
or a list of other psychological disorders. There are also some physical
conditions, such as diabetes or congestive heart failure, which could
result in the mental symptoms that your client is reporting.

So, how do you determine which diagnosis, if any, you give your client?
One tool that can help you is a psychological test. These are
instruments used to measure how much of a specific psychological
construct an individual has. Psychological tests are used to assess many
areas, including:

• Traits such as introversion and extroversion


• Certain conditions such as depression and anxiety
• Intelligence, aptitude and achievement such as verbal intelligence and
reading achievement
• Attitudes and feelings such as how individuals feel about the
treatment that they received from their therapists
• Interests such as the careers and activities that a person is interested
in
• Specific abilities, knowledge or skills such as cognitive ability, memory and
problem-solving skills

It is important to note that not everyone can administer a psychological


test. Each test has its own requirements that a qualified professional must
meet in order for a person to purchase and administer the test to
someone else.

Psychological tests provide a way to formally and accurately measure


different factors that can contribute to people's problems. Before a
psychological test is administered, the individual being tested is usually
interviewed. In addition, it is common for more than one psychological
test to be administered in certain settings.

Let's look at an example involving a new client. You might decide that the
best way to narrow down your client's diagnosis is to administer the
Beck Depression Inventory (BDI), PTSD Symptom Scale Interview (PSSI)
and an insomnia questionnaire. You may be able to rule out a diagnosis or
two based on the test results. These assessments may be given to your
client in one visit, since they all take less than 20 minutes on average to
complete.
Types and Examples of Psychological Tests

Intelligence tests are used to measure intelligence, or your ability to


understand your environment, interact with it and learn from it. Intelligence
tests include:

• Wechsler Adult Intelligence Scale (WAIS)


• Wechsler Intelligence Scale for Children (WISC)
• Stanford-Binet Intelligence Scale (SB)

Personality tests are used to measure personality style and traits.


Personality tests are commonly used in research or to assist with clinical
diagnoses. Examples of personality tests include:

• Minnesota Multiphasic Personality Inventory (MMPI)


• Thematic Apperception Test (TAT)
• Rorschach, also known as the 'inkblot test'

Attitude tests, such as the Likert Scale or the Thurstone Scale, are used
to measure how an individual feels about a particular event, place,
person or object.

Achievement tests are used to measure how well you understand a


particular topic (i.e., mathematics achievement tests). Aptitude tests are
used to measure your abilities in a specific area (i.e. clerical skills).

Achievement tests include:

• Wechsler Individual Achievement Test (WIAT)


• Peabody Individual Achievement Test ( PIAT) Aptitude tests include:
• Bloomberg Aptitude Test (BAT)
• Armed Services Vocational Aptitude Battery (ASVAB)

Neuropsychological tests are used to detect impairments in your cognitive


functioning that are thought to be a result of brain damage. For example, if
you were to have a stroke, you might have a neuropsychological test to see
if there is any resulting cognitive damage (i.e., decreased ability to think due
to damage in a brain pathway). One example of a neuropsychological test is
the Halstead-Reitan Neuropsychological Test Battery. Other examples
include:

• Wisconsin Card Sorting Test (WCST)


• Benton Visual Retention Test (BVRT)

Vocational tests, also referred to as career tests or occupational tests,


are used to measure your interests, values, strengths and weaknesses.
This information is then used to determine which careers or occupational
settings you are most suitable for. Career psychologists and counselors most
commonly use vocational assessments to help their clients make decisions
about their future educational goals and career choices. Examples of
vocational tests include:

• Jackson Vocational Interest Survey (JVIS)


• Strong Interest Inventory (SII)

Direct observation tests are measures in which test takers are observed
as they complete specific activities. It is common for this type of test to be
administered to families in their homes, in a clinical setting such as a
laboratory or in a classroom with children. They include:

• Parent-Child Interaction Assessment-II (PCIA-II)


• MacArthur Story Stem Battery (MSSB)
• Dyadic Parent-child Interaction Coding System-II (DPICS II)

There are also specific clinical tests that measure specific clinical
constructs, such as anxiety or PTSD. Some examples of specific clinical
tests include:

• Beck Depression Inventory (BDI)


• Beck Anxiety Inventory (BAI)
• Hopelessness Scale for Children (HSC)
Aptitude vs. Achievement Tests
Aptitude

There are a few kinds of frequently used tests, and each one of them is
focused on evaluating a different aspect of a person's cognitive abilities.

One of the most common is an aptitude test, which are tests designed to
evaluate a person's ability to learn a skill or subject. That's what aptitude is -
natural skill, talent, or capacity to learn.

An aptitude test, therefore, is not testing what you have learned or how well
you have responded to education or training. It's not even an evaluation of
intelligence or intellectual capacity. It's a measurement of a person's ability
to learn a specific subject.

Many training programs, from music and the arts to vocational and technical
programs, rely heavily on aptitude tests. These tests inform the instructors
whether or not a student is likely to succeed based on their personality,
talent, and potential for growth.

Many schools also use aptitude tests to determine how well a student is
likely to perform, which can be very useful in determining the best
educational style for them. Parts of exams like the SAT and GRE are
actually aptitude-based, meant to determine whether a candidate has the
skill and capacity to be successful at the next level of education.

Since aptitude tests are based on natural talent, skill, and ability to learn, you
don't study for them in the way you would other tests. You're not being
tested on knowledge or information you've been taught, but instead on your
capacity for learning new information. This being said, there are ways to
prepare for an aptitude test.

Aptitude can be broadened through routine mental exercise and


exposure to skills. Frequent reading, trips to museums and art galleries,
and even things like crosswords, puzzles, and even vocabulary lists can help
children and adults increase their potential for learning.

Achievement

Aptitude tests evaluate a distinct aspect of a person's cognitive abilities.


That makes them different from another kind of evaluation: the
achievement test. While aptitude is the potential to learn, achievement is
learning itself.

An achievement test evaluates the information or skills a student has


already learned. This is the sort of testing most students will be familiar with.
Instructors use achievement tests to determine whether or not information
was successfully learned and retained.

Standardized tests like the SAT II (Subject Tests) are achievement tests,
meant to determine whether students achieved a certain level of
education and mastery of subjects before moving on to college.

While aptitude tests are a matter of natural talent, achievement tests


are about learning and retention, so studying is a good idea. Study skills
that maximize long-term retention and in- depth understanding are the
most successful at helping a student prepare for an achievement test.

In general, short yet frequent study sessions over a length of time (as
opposed to cramming it all at once the night before) result in higher levels of
retention and understanding.
Performance Assessments: Product vs.
Process
Performance assessments are assessments in which students demonstrate
their knowledge and skills in a non-written fashion. These assessments are
focused on demonstration versus written response. Playing a musical
instrument, identifying a chemical in a lab, creating a spreadsheet in computer
class, and giving an oral presentation are just a few examples of performance
assessments.

These types of assessments provide educators with an alternative


method to assess students' knowledge and abilities, but they must be
used with a specific purpose in mind. Let's discuss four concepts to aid
our understanding in choosing appropriate performance-based
assessment tasks.

Guidelines for Choosing Appropriate Tasks

The first guideline deals with products versus process. A product is a


tangible creation by a student that could take the form of a poster, drawing,
invention, etc. Performance assessments are useful in assessing these
products in order to gauge a student's level of understanding and ability. For
example, asking a student to create an invention for a science class that
incorporates Newton's laws of gravity would be a way to assess a product
and the student's knowledge of scientific principles.

Sometimes we don't have a product to assess and must assess a process.


In situations with no tangible product, teachers must assess the process and
the behaviors that students display. Giving an oral presentation, singing a
song, or demonstrating a tennis swing are examples of processes that could
be assessed.

When assessing a process, teachers may be interested in examining


students' cognitive processes as well. Teachers can learn a great deal
about a student's thinking by assigning a process task. For example, if a
teacher wants to understand the thinking processes behind her
students' knowledge of force and acceleration, she might assign
students an activity in which they perform experiments to determine how
fast objects will roll down an incline. In this example, the teacher would
have students make predictions first, then complete the experiments. The
student predictions allow the teacher to gauge their understanding of
the scientific principles behind the experiment.

When considering using performance assessments, we must also


consider individual versus group performance. Teachers have the
ability to assign individual or group assessments. Group performance
assessments allow teachers to assign complex projects and tasks that
are best accomplished by many students. For example, a geography
teacher wants to assess his students' understanding of town planning.
He may assign a project requiring the students to collect data, make
maps, predict population growth, etc. Group performance projects allow
students to assess their peers also, which provide a different level of
assessment for the teacher.

Some performance tasks are relatively short in duration ; this i s referred to


as restricted performance. These are tasks that involve a one- time
performance of a particular skill or activity. For example, a PE instructor asks
her students to perform a push-up. She wants to assess their form for this
one particular exercise.

Alternatively, other performance tasks are extended. When teachers assess


extended performance, they want to determine what the students are capable of for
long periods of time. This method allows teachers to assess development at
certain points over time. It also allows time for feedback and the opportunity for
students to edit their work. For example, an English teacher might task the
students with putting on a play at the end of a semester. The students may
have specific goals to meet throughout the semester, which are also
assessed, such as creating an outline, assigning roles, writing the script, and
creating props. The play at the end of the semester concludes the
process. While extended performance tasks allow for a more thorough
assessment of student knowledge, they are very time-consuming to
administer.

Another thing to consider is static versus dynamic assessment. Static


assessments, which are the most common form of performance and paper-
based pencil assessments, focus on student's existing abilities and knowledge.
Dynamic assessments, on the other hand, systematically examine how a student's
knowledge or reasoning may change as a result of learning or performing
specific tasks. This concept is consistent with Vygotsky's concept of Zone of
Proximal De ve lo pment and provides teachers with information about
what students are likely to accomplish with appropriate structure and
guidance.

Guidelines for Performance Assessments

For all types of performance assessment, the following guidelines should


be followed:

1. Tasks should be defined clearly and unambiguously.


2. The teacher must specify the scoring criteria in advance.
3. The teacher must be sure to standardize administration
procedures as much as possible.
4. The teacher should encourage students to ask questions when
tasks are not clear.
Factors Influencing Performance
Controllable Inhibitors

Test-taking anxiety is probably the easiest one to see. It is intense


nervousness from an examination situation. Test anxiety is extremely
common and for some people never fully goes away.

Sympathetic response increases the heart rate and increases stress


hormones into your blood stream, effectively reducing cognitive
performance in the long run. Dealing with test anxiety involves awareness
of the anxiety source, countering active stress processes like deep
breathing or stopping negative self-talk and reducing caffeine and stress.

Cautiousness is excessively close attention and doubtful movement.


Similar to those who suffer from test anxiety are people who spend five
minutes per question second-guessing themselves - or the people who
need to complete some task, like putting things in a particular order, and
spend most of their time checking and rechecking.

Like with test anxiety, awareness of the anxiety source and countering
the active stress process are good ways of effectively dealing with
cautiousness.

Timed tests, which are examinations stressing speed of answers, and


motivation, which is an intrinsic quality of desire to complete the task, can
both make these conditions worse. Adding in a timed element to a test is a
good way to paralyze someone with excessive cautiousness and test
anxiety. When it comes to motivation, we have all had tests, assignments, or
work that we just don't want to do. It's a good way to kill a score or review
because unmotivated people don't do as well on things.

To counter the stress of timed tests, determine if the timed element is


necessary, because if it isn't, it is only adding stress for no reason. As for
motivation, discussing reasons the examination or behavior is necessary is a
good start. We all remember thinking about how advanced calculus or
personality test isn't necessary, so helping them find intrinsic reasons for
wanting to learn is also a good thing.

Uncontrollable Inhibitors

Sometimes things are just beyond the control of the person doing the
examination. Cohort effects are group-wide bonding or understanding. A
group can be anything, such as an entire generation growing up knowing
only the War on Terror or a small town knowing what it is like to live through
a severe drought. When it comes to how this may influence testing results,
entire groups of people may be shifted.
For instance, if you're doing IQ tests and an entire school did an IQ test
two months ago, then the subjects would be more familiar with the test
and thus score higher. Cohort effects can be related to educational
effects, which are systemic changes made by education. People with
higher education tend to live healthier and different lives than those with
lower education. When it comes to influencing test takers, people who
take a lot of written tests just know how to answer better.

If a group of individuals entered a school at the same time, they would be


a cohort. What happened to them would be part of their cohort effect. As
their education increased, they would have changes outside their
school, such as their diet, exercise, and lifestyle. When it comes to
performance, these are measurable and broad changes that occur to a
group.

Studying the effects of aging and disease, the educational effect may
also change the way the brain fends off disease. When testing
individuals with cognitive diseases, the cognitive reserve becomes an
issue. Cognitive resistance is when the resistance to Alzheimer's disease
increases or decreases based on lifestyle and education.

In studying the cognitive and behavioral components of people, you will find
that there is more resistance to decline in old age in people with higher
education and more intellectually challenging jobs. The old adage of 'use it
or lose it ' is basically what you need to remember. When it comes to
performance, you need to remember that people who think more tend to
think faster and better.

When studying the elderly and cohorts, we must be aware that as we age
there is a certain point in which the individual breaks off from the mean and
changes based on his or her own timetable. There appears to be a terminal
drop - a drastic decline in cognitive abilities one to five years before death.

It seems that the brain and body begin to wobble and break down and that is
likely part of the reason the person dies soon after. If doing testing,
understanding this is crucial, as it is something beyond the researcher's
control. When it comes to performance, you may come to realize that this is
happening to someone and it will be up to you to discuss it with the family.
20
Lesson 2: Standardization and Norming
• Standardization
• Norms
• Standardized Assessments in Educational Setting
• Basic Statistics of Score Distribution
• Comparing Scores to Larger Population
• Mean, Median, and Mode
• Standard Deviation and Bell Curve
• Norm-Referenced vs. Criterion-Referenced Tests
• Bias in Testing
Standardization and Norms of Psychological
Tests
Imagine that you are approaching the finish line of a race. Your heart is
pumping, and you can feel the adrenaline kicking in. As you cross the
finish line, you look up and see that you ran the race in 6 minutes and 43
seconds.

Did you do well on the race or not? How do you answer that question? Is it
based on the time it took you to finish, or based on whether you came in
first or last?

Believe it or not, the question of how well you did in your race is very
similar to the question of how well people do on intelligence tests, which
are meant to measure how much innate ability a person has. There are
many different intelligence tests, which are sometimes called IQ tests.
But there are a few things that they a l l ha ve in co mmon, including
standardization and norms. Let's look closer at those things.

Standardization

Let's rewind for a moment. You're not approaching the finish line for the
race; now you're at the starting line for the race, getting ready. But when
you line up to race, you realize that something is wrong. Everyone has a
different starting point: some people are only a few feet from the finish
line, while others are far, far away. What gives?

Your race does not have standardization, which ensures that ever ything
is the same for all participants. In the case of the race, this means that
every racer must run the same distance. In the case of intelligence tests,
it means that every test-taker must have the same circumstances.

You might be thinking, 'But a test isn't a race. How can it be different?' Think
about this: intelligence tests are being given out to people all over the
world, all the time. Also, there's not one person giving them but many, many
people.

So, imagine that you take an intelligence test that's given to you by Amy, a
nice woman who hands the test over, tells you that you have an hour to
take it, and then walks away. You are left to figure everything out on your
own.

But imagine that your friend takes that same test, but this time it's given by
someone named Rosa. Rosa notices when your friend starts to struggle
with a question, so she gives him a hint. When he really can't get an
answer, she lets him look the answers up online.

What if you score the same as your friend? Does that mean that you are
equally adept? No, because you didn't have standardization. That is, the
test you took was harder than your friend's test, even though it had the
same questions, just by virtue of the fact that you didn't have the same
help that he did.

As you can probably tell, standardization is very important in an


intelligence test and other psychological tests. Making sure that every
single person gets the test under standard conditions ensures that
everyone gets a fair shot at the test.

Norms

You might be wondering, though, why it matters what you got on the test
compared to your friend, anyway. Who cares? Let's go back to the race for
a second. You've just finished in 6:43. How well did you do?

If you're like most people, your answer is along the lines of, 'Well, it
depends on how well the other people did.' After all, 6:43 might mean that
you came in first place, or it might mean that you were four minutes behind
the next-to-last person in the race.

A normative test compares your answers to the answers of others in


the same group as you. In your race, your finishing time is probably
compared with those of other racers, who are likely around the same
age and fitness as you are.

What does this have to do with intelligence tests? Answer this: if you get a
100 on an IQ test, how well did you do?

A raw score on an intelligence test doesn't tell you a lot. For some tests,
100 is average. In others, it could be very good or very bad. But the point
is it doesn't tell you how you did until you compare it to others in the same
age group as you.

Norms are results obtained by giving the exam to a sample of people


who represent all test-takers. This means that the sample has to
include people of different races, genders, and economic status.

Norms allow you to compare your test scores with others. So, instead of
just knowing that you got a 100 on the test, you could also be told that a
score of 100 is at the 50th percentile. That tells you that roughly half of
the people who are in the same group as you scored higher and lower
than you did. You are average.

Like standardization, norms allow a person to understand how they


measure up in comparison to their peers. This gives you a more complete
picture of what your strengths are.
Standardized Assessments in Educational
Setting
Standardized assessments are assessments constructed by
experts and published for use in many different schools and classrooms.

Standardized assessments are very common and can be used for several
purposes.

• They can be used to evaluate a student's understanding and


knowledge for a particular area.
• They can be used to evaluate admissions potential for a student in grade
school or college.
• They can be used to assess language and technical skill proficiency.
• Standardized assessments are often used to determine
psychological services.
• They are even used to evaluate aptitude for a career in the armed forces.

Standardized Assessment Qualities

Standardized assessments have several qualities that make them unique


and standard. First, all students taking the particular assessment are given
the same instructions and time limit. Instructional manuals typically
accompany the assessment so teachers or proctors know exactly what to
say. Second, the assessments contain the same or very similar questions.
Third, the assessments are scored, or evaluated, with the same criteria.

Standardized Assessment Types

There are four main types of standardized assessments used by


schools. They are achievement assessments, scholastic aptitude and
intelligence assessments, specific aptitude assessments and school
readiness assessments.

Achievement Assessments
Achievement assessments are designed to assess how much students
have learned from classroom instruction. Assessment items typically
reflect common curriculum used throughout schools across the state or
nation. For example, a history assessment might contain items that
focus on national history rather than history distinct to a particular state
or county.
There are advantages to achievement assessments. First, achievement
assessments provide information regarding how much a student has
learned about a subject. These assessments also provide information on
how well students in one classroom compare to other students. They also
provide a way to track student progress over time.

There are some disadvantages to achievement assessments as well.


Achievement assessments do not indicate how much a student has
learned for a particular area within the subject. For example, the
assessment may indicate a relative understanding of math, but will not
indicate if a student knows how to use a particular equation taught in the
classroom.

Scholastic Aptitude Assessments


are designed to assess a general capacity
Scholastic aptitude assessments
to learn and used to predict future academic achievement. Scholastic
aptitude assessments may assess what a student is presumed to have
learned in the past. These assessments include vocabulary terms
presumably encountered over the years and analogies intended to assess
how well students can recognize similarities among well- known
relationships. The most common type of these assessments is the SAT.

The advantages of scholastic aptitude assessments are that the same


test allows for comparison of multiple students across schools and
states. There is also a disadvantage; some students develop test
anxiety and do not perform well on standardized scholastic
assessments, causing the results to be an inaccurate reflection of the
student's actual or potential academic abilities.

Specific Aptitude Assessments


Specific aptitude assessments are designed to predict future ability to
succeed in a particular content domain. Specific aptitude assessments
may be used by school personnel to select students for specific
instructional programs or remediation programs. They may also be
used for counseling students about future educational plans and career
choices. One commonly used assessment to evaluate one's aptitude for
a career in the armed forces is the ASVAB.

With specific aptitude assessments, usually, one's ability to learn in a


specific discipline is stable, and therefore, these types of assessments
are an effective way to identify academic tendencies and weaknesses.
The disadvantage, however, is that the use of these assessments
encourages specific skill development in a few areas, opposed to
encouraging the development of skills in a wide range of academic
disciplines and abilities.

School Readiness Assessments


Finally, we have the school readiness assessment. School readiness
assessments are designed to assess cognitive skills important for success
in a typical kindergarten or first grade curriculum. These assessments
are typically given six months before a child enters school.

The advantage with these assessments is that they provide information


regarding developmental delays that need to be addressed immediately.
There are disadvantages to school readiness assessments as well.
First, the evaluation has been found to have low correlation with the
student's actual academic performance beyond the first few months of
school. Second, school readiness assessments usually only evaluate
cognitive development. However, social and emotional development are
critical to one's success in kindergarten and first grade.

Choosing Standardized Assessments

There are guidelines and considerations for choosing standardized


assessments:

1. The school should choose an assessment that has a high validity for
the particular purpose of testing. Meaning that if the school wants to
assess knowledge of its student's science comprehension, it should
choose an assessment that evaluates science knowledge and skills.

2. The school should make sure the group of students used to 'norm' the
assessment are similar to the population of the school.

3. The school should take the student's age and developmental level
into account before administering any standardized assessments

Formative and Summative Assessments

There are other forms of assessment that are given at different times.
These are referred to as formative and summative assessments

Formative assessments are ongoing assessments, reviews and


observations used to evaluate students in the classroom. Teachers use
these types of assessments to continually improve instructional methods
and curriculum. Student feedback is also a type of formative assessment.
Examples include quizzes, lab reports and chapter exams.

Summative assessments are used to evaluate the effectiveness of


instruction at the end of an academic year or end of the class.
Summative assessments allow educators to evaluate student
comprehensive competency and final achievement in the subject or
discipline. Examples of these include final exams, statewide tests and
national tests.
Understanding Basic Statistics of Score
Distribution

Raw Score

A raw score is the score based solely on the number of correctly


answered items on the assessment. This raw score will tell you how
many questions the student got right, but just the score itself won't tell
you much more. Let's now move onto how scores can be used to
compare one student's results to the results of other students.

Normal Distribution

All test scores fall along a normal distribution. A normal distribution is


a pattern of educational characteristics or scores in which most scores
lie in the middle range and only a few lie at either extreme. To put it
simply, some scores will be low and some will be high, but most scores
will be moderate.

The normal distribution shows two things:

1. The variability or spread of the scores.


2. The midpoint of the normal distribution. This midpoint is found by
calculating a mean of all of the scores, or, in other words, the
mathematical average of a set of scores.

For example, if we had the following raw scores from your classroom - 57,
76, 89, 92, and 95 - the variability would range from 57 being the low score to
95 being the high score. Plotting these scores along a normal distribution
would show us the variability. The midpoint of the distribution is also
illustrated.
Standard Deviation

The normal distribution curve helps us find the standard deviation of the
scores. Standard deviation is a useful measure of variability. It measures
the average deviation from the mean in standard units. Deviation, in this
case, is defined as the amount an assessment score differs from a fixed
value, such as the mean.

The mean and standard deviation can be used to divide the normal
distribution into several parts. The vertical line at the middle of the curve
shows the mean, and the lines to either side reflect the standard deviation.
A small standard deviation tells us that the scores are close together, and
a large number tells us that they are spread apart more. For example, a
set of classroom tests with a standard deviation of 10 tells us that the
individual scores were more similar than a set of classroom tests with a
standard deviation of 35.

In statistics, there is a rule called the 68-95-99.7 rule. This rule states that
for a normal distribution, almost all values lie within one, two or three
standard deviations from the mean. Specifically, approximately 68% of all
values lie within one standard deviation of the mean. Approximately 95% of
all values lie within two st an da rd de viation s of the mean and
approximately 99.7% of all values lie within three standard deviations of
the mean.
Comparing Test Scores to a Larger
Population

Standard Score, Stanines and Z-Score

A common method to transform raw scores (the score based solely on


the number of correctly answered items on an assessment) in order to
make them more comparable to a larger population is to use a standard
score. A standard score is the score that indicates how far a student's
performance is from the mean with respect to standard deviation units.

In another lesson, we learned that standard deviation measures the


average deviation from the mean in standard units. Deviation is defined as
the amount an assessment score differs from a fixed value. The standard
score is calculated by subtracting the mean from the raw score and
dividing by standard deviation.

There are two types of standard scores: stanine and Z-score.

Stanines are used to represent standardized test results by ranking


student performance based on an equal interval scale of 1-9. A ranking of 5
is average, 6 is slightly above average and 4 is slightly below average.
Stanines have a mean of 5 and a standard deviation of 2.

Z-scores are used frequently by statisticians and have a mean of 0 and a


standard deviation of 1. A Z- score tells us how many standard deviations
someone is above or below the mean.

To calculate a Z-score, subtract the mean from the raw score and divide by
the standard deviation. For example, if we have a raw score of 85, a mean of
50 and a standard deviation of 10, we will calculate a Z- score of 3.5.

Cumulative Percentage and Percentile Rank

Another method to convert a raw score into a meaningful comparison is


through percentile ranks and cumulative percentages.

Percentile rank scores indicate the percentage of peers in the norm


group with raw scores less than or equal to a specific student's raw
score. In this lesson, 'norm group' is defined as a reference group that
is used to compare one score against similar others' scores.
Cumulative percentages determine placement among a group of
scores. Cumulative percentages do not determine how much greater
one score is than another or how much less it is than another.
Cumulative percentages are ranked on an ordinal scale and are used
to determine order or rank only. Specifically, this means that the highest
scores in the group will be the top score no matter what that score is.

For example, let's take a test score of 85, the raw score. If 85 were the
highest grade on this test, the cumulative percentage would be 100%.
Since the student scored at the 100th percentile, she did better than or the
same as everyone else in the class. That would mean that everyone else
made either an 85 or lower on the test.

Cumulative percentages and percentiles are ranked on a scale of 0%-


100%. Changing raw scores to cumulative percentages is one way to
standardize raw scores within a certain population.
Using Mean, Median, and Mode for
Assessment

The Mean

Imagine you teach a class with 20 students, and they take a test with 20
multiple choice questions. Imagine that the grades you get back from
scoring their tests look like this:

Student #1: 20 Student #11: 10


Student #2: 17 Student #12: 10
Student #3: 16 Student #13: 10
Student #4: 14 Student #14: 8
Student #5: 14 Student #15: 8
Student #6: 12 Student #16: 8
Student #7: 12 Student #17: 6
Student #8: 12 Student #18: 6
Student #9: 10 Student #19: 4
Student #10: 10 Student #20: 3

So you can see we have student #1 through student #20, and you can
see that the scores range quite a bit. Looking at these scores you can
see that one student, student #1, got a perfect score of 20 out of
20. Many students got scores somewhere in the middle, with five
students getting half of the questions right - that would be a score of
10 out of
20. A few of the students did pretty well, only missing a few questions,
while a few students did pretty badly but at least got a few of the
questions right. How can we be more precise with analyzing these
scores using statistics?

Let's say the principal of your school wants a quick summary of how
your students did on their test. How would you summarize the results?
The most common type of statistic, in either the context of the
classroom assessment or in laboratory research projects, is
statistics of summary. There are various types of statistics of
summary, but in general their purpose is to quickly give a general
impression of the overall trend in results. So, just like you'd guess,
based on the term statistics of summary, these statistics just give you a
ballpark idea of what happened on the test. Let's go over three different
types of summary statistics.

The most well-known statistic of summary is called the mean, which is


the term we use for the arithmetic average score. When most people use
the term 'average score,' what they're really referring to, technically, is what
we call the mean. How do we calculate the mean? We simply add up all of
the individual results, get the total, and then divide by the number of
students in the class. In our example, you can see how this would look on
the screen. If you add up the scores of 20 + 17 + 16 and so on through all
20 students, you get a total score of 210. You divide 210 by 20 (the number
of students), and you get a mean of 10.5. You can see that this score, 10.5,
is a pretty representative score of the middle score for this class, so it
works nicely as a summary.

The Median

A different statistic of summary is called the median. A median is


simply the score that falls exactly in the middle, such that half of the
people had higher scores, and half of the people had lower scores. To
find the median, you don't have to do any actual math. All you have to do
is put the scores in numerical order from highest to lowest, find out how
many people are in each half (so in this example, it's 10 in the top half and
10 in the bottom half), and count down until you're at the 10th score. The
10th score here is the student who got 10 questions right out of 20. So you
can use the median just as a nice way of knowing where the middle
student fell.

Why would you use the median instead of the mean? In this example, the
two scores are pretty similar (a mean of 10.5 versus a median of 10). So
here, it's doesn't really make any difference which one you would pick. The
difference between the mean and the median only really matters if you have
extreme scores on one end or the other. Let's say you were curious to know
how many state capitals the children in your kindergarten class knew. Let's
say you have five kids in the class, and maybe you get scores like these:

Child 1: 1 capital

Child 2: 2 capitals

Child 3: 3 capitals

Child 4: 3 capitals

Child 5: 47 capitals

So child #1 only knew one capital, child 2 knew two capitals, and that's
basically average until you get to child 5, who actually knew 47 capitals. Here,
one of the scores (the child who knows 47 capitals) is extremely different
from the rest of the scores. When a score is extremely different from the rest
of the scores in a distribution, that score is called an outlier. If an outlier
exists in your data, it will have a huge effect on the mean. Here, the mean
would be:
1+ 2 + 3 + 3 + 47 = 56 / 5 = 11.2.

So the mean of 11.2 is not a very good representation of the actual


average number of state capitals kindergartners in your class knew. This
number makes it look like they are much better at naming capitals than
they really are. So in the case of outliers, it's better to use the median. In
our example, the median is, again, just the middle number - so the median
here is 3. The number 3 is a much better representation of the basic level
of the class on this particular task. So in summary, use a median if you
have outliers, because the mean might not be a good summary number in
this case.

The Mode

The third and final statistic of summary is called the mode, which is
simply the score obtained by the most people in the group. Let's go
back to our original example of scores on the history test for the
Revolutionary War. When you look at the test scores again, what is the
most common score? The answer here is the score of 10. Five students
got that score, so the mode in our example is the score of 10. Again, in
this particular example, the mode is similar to both the mean and the
median.

So why would you use the mode instead of the mean or median? Usually
the mode is used for examples when scores are not in numerical form.
Remember, the mode is telling you what the most common answer is. So
modes are good when the data involved are categorical instead of
numerical.

Think about baseball teams. Who won the World Series last year? Do
you know the team that's won the World Series the most often ever
since it began? The answer is the New York Yankees. So it's accurate to
say that the mode team for winning the World Series is the Yankees,
because it's the most common answer.

Let's go over one more example. When you get a new car, your car
insurance price is based on a lot of things, like your gender and age, but
it's also based on the color of your car. You have to pay more for
insurance if you drive a red car. Why is that? It's because the mode color
of car that gets into accidents is red. In other words, red cars get in more
accidents than any other car - so red is the mode car accident color. It
wouldn't make sense to try to use a mean or a median when talking about
colors of cars, because there aren't any numbers involved. So for
categories like colors or baseball teams, you have to use the mode if you
want to create a statistic of summary.
Using Standard Deviation and Bell Curves for
Assessment

Standard Deviation

Let’s use the students score again from the previous table. Remember
there are 20 students from the example?

Now you want to know the basic variability within the classroom. So, did the
students' scores kind of clump up all together, meaning the students all
showed about the same amount of knowledge? Or did the scores vary
widely from each other, meaning some students did great whereas other
students failed the test?

The answer to this question can come very precisely from the standard
deviation, which i s a measurement that indicates how much a group of scores
vary from the average.

Calculating Standard Deviation


The standard deviation calculation looks complicated at first, but it's really
quite simple if we take it step by step.

1. We'll start by finding the mean, or average, of all the scores. To do this,
we add up all the scores and divide by the total number of scores. This
gives us a mean of 10.5.

2. The next step is to take each score, subtract the mean from it and
square the difference. For example, looking at the top score of 20,
we subtract 10.5 and then square the difference to get
90.25. We repeat this process for each score
.
3. Now, we add up all our squared differences and divide by the total
number of scores. This gives us 353/20, or 17.65.

4. The final step is to take the square root of this number, which is 4.2.
This is the standard deviation of the scores.

Why Standard Deviation Matters

Now that we have our standard deviation of 4.2, what the heck does that
mean? Well, it just gives us an idea of how much the scores on the
test clumped together. To understand this better, look at the two
distributions of scores on the screen.

The one on the left shows scores that are all very similar to each other.
So, because the scores are all close together, the standard deviation is
going to be very small. But, the one on the right shows scores that are
all pretty different from each other (lots of high scores on the test, but
also lots of failing grades on the test). For this distribution, we'd have a
high number for our standard deviation.

So, why do we care about standard deviation at all? Well, a teacher


would want to know this information because it might change how he or
she teaches the material or how he or she constructs the test. Let's say
that there's a small standard deviation because all of the scores
clustered together right around the top, meaning almost all of the
students got an A on the test. That would mean that the students all
demonstrated mastery of the material. Or, it could mean that the test
was just too easy! You could also get a small standard deviation if all of
the scores clumped together on the other end, meaning most of the
students failed the test. Again, this could be because the teacher did a
bad job explaining the material or it could mean that the test is too
difficult.

Most teachers want to get a relatively large standard deviation because it


means that the scores on the test varied across the grade range. This
would indicate that a few students did really well, a few students failed, and
a lot of the students were somewhere in the middle. When you have a
large standard deviation, it usually means that the students got all the
different possible grades (like As, Bs, Cs, Ds, and Fs). So, the teacher can
know that he or she taught’ the material correctly (because at least some of
the students got an A) and the test was neither too difficult, nor too easy.
Normal Distribution and the Bell Curve

Let's plot the test scores in a graph. The x-axis is for the score
received, and the y-axis is for the number of students who got that
score. So, still using our same example of 20 students who took a test
with 20 questions, you can see here the pattern that shows up on the
graph:

There's a big bump in the middle, showing the five students who got the
middle score of 10. Then, the graph tapers off on each side, indicating that
fewer students got very high or very low scores. The shape of this
distribution, a large rounded peak tapering away at each end, is called a bell
curve.

Remember that we said that most teachers will want their students' scores
to look kind of like what we see here. We had a lot of scores that fell in the
middle (indicated by the big bump), which might be like a letter grade of a
C. We had a few students who did really well (which might be like the
grade of A) and a few students who did poorly (in other words, they got
an F). When you have a bell curve that looks like this one, with a bump in
the middle and little ends on each side, you know you have a normal
distribution. A normal distribution has this bell shape and is called
normal because it's the most common distribution that teachers see in a
classroom.
Skewed Distribution

When a distribution is not normal and is instead weighted heavily on either side
like this, it's called a skewed distribution.

Imagine that most of the students got an A on the test. What would that
distribution look like? It would look something like this one:

You can see here that the bump falls along the right side of the graph
(where the higher scores are), with it tapering off only on the left side,
showing that most students got high scores and only a few got low scores.
This is what you call a negatively skewed distribution.

The exact opposite would be true if most students got an F, which would
look like this graph, positively skewed distribution:
Types of Tests:
Norm-Referenced vs. Criterion-Referenced

Norm-Referenced

A norm-referenced test scores a test by comparing a person's


performance to others who are similar. You can remember norm-
referenced by thinking of the word 'normal.' The object of a norm-
referenced test is to compare a person's performance to what is normal
for other people like him or her.

Think of it kind of like a race. If a runner comes in third in a race, that


doesn't tell us anything objectively about what the runner did. We don't
know if she finished in 30 seconds or 30 minutes; we only know that she
finished after two other runners and ahead of everyone else.

There are three types of norm-referenced scores. The first is age or


grade equivalent. These scores compare students by age or grade.
Breaking this type down, we can see that age equivalent scores indicate
the approximate age level of students to whom an individual student's
performance is most similar, and grade equivalent scores indicate the
approximate grade level of students to whom an individual student's
performance is most similar.

These scores are useful when explaining assessment results to parents or


people unfamiliar with standard scores. For example, let's look at your
child's raw score on a recent math standardized assessment. Looking at
the chart, we see that your child's raw score of 56 places him at an 8th
grade level and an approximate age of 13.

The potential disadvantage of using age or grade equivalent scores is


that parents and some educators misinterpret the scores, especially
when scores indicate the student is below expected age or grade level.
The second type of norm-referenced scoring is percentile rank. These
scores indicate the percentage of peers in the norm group with raw
scores less than or equal to a specific student's raw score.

Percentile rank scores can sometimes overestimate differences of


students with scores that fall near the mean of the normed group and
underestimate differences of students with scores that fall in the
extreme lower or upper range of the scores.

For example, let's look at your child's percentile score on a recent math
standardized assessment. The percentile indicates he scored a 55. This
means that he scored better than 55% of other students taking the same
assessment.

The final type of norm-referenced scoring is standard score. These


scores indicate how far a student's performance is from the mean with
respect to the normal distribution of scores (also referred to as standard
deviation units). While these scores are useful when describing a
student's performance compared to a larger group, they might be
confusing to understand without a basic knowledge of statistics
- which is covered in another lesson.

We see here from your son's score that he falls about one standard
deviation away from the mean (the average scores of the population
that took the same assessment). This information tells us that his score
is slightly above the scores of the other students.

Norm-referenced tests are a good way to compensate for any mistakes that
might be made in designing the measurement tool. For example, what if the
math test is too easy, and everybody aces it? If it is a norm- referenced
test, that's OK because you're not looking at the actual scores of the
students but how well they did in relation to students in the same age
group, grade, or class.

Criterion-Referenced

But, norm-referenced tests aren't perfect. They aren't completely objective


and make it hard to know anything other than how someone did in
comparison to others. But, what if we want to know about a person's
performance without comparing them to others?

A criterion-referenced test is scored on an absolute scale with no


comparisons made. It is interested in one thing only: did you meet the
standards?

Let's go back to our race scenario. Saying that a runner came in third
place is norm-referenced because we are comparing her to the other
runners in the race. But, if we look at her time in the race, that's criterion-
referenced. Saying she finished the race in 58:42 is an objective measure
that is not a comparison to others.

Tests that are pass-fail are criterion-referenced, as are many tests for
certifications. Any test where there's a certain score that you have to
achieve to pass is criterion-referenced. So, for example, you could say
that students have to get a 70% on her test to pass, which would make
it a criterion-referenced test.

As we mentioned, criterion-referenced tests are good for giving an


objective picture of how a person does. They are often seen as more
fair than norm- referenced tests because how well the other people in
the group do on the test doesn't affect your score.

But, it is difficult to measure certain things, and criterion-referenced tests


run the risk of not giving a good picture of what people can and cannot do.
If the test is too difficult and the whole class fails, a criterion-referenced
score will not take into account that the test was difficult.

The potential drawback for criterion-referenced scores is that the


assessment of complex skills is difficult to determine through the use of
one score on an assessment.

So, in short, both criterion-referenced and norm- referenced tests have


positives and negatives. A researcher has to decide which is better for
their study: a measurement tool that offers information about how people
do in relation to others or one that looks at non-comparative data of how
students do.
Testing Bias, Cultural Bias, & Language
Differences in Assessments
A test that yields clear and systematic differences among the results of
the test-takers is biased. Typically, test biases are based on group
membership of the test-takers, such as gender, race and ethnicity.

Cultural Bias

A test is not considered biased simply because some students score


higher than others. A test is considered biased when the scores of one
group are significantly different and have higher predictive validity,
which is the extent to which a score on an assessment predicts future
performance, than another group.

Most test biases are considered cultural bias. Cultural bias is the
extent to which a test offends or penalizes some students based on
their ethnicity, gender or socioeconomic status.

Types of Test Bias

Researchers have identified multiple types of test bias that affect the
accuracy and usability of the test results.

Construct Bias

Construct bias occurs when the construct measured yields


significantly different results for test- takers from the original culture for
which the test was developed and test-takers from a new culture. A
construct refers to an internal trait that cannot be directly observed but
must be inferred from consistent behavior observed in people. Self-
esteem, intelligence and motivation are all examples of a construct.

Basing an intelligence test on items from American culture would create


bias against test-takers from another culture.

Method Bias

Another type of testing bias is method bias. Method bias refers to


factors surrounding the administration of the test that may impact the
results.
The testing environment, length of test and assistance provided by the
teacher administrating the test are all factors that may lead to method
bias. For example, if a student from one culture is used to, and expects to,
receive assistance on standardized tests, but is faced with a situation in
which the teacher is unable to provide any guidance, this may lead to
inaccurate test results.

Additionally, if the test-taker is used to a more relaxed testing


environment, such as one that includes moving around the room
freely and taking breaks, then an American style of standardized
testing administration, where students are expected to sit quietly and
work until completion, is likely to cause difficulty in performance. Again,
this could yield results that may be an inaccurate representation of that
student's knowledge.

Item Bias

Item Bias refers to problems that occur with individual items on the
assessment. These biases may occur because of poor use of grammar,
choice of cultural phrases and poorly written assessment items.

For example, the use of phrases, such as 'bury the hatchet' to indicate
making peace with someone or 'the last straw' to indicate the thing that
makes one lose control, in test items would be difficult for a test-taker
from a different culture to interpret. The incorrect interpretation of
culturally biased phrases within test items would lead to inaccurate test
results.
Language Differences and Test Bias

In addition to biases within the test itself, language differences also


affect performance on standardized testing, which causes bias against
non-native English test-takers. Non-native English test-takers may
struggle with reading comprehension, which hinders their ability to
understand questions and answers. They may also struggle with writing
samples, which are intended to assess writing ability and levels.

Bias and Test-Taker Differences

Biases in testing also occur due to social, cognitive and behavioral


differences among test-takers. Test- takers with cognitive or academic
difficulties will often:

• Have poor listening, reading and writing skills


• Perform inconsistently on tests due to off-task behaviors, such as
daydreaming and doodling
• Have higher than average test anxiety

Test-takers with social or behavioral difficulties will often:

• Perform inconsistently on tests due to off-task behaviors


• Have lower than average motivation for testing Test-takers with delays in

cognitive processing will often:

• Have slow learning and cognitive processing


• Have limited reading and writing skills
• Have poor listening skills

Finally, test-takers with physical or sensory challenges will often:

• Have a tendency to get tired during a test


• Have less developed language skills
• Have poor listening skills
• Have slower learning and cognitive processing skills
Psychometricians should account for individual differences among test-
takers when administering tests and using results to predict future
performance and success.

You might also like