Introduction To Psychological Testing BA Program

Introduction to Psychological
Testing
Defining Characteristics
A psychological test is anything that requires an
individual to perform (a behavior) for the
purpose of measuring some attribute, trait, or
characteristic or to predict an outcome.
All good psychological tests have three

characteristics in common:
Defining Characteristics (cont.)
Tests representatively sample the behaviors thought to measure an attribute or thought to predict
an outcome.
Eg., suppose we are interested in developing a test to measure your physical ability.
One option would be to evaluate performance in every sport you have ever played.
Another option would be to have you run the 50-meter dash.
Both of these options have drawbacks.

The first option would be very precise, but not very practical. Can you imagine how much time and
energy it would take to review how you performed in every sport you have ever played?
The second option is too narrow and unrepresentative. How fast you run the 50-meter dash does not
tell us much about your physical ability in general.
A better method would be to take a representative sample of performance in sports. Eg., we might
require you to participate in some individual sports (E.g., running, tennis, gymnastics) and team
sports (e.g., football, basketball) that involve different types of physical abilities (e.g., strength,
endurance, precision). This option would include a more representative sample.
All good psychological tests include behavior samples that are

obtained under standardized conditions. A test must be
administered the same way to all people taking various factors
that can affect the score besides the characteristic, attribute, or
trait that is being measured.
Factors related to the environment (eg., room temperature,

lighting), the examiner (eg., examiner attitude, how the
instructions are read), the examinee (eg., disease, fatigue), and
the test (eg., understandability of questions) all can affect your
score.
If everyone is tested under the same conditions (eg.,

the same environment), we can be more confident that
these factors will affect all test takers similarly. If all of
these factors affect test takers similarly, we can be
more certain that a person’s test score accurately
reflects the attribute being measured. Although it is
possible for test developers to standardize factors
related to the environment, the examiner, and the test,
it is difficult to standardize examinee factors. Eg., test
developers have little control over what test takers do
the night before they take a test.
All good psychological tests have established rules for scoring.
These rules ensure that all examiners will score the same set of
responses in the same way. Eg., teachers might award 1 point for
each multiple-choice question you answer correctly, and they
might award or deduct points based on what you include in your
response to an essay question. Teachers might then report your
overall exam score either as the number correct or as a
percentage of the number correct (the number of correct
answers divided by the total number of questions on the test).
Although all psychological tests have these characteristics, not all
exhibit these characteristics to the same degree. For example,
some tests may include amore representative sample of
behavior than do others. Some tests, such as group-administered
tests, may be more conducive to administration under
standardized conditions than are individually administered tests.
Some tests may have well-defined rules for scoring, and others
might have general guidelines. Some tests may have very explicit
scoring rules, for example, “If Question 1 is marked true, then
deduct 2 points.” Other tests, such as those that include short
answers, may have less explicit rules for scoring, for example,
“Award 1 point for each concept noted and defined.”
Assumptions of Psychological Tests
1. Psychological tests measure what they purport to measure or predict what

they are intended to predict.
2. An individual’s behavior, and therefore test scores, will typically remain

stable over time.
3. Individuals understand test items similarly.
4. Individuals will report accurately about themselves (for example, about their
personalities, about their likes and dislikes, etc.).
5. Individuals will report their thoughts and feelings honestly.
6. The test score an individual receives is equal to his or her true ability plus
some error, and this error may be attributable to the test itself, the examiner,
the examinee, or the environment.
1. Psychological tests measure what they purport to
measure or predict what they are intended to predict.
In addition, any conclusions or inferences that are

drawn about the test takers based on their test scores
must be appropriate. This is also called test validity. If a
test is designed to measure mechanical ability, we must
assume that it does indeed measure mechanical ability.
If a test is designed to predict performance on the job,
then we must assume that it does indeed predict
performance. This assumption must come from a
personal review of the test’s validity data.
2. An individual’s behavior, and therefore test
scores, will typically remain stable over time.
This is also called test-retest reliability. If a test is
administered at a specific point in time and then we
administer it again at a different point in time (for
example, two weeks later), we must assume,
depending on what we are measuring, that an
individual will receive a similar score at both points in
time. If we are measuring a relatively stable trait, we
should be much more concerned about this
assumption. However, there are some traits, such as
mood, that are not expected to show high test-retest
reliability.
3. Individuals understand test items similarly
(Wiggins, 1973).
For example, when asked to respond true or
false to a test item
Such as “I am almost always healthy,”
we must assume that all test takers understand
and interpret “almost always” similarly.
4. Individuals will report accurately about themselves
(for example, about their personalities, about their likes
& dislikes, etc.).
When we ask people to remember something or to tell
us how they feel about something, we must assume
that they will remember accurately and that they have
the ability to assess and report accurately on their
thoughts and feelings. For example, if we ask you to tell
us whether you agree or disagree with the statement “I
have always liked cats,” you must remember not only
how you feel about cats now but also how you felt
about cats previously.
5. Individuals will report their thoughts and
feelings honestly.
Even if people are able to report correctly about themselves,
they may choose not to do so. Sometimes people respond how
they think the tester wants them to respond, or they lie so that
the outcome benefits them. Eg., if we ask test takers whether
they have ever taken a vacation, they may tell us that they have
even if they really have not. Why? Because we expect most
individuals to occasionally take vacations, and therefore the test
takers think we would expect most individuals to answer yes to
this question. Criminals may respond to test questions in a way
that makes them appear neurotic or psychotic so that they can
claim they were insane when they committed crimes. When
people report about themselves, we must assume that they will
report their thoughts and feelings honestly, or we must build
validity checks into the test.
6. The test score an individual receives is equal to his or her true ability plus some error, and this error may
be attributable to the test itself, the examiner, the examinee, or the environment.
That is, a test taker’s score may reflect not only the
attribute being measured but also things such as
awkward question wording, errors in administration of
the test, examinee fatigue, and the temperature of the
room in which the test was taken. When evaluating an
individual’s score, we must assume that it will include
some error.
6. The test score an individual receives is equal to his or her true ability plus
some error, and this error may be attributable to the test itself, the
examiner, the examinee, or the environment (cont.)
Although we must accept some of these assumptions at

face value, we can increase our confidence in others by
following certain steps during test development. For
example, in Section III of this textbook, which covers
test construction, we talk about how to design test
questions that are understood universally. We also talk
about the techniques that are available to promote
honest answering. In Section II, which covers
psychometric principles, we discuss how to measure a
test’s reliability and validity.
Historical Development
• The history of psychological testing is very
fascinating and has abundance relevance to
present day practices.
• History of psychological testing can be divided into

three eras :-
• (a)Pre 1800
• (b)1800s-1900s
• (c)1900s onwards
Earliest Period
• 2200 BCE- XIA DYNASTY
• 200-100 BCE – HAN DYNASTY
• 618-907 CE – T’ANG DYNASTY
• 1368-1644 – MING DYNASTY
• 1791- FRANCE AND BRITIAN
Forms of testing in China in 2200BC
• Chinese Xia Dynasty emperor had his officials examined every third
year to determine their fitness for office.
• Written exams were introduced in the Han Dynasty.
• Tests included topics like civil law, military affairs, etc.
• Suitability of Civil Service Employment was also based upon

Penmanship.
• The examination system was established in 1906 by Royal Decree in

response to widespread discontent.
Hans Dynasty
Ancient China royal examinations began around 200-
100 BC, in the late Qin(Ch’in) or early Han dynasty.
Five topics were tested:
(a) Civil law,
(b) Military affairs,
(c) Agriculture,
(d) Revenue,
(e) Geography
T’Ang Dynasty
Such examination systems seem to have been
discounted until the T’Ang dynasty when their
use increased significantly.
Ming Dynasty
During the Ming Dynasty the examinations became
more formal. There were different levels of
examinations:
1. Municipal,
2. Provincial, and
3. National
The results of examinations became associated with

granting formal titles.
Europe
France initially began using this kind of
examination system in 1791. The system
adopted by France served as a model for a
British system started in 1833 to select trainees
for the Indian civil service.
Medieval Period
Psychiatric antecedents of psychological testing
1. The brass instrument era

1. German physician Hubert von Grashey (1885)-
antecedents of memory drum.
2. Tested brain-injured patients.
3. Subjects were shown words, symbols, or pictures

through a slot of sheet which was moving slowly.
4. Patients could identify stimuli in totality but could not

identify them when shown through the moving slot.
Brass Instruments era of Psychological Testing
• Experimental psychology flourished in continental Europe and Great

Britain.
• Problem-psychologists mistook simple sensory processes as

intelligence.
• Used assorted brass instruments to measure-sensory thresholds and

reaction times.
• Wilhem Wundt, James McKeen Cattell, and Clark Wissler showed

that it was possible to expose the mind to scientific scrutiny and
measurement.
Wundt’s Contributions
• Found the first psychological laboratory, at the University of Leipzig,
in 1879,Germany.
• Measuring mental processes with his thought meter.
• This device consisted of a pendulum and needles, the pendulum

could hit the needles which had balls.
• Determines swiftness of observer.
• This was an empirical analysis that tried to explain individual

difference .
Francis Galton’s Contribution
• Obsessed with measurement, believed that virtually anything was
measurable.
• He continued the tradition of brass instruments but with an

important difference: his procedures had timely collection of data
• Galton was considered as the father of mental testing
• Set up a lab in london,1884 ; things measured:-physical-height,

weight, head length, etc. behavioral -strength of hand
squeeze(dynamometer),vital capacity of lungs (spirometer), RT both
visual and auditory stimuli.
MODERN PERIOD
Simon Binet and testing for higher mental processes.
• Revised scales and the advent of IQ (1905)
• The Stanford Binet. The early mainstay of IQ (1916)
• Group Tests and Classification of WWI Army Recruits (1917)
• The Wechsler – Bellevue Intelligence Scale and the

Wechsler Adult Intelligence Scale(1930s)
Binet and testing of higher mental processes
• In1904 the minister of public instruction in Paris

appointed a commission to decide upon the educational
measures that should be undertaken with those children
who could not profit from regular instruction.
• Binet and Simon developed a tool for this purpose
• 1908-Binet-Simon published a revision of the 1905 scale
• 1911-third revision of the Binet-Simon scale appeared

Stanford-Binet early mainstay of IQ
• Lewis Terman popularised IQ testing in the US
with his revision of the Binet scales in 1916
which is later known as Stanford-Binet Test.
• William Stern suggested-a better measure of

the relative functioning subject compared to
his/her same aged peers can be got by an IQ
Group Tests and Classification of World War I Army
Recruit Tests
I. Army alpha test (1917)- for literate recruits

(under the leadership of Prof. Robert Yerkes,
President of the American Psychological
Association).
II. Army beta test (1917)- for illiterate recruits

knowing many recruits are illiterate which
Army Beta tests cannot produce any results.
Wechsler intelligence scale
Devised intelligence scales for different age group:
(A)Wechsler adult intelligence scale (WAIS, 1955)
(B)Wechsler intelligence scale for children (WISC,

1945)
(C)Wechsler pre school and primary scale of

intelligence (WPPSI, 1976)
Advent of Other Tests
1. Aptitude Test:- measures underlying potentials
2. Personality Test:-measures specific individualistic

characteristics
3. Projective Test-the individual inner thoughts are

measured
4. Interest Inventory-measures the interest

Development of Personality Tests
• Earliest personality tests were structured,

paper & pencil, group tests. Mostly multiple
choice.
• First structured personality test was

Woodworth Personality Data Sheet. The
motivation to develop this test was the need to
screen military recruits. It was published just
after World War I.
Decline of Personality Tests
Structured tests were criticized because
• while practically administering the tests the logical
assumptions of the administrator was proven to be
wrong.
• Due to its drawback uses of it declined by late 1930s to

early 1940s.
• Following World War II new personality tests based on

fewer or different assumptions were introduced.
Rise of Projective Techniques
• After rise and fall of structured personality
tests, interest in projective tests began to grow
due to its subjective interpretations which was
free from assumptions.
• Herman Rorschach of Switzerland developed

the Rorschach Inkblot Test in 1921. However
David Levy first introduced it in U.S.A around
1932.
Rise of Projective Techniques (cont.)
• In 1935 Henry Murray & Christian Morgan

developed the Thematic Apperception Test
(TAT). A little more structured than Rorschach.
• In this test, participants were asked to make

up dramatic stories about an ambiguous
scenes and situations depicted through
pictures.
New Era of Structure Personality Test
• TAT and Rorschach was not universally accepted so in
1943, Minnesota Multiphasic Personality Inventory
(MMPI), a new structured personality test was
developed by Starke R Hathaway and J. C. McKinley.
• It had overcome most of the problems of previous

structured test.
• In 1989, 1990 came MMPI-2 by J.N. Butcher, which is

now the most widely used & referenced personality test.
New Era of Structured Personality Test (cont.)
• About the time MMPI appeared, personality tests based on

the statistical procedures called factor analysis began to
emerge.
• In early 1940s J.R. Guildford made the first serious attempt to

use factor analytic technique in development of structured
personality tests.
• By the end of that decade Raymond. B. Catell had introduced

16 Personality Factor Questionnaire (16PF), which one of the
most well-constructed structured personality tests.
Rapid Changes in the Status of Testing
• The 1940s not only saw a whole new technology in

psychological testing but also the growth of Applied
Aspects of Psychology.
• By 1949 formal university training standard had been

developed & accepted and Clinical Psychology was born.
• Other applied branches of psychology like Industrial,

Counseling, Educational and School Psychology soon
began to blossom where testing was used.
Rise of Testing in Clinical Psychology
• Shakow, et. al., report in 1947 recommended
that testing methods should be taught only to
doctoral psychology students.
• In 1984, American Psychological Association

(APA) formally declared that clinical
psychologists could conduct testing but NOT
psychotherapy independently without ‘true’
collaboration with physicians.
Decline in Testing
• As tester psychologists aided physicians during late 1940s & early
1950s.
• As more creative and talented people entered the field of

psychology, they felt like technicians serving medical profession
therefore rejected their secondary role of testing.
• There were attacks on testing due to its potentially intrusive nature

& misuse created public suspicion.
• Testing therefore underwent another sharp decline in status in the

late 1950s persisted till 1970s.
Re-emergence in Testing
• During 1980s & 1990s testing again grew in status
& use as major branches of applied psychology
emerged (neuro, health, forensic, child) which
made extensive use of testing.
 Health: Survey in variety of medical settings.

 Child: Assesses childhood disorders.
 Forensic: Legal system to assess insanity,
incompetency and emotional damage.
Some of the Tests developed After WW-I
• 1939- Wechsler-Bellevue Intelligence scale is published; developed by
David Wechsler, specially designed as an adult-oriented point scale of
intelligence. It is replaced by Wechsler Adult Intelligence Scale(WAIS) in
1955, WAIS (R) published in 1981 and in 1997, WAIS-III was published.
• 1942-The Minnesota Multiphasic Personality Inventory (MMPI) is

published.
• 1947- Differential Aptitude Test (DAT). Developed by George K. Bennett,

Harold G. Seashore and Alexander G. Wesman. It was designed to
measure different abilities of both boys and girls in grade eight through
twelve for purpose of educational and vocational guidance.
• 1954- The Standard for Educational and Psychological Testing by (APA) in collaboration
with American Educational Research Association(AERA) and National Council on
Measure in Education(NCME) was published. Revised in 1966, 1974 and 1985.
• 1967- Personality Research Form (PRF) was developed and published by Douglas
Jackson. It represents a relatively sophisticated attempt to measure dimensions of
normal personality.
1969- In 1969 Nancy Bayley developed a Scales of Infant development between age of
1 month and 3 ½ year. The scales are Mental scale which consist of sensory and
perceptual activities, memory learning, problem solving, vocalization, abstract
thinking. Motor Scale includes Sitting, standing, walking and also manipulatory skills of
hands and fingers, and Behavior rating scale assess various aspects of personality like
emotional and social behavior, attention, arousal, persistence etc. It has 5 point
scoring system.
• 1974- Rudolf Moos begins; Social Climate Scale to assess
different environments.
• 1978-Jane Mercer published System Of Multicultural

Pluralistic Assessment (SOMPA), a test battery design to
reduced cultural discrimination.
• 1999-APA and other groups published revised Standards for

Educational and Psychological Testing(SEPT).
Some Milestone in the Development of Tests
 1000 B. C.: Testing in Chinese civil service
 1883 A. D: East India Company conducted the examination for recruitment of people for the
company.
 1850-1900: Civil service examinations in U.S
 1900-1920:Development of individual and group tests of cognitive ability, development of

psychometric theory
 1920-1940:Development of factor analysis, projective tests and standardized personality

inventories
 1940-1960:Development of vocational interest measures, standardized measures of

psychopathology
 1960-1980:Development of item response theory, neuropsychological testing
 1980-present:Large scale implementation of computerized adaptive testing.

Classification of Psychological Tests
A. On the Basis of the Criterion of Administration
1. Individual test
2. Group test
B. On the Basis of the Criterion of Scoring

1. Objective tests
2. Subjective tests
C .On the Basis of Criterion of Time Limit :-
1. Speed Test
2. Power Test
D. On the Basis of the Nature or Contents of the Items:-

1. Verbal test
2. Non-Verbal test
3. Performance test
E. On the basis of criterion of purpose or objective
1. Intelligence tests
2. Aptitude Tests
3. Achievement tests
4. Personality tests
5. Neuropsychological tests
6. interest inventory
7. Creativity tests
F. On the basis of the criterion of standardization

1. Standardized tests
2. Non standardized tests
I. On the Basis of Criterion on the Fairness with Culture.
A. Culture Fair Test
B. Culture Biased Test
Projective Personality Tests
Projective Tests: Basic Essential Features
• Individuals must impose their own structure

which is meaningful
• Stimulus material is unstructured
• Indirect (disguised) method
• Freedom of response
• Interpretation is broad
Projective Tests
All projective methods share the assumption that when
people try to make sense of ambiguous stimuli, their
understanding is determined by their feelings,
experiences, needs, and thought processes
(assumption has been called the projective hypothesis.
In other words, the projective hypothesis assumes that
a person will project something important about him or
herself onto an ambiguous stimulus.
Projective Tests
Projective methods are relatively unstructured.
The tasks allow for an almost inexhaustible
variety of possible responses.
Projective Tests
The purpose and procedures used in projective
tests are usually disguised. That is the client
does not know the way in which their responses
will be analyzed and interpreted.
Projective Tests (cont.)
The client/participant is not limited in the way
they respond- unlike on a questionnaire
requiring a certain yes no or other designated
response format.
The clinician can make interpretation along
multiple dimensions unlike objective test which
usually provide a single score, or set of scores on
a fixed number of scales.
• Rorschach Inkblot Test
• Thematic Apperception Test
Hermann Rorschach (1884-1922)
• Nicknamed “Kleck” or
inkblot
• Talented art student who
decided to study science
• Dream convinced him of
relationship between
perception and
unconscious
• 1921 published
Psychodiagnostik
• Died in 1922
• When he was in high school, the inventor of the

world famous Rorschach Inkblot test was called
Kleck, or Inkblot by his classmates. Like many
other youngsters in Switzerland, he enjoyed
Klecksographiy, the making of fanciful Inkblot
pictures.
• As an art student/teacher like hi father, he had a

great talent at painting. When graduating he
could not decide between a career in art and
science. He wrote a letter to a famous biologist
to ask advice who predictably suggested
science.
• While Sigmund Freud was studying the subconscious, this
excitement over the topic was circulating in scientific circles
constantly reminded Rorschach of his childhood inkblots. He
wondered why two people see entirely different things in the inkblot.
Rorschach began showing the inkblots to children and examining
their reactions.
• It was a dream that he had that influenced his development of the

test. One night, after witnessing an autopsy, he dreamt that his body
was being autopsied but he could see and feel what was happening.
This dream helped convince Rorschach that there was a strong tie
between perception and the unconscious. He chose the topic of
hallucinations for his dissertation.
• In 1921 Rorschach published Psychodiagnotik in which
he described a method of diagnosing patients and
describing characteristics of their personality based upon
their responses to a set of ten inkblots. The book
contained Rorschach’s observations about the
responses of hundreds of patients and non-patients to
his inkblots.
• Among the many observations, Rorschach noted that the

schizophrenic group responded much differently than
others.
• Unfortunately Rorschach died of complications of appendicitis in
1922 one year after the publication of his famous book, at the age of
37 years.
• There was not a strong response to his book during the short period
between its publication and his untimely death. Very few copies sold
and the original publisher declared bankruptcy not long after
Rorschach’s death (but not on account of the book).
• His work might have had little impact upon psychiatry or clinical
psychology had it not been for three of his close friends who
continued to teach Rorschach’s method and to keep it alive.
Rorschach
Inkblot
Test
Rorschach: Historical
5 Scoring Systems
• Adopted by 5 American psychologists with
very different theoretical backgrounds
• Shared common features (same blots were
used, response phase followed by inquiry)
• 5 different systems of administration, scoring
and interpretation emerged
• Two most popular (Beck, Klopf)
Rorschach: Validity and Reliability
Poor psychometric reputation:
• Lack of standardized rules for administration
and scoring
• Poor inter-rater reliability
• Lack of adequate norms
• Unknown or weak validity
Rorschach: Contemporary Use
• John Exner established Rorschach Research
Foundation in 1986
• Integrated five scoring and interpretation systems
• Established empirical support for new system
• Provide a center for training

Contemporary Use: Administration
Association Phase Inquiry Phase
What might this be? I want you to help me see
what you saw. I’m going to
read what you said, and
• Present all the cards then I want you to show me
• Record response where on the blot you saw it
verbatim and what there is there that
• Note location of makes it look like that so
response that I can see it too. I’d like
to see it just like you did, so
help me now.
Rorschach Inkblot Test
• A psychometrically sound test?
• An in-class exercise
Contemporary Use: Scoring
Exner scoring system: The Structural Summary
Location
• Location (W, D, Dd)
• Use of white space (S)
Determinants
• Form (good, poor, bad quality)
• Movement (active and passive)
• Color
• Texture
• Shading
Rorschach Inkblot Test
• A psychometrically sound test?
• Particularly useful in assessing thought
processes
Thematic Apperception Test (TAT)
• Developed by Henry Murray and Christian Morgan at
Harvard Psychological Clinic in the year 1935.
• 31 TAT cards depicting people in a variety of

ambiguous situations (one blank card)
• Examinee is asked to create a story about each

picture
TAT: Administration
Now I want you to make up a story about each
of these pictures. Tell me who the people are,
what they are doing, what they are thinking or
feeling, what led up to the scene, and how it will
turn out.
TAT Scoring and Interpretation
Content analysis of themes that emerge from
the stories
TAT: Psychometric Critics
• Selection of cards is not standardized
• Lack of norms
• Clinicians rely on qualitative impressions
TAT
Used to assess:
• Locus of problems
• Nature of needs
• Quality of interpersonal relationships
Minnesota Multiphasic Personality Inventory
(MMPI)
• Developed in 1930’s
• Starke Hathaway Ph.D. & J. Charnley McKinley,
MD.
• Needed test to identify diagnosis
• Developed an item pool
• Identified a group of patients and nonpatients
• Resulting scale of 550 items (true/false/cannot
say)
MMPI Clinical Scales
Scale # Scale Name Meaning of High
Score
1 Hypochochodriasis Concern about health
2 Depression Depression
3 Hysteria Somatic complaints
Denial of psych. prob.
4 Psychopathic Deviate Antisocial behavior
5 Mas.-Fem Nonstandard gender
interests
6 Paranoia Suspiciousness
7 Psychasthenia Anxiety
8 Schizophrenia Disturbed thought
9 Hypomania Manic mood
10 Social Introversion Shy, social inept
MMPI: Validity Scales
? (Cannot say)
• Unanswered items
L (Lie)
• Faking good
F (Infrequency)
• Faking bad
K (Defensiveness)
• Defensiveness in admitting to problems
Interpreting MMPI
• Validity Scales
• Single scales
• Profile analysis
MMPI: Shortcomings
• Unrepresentative normative sample
• Language of items was outdated (including
sexist language)
• Inadequately addressed difficulties such as
suicide or drug use
MMPI: Revision
• Assembled team of MMPI experts
• Rewrote some items
• Added new items
• Administered new item pool (n=704) to a
standardization sample (representative)
• Retained 567 items from the item pool
Continued problems
– failure of some items to reliably discriminate
between groups
– dimensions based on pre-conceived theory about
structure of personality,
– scales correlate highly and thus provide redundant
information
– they are highly influenced by state at the time of
taking, and the test and re-test stability may
therefore be lower than desired (a problem for
many/most trait measures)
MMPI-2 Content Scales
• Antisocial Practices
Anxiety
• FearsA
Type
• Obsessiveness
Low Self-Esteem
• Depression
Social Discomfort
• Health Problems
Family Concerns
• BizarreInterference
Work Thoughts
• Anger Treatment Indicators
Negative
• Cynicism
MMPI-2 Content Scales (cont.)
• Content scales consist of groups of items
related to a specific content area that have been
empirically related to that content area, such as
the anger scale whose content contains
references to irritability, hot-headedness, and
other symptoms of anger-control problems.
Steps for Test Construction
In order to develop a test one must choose
strategies and materials and then make day to
day research decision that will affect the
quality of the emerging instrument.
 Valid test always emerge slowly from an

evolutionary developmental process that
builds in validity from the very beginning.
6 stages of test construction
• Defining the test
• Selecting the Scaling Method
• Constructing the Test Items
• Testing the Test Items/Items Analysis
• Revising the Test
• Publishing the Test
Defining the Test
• In order to construct a new test, the developer must have a
clear idea of what the test is to measure and how it is to differ
from existing instruments.
• He should be able to determine its scope and purpose, which

must be known before the developer can proceed to test
construction
• Contemporary research continually adds to our

understanding of intelligence and impels us to seek new and
more useful ways to measure this multifaceted construct.
Defining the Test (cont.)
• For instance, Kaufman and Kaufman(1983) provided
a good model of test defining process. He proposed
the Kaufman Assessment Battery for Children(K-
ABC), a new test of general intelligence in children
which turned out to be a fresh focus for measuring
intelligence.
Selecting a Scaling Method
Many psychological tests used or assigning numerals to
observations/responses.
Test developer should carefully choose a scaling method for his

test , which should be according to the ‘Trait’ to measure. For
example , some tests have qualitative variables measuring a trait,
like a test trying to measure the introversion and extroversion
aspect of a person could take qualitative variables like shyness,
sociable, outgoing, etc.
Important concept with this is the’ Scales of Measurement’ given

by S.S Stevens (1946).
Scales of Measurement
1. Nominal Scale
2. Ordinal Scale
3. Interval Scale
4. Ratio Scale
Constructing the Items
Constructing the items is a painful and laborious procedure that taxes the
creativity of test developers. The item writer is confronted with a profusion of
initial questions:
1.Should item content be homogeneous or varied?
2. What range of difficulty should the items cover?
3. How many initial items should be constructed?
4. Which cognitive processes and item domains should be tapped?
5. What kind of test item should be used?

Initial Questions in Test
Construction
1. Should item content be homogeneous or varied?
 The first question pertains to the homogeneity versus

heterogeneity of test item content. In large measure, whether
item content is homogeneous or varied is dictated by the
manner in which the test developer has defined the new
instrument. Consider a culture-reduced test of general
intelligence. Such an instrument might incorporate varied
items, so long as the questions do not presume specific
schooling. The test developer might seek to incorporate novel
problems equally unfamiliar to all examinees.
2. What range of difficulty should the test items cover?
The range of item difficulty must be sufficient to allow for

meaningful differentiation of examinees at both extremes. The
most useful tests, then, are those that include a graded series of
very easy items passed by almost everyone as well as a group of
incrementally more difficult items passed by virtually no one.
Ceiling Effect :- A ceiling effect is observed when significant

numbers of examinees obtain perfect or near perfect scores. The
problem with a ceiling effect is that distinctions between high
scoring examinees are not possible, even though these
examinees might differ substantially on the underlying trait
measured by the test.
Floor Effect:
A Floor Effect is observed when significant numbers of
examinees obtain scores at or near the bottom of the
scale. For example, the WAIS-R has a serious floor
effect in that it fails to discriminate between moderate,
severe, and profound levels of mental retardation – all
persons with significant developmental disabilities fail
to answer virtually every question.
3. How many initial items should be constructed?
Test developers expect that some initial items will

prove to make ineffectual contributions to the overall
measurement goal of their instrument. For this reason,
it is common practice to construct a first draft that
contains excess items, perhaps double the number of
questions desired on the final draft. For example, the
550-item MMPI originally consisted of more than 1,000
true-false personality statements.
4. Which cognitive processes and item domains should
be tapped?
 Before development of a test begins, item

writers usually receive a table of specifications. A
table of specifications enumerates the information
and cognitive tasks on which examinees are to be
assessed. Perhaps the most common specification
table is the content-by-process matrix, which lists
the exact number of items in relevant content
areas and details the precise composite of items
that exemplify different cognitive processes .
For example :
Consider a science achievement test suitable for high
school students. Such a test must cover many different
content areas and should require a mixture of cognitive
processes ranging from simple recall to inferential
reasoning. By providing a table of specifications prior to
the item-writing stage, the test developer can
guarantee that the resulting instrument contains a
proper balance of topical coverage and taps a desired
range of cognitive skills. A hypothetical but realistic
table of specifications is portrayed...
5. What kind of test item should be used? When it comes to method by
which psychological attributes are to be assessed, the test developer is
confronted with dozens of choices. For example :-
Multiple Choice Questionnaire :-

For group-administered tests of intellect or achievement, the technique of
choice is the multiple choice question.
Proponents of multiple-choice methodology argue that properly constructed
items can measure conceptual as well as factual knowledge. Multiple - choice
tests also permit quick and objective machine scoring. Furthermore, the
fairness of multiple - choice questions can be proved with very simple item
analysis procedures .
The major shortcomings of multiple choice questions are, first the difficulty of
writing good distracter options and, second, the possibility that the presence
of the response may cue a half-knowledgeable respondent to the correct
answer.
Matching Questions :
Matching questions are popular in classroom testing,
but suffer serious psychometric shortcomings.
The most serious problem with matching questions is

that responses are not independent – missing one
match usually compels the examinee to miss
another . Another problem is that the options in a
matching question must be very closely related or
the question will be too easy.
Individually Administered Tests
For individually administered tests, the procedure of
choice is the short-answer objective item. Indeed, the
simplest and most straightforward types of questions
often possess the best reliability and validity. A case in
point is the Vocabulary subtest from the WAIS-III, which
consists merely of asking the examinees to define
words. This subtest has very high reliability (.96) and is
usually considered the single best measure of overall
intelligence on the WAIS-III.
Testing the Items/ Item Analysis
Numerous test items from the original try-out pool will be
discarded or revised as test development proceeds. For
this reason, test developers initially produce many, many
excess items, perhaps double the number of items they
intend to use. So, how is the final sample of test
questions selected from the initial item pool? Test
developers use item analysis, a family of statistical
procedures, to identify the best items. In general the
purpose of item analysis is to determine which item
should be retained, which revised, and which thrown out.
Items – Difficulty Index
The item difficulty for a single test item is defined as the proportion of
examinees in a large try-out sample who get that item correct. For
any individual item i, the index of item difficulty is pi , which varies
from 0.0 to 1.0.
The item-difficulty index is a useful tool for identifying items that

should be altered or discarded. Suppose an item has a difficulty index
near 0.0, meaning that nearly everyone has answered it incorrectly.
Unfortunately, this item is psychometrically unproductive because it
does not provide information about the differences between
examinees. For most applications , the item should be said for an
item difficulty index near 1.0, where virtually all subjects provide a
correct answer.
What is the optimal level of item difficulty?
Generally, item difficulties that hover around .5 ranging between .3

and .7, maximize the information the test provides about differences
between examinees. However this rule of thumb is subject to one
important qualification and one very significant exception.
 For true – false or multiple choice items, the optimal level of item
difficulty needs to be adjusted for the effects of guessing. For a true-
false test, a difficulty level of .5 can result when examinees merely guess.
Thus , the optimal level for such items would be .75 (halfway between .5
and 1.0) .
Optimal level of item difficulty formula = (1.0+g)/2 where g is the chance

success level.
Thus, for a four – option multiple choice item, the
chance success level is .25 , and the optimal level of
item difficulty would be (1.0 + .25) /2, or about .63 .
 If a test is to be used for selection of an extreme group of a

cutting score (cutting score – a very high score or a very low
score ), it may be desirable to select items with difficulty
levels outside the .3 to .7 range .
 For example :-
University’s Graduate Admissions – Very High cutting score.
Students Eligible for Remedial Education Programme - Very
Low cutting score.
Item Characteristic Curve
Purpose – An Item Characteristic Curve (ICC) is a graphical
display of the relationship between the probability of a
correct response and the examinee’s position on the
underlying trait measured by the test. It is can also be said
as a mathematical idealization of the relationship
between the probability of a correct response and the
amount of trait possessed by test respondents.
The theory of ICC is also known as item response theory and

latent trait theory.
Rasch Model
The simplest ICC model is Rasch Model , based upon the item –
response theory of the Danish Mathematician George Rasch (1996).
This is the simplest model because it makes just two assumptions :-
1. Test items are uni-dimensional and they measure a continuum of

difficulty level.
2. In general a good item has a positive ICC slope which resembles to a

normal ogive.
The desired shape of ICC depends upon the purpose of the test. We
can understand this statement by the following picture.
Item Discrimination Index
Purpose – An Item Discrimination Index is a statistical Index of
how efficiently an item discriminates between persons who
obtain high and low scores on the entire test.
What is Upper Range an Lower Range ?

- Upper Range generally includes the specific percentage of
examinees who scored high in the entire test.
- Lower Range generally includes the same specific percentage
of examinees who scored low in the entire test.
- The upper range and lower range percentages can vary from
10% to 33%.
Item Discrimination Index is symbolised by
Letter ‘d’, and is calculated from the formula :-
d= ( U-L)/N , where U is the number of examinees in Upper range , L is the number of
examinees in Lower range and N is the total number of examinees in upper and lower
ranges.
For example a test developer administer an exam to a try-out sample of 400 high
school students . After scoring he/ she identifies 25% of high scoring and 25% of low
scoring students.
 The purpose of administration between high scores and low scores is to identify the
best items.
An ideal item is one that is most of high scorers pass and most of the low scorers fail.
This statement can be proved by taking out ‘d’.
‘d’ varies from -1.0 to +1.0
- An item with d = -1.0 indicates that more of low scoring
subjects answered the item correctly than did the high scoring
subjects.
- An item with d = 0.0 exactly equals low and high scoring

subjects answering the items correctly.
- An item with d=+1.0 or closer to +1.0 is preferred.
- Picture Example for ‘d’ as well as a picture example for how

high scores and low scorers assess an item’s quality.
Social Desirability
Social Desirability can be of two types :-
1. On the part of test construction, example
including some socially undesirable words like
Negros or homosexual, etc.
2. On the part of participant., example
My brother --- . This type of question forces the
participant to give socially desirable answers.
Just for the sake of giving a socially desirable
answer anybody can fake his/ her answers.
Revising the Test
Purpose: - Identify unproductive items in the preliminary test so that
they can be revised, eliminated or replaced.
 In the process of test development many items are dropped , others

refined, and new items added. The initial repercussion is that a new
and slightly different test emerges. This revised test contains more
discriminating items with higher reliability and greater predictive
accuracy, but these improvements are for the first try-out sample only.
 Therefore there is a need to collect new data from a second try-out

sample. The purpose of collecting additional test data is to repeat the
item analysis procedures anew.
Feedback From Examinees
Feedback from examinees is a potentially valuable source of information
that is normally overlooked by test developers . We can illustrate this
approach with the research by NEVO (1992).
Nevo developed Examinee Feedback Questionnaire ( EFeQ) to study the

Inter – University Psychometric Entrance Examination. This exam is a
group test consisting five subtests i.e., General Knowledge, Figural
Reasoning, Comprehension, Mathematical Reasoning and English.
Some Assessed Features are Behaviour of Examinees, Testing Conditions,

Level of Guessing, Clarity of exam instructions etc.
Publishing the Tests
Production of the testing materials
- The physical packaging of test materials must allow for quick

and smooth administration.
- The test developer should simplify the duties of the examinees

while leaving examinee task demands unchanged, the resulting
instruments will have greater acceptability to potential users.
For example , if the administration instructions can be
summarised on the test form , the examiner can put the test
manual aside while setting out the task for the examinee.
Technical Manual and User’s Manual
Technical Manual – This manual have technical data about a new instrument .
Here the user can find information about item analysis, scaling reliabilities, etc.
 User’s Manual – This manual gives instructions for administration and also
provide guidelines for test interpretation.
 Some Purposes of Test Manuals :-

- Describe the rationale and recommended users for the test.
- Provide specific cautions against anticipated misuses of a test .
- Cite Quantitative relationships between test scores and criteria.
- Provide essential data on reliability and validity.
Testing in Big Business
• Test development is extraordinarily expensive,
which means that publishers are inherently
conservative about introducing new tests. For
example – Jensen’s (1980) view on this topic is
that the; improvement over existing test can be a
multimillion dollar project requiring a large staff
of test construction experts working for several
years where sometimes getting copyrights and
government subsidies becomes very difficult.
Characteristics of Psychological Tests
• Objectivity
• Reliability
• Validity
• Norms
Objectivity
Objectivity is the tendency to base
decisions and perceptions on exterior
information instead of on subjective
aspects, like private emotions, beliefs,
and experiences.
Types of Reliability

From the perspective of classical test theory,
an examinee's obtained test score (X) is
composed of two components, a true score
component (T) and an error component (E):

X=T+E

Reliability
The true score component reflects the
examinee's status with regard to the attribute
that is measured by the test, while the error
component represents measurement error.
Measurement error is random error. It is due

to factors that are irrelevant to what is being
measured by the test and that have an
unpredictable (unsystematic) effect on an
examinee's test score.
Reliability
The score you obtain on a test is likely to be due
both to the knowledge you have about the
topics addressed by exam items (T) and the
effects of random factors (E) such as the way
test items are written, any alterations in anxiety,
attention, or motivation you experience while
taking the test, and the accuracy of your
"educated guesses."

Reliability
Whenever we administer a test to examinees,
we would like to know how much of their
scores reflects "truth" and how much reflects
error. It is a measure of reliability that
provides us with an estimate of the proportion
of variability in examinees' obtained scores
that is due to true differences among
examinees on the attribute(s) measured by
the test.
Reliability
When a test is reliable, it provides

dependable, consistent results and, for this
reason, the term consistency is often given as
a synonym for reliability (e.g., Anastasi, 1988).
Consistency = Reliability
The Reliability Coefficient
Ideally, a test's reliability would be calculated
by dividing true score variance by the
obtained (total) variance to derive a reliability
index. This index would indicate the
proportion of observed variability in test
scores that reflects true score variability.
True Score Variance/Total Variance = Reliability

Index
A test's true score variance is not known, however, and
reliability must be estimated rather than calculated directly.
There are several ways to estimate a test's reliability. Each

involves assessing the consistency of an examinee's scores
over time, across different content samples, or across
different scorers.
The common assumption for each of these reliability

techniques that consistent variability is true score variability,
while variability that is inconsistent reflects random error.
Most methods for estimating reliability
produce a reliability coefficient, which is a
correlation coefficient that ranges in value
from 0.0 to + 1.0. When a test's reliability
coefficient is 0.0, this means that all variability
in obtained test scores is due to measurement
error. Conversely, when a test's reliability
coefficient is + 1.0, this indicates that all
variability in scores reflects true score
variability.
The reliability coefficient is symbolized with
the letter "r" and a subscript that contains two
of the same letters or numbers (e.g., ''rxx'').
The subscript indicates that the correlation

coefficient was calculated by correlating a test
with itself rather than with some other
measure.
Regardless of the method used to calculate a reliability
coefficient, the coefficient is interpreted directly as the
proportion of variability in obtained test scores that
reflects true score variability. For example, as depicted in
Figure 1, a reliability coefficient of .84 indicates that 84%
of variability in scores is due to true score differences
among examinees, while the remaining 16% (1.00 - .84) is
due to measurement error.
True Score Variability Error (16%)

(84%)

Figure 1. Proportion of variability in test scores
Note that a reliability coefficient does not provide any
information about what is actually being measured by a
test!
A reliability coefficient only indicates whether the attribute

measured by the test - whatever it is - is being assessed in a
consistent, precise way.
Whether the test is actually assessing what it was designed

to measure is addressed by an analysis of the test's validity.
Methods to Estimate/Establish Reliability
The selection of a method for estimating

reliability depends on the nature of the test.
Each method not only entails different

procedures but is also affected by different
sources of error. For many tests, more than
one method should be used.
1. Test Re-Test Reliability
The test-retest method for estimating reliability
involves administering the same test to the
same group of examinees on two different
occasions and then correlating the two sets of
scores. When using this method, the reliability
coefficient indicates the degree of stability
(consistency) of examinees' scores over time
and is also known as the coefficient of stability.
Test Re-Test Reliability
The primary sources of measurement error for test-
retest reliability are any random factors related to the
time that passes between the two administrations of the
test.
These time sampling factors include random fluctuations

in examinees over time (e.g., changes in anxiety or
motivation) and random variations in the testing
situation.
Memory and practice also contribute to error when they

have random carryover effects; i.e., when they affect
many or all examinees but not in the same way.
Advantage of Test-Retest Reliability
Test-retest reliability is appropriate for determining

the reliability of tests designed to measure attributes
that are relatively stable over time and that are not
affected by repeated measurement.
It would be appropriate for a test of aptitude, which

is a stable characteristic, but not for a test of mood,
since mood fluctuates over time, or a test of
creativity, which might be affected by previous
exposure to test items.
2. Alternate (Equivalent, Parallel) Forms Reliability:
To assess a test's alternate forms reliability, two

equivalent forms of the test are administered to
the same group of examinees and the two sets
of scores are correlated.
Alternate forms reliability indicates the

consistency of responding to different item
samples (the two test forms) and, when the
forms are administered at different times, the
consistency of responding over time.
2. Alternate (Equivalent, Parallel) Forms Reliability
• The alternate forms reliability coefficient is

also called the coefficient of equivalence
when the two forms are administered at
about the same time;
• and the coefficient of equivalence and

stability when a relatively long period of
time separates administration of the two
forms.
Alternate (Equivalent, Parallel) Forms Reliability
• The alternate forms reliability coefficient is

also called the coefficient of equivalence
when the two forms are administered at
about the same time; and the coefficient of
equivalence and stability when a relatively
long period of time separates
administration of the two forms.
The primary source of measurement error for

alternate forms reliability is content sampling,
or error introduced by an interaction between
different examinees' knowledge and the
different content assessed by the items
included in the two forms (eg: Form A and
Form B).
The items in Form A might be a better match of one

examinee's knowledge than items in Form B, while the
opposite is true for another examinee.
In this situation, the two scores obtained by each

examinee will differ, which will lower the alternate
forms reliability coefficient.
When administration of the two forms is separated by

a period of time, time sampling factors also contribute
to error.
Like test-retest reliability, alternate forms

reliability is not appropriate when the
attribute measured by the test is likely to
fluctuate over time (and the forms will be
administered at different times) or when
scores are likely to be affected by repeated
measurement.
• If the same strategies required to solve problems on Form

A are used to solve problems on Form B, even if the
problems on the two forms are not identical, there are
likely to be practice effects.
• When these effects differ for different examinees (i.e., are

random), practice will serve as a source of measurement
error.
• Although alternate forms reliability is considered by some

experts to be the most rigorous (and best) method for
estimating reliability, it is not often assessed due to the
difficulty in developing forms that are truly equivalent.
3. Internal Consistency Reliability:
Reliability can also be estimated by measuring
the internal consistency of a test.
Split-half reliability and coefficient alpha are two

methods for evaluating internal consistency. Both
involve administering the test once to a single
group of examinees, and both yield a reliability
coefficient that is also known as the coefficient of
internal consistency.
Internal Consistency Reliability
To determine a test's split-half reliability, the
test is split into halves so that each examinee
has two scores (one for each half of the test).
Scores on the two halves are then correlated.

Tests can be split in several ways, but probably
the most common way is to divide the test on
the basis of odd - versus even - numbered
items.
A problem with the split-half method is that it produces a
reliability coefficient that is based on test scores that were
derived from one-half of the entire length of the test.
If a test contains 30 items, each score is based on 15 items.

Because reliability tends to decrease as the length of a test
decreases, the split-half reliability coefficient usually
underestimates a test's true reliability.
For this reason, the split-half reliability coefficient is ordinarily

corrected using the Spearman-Brown prophecy formula, which
provides an estimate of what the reliability coefficient would
have been had it been based on the full length of the test.
Cronbach's coefficient alpha also involves administering the
test once to a single group of examinees. However, rather
than splitting the test in half, a special formula is used to
determine the average degree of inter-item consistency.
One way to interpret coefficient alpha is as the average

reliability that would be obtained from all possible splits of
the test. Coefficient alpha tends to be conservative and can be
considered the lower boundary of a test's reliability (Novick &
Lewis, 1967).
When test items are scored dichotomously (right or wrong), a

variation of coefficient alpha known as the Kuder-Richardson
Formula 20 (KR-20) can be used.
Content sampling is a source of error for both split-half
reliability and coefficient alpha.
• For split-half reliability, content sampling refers to the error

resulting from differences between the content of the two
halves of the test (i.e., the items included in one half may
better fit the knowledge of some examinees than items in
the other half);
• for coefficient alpha, content (item) sampling refers to

differences between individual test items rather than
between test halves.
Coefficient alpha also has as a source of error,
the heterogeneity of the content domain.
A test is heterogeneous with regard to content

domain when its items measure several
different domains of knowledge or behavior.
The greater the heterogeneity of the content

domain, the lower the inter-item correlations and
the lower the magnitude of coefficient alpha.
Coefficient alpha could be expected to be smaller for

a 200-item test that contains items assessing
knowledge of test construction, statistics, ethics,
epidemiology, environmental health, social and
behavioral sciences, rehabilitation counseling, etc.
than for a 200-item test that contains questions on
test construction only.
The methods for assessing internal consistency reliability
are useful when a test is designed to measure a single
characteristic, when the characteristic measured by the
test fluctuates over time, or when scores are likely to be
affected by repeated exposure to the test.
They are not appropriate for assessing the reliability of

speed tests because, for these tests, they tend to
produce spuriously high coefficients. (For speed tests,
alternate forms reliability is usually the best choice.)
4. Inter-Rater (Inter-Scorer, Inter-Observer) Reliability:
Inter-rater reliability is of concern whenever test

scores depend on a rater's judgment.
A test constructor would want to make sure that an

essay test, a behavioral observation scale, or a
projective personality test have adequate inter-rater
reliability. This type of reliability is assessed either by
calculating a correlation coefficient (e.g., a kappa
coefficient or coefficient of concordance) or by
determining the percent agreement between two or
more raters.
Although the latter technique is frequently used, it can lead to

erroneous conclusions since it does not take into account the
level of agreement that would have occurred by chance alone.
This is a particular problem for behavioral observation scales

that require raters to record the frequency of a specific
behavior.
In this situation, the degree of chance agreement is high

whenever the behavior has a high rate of occurrence, and
percent agreement will provide an inflated estimate of the
measure's reliability.
Sources of error for inter-rater reliability include

factors related to the raters such as lack of
motivation and rater biases and characteristics
of the measuring device.
An inter-rater reliability coefficient is likely to be

low, for instance, when rating categories are not
exhaustive (i.e., don't include all possible
responses or behaviors) and/or are not
mutually exclusive.
The inter-rater reliability of a behavioral rating scale

can also be affected by consensual observer drift,
which occurs when two (or more) observers working
together influence each other's ratings so that they
both assign ratings in a similarly idiosyncratic way.
(Observer drift can also affect a single observer's

ratings when he or she assigns ratings in a
consistently deviant way.) Unlike other sources of
error, consensual observer drift tends to artificially
inflate inter-rater reliability.
The reliability (and validity) of ratings can be

improved in several ways:
– Consensual observer drift can be eliminated by having
raters work independently or by alternating raters.
– Rating accuracy is also improved when raters are told that
their ratings will be checked.
– Overall, the best way to improve both inter- and intra-rater
accuracy is to provide raters with training that emphasizes
the distinction between observation and interpretation
(Aiken, 1985).
Points to Remember
• Study Tip: Remember the Spearman-Brown formula
is related to split-half reliability and KR-20 is related
to the coefficient alpha. Also know that alternate
forms reliability is the most thorough method for
estimating reliability and that internal consistency
reliability is not appropriate for speed tests.
Factors Affecting Reliability
The magnitude of the reliability coefficient is affected
not only by the sources of error discussed earlier, but
also by the length of the test, the range of the test
scores, and the probability that the correct response
to items can be selected by guessing.
– Test Length
– Range of Test Scores
– Guessing
1. Test Length
The larger the sample of the attribute being
measured by a test, the less the relative effects of
measurement error and the more likely the sample
will provide dependable, consistent information.
Consequently, a general rule is that the longer the

test, the larger the test's reliability coefficient.

1. Test Length
The Spearman-Brown prophecy formula is most associated
with split-half reliability but can actually be used whenever a
test developer wants to estimate the effects of lengthening or
shortening a test on its reliability coefficient.
For instance, if a 100-item test has a reliability coefficient

of .84, the Spearman-Brown formula could be used to
estimate the effects of increasing the number of items to 150
or reducing the number to 50.
A problem with the Spearman-Brown formula is that it does

not always yield an accurate estimate of reliability: In general,
it tends to overestimate a test's true reliability (Gay, 1992).
Test Length
This is most likely to be the case when the added items
do not measure the same content domain as the original
items and/or are more susceptible to the effects of
measurement error.
Note that, when used to correct the split-half reliability

coefficient, the situation is more complex, and this
generalization does not always apply: When the two
halves are not equivalent in terms of their means and
standard deviations, the Spearman-Brown formula may
either over- or underestimate the test's actual reliability.
2. Range of Test Scores:
Since the reliability coefficient is a correlation
coefficient, it is maximized when the range of
scores is unrestricted.
The range is directly affected by the degree of

similarity of examinees with regard to the
attribute measured by the test.
2. Range of Test Scores (cont.):
When examinees are heterogeneous, the range of scores is
maximized.
The range is also affected by the difficulty level of the test

items.
When all items are either very difficult or very easy, all
examinees will obtain either low or high scores, resulting in a
restricted range.
Therefore, the best strategy is to choose items so that the

average difficulty level is in the mid-range (r = .50).
3. Guessing
A test's reliability coefficient is also affected by the
probability that examinees can guess the correct answers to
test items.
As the probability of correctly guessing answers increases,

the reliability coefficient decreases.
All other things being equal, a true/false test will have a

lower reliability coefficient than a four-alternative multiple-
choice test which, in turn, will have a lower reliability
coefficient than a free recall test.
Interpretation of Reliability
The interpretation of a test's reliability entails
considering its effects on the scores achieved
by a group of examinees as well as the score
obtained by a single examinee.

The Reliability Coefficient: As discussed previously, a
reliability coefficient is interpreted directly as the proportion
of variability in a set of test scores that is attributable to true
score variability.
A reliability coefficient of .84 indicates that 84% of variability

in test scores is due to true score differences among
examinees, while the remaining 16% is due to measurement
error.
While different types of tests can be expected to have

different levels of reliability, for most tests in the social
sciences, reliability coefficients of .80 or larger are
considered acceptable.
When interpreting a reliability coefficient, it is
important to keep in mind that there is no
single index of reliability for a given test.
Instead, a test's reliability coefficient can vary

from situation to situation and sample to
sample. Ability tests, for example, typically have
different reliability coefficients for groups of
individuals of different ages or ability levels.
Interpretation of Standard Error of Measurement
While the reliability coefficient is useful for

estimating the proportion of true score variability in
a set of test scores, it is not particularly helpful for
interpreting an individual examinee's obtained test
score.
When an examinee receives a score of 80 on a 100-

item test that has a reliability coefficient of .84, for
instance, we can only conclude that, since the test is
not perfectly reliable, the examinee's obtained score
might or might not be his or her true score.
A common practice when interpreting an examinee’s

obtained score is to construct a confidence interval around
that score.
The confidence interval helps a test user estimate the range

within which an examinee's true score is likely to fall given his
or her obtained score.
This range is calculated using the standard error of

measurement, which is an index of the amount of error that
can be expected in obtained scores due to the unreliability of
the test. (When raw scores have been converted to percentile
ranks, the confidence interval is referred to as a percentile
band.)
The following formula is used to estimate the standard error of

measurement:

Formula 1: Standard Error of Measurement
SEmeas = SDx *(1 – rxx)1/2
Where:
SEmeas = standard error of measurement
SDx = standard deviation of test scores
rxx= reliability coefficient
Asshown by the formula, the magnitude of the

standard error is affected by two factors:
– the standard deviation of the test scores (SDx), and
– the test's reliability coefficient (rxx).
The lower the test's standard deviation and the

higher its reliability coefficient, the smaller the
standard error of measurement (and vice versa).
Because the standard error is a type of standard

deviation, it can be interpreted in terms of the areas
under the normal curve.
With regard to confidence intervals, this means that a

68% confidence interval is constructed by adding and
subtracting one standard error to an examinee's
obtained score; a 95% confidence interval is
constructed by adding and subtracting two standard
errors; and a 99% confidence interval is constructed
by adding and subtracting three standard errors.
Example: A psychologist administers a interpersonal assertiveness test to

a sales applicant who receives a score of 80. Since the test's reliability is
less than 1.0, the psychologist knows that this score might be an
imprecise estimate of the applicant's true score and decides to use the
standard error of measurement to construct a 95% confidence interval.
Assuming that the test’s reliability coefficient is .84 and its standard
deviation is 10, the standard error of measurement is equal to 4.0:

SEmeas = SDx *(1 – rxx)1/2 = 10 (1 - .84)1/2 = 10(.4) = 4.0

The psychologist constructs the 95% confidence interval by adding and
subtracting two standard errors from the applicant's obtained score: 80 +
2(4.0) = 72 to 88. This means that there is a 95% chance that the
applicant's true score falls somewhere between 72 and 88.
One problem with the standard error is that

measurement error is not usually equally distributed
throughout the range of test scores.
Use of the same standard error to construct confidence

intervals for all scores in a distribution can, therefore,
be somewhat misleading.
To overcome this problem, some test manuals report

different standard errors for different score intervals.
Estimating True Scores from Obtained Scores
As discussed earlier, because of the effects of

measurement error, obtained test scores tend
to be biased (inaccurate) estimates of true
scores.
More specifically, scores above the mean of a

distribution tend to overestimate true scores,
while scores below the mean tend to
underestimate true scores.
Moreover, the farther from the mean an

obtained score is, the greater this bias.
Rather than constructing a confidence interval,

an alternative (but less used) method for
interpreting an examinee's obtained test score is
to estimate his/her true score using a formula
that takes into account this bias by adjusting the
obtained score using the mean of the
distribution and the test's reliability coefficient.
For example, if an examinee obtains a score of 80 on a test

that has a mean of 70 and a reliability coefficient of .84, the
formula predicts that the examinee's true score is 78.2.
T’=a + bX
=(1-rxx )X + rxx X
T’=(1-.84) x 70 + .84 x 80
=.16 x 70 + .84 x 80
=11.2 + 67=78.2
The Reliability of Different Scores
• A test user is sometimes interested in
comparing the performance of an examinee
on two different tests or subtests and,
therefore, computes a difference score. An
educational psychologist, for instance, might
calculate the difference between a child's
WISC-IV Verbal and Performance scores.
The Reliability of Difference Scores
When doing so, it is important to keep in mind that the
reliability coefficient for the difference scores can be no larger
than the average of the reliabilities of the two tests or subtests:
If Test A has a reliability coefficient of .95 and Test B has a

reliability coefficient of .85, this means that difference scores
calculated from the two tests will have a reliability coefficient of
.90 or less.
The exact size of the reliability coefficient for difference scores

depends on the degree of correlation between the two tests:
The more highly correlated the tests, the smaller the reliability
coefficient (and the larger the standard error of measurement).
Factors to Improve Reliability
1. Homogeneity of the test items,
2. Heterogeneity of the test takers,
3. Lengthening of the test items,
4. Moderately difficulty level of the test items,
5. Acceptable discriminative index)?
Validity
Validity refers to a test's accuracy. A test is valid when it measures what it
is intended to measure. The intended uses for most tests fall into one of
three categories, and each category is associated with a different method
for establishing validity:

– The test is used to obtain information about an examinee's familiarity
with a particular content or behavior domain: content validity.

– The test is administered to determine the extent to which an
examinee possesses a particular hypothetical trait: construct validity.

– The test is used to estimate or predict an examinee's standing or
performance on an external criterion: criterion-related validity.
Validity
For some tests, it is necessary to demonstrate only one type
of validity; for others, it is desirable to establish more than
one type.
For example, if an arithmetic achievement test will be used to

assess the classroom learning of 8th grade students,
establishing the test's content validity would be sufficient. If
the same test will be used to predict the performance of 8th
grade students in an advanced high school math class, the
test's content and criterion-related validity will both be of
concern.

Note that, even when a test is found valid for a particular
purpose, it might not be valid for that purpose for all people.
It is quite possible for a test to be a valid measure of
intelligence or a valid predictor of job performance for one
group of people but not for another group.
Types of Validity
1. Face Validity
2. Content Validity
3. Construct Validity
i. Convergent Validity
ii. Discriminant Validity
4. Criterion-Related Validity
i. Predictive Validity
ii. Concurrent Validity
Face Validity
• A test can be said to have face validity if it "looks like"
it is going to (appears to) measure what it is
supposed to measure.
• For instance, if you prepare a test to measure

whether students can perform multiplication, and
the people you show it to all agree that it looks like a
good test of multiplication ability, you have shown
the face validity of your test.
Content Validity
A test has content validity to the extent that it adequately
samples the content or behavior domain that it is designed to
measure.
If test items are not a good sample, results of testing will be

misleading.
Although content validation is sometimes used to establish the

validity of personality, aptitude, and attitude tests, it is most
associated with achievement-type tests that measure
knowledge of one or more content domains and with tests
designed to assess a well-defined behavior domain.
Adequate content validity would be important for a statistics

test and for a work (job) sample test.
Content Validity
Content validity is usually "built into" a test as it is
constructed through a systematic, logical, and
qualitative process that involves clearly identifying the
content or behavior domain to be sampled and then
writing or selecting items that represent that domain.
Once a test has been developed, the establishment of

content validity relies primarily on the judgment of
subject matter experts.
If experts agree that test items are an adequate and

representative sample of the target domain, then the
test is said to have content validity.
Content Validity
Although content validation depends mainly on the
judgment of experts, supplemental quantitative
evidence can be obtained.
If a test has adequate content validity:

– a coefficient of internal consistency will be large;
– the test will correlate highly with other tests that purport
to measure the same domain; and
– pre-/post-test evaluations of a program designed to
increase familiarity with the domain will indicate
appropriate changes.
Content Validity
Don’t confuse Content validity with Face validity.
Content validity refers to the systematic evaluation of a test

by experts who determine whether or not test items
adequately sample the relevant domain, while face validity
refers simply to whether or not a test "looks like" it
measures what it is intended to measure.
Although face validity is not an actual type of validity, it is a

desirable feature for many tests. If a test lacks face validity,
examinees may not be motivated to respond to items in an
honest or accurate manner. A high degree of face validity
does not, however, indicate that a test has content validity.
Construct Validity
When a test has been found to measure the hypothetical trait
(construct) it is intended to measure, the test is said to have
construct validity. A construct is an abstract characteristic
that cannot be observed directly but must be inferred by
observing its effects. intelligence, mechanical aptitude, self-
esteem, and neuroticism are all constructs.

There is no single way to establish a test's construct validity.
Instead, construct validation entails a systematic
accumulation of evidence showing that the test actually
measures the construct it was designed to measure. The
various methods used to establish this type of validity each
answer a slightly different question about the construct and
include the following:
Construct Validity
– Assessing the test's internal consistency: Do scores on
individual test items correlate highly with the total test
score; i.e., are all of the test items measuring the same
construct?

– Studying group differences: Do scores on the test
accurately distinguish between people who are known to
have different levels of the construct?

– Conducting research to test hypotheses about the
construct: Do test scores change, following an
experimental manipulation, in the direction predicted by
the theory underlying the construct?
Construct Validity
– Assessing the test's convergent and discriminant
validity: Does the test have high correlations with
measures of the same trait (convergent validity)
and low correlations with measures of unrelated
traits (discriminant validity)?

– Assessing the test's factorial validity: Does the test
have the factorial composition it would be
expected to have; i.e., does it have factorial
validity?
Construct Validity
Construct validity is said to be the most
theory-laden of the methods of test
validation.
The developer of a test designed to measure a

construct begins with a theory about the
nature of the construct, which then guides the
test developer in selecting test items and in
choosing the methods for establishing the
test's validity.
Construct Validity
• For example, if the developer of a creativity test believes that
creativity is unrelated to general intelligence, that creativity is
an innate characteristic that cannot be learned, and that
creative people can be expected to generate more alternative
solutions to certain types of problems than non-creative
people, she would want to determine the correlation
between scores on the creativity test and a measure of
intelligence, to see if a course in creativity affects test scores,
and find out if test scores distinguish between people who
differ in the number of solutions they generate to relevant
problems.
Construct Validity
Note that some experts consider construct
validity to be the most basic form of validity
because the techniques involved in establishing
construct validity overlap those used to
determine if a test has content or criterion-
related validity.
Indeed, Cronbach argues that "all validation is

one, and in a sense all is construct validation."
Construct Validity
Convergent and Discriminant Validity:
As mentioned earlier one way to assess a test's construct

validity is to correlate test scores with scores on measures
that do and do not purport to assess the same trait.
High correlations with measures of the same trait provide

evidence of the test's convergent validity, while low
correlations with measures of unrelated characteristics
provide evidence of the test's discriminant (divergent)
validity.
Construct Validity
1. Convergent Validity
2. Discriminant Validity
Convergent Validity
• Convergent validity refers to the degree to which
two measures of constructs that theoretically should
be related, are in fact related.
• For example, if a measure of general happiness has

convergent validity then constructs similar to
happiness (satisfaction, contentment, cheerfulness,
etc.) should relate closely to the measure of general
happiness.
Discriminant Validity
• Discriminant validity tests whether concepts or
measurements that are supposed to be unrelated
are, in fact, unrelated.
• For example, if a measure of general happiness has

discriminate validity then constructs that are not
supposed to be related to general happiness
(sadness, depression, despair, etc.) should not relate
to the measure of general happiness.
Construct Validity
The multitrait-multimethod matrix (Campbell & Fiske, 1959)
is used to systematically organize the data collected when
assessing a test's convergent and discriminant validity.
The multitrait-multimethod matrix is a table of correlation

coefficients, and, as its name suggests, it provides information
about the degree of association between two or more traits
that have each been assessed using two or more methods.
When the correlations between different methods measuring

the same trait are larger than the correlations between the
same and different methods measuring different traits, the
matrix provides evidence of the test's convergent and
discriminant validity.
Multitrait-multimethod matrix
Example: To assess the construct validity of the interpersonal
assertiveness test, a psychologist administers four measures to a
group of salespeople: ( 1 ) the test of interpersonal assertiveness;
(2) a supervisor's rating of interpersonal assertiveness; (3) a test of
aggressiveness; and (4) a supervisor's rating of aggressiveness.
The psychologist has the minimum data needed to construct a

multitrait-multimethod matrix: She has measured two traits that she
believes are unrelated (assertiveness and aggressiveness), and each
trait has been measured by two different methods (a test and a
supervisor-s rating). The psychologist calculates correlation
coefficients for all possible pairs of scores on the four measures and
constructs the following multitrait-multimethod matrix (the upper
half of the table has not been filled in because it would simply
duplicate the correlations in the lower half):
All multitrait-multimethod matrices contain four types
of correlation coefficients:

– Monotrait-monomethod coefficients ("same trait-same
method")
– Monotrait-heteromethod coefficients ("same trait-
different methods")
– Heterotrait-monomethod coefficients ("different traits-
same method")
– Heterotrait-heteromethod coefficients ("different traits-
different methods“)
Monotrait-monomethod coefficients
(or the "same trait-same method")
The monotrait-monomethod coefficients (coefficients

in parentheses in the previous matrix) are reliability
coefficients:
They indicate the correlation between a measure and

itself.
Although these coefficients are not directly relevant to

a test's convergent and discriminant validity, they
should be large in order for the matrix to provide
useful information.
Monotrait-monomethod coefficients
(or "same trait-different methods"):
These coefficients (coefficients in rectangles)

indicate the correlation between different
measures of the same trait.
When these coefficients are large, they

provide evidence of convergent validity.
Heterotrait - monomethod coefficients
(or "different traits-same method"):
These coefficients (coefficients in ellipses) show

the correlation between different traits that
have been measured by the same method.
When the heterotrait - monomethod

coefficients are small, this indicates that a test
has discriminant validity.
Heterotrait-monomethod coefficients
(or "different traits-different methods"):
The heterotrait-heteromethod coefficients

(underlined coefficients) indicate the
correlation between different traits that have
been measured by different methods.
These coefficients also provide evidence of

discriminant validity when they are small.
Also, the number of correlation coefficients
that can provide evidence of convergent and
discriminant validity depends on the number
of measures included in the matrix.
In the example, only four measures were

included (the minimum number), but there
could certainly have been more.
Note that, in a multitrait-multimethod matrix, only
those correlation coefficients that include the test
that is being validated are actually of interest.
In our example matrix, the correlation between the

rating of interpersonal assertiveness and the rating of
aggressiveness (r = .16) is a heterotrait-monomethod
coefficient, but it isn't of interest because it doesn't
provide information about the interpersonal
assertiveness test.
Example: Three of the correlations in our multitrait-
multimethod matrix are relevant to the construct
validity of the interpersonal assertiveness test.
The correlation between the assertiveness test and

the assertiveness rating (monotrait-heteromethod
coefficient) is .71. Since this is a relatively high
correlation, it suggests that the test has convergent
validity.
The correlation between the assertiveness test and the
aggressiveness test (heterotrait-monomethod coefficient)
is .13 and the correlation between the assertiveness test and
the aggressiveness rating (heterotrait-heteromethod
coefficient) is .04.
Because these two correlations are low, they confirm that the
assertiveness test has discriminant validity. This pattern of
correlation coefficients confirms that the assertiveness test
has construct validity.
Note that the monotrait-monomethod coefficient for the

assertiveness test is .93, which indicates that the test also has
adequate reliability. (The other correlations in the matrix are
not relevant to the psychologist's validation study because
they do not include the assertiveness test.)
A1Assertivenes B1 A2 B2
s Test Aggressiveness Assertiveness Aggressiveness
Test Rating Rating
A1 rA1A1 (.93)
B1 rB1A1 (.13) rB1B1 (.91)
A2 rA2A1 (.71) rA2B1 (.09) rA2A2 (.86)
B2 rB2A1 (.04) rB2B1 (.68) rB2A2 (.16) rB2B2 (.89)

Construct Validity
Factor Analysis: Factor analysis is used for several
reasons including identifying the minimum number
of common factors required to account for the
intercorrelations among a set of tests or test items,
evaluating a test’s internal consistency, and
assessing a test’s construct validity.
When factor analysis is used in the latter purpose, a

test is considered to have construct (factorial)
validity when it correlates highly only with the
factor(s) that it would be expected to correlate
with.
Criterion Related Validity
• This type of validity is demonstrated when a test is shown
to be effective in estimating an examinee’s performance
on some outcome measure.
• In this context, the variable of primary interest is the
outcome measure, called a criterion.
• For instance, a college entrance examination that is
reasonably accurate in predicting the subsequent grade
point average of examinees would possess criterion-
related validity. E.g.- Scholastic Aptitude Test (SAT).
• It is further divided into two types:
• Predictive validity.
• Concurrent validity
Criterion Related Validity
1. Concurrent Validity
2. Predictive Validity
Concurrent Validity
• Concurrent validity is demonstrated where a test
correlates well with a measure that has previously
been validated at the same time.
• The two measures may be for the same construct, or

for different, but presumably related, constructs.
• For example, a newly constructed adult intelligence

test is validated concurrently against WAIS.
Predictive Validity
• Predictive validity is the extent to which a score on
a scale or test predicts scores on some criterion
measure obtained at a later date.
• For example, the validity of a cognitive test for job

performance is the correlation between test scores
and, for example, supervisor performance ratings.
Such a cognitive test would have predictive validity if
the observed correlation were statistically significant.
Standardization
• Test could be either Norm-Referenced or Criterion-Referenced.
• Norm referenced tests are based on Norms established for the

test. And Norms are simply a comparative frame of reference.
• Criterion Referenced test are based not on norms but on

Criterion, which focuses only on what the test taker can do
rather than comparing with other’s performance levels.
• DAT is a norm based test while Driving test is a criterion

Referenced test.
Norms
• Norm is a comparative frame of reference in a standardized norm-
referenced test.
• For a test result to be meaningful, examiners must be able to

convert the initial score to some form of derived score upon
comparison to a standardization or group norm.
• To be effective, norms must be obtained with great care and

constructed according to well known precepts.
• Furthermore , norms may become outmoded in just a few years, so

periodic renorming of tests should be the rule, not the exception.
Age Norms
• An Age norms depicts the level of test performance for
each separate age group in a normative sample.
• With age norms, the performance of an examinee is

interpreted in relation to standardization subjects of
the same age.
• The age span for a normative age group can vary from
month to a decade or more depending on the degree to
which the test performance is age dependent.
Grade Norms
• A Grade Norm depicts the level of test
performance for each separate grade in the
normative sample.
• These Norms are particularly useful in school
setting when reporting the achievement levels
of schoolchildren.
• Grade based comparison is preferred to age
based comparison where content areas are
heavily dependent upon grades rather than age.
Local and Sub-Group Norms
• Local and subgroup norms are needed to suit the
specific purpose of a test.
• Local norms are derived from representative local
examinees, as opposed to a national sample.
• Subgroup norms consist of the scores obtained from an
identified subgroup(African, Americans, females)
• As a general rule, whenever an identifiable subgroup
performs appreciably better or worse or test than the
more broadly defined standardization sample, it may be
helpful to construct supplementary subgroup norms.
Issues in Psychological Testing
Social Issues
Professional Issues
Moral Issues
Social Issues
1.Dehumanization.
2.Usefulness of tests.
3.Access to psychological testing services.
Dehumanization
In the process of test administration and scoring, psychologists
dehumanize participants during the test constructions.
Dehumanization exists while establishing test reliability.
 indirect forceful obligations are made.

•
Feelings of anxiety interferes with examinee’s performance during

assessment.
Usefulness of Test
 Whether we use the results of the test in an important manner or
not ,when we finish administering the test ,we find a conclusion
but after this what do we do with it nothing absolutely.
The assumptions underlying todays test maybe fundamentally

incorrect and the resulting tests instruments are far from perfect.
Tests are useful as long as they provide information leading to

better predictions and understanding.
Access to Psychological Testing Services
Testing is expensive.
The expensive test batteries for neurological

and psychiatric assessment are available to
those who can afford them and to those who
have enough insurance.
Professional Issues
Theoretical concerns
 Clinical Prediction VS Actuarial concerns
 The Adequacy of tests

Theoretical concerns
Dependability of the test reliability.
Current practices: Reliability VS validity.
Prioritizing propose of the test.
Theories as a base of any psychological test.
Clinical Prediction VS Actuarial concerns
Accuracy of prediction made: Set of rules VS

the trained professionals.
Computerized test interpretation
Reporting of results
Diagnosis of clients.
Appropriateness of the analysis.
The Adequacy of tests
• Psychometric qualities to warrant their use
Moral Issues
Human Rights
 Labeling
 Invasion of privacy
 Divided loyalties
Human Rights
Test interpreters’ have an ethical obligation to protect
human rights.
Test security should be protected but not at the

expense of an individual's right to know about the basis
of any decision that affect their lives.
The increasing awareness among the test users to

know and demand their rights is an important
influence of testing.
Labelling
Invasion of Privacy
Occurs when information about the test taker
is used inappropriately.
Psychologist are ethically and legally bound to

maintain confidentiality.
Subject must corporate in order to be tested.

Divided Loyalty
Absence of a coherent set of ethical principles governing
all legitimate uses of testing.
Divided loyalties formulate the core of the above problem.
The conflicting commitments of the psychologist who uses

tests.
Dual role of the psychologist: Maintaining the test security

as well as protecting his client’s right.
Present Trends of Psychological Testing
The present trends can be classified into 4:

A. Computer and internet application.
B. Proliferation of new tests.
C. Higher standards, improved technology, and
increased objectivity.
D.Greater public awareness and influence.
E. Controversies will continue
A. Computerization of Psychological Testing
 Nowadays, computers are being applied to testing on a

rapid and widespread bases.
 In adaptive computerized testing, different sets of tests

questions are administered via computer to different
individuals, depending on each individuals status on
the trait being measured.
 Computers are also being used to administer scores

and even interpret psychological tests.
A. Computerization of Psychological Testing (cont.)
Evolution of technology for the transformation

of p & p responses into test scores
These are the implementation of:
 Punch cards
 OMR (Optical Mark Reader)
 OCR (Optical Character Recognition)
 E-marking
(cont.)
1. Punch cards were cards in which a pattern of holes was cut to
represent information, which was then read by a computer. This
method is not used any more.
2.Using the Optical Mark Reader (OMR) test takers answer the items
filling in a so called lozenge (a small area) on a form. The mark reader is
a machine that recognizes whether the lozenge has been filled or not.
Then, this information can be transformed into test scores.
3.OCR (Optical Character Recognition) was an improvement of the OMR

system and produced more reliable test scores. OMR was created to
recognize if certain blocks on the form were filled in or not, while OCR
was developed to be able to recognize written characters and signs.
(cont.)
Currently, so called e-marking is on the market
and makes it possible to recognize and record
on screen hand written answers by
candidates.
(cont.)
• In 1966, a Rogerian therapist named Eliza marked the
beginning of a new phase in psychological testing and
assessment.
• With a great amount of empathy, Eliza encouraged her

clients to talk about their experiences and how these
experiences made them feel.
• Clients responded warmly and enjoyed the sense of empathy

resulting from their interaction and this came as a big
surprise to researcher Dr. Joseph Weizenbaum.
(cont.)
• Eliza was his creation, a computer program developed to
emulate(imitate) the behavior of a psychotherapist.
• Weizenbaum had produced the program in an attempt to

show that human-computer interaction was superficial and
ineffective for therapy.
• He discovered that sessions with Eliza engendered(give rise

to) positive emotions in the clients who had actually enjoyed
the interaction and at attributed human characteristics to
the program.
(cont.)
• The research by Weizenbaum gave acceptance
of something as true to the theory that
human-computer interaction may be
beneficial and opened the door for further
study.
B. Testing on Internet
• According to Crespin and Austin(2002), one of the most important
future applications of psychological testing is to its use on the
internet.
• The advent of computers and internet has revolutionized testing and

played a major role in the proliferation of modern techniques.
• Nowadays , any interested person can easily find free psychological

tests on the world wide web.
For eg.,: Brain.com offers 7 types of free IQ tests.
And another offers 200 free tests about relationships health career
IQ and personality.
B. Testing on Internet (cont.)
• In 1966, he published a comparatively simple program called ELIZA named after
the innocent women in George Bernard Shaw's Pygmalion, which
performed natural language processing. Driven by a script named DOCTOR, it was
capable of engaging humans in a conversation which bore a striking resemblance
to one with an empathic psychologist.
• Weizenbaum modeled its conversational style after Carl Roger, who introduced

the use of open-ended questions to encourage patients to communicate more
effectively with therapists.
• The program applied pattern matching rules to statements to figure out its
replies. It is considered the forerunner of thinking machines. Weizenbaum was
shocked that his program was taken seriously by many users, who would open
their hearts to it. He started to think philosophically about the implications
of artificial intelligence and later became one of its leading critics.
C. Proliferation of New Tests
• New tests keep coming out all the time, with no
end in sight. There are many revised and
updated tests, and hundreds of new tests are
being published each year.
• The impetus for developing new tests comes
from:
1. Professional disagreement over the best
strategies for measuring human characteristics,
over the nature of these characteristics, and over
theories about the causes of human behavior.
(cont.)
From public and professional pressure to use only fair,
accurate, and unbiased instruments.
• Finally, if tests are used, then authors and publishers
of tests stand to profit financially. As long as
someone can make a profit publishing tests, then
new tests will be developed and marketed.
• Majority of new tests are based on the same

principles and underlying theories as the more
established tests. Indeed, most newly developed
tests are justified on the grounds that they are
psychometrically superior to the existing tests or
more specific and thus more appropriate for a
particular problems.
(cont.)
• Some of the new tests are based on models, theories, and
concepts that fundamentally differ from those that underlie
traditional tests. These nontraditional tests stem from
modern concepts and theories from learning, social,
physiological, and experimental psychology. Most of these
tests newer tests are rooted in empirically derived data.
• The proliferation of nontraditional tests is related to two
other trends in testing.
1. It reflects the role of the science of psychology in testing.
2. Efforts are being made to integrate tests with other aspects
of applied psychology.
(cont.)
• Many psychologists, especially the behavior oriented,
have long regretted the poor relationship among
clinical assessment, traditional tests, and subsequent
treatment interventions. They prefer test results that
not only have direct relationship to treatment but
also can be used to assess the effectiveness of
treatment.
D. Higher Standards, Improved Technology,
and Increasing objectivity (cont.)
• Greater public awareness of nature and use of
psychological tests has lead to increasing external
influence on testing.
• Public awareness has lead to an increased demand

for psychological services including testing.
• The benefit of increased public awareness of test has

been the extra focus on safeguarding human rights.
D. Higher Standards, Improved Technology,
and Increasing objectivity
• Higher standards in the psychological testing is
the trend towards increasing objectivity in test
interpretation.
• The minimum acceptable standards for tests are

becoming higher.
• Higher standards of test construction have

encouraged better use of test.
E. Greater Public Awareness and Influence
Greater public awareness of the nature and use of psychological tests has led to increasing external
influence on testing. At one time, the public knew little about psychological tests; psychologists played an
almost exclusive role in governing how tests were used. Public awareness has led to an increased demand
for psychological services, including testing services. This demand is balanced by the tendency toward
restrictive legislative and judicial regulations and policies such as the judicial decision that restricts the use
of standard intelligence tests in diagnosing mental retardation. Th ese restrictions originate in real and
imagined public fears. In short, the public seems to be ambivalent about psychological testing,
simultaneously desiring the benefits yet fearing the power they attribute to tests.
As more individuals share the responsibility of encouraging the proper use of tests by becoming informed
of their rights and insisting on receiving them, the probability of misuse and abuse of tests will be reduced.
The commitment of the field of psychology to high ethical standards can be easily seen in the published
guidelines, position papers, and debates that have evolved during the relatively short period beginning in
1947 with the development of formal standards for training in clinical psychology (Shakow, Hilgard,
Kelly, Sanford, & Shaffer, 1947).
Practitioners of psychology, their instructors, and their supervisors show a deep concern for social values
and the dignity of the individual human being. However, the pressure of public interest in psychological
tests has led practitioners to an even greater awareness about safeguarding the rights and dignity of the
individual.
Interrelated with all of these issues is the trend toward greater protection for the public. Nearly every state
has laws that govern the use of psychological tests. Several factors give the public significant protection
against the inherent risks of testing: limiting testing to reduce the chance that unqualified people will use
psychological tests, sensitivity among practitioners to the rights of the individual, relevant court decisions,
and a clearly articulated set of ethical guidelines and published standards for proper test use.
Future Trends
A. Future Prospects for Testing Are Promising
B. The Proliferation of New and Improved Tests Will
Continue
C. Controversy, Disagreement, and Change Will
Continue
D. The Integration of Cognitive Science and Computer
Science Will Lead to Several Innovations in Testing
A. Future Prospects for Testing Are Promising
It is believed that testing has a promising future and that optimism on the integral role that testing has
played in the development and recognition of psychology. Testing is a multibillion-dollar industry, and
even relatively small testing companies can gross millions of dollars per year. With so much at stake,
testing is probably here to stay.
Despite division within psychology concerning the role and value of testing, it remains one of the few
unique functions of the professional psychologist. Psychological testing encompasses not only
traditional but also new and innovative uses-as in cognitive-behavioral assessment, psychophysiology,
evaluation research, organizational assessment, community assessment, and investigations into the
nature of human functioning-one can understand just how important tests are to psychologists. With
this fundamental tie to testing, psychologists remain the undisputed leaders in the field.
It is unlikely that attacks on and dissatisfaction with traditional psychological tests will suddenly
compel psychologists to abandon tests. Instead, psychologists will likely continue to take the lead in
this field to produce better and better tests, and such a direction will benefit psychologists, the field,
and society.
Testing corporations that publish and sell widely used high-stakes standardized tests will no doubt
continue to market their products aggressively. Moreover, tests are used in most institutions-schools,
colleges, hospitals, industry, business, the government, and so forth-and new applications and creative
uses continue to emerge in response to their demands. Tests will not suddenly disappear with nothing to
replace them. Current tests will continue to be used until they are replaced by still better tests, which of
course may be based on totally new ideas. Though current tests may gradually fade from the scene, we
believe psychological testing will not simply survive but will flourish through the 21st century.
B. The Proliferation of New and Improved
Tests Will Continue
The future may see the development of many more tests. E.g., Stanford-Binet and Wechsler tests -
two major intelligence scales are probably about as technically adequate as they will ever be. They
can be improved through minor revisions to update test stimuli and to provide larger and even more
representative normative samples with special norms for particular groups and through additional
research to extend and support validity documentation.
In the coming days, major intelligence scales may or may not be challenged at least once or twice by
similar tests with superior standardization and normative data or with less bias against certain
minorities. A true challenge can come only from a test based on original concepts and a more
comprehensive theoretical rationale than that of the present scales.
Structured personality test like MMPI-2 appears destined to be the premier test of the 21st century.
We had not anticipated the innovative approach of Butcher and colleagues in dealing with the
original MMPI’s inadequate normative sample. Thus, future prospects for the MMPI-2 are indeed
bright.
For certain tests like projective tests, the future may be uncertain. There is serious doubt whether
the Rorschach provides clinically useful information (Hunsley & Bailey, 1999). Some say the
Rorschach is no better than reading tea leaves. Although we would not go this far, it is clear that
proponents of the Rorschach are fighting an uphill battle (Exner, 1999; Weiner, 2003).
The future of the Thematic Apperception Test (TAT) is more difficult to predict. In a projective test,
outdated stimuli are not a devastating weakness because projective stimuli are by nature ambiguous.
Nevertheless, because the TAT stimuli have been revised (Ritzler, Sharkey, & Chudy, 1980) the TAT
may enjoy increased respectability as more data are acquired on the more recent versions.
C. Controversy, Disagreement, and Change
Will Continue
• It doesn’t matter whether the topic is testing or animal learning—disagreement and
controversy are second nature to psychologists. Disagreement brings with it new
data that may ultimately produce some clarification along with brand new
contradictions and battle lines. Psychologists will probably never agree that any
one test is perfect, and change will be a constant characteristic of the fi eld. We
continue to be optimistic because we see the change as ultimately resulting in more
empirical data, better theories, continuing innovations and advances, and higher
standards.
F. The Integration of Cognitive Science and Computer
Science Will Lead to Several Innovations in Testing
Today, the integration of concepts from experimental cognitive psychology, computer

science, and psychometrics are rapidly shaping the fi eld (Spear, 2007).
Computerized tests form the most recent cutting edge in the new generation of assessment
instruments. The test takers sit in front of a computer that presents realistically animated
situations with full color and sound. The program is both interactive and adaptive. The
computer screen freezes and asks the test taker to provide a response. If the response is
good, then a more difficult item is presented.
The test taker, who is applying for a manager’s job, is given four choices to deal with the
situation. If an effective choice is made, the computer moves on to an even more difficult
situation, such as a threat from the offensive employee.
The computer offers test developers unlimited scope in developing new technologies:
from interactive virtual reality games that measure and record minute responses
to social conflict within a digital world to virtual environments that are suitable for
measuring physiological responses while offering safe and effective systematic
desensitization experiences to individuals with phobias. As we noted at the outset, the
computer holds one of the major keys to the future of psychological testing.

Introduction To Psychological Testing BA Program

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction To Psychological Testing BA Program

Uploaded by

Copyright:

Available Formats

Introduction to Psychological

All good psychological tests have three

Another option would be to have you run the 50-meter dash.

Both of these options have drawbacks.

All good psychological tests include behavior samples that are

Factors related to the environment (eg., room temperature,

If everyone is tested under the same conditions (eg.,

1. Psychological tests measure what they purport to measure or predict what

2. An individual’s behavior, and therefore test scores, will typically remain

3. Individuals understand test items similarly.

5. Individuals will report their thoughts and feelings honestly.

In addition, any conclusions or inferences that are

Although we must accept some of these assumptions at

• History of psychological testing can be divided into

• Written exams were introduced in the Han Dynasty.

• Tests included topics like civil law, military affairs, etc.

• Suitability of Civil Service Employment was also based upon

• The examination system was established in 1906 by Royal Decree in

The results of examinations became associated with

1. The brass instrument era

2. Tested brain-injured patients.

3. Subjects were shown words, symbols, or pictures

4. Patients could identify stimuli in totality but could not

• Experimental psychology flourished in continental Europe and Great

• Problem-psychologists mistook simple sensory processes as

• Used assorted brass instruments to measure-sensory thresholds and

• Wilhem Wundt, James McKeen Cattell, and Clark Wissler showed

• Measuring mental processes with his thought meter.

• This device consisted of a pendulum and needles, the pendulum

• Determines swiftness of observer.

• This was an empirical analysis that tried to explain individual

• He continued the tradition of brass instruments but with an

• Galton was considered as the father of mental testing

• Set up a lab in london,1884 ; things measured:-physical-height,

• Revised scales and the advent of IQ (1905)

• The Stanford Binet. The early mainstay of IQ (1916)

• Group Tests and Classification of WWI Army Recruits (1917)

• The Wechsler – Bellevue Intelligence Scale and the

• In1904 the minister of public instruction in Paris

• Binet and Simon developed a tool for this purpose

• 1908-Binet-Simon published a revision of the 1905 scale

• 1911-third revision of the Binet-Simon scale appeared

• William Stern suggested-a better measure of

I. Army alpha test (1917)- for literate recruits

II. Army beta test (1917)- for illiterate recruits

(A)Wechsler adult intelligence scale (WAIS, 1955)

(B)Wechsler intelligence scale for children (WISC,

(C)Wechsler pre school and primary scale of

2. Personality Test:-measures specific individualistic

3. Projective Test-the individual inner thoughts are

4. Interest Inventory-measures the interest

• Earliest personality tests were structured,

• First structured personality test was

• Due to its drawback uses of it declined by late 1930s to

• Following World War II new personality tests based on

• Herman Rorschach of Switzerland developed

• In 1935 Henry Murray & Christian Morgan

• In this test, participants were asked to make

• It had overcome most of the problems of previous

• In 1989, 1990 came MMPI-2 by J.N. Butcher, which is

• About the time MMPI appeared, personality tests based on

• In early 1940s J.R. Guildford made the first serious attempt to

• By the end of that decade Raymond. B. Catell had introduced