7c0c8test Development

Test Development
Why Develop a New Test?

meet
the needs of a special group of test takers sample behaviours from a newly defined test domain improve the accuracy of test scores for their intended purpose Tests need to be revised
First Four Steps

Defining
the test universe, audience, and purpose Developing a test plan Composing the test items Writing the administration instructions
Continued Steps of Test Construction

Diagram of Test Construction (p. 234)
Constructing
Scales Piloting the Test Standardizing the Test Collecting Norms Validation & Reliability Studies Manual Writing Test Revision
Defining the Test Universe, Audience, & Purpose

Defining
the test universe.
prepare a working definition of the construct locate studies that explain the construct locate current measures of the construct

Defining
the target audience.
make a list of characteristics of persons who will take the test--particularly those characteristics that will affect how test takers will respond to the test questions (e.g., reading level, disabilities, honesty, language)

Defining
the purpose.
includes not only what the test will measure, but also how scores will be used e.g., will scores be used to compare test takers (normative approach) or to indicate achievement (criterion approach)? e.g., will scores be used to test a theory or to provide information about an individual?
Developing a Test Plan

A
test plan includes a definition of the construct, the content to be measured (test domain), the format for the questions, and how the test will be administered and scored
Defining the Construct

Define
construct after reviewing literature about the construct and any available measures Operationalize in terms of observable and measurable behaviours Provides boundaries for the test domain (what should and shouldnt be included) Specify approximate number of items needed
Choosing the Test Format

Test
format refers to the type of questions the test will contain (usually one format per test for ease of test takers and scoring) Test formats have two elements:
stimulus (e.g., a question or phrase) mechanism for response (e.g., multiple choice, true-false). May be objective or subjective test format
Composing the Test Items

test
items are the stimuli presented to the test taker (may or may not take the form of questions) the form chosen depends on decisions made in the test plan (e.g., purpose, audience, method of administration, scoring)
Test Types
Structured
Response
Multiple Choice True False, Forced Choice Likert Scales

Free
Response
Essay, Short Answer Interview Questions Fill in the Blank Projective Techniques
Multiple Choice
Multiple
choice most common in educational testing (and also some personality and employment testing)
consists of a stem and a number of responses-should only be one right answer the wrong answers are called distractors because they may appear correct--should be realistic enough to appeal to uninformed test taker easy scoring but downside is that test takers can get some correct by guessing
Multiple Choice
Pros

more answer options (4-5) reduce the chance of guessing that an item is correct many items can aid in student comparison and reduce ambiguity, increase reliability measures narrow facets of performance reading time increased with more answers transparent clues (e.g., verb tenses or letter uses a or an) may encourage guessing difficult to write four or five reasonable choices takes more time to write questions
Cons

True/False
True/False
is also used in educational testing and some personality testing

in educational testing the test taker can again gain some advantage by guessing
True/False (cont.)
Ideally
a true/false question should be constructed so that an incorrect response indicates something about the student's misunderstanding of the learning objective. may be a difficult task, especially when constructing a true statement.
This
Forced Choice Items

Forced-Choice
is similar to multiple-choice but is used in personality and attitude tests (e.g., MBTI)
test taker must choose between unrelated but equally acceptable responses
Forced Choice Items(cont.)

Example Place an X in the space to the left of the work that of the word in each pair that best describes your personality. 1. ____ Sunny ____ Friendly 2. ____ Outgoing ____ Loyal
Likert Scales
Likert
scales are usually reliable and highly popular (e.g., personality and attitude tests)
item is presented with an array of response options (e.g., 1 to 5 or 1 to 7 scale), usually on an agree/disagree or approve/disapprove continuum
Test Types
Structured
Great
Response
Advantages
Breadth Quick Scoring
Disadvantages
Limited
Depth Difficult to assess higher levels of skills Guessing/Memorization vs. Knowledge
Subjective Items
subjective
items are less easily scored but provide the test taker with fewer cues and open wider areas for response--often used in education
essay questions - responses can vary in breadth and depth and scorer must determine to what extent the response is correct (often by examining match with predetermined correct response)
Essay Questions
Provide
a freedom on response that facilitates assessing higher cognitive behaviors (e.g., analysis and evaluation) respondent to focus on what they have learned and does not limit them to specific questions
Allows
Interview Questions
interview questions are often used in organizational settings--interviewer decides what is a good or poor answer
test
plan should be based on knowledge, skills, abilities and other characteristics required to perform the job
Information
can be obtained from a job description, job analysis, current job incumbent
Projective Techniques
Projective
techniques are often employed in clinical settings

uses a highly ambiguous stimulus to elicit an unstructured response (i.e., the test taker projects his or her perception and perspective onto a neutral stimulus) variety of stimuli (e.g., pictures, words) and responses may be verbal or drawing pictures
Sentence Completion
Sentence-Completion
format presents an incomplete sentence that the test taker completes (e.g., I feel happiest when ) tests are at risk for judgment error and inter-rater reliability is therefore of particular importance--scoring keys and training important
subjective
Test Types
Subjective
Can
Items
Advantages
Test Higher Cognitive skills Encourages organize/develop thoughts
Disadvantages
Difficult
to Grade Judgement error (e.g., interrater reliability) Requires Advance - Objective Scoring Key
Writing Good Items

Basis
building block of test construction Little attention given to writing items an art that requires originality, creativity, combined with knowledge of test domain and good item writing practices not all items will perform as expected--may be too easy or difficult, may be misinterpreted, etc. rule of thumb to write at least twice as many items as you expect to use Broad vs. Narrow items
Writing Good Items (cont.)

Suggestions:
identify item topics by consulting test plan (increases content validity) ensure that each item presents a central idea or problem write items drawn only from testing universe write each item in clear and direct manner
Writing Good Items (cont..)

Suggestions:
use vocabulary and language appropriate for the target audience (e.g., age, culture) take into account sexist or racist language (e.g., mailman, fireman) make all items independent (e.g.,one question per question) ask an expert to review items to reduce ambiguity and inaccuracy
Writing Administration Instructions

specify
the testing environment to decrease variation or error in test scores should address:
group or individual administration requirements for location (e.g., quiet) required equipment time limits or approximate completion time script for administrator and answers to questions test takers may ask
Specifying Administration and Scoring Methods

determine
such things as how test will be administered (e.g., orally, written, computer--individually or in groups) method of scoring, but also whether scored by hand by test administrator, or accompanied by scoring software, or sent to test publisher for scoring
Scoring Methods
Cumulative model: most common
assumes that the more a test taker responds in a particular fashion the more he/she has of the attribute being measured (e.g., more correct answers, or endorses higher numbers on a Likert scale) correct responses or responses on Likert scale are summed yields interval data that can be interpreted with reference to norms
Scoring Methods (cont.)

Categorical model: place test takers in a group
e.g., a particular pattern of responses may suggest diagnosis of a certain psychological disorder typically yields nominal data because it places test takers in categories
Scoring Methods (cont)

Ipsative model: test takers scores are not compared to that of other test takers but rather compare the scores on various scales WITHIN the test taker (Which scores are high & low)
e.g., a test taker may complete a measure of interpersonal problems of various types and the test administrator may want to determine which of the types the test taker feels is most problematic for him or her
Cumulative model may be combined with categorical or ipsative model
Response Bias
In
preparing an item review, each question can be evaluated from two perspectives: Is the item fair? Is the item biased?
Tests
are subject to error and one form comes from the test takers
Response Sets/Styles
Are
patterns of responding that result in misleading information and limit the accuracy and usefulness of the test scores
Reasons for misleading information 1. Information requested is too personal 2. Distort their responses 3. Answer items carelessly 4. May feel coerced into completing the test
Response Style
People always agree (acquiescence) or disagree (criticalness) with statements without attending to the actual content Usually, when items are ambiguous
Solution: use both positively- and negativelykeyed items
Social Desirability
Some test takers choose socially acceptable answers or present themselves in a favourable light
People
often do not attend as much to the trait being measured as to the social acceptability of the statement
represents unwanted variance
This
Social Desirability (cont.)

Example items:
Friends would call me spontaneous. People I know can count on me to finish what I start. I would rather work in a group than by myself.
I often get stressed-out in many situations.
Faking
Faking -- some test takers may respond in a particular way to cause a desired outcome
may fake good (e.g., in employment settings) to create a favourable impression may fake bad (e.g., in clinical or forensic settings) as a cry for help or to appear mentally disturbed
may use some subtle questions that are difficult to fake because they arent clearly face valid
Faking Bad
People try to look worse than they really are
Common
problem in clinical settings
Reasons:
Cry
for help Want to plea insanity in court Want to avoid draft into military Want to show psychological damage
Most people who fake bad overdo it
Impression Management
Mitigating IM:
Use
positive and negative impression scales (endorsed by 10% of the population) Use lie scales to flag those who score high (e.g., I get angry sometime). Inconsistency scales (e.g., two different responses to two similar questions) (Use multiple assessment methods (other than self-report)
Random Responding
Random responding may occur when test takers are unwilling or unable to respond accurately.
likely to occur when test taker lacks the skills (e.g., reading), does not want to be evaluated, or lacks attention to the task try to detect by embedding a scale that tends to yield clear results from vast majority such that a different result suggests the test taker wasnt cooperating
Random Responding
Detection:
Duplicate
items: I love my mother. I hate my mother. scales: Ive never had hair on my head. I have not seen a car in 10 years.
Infrequency
Random Responding
May occur for several reasons: People are not motivated to participate Reading or language difficulties Do not understand instructions / item content Too confused or disturbed to respond appropriately
Piloting and Revising Tests

cant
assume the test will perform as expected pilot test scientifically investigates the tests reliability and validity administer test to sample from target audience analyze data and revise test to fix any problems uncovered--many aspects to consider
Setting Up the Pilot Test

test
situation should match actual circumstances in which test will be used (e.g., in sample characteristics, setting) developers must follow the American Psychological Associations codes of ethics (e.g., strict rules of confidentiality and publish only aggregate results)
Conducting the Pilot Test

depth
and breadth depends on the size and complexity of the target audience adhere strictly to test procedures outlined in test administration instructions generally require large sample may ask participants about the testing experience
Analyzing the Results

can
gather both quantitative and qualitative information use quantitative information for such things as item characteristics, internal consistency, convergent and discriminate validity, and in some instances predictive validity
Revising the Test

Choosing
the final items requires weighing each items content validity, item difficulty and discrimination, interitem correlation, and bias when new items need to be added or items need to be revised, the items must again be pilot tested to ensure that the changes produced the desired results
Validation and Cross-Validation

Validation
is the process of obtaining evidence that the test effectively measures what it is supposed to measure (i.e., reliability and validity) first part of establishing content validity is carried out as the test is developed--that it measures the constructs (construct validity) and predicts an outside criterion is determined in subsequent data collection
Validation and Cross-Validation

when
the final revision of a test yields scores with sufficient evidence of reliability and validity, test developers then conduct cross-validation--a final round of test administration to another sample because of chance factors the reliability and validity coefficients will likely be smaller in the new sample--referred to as shrinkage

7c0c8test Development

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

7c0c8test Development

Uploaded by

Copyright:

Available Formats

Test Development

Why Develop a New Test?

First Four Steps

Continued Steps of Test Construction

Defining the Test Universe, Audience, & Purpose

the test universe.

Defining the Test Universe, Audience, & Purpose

the target audience.

Defining the Test Universe, Audience, & Purpose

Developing a Test Plan

Defining the Construct

Choosing the Test Format

Composing the Test Items

Multiple Choice True False, Forced Choice Likert Scales

is also used in educational testing and some personality testing

Forced Choice Items

Forced Choice Items(cont.)

Depth Difficult to assess higher levels of skills Guessing/Memorization vs. Knowledge

techniques are often employed in clinical settings

Writing Good Items

Writing Good Items (cont.)

Writing Good Items (cont..)

Writing Administration Instructions

Specifying Administration and Scoring Methods

Scoring Methods (cont.)

Scoring Methods (cont)

Social Desirability (cont.)

I often get stressed-out in many situations.

problem in clinical settings

Most people who fake bad overdo it

Piloting and Revising Tests

Setting Up the Pilot Test

Conducting the Pilot Test

Analyzing the Results

Revising the Test

Validation and Cross-Validation

Validation and Cross-Validation

You might also like