Designing Classroom Language Tests Chapter 3

YESDL Eskiehirin LP Markas
www.yesdil.com -20-
CHAPTER 3
DESIGNING CLASSROOM LANGUAGE TESTS

In this chapter, we will examine test types, and we will learn how to design tests and
revise existing ones. To start the process of designing tests, we will ask some critical
questions.
The following five questions should form the basis of your approach to designing tests
for your classroom.

Question 1:

What is the purpose of the test?
Why am I creating this test?
For an evaluation of overall proficiency? (Proficiency Test)
To place students into a course? (Placement Test)
To measure achievement within a course? (Achievement Test)

Once you have established the major purpose of a test, you can determine its
objectives.

Question 2:

What are the objectives of the test?
What specifically am I trying to find out?
What language abilities are to be assessed?

Question 3:

How will the test specifications reflect both the purpose and objectives?
When a test is designed, the objectives should be incorporated into a structure
that appropriately weights the various competencies being assessed.

Question 4:

How will the test tasks be selected and the separate items arranged?
The tasks need to be practical.
They should also achieve content validity by presenting tasks that mirror those
of the course being assessed.
They should be evaluated reliably by the teacher or scorer.
The tasks themselves should strive for authenticity, and the progression of tasks
ought to be biased for best performance.

Question 5:

What kind of scoring, grading, and/or feedback is expected?
Tests vary in the form and function of feedback, depending on their purpose.
For every test, the way results are reported is an important consideration.
Under some circumstances a letter grade or a holistic score may be
appropriate; other circumstances may require that a teacher offer substantive
washback to the learner.

http://www.yesdil.com/v1/pdf/testing_language_assessment.pdf
FOR EDUCATIONAL USE ONLY

www.yesdil.com -21-
TEST TYPES
Defining your purpose will help you choose the right kind of test, and it will also help
you to focus on the specific objectives of the test.

Below are the test types to be examined:

1. Language Aptitude Tests
2. Proficiency Tests
3. Placement Tests
4. Diagnostic Tests
5. Achievement Tests

1. Language Aptitude Tests

They predict a persons success prior to exposure to the second language.
A language aptitude test is designed to measure capacity or general ability to learn
a foreign language.
Language aptitude tests are ostensibly(grnte olan) designed to apply to the
classroom learning of any language.
Two standardized aptitude tests have been used in the US. The Modern Language
Aptitude Test(MLAT), and the Pimsleur Language Aptitude Battery(PLAB)
Tasks in MLAT includes: Number learning, phonetic script, spelling clues, words in
sentences, and paired associates.
Theres no unequivocal(su gtrmez bir ekilde) evidence that language aptitude
tests predict communicative success in a language. (Yani bu tr test sonular ile
rencilerin dil renme sreleri arasnda genelde bir tutarllk olsa da bu testlerin
mutlak olduu sylenemez.)
Any test that claims to predict success in learning a language is undoubtedly flawed
because we now know that with appropriate self-knowledge, and active
strategic involvement in learning, virtually everyone can succeed eventually.

2. Proficiency Tests

A proficiency test is not limited to any one course, curriculum, or single skill in
the language; rather, it tests overall ability.
It includes: standardized multiple choice items on grammar, vocabulary, reading
comprehension, and aural(iitsel) comprehension. Sometimes a sample of
writing is added, and more recent tests also include oral production.
Such tests often have content validity weaknesses.
Proficiency tests are almost always summative and norm-referenced.
They are usually not equipped to provide diagnostic feedback.
Their role is to accept or to deny someones passage into the next stage of a
journey.
TOEFL is a typical standardized proficiency test.
Creating these tests and validating them with research is a time-consuming and
costly process. To choose one of a number of commercially available proficiency
tests is a far more practical method for classroom teachers.

3. Placement Tests

The ultimate objective of a placement test is to correctly place a student into a
course or level.
Certain proficient tests can act in the role of placement tests.
A placement test usually includes a sampling of the material to be covered in the
various courses in a curriculum.

www.yesdil.com -22-
In a placement test, a student should find the test material neither too easy nor
too difficult but appropriately challenging.
The English as a Second Language Placement Test (ESLPT) at San Francisco State
University has three parts. Part 1: students read a short article and then write a
summary essay. Part 2: students write a composition in response to an article. Part
3: multiple-choice; students read an essay and identify grammar errors in it.
The ESL is more authentic but less practical, because human evaluators are
required for the first two parts.
Reliability problems are also present but are mitigated(hafifletmek) by
conscientious training of all evaluators of the test.
What is lost in practicality and reliability is gained in the diagnostic information
that the ESLPT provides.

4. Diagnostic Tests

A diagnostic test is designed to diagnose specified aspects of a language.
A diagnostic test can help a student become aware of errors and encourage the
adoption of appropriate compensatory strategies.
A test of pronunciation, for example, might diagnose the phonological features of
English that are difficult for learners and should therefore become part of a
curriculum. Usually such tests offer a checklist of features for the administrator to use
in pinpointing difficulties.
Another example: a writing diagnostic would elicit a writing sample from students
that would allow the teacher to identify those rhetorical and linguistic features on
which the course needed to focus special attention.
A typical diagnostic test of oral production was created by Clifford Prator(1972) to
accompany a manual of English pronunciation. In the test;
Test-takers are directed to read a 150-word passage while they are tape-
recorded.
The test administrator then refers to an inventory(envanter, deftere kaytl
eya) of phonological items for analyzing a learners production.
After multiple listenings, the administrator produces a checklist for errors in
five separate categories.

o Stress and rhythm,
o Intonation,
o Vowels,
o Consonants, and
o Other factors.

This information can help teacher make decisions about aspects of English phonology.

5. Achievement Tests

An achievement test is related directly to classroom lessons, units, or even a
total curriculum.
Achievement tests should be limited to particular material addressed in a
curriculum within a particular time frame and should be offered after a course
has focused on the objectives in question.
Theres a fine line of differences between a diagnostic test and an achievement
test.
Achievement tests analyze the extent to which students have
acquired language features that have already been taught. (Gemiin
analizini yapyor.)
Diagnostic tests should elicit information on what students need to
work on in the future. (Gelecek ile ilgili bir analiz yaplyor.)

www.yesdil.com -23-
The primary role of an achievement test is to determine whether course
objectives have been met and appropriate knowledge and skills acquired by
the end of a period of instruction.
Achievement tests are often summative because they are administered at the end
of a unit or term of study. But effective achievement tests can serve as useful
washback by showing the errors of students and helping them analyse their
weaknesses and strengths. (Tam bir washback rnei.)
Achievement tests range from five- or ten-minute quizzes to three-hour final
examinations, with an almost infinite variety of item types and formats.

IMPORTANT:
New and innovative testing formats take a lot of effort to design and a long time to
refine through trial and error. Traditional testing techniques can, with a little
creativity, conform to the spirit of an interactive, communicative language curriculum.
Your best tack(yol, gidi) as a new teacher is to work within the guidelines of
accepted, known, traditional testing techniques.
Slowly, with experience, you can get bolder in your attempts.

In that spirit, then, let us consider some practical steps in constructing classroom
tests:

A) Assessing Clear, Unambiguous Objectives

Before giving a test;
examine the objectives for the unit youre testing. Your first task in designing a
test, then, is to determine appropriate objectives. (Objective olarak: Tag
questions ya da Students will learn tag questions. eksiktir. nk testable
deildir. rnein; spoken olarak m, writing olarak m renecekleri belli deil.
Dahas context olarak conversation da m, essay de mi, academic lecture da m
olaca belli deil.) Olmas gereken objective: Students will recognize and
produce tag questions, with the correct grammatical form and final intonation
pattern, in simple social conversations. For more see the original book pg. 50

B) Drawing Up Test Specifications (Talimatlar)

Test specifications will simply comprise
a) a broad outline of the test
b) what skills you will test
c) what the items will look like

This is an example for test specifications based on the objective stated above:
Students will recognize and produce tag questions, with the correct grammatical form
and final intonation pattern, in simple social conversations.

Test specifications

Speaking (5 minutes per person, previous day)
Format: oral interview, T and S
Task: T asks questions to S

Listening (10 minutes)
Format: T makes audiotape in advance, with one other voice on it
Tasks: a. 5 minimal pair items, multiple choice
b. 5 interpretation items, multiple choice


www.yesdil.com -24-
Reading (10 minutes)
Format: cloze test items (10 total) in a story line
Tasks: fill in the blanks

Writing (10 minutes)
Format: prompt for a topic: why I liked/didnt like a recent TV sitcom
Task: writing a short opinion paragraph

These informal classroom-oriented specifications give you an indication of
the topics(objectives) you will recover
the implied elicitation and response formats for items
the number of items in each section
the time to be allocated for each

C) Devising Test Tasks

As you devise your test items, consider such factors as
how students will perceive them(face validity)
the extent to which authentic language and contexts are present
potential difficulty caused by cultural schemata

In revising your draft, you should ask yourself some important questions:

1. Are the directions to each section absolutely clear?
2. Is there an example item for each section?
3. Does each item measure a specified objective?
4. Is each item stated in clear, simple language?
5. Does each multiple choice have appropriate distractors; that is, are the wrong items
clearly wrong and yet sufficiently alluring that they arent ridiculously easy?
6. Is the difficulty of each item appropriate for your students?
7. Is the language of each item sufficiently authentic?
8. Do the sum of the items and the test as a whole adequately reflect the learning
objectives?

In the final revision of your test,
imagine that you are a student taking the test
go through each set of directions and all items slowly and deliberately. Time
yourself
if the test should be shortened or lengthened, make the necessary
adjustments
make sure your test is neat and uncluttered on the page, reflecting all the care
and precision you have put into its construction
if there is an audio component, make sure that the script is clear, that your
voice and any other voices are clear, and that the equipment is in working
order before starting the test.

D) Designing Multiple-Choice Test Items

Therere a number of weaknesses in multiple-choice items:
The technique tests only recognition knowledge.
Guessing may have a considerable effect on test scores.
The technique severely restricts what can be tested.
It is very difficult to write successful items.
Washback may be harmful. (Nasl olsa cevab tahmin ederim. Atsam bile tutar
gibi dncelerle negatif bir washback olabilir.)
Cheating may be facilitated.

www.yesdil.com -25-
However,
The two principles that stand out in support of multiple-choice formats are, of
course, practicality and reliability.

Some important jargons in Multiple-Choice Items:

Multiple-choice items are all receptive, or selective, that is, the test-taker
chooses from a set of responses rather than creating a response. Other
receptive item types include true-false questions and matching lists.
Every multiple-choice item has a stem(soru kk), which presents several
options(klar/ usually between three and five) or alternatives to choose
from.
One of those options, the key (doru cevap), is the correct response, while
the others serve as distractors (eldirici).

IMPORTANT!!!

Consider the following four guidelines for designing multiple-choice items for both
classroom-based and large-scale situations:

1. Design each item to measure a specific objective. (rnein; ayn anda hem modal
bilgisini hem de article bilgisini lme.) see pg. 56

2. State both stem and options as simply and directly as possible. Do not use
superfluous (lzumsuz) words, and another rule of succinctness (az ve z) is to
remove needless redundancy (gereksiz bilgi) from your options. see pg. 57

3. Make certain that the intended answer is clearly the only correct one. Eliminating
unintended possible answers is often the most difficult problem of designing
multiple-choice items. With only a minimum of context in each stem, a wide variety of
responses may be perceived as correct.

4. Use item indices (indeksler) to accept, discard, or revise items: The appropriate
selection and arrangement of suitable multiple-choice items on a test can best be
accomplished by measuring items against three indices: a) item facility(IF), or item
difficulty b) item discrimination (ID), or item differentiation, and c) distractor
analysis

a) Item facility (IF) is the extent to which an item is easy or difficult for the proposed
group of test-takers. (ok ok kolay ya da ok ok zor olan sorular altn ve kmr
deerindeki rencileri birbirinden ayrt edebilmemize hizmet etmez. Bu yzden bu item
facility nemli bir parametre olarak karmza kmaktadr.)

20 renciden 13 doru cevap geldiyse; 13/20=0,65(%65)
Orann ne olmas gerektii hakknda kesin bir bilgi olmamasna ramen %15 - %85in
kabul edilebilir oranlar olduu sylenebilir.

Note:
Two good reasons for occasionally including a very easy item (%85 or higher) are to
build in some affective feelings of success among lower-ability students and to serve
as warm-up items. And very difficult items can provide a challenge to the highest-
ability students.


www.yesdil.com -26-
b) Item discrimination (ID) is the extent to which an item differentiates between
high- and low-ability test-takers.
An item on which high-ability students and low-ability students score equally well
would have poor ID because it did not discriminate between the two groups.
An item that garners(toplamak) correct responses from most of the high-ability
group and incorrect responses from most of the low-ability group has good
discrimination power.

30 renciyi en iyiden en de kadar eit paraya ayr. En yksek notu alan 10
renci ile en dk notu alan 10 renciyi bir itemda aadaki gibi ayrdmz farz
edelim:

Item # Correct Incorrect
High-ability students (top 10) 7 3
Low-ability students (bottom10) 2 8

ID: 7-2=5/ 10= 0,50 The result tells us that us that the item has a moderate level of
ID. High discriminating level would approach 1.0 and no discriminating power at all
would be zero.
In most cases, you would want to discard an item that scored near zero.
As with IF, no absolute rule governs the establishment of acceptable and
unacceptable ID indices.

c) Distractor efficiency (DE) is the extent to which
the distractors lure a sufficient number of test-takers, especially lower-ability
ones, and
those responses are somewhat evenly distributed across all distractors.

Example:

Choices A B C* D E
High-ability students (10) 0 1 7 0 2
Low-ability students (10) 3 5 2 0 0

*Note: C is the correct response.
The item might be improved in two ways:

a) Distractor D doesnt fool anyone. Therefore it probably has no utility. A
revision might provide a distractor that actually attracts a response or
two.
b) Distractor E attracts more responses (2) from the high-ability group than
the low-ability group (0). Why are good students choosing this one?
Perhaps it includes a subtle reference that entices the high group but is
over the head of the low group, and therefore the latter students dont
even consider it.

The other two distractor (A and B) seem to be fulfilling their function of attracting some
attention from the lower-ability students.


www.yesdil.com -27-
SCORING, GRADING AND GIVING FEEDBACK

A) Scoring
As you design a classroom test, you must consider how the test will be scored and
graded. Your scoring plan reflects the relative weight that you place on each section
and items in each section. (Lesson objective hangi beceriyi daha ok nemsemise o
beceriye daha fazla puan vermek gerekir.)

Listening ve speaking e younlaan reading ve writing e daha az nemseyen bir lesson
objective ya da curriculum varsa puan dalm yle olabilir:
Oral production %30, Listening %30, Reading %20 ve Writing %20 eklinde.

B) Grading
Grading doesnt mean just giving A for 90-100, and a B for 80-89. Its not that
simple. How you assign letter grades to a test is a product of

the country, culture, and context of the English classroom,
institutional expectations (most of them unwritten),
explicit and implicit definitions of grades that you have set forth,
the relationship you have established with the class, and
student expectations that have been engendered(cause) in previous tests and
quizzes in the class.

C) Giving Feedback
Feedback should become beneficial washback. Those are some examples of
feedback:

1. a letter grade
2. a total score
3. four subscores (speaking, listening, reading, writing)
4. for the listening and reading sections
a. an indication of correct/incorrect responses
b. marginal comments
5. for the oral interview
a. scores for each element being rated
b. a checklist of areas needing work
c. oral feedback after the interview
d. a post-interview conference to go over the results
6. on the essay
a. scores for each element being rated
b. a checklist of areas needing work
c. marginal and end-of-essay comments, suggestions
d. a post-test conference to go over work
e. a self-assessment
7. on all or selected parts of the test, peer checking of results
8. a whole-class discussion of results of the test
9. individual conferences with each student to review the whole test

Options 1 and 2 give virtually no feedback. The feedback they present does not
become washback.
Option 3 gives a student a chance to see the relative strength of each skill area and
so becomes minimally useful.
Options 4, 5, and 6 represent the kind of response a teacher can give that
approaches maximum feedback.

- END OF CHAPTER 3 -


www.yesdil.com -28-
CHAPTER 3

EXERCISE 1:
Decide whether the following statements are TRUE or FALSE.

1. A language aptitude test measures a learners future success in learning a foreign
language.
2. Language aptitude tests are very common today.
3. A proficiency test is limited to a particular course or curriculum.
4. The aim of a placement test is to place a student into particular level.
5. Placement tests have many varieties.
6. Any placement test can be used at a particular teaching program.
7. Achievement tests are related to classroom lessons, units, or curriculum.
8. A five-minute quiz can be an achievement test.
9. The first task in designing a test is to determine test specification.

EXERCISE 2:
Decide whether the following statements are TRUE or FALSE.

1. It is very easy to develop multiple-choice tests.
2. Multiple-choice tests are practical but not reliable.
3. Multiple-choice tests are time-saving in terms of scoring and grading.
4. Multiple-choice items are receptive.
5. Each multiple-choice item in a test should measure a specific objective.
6. The stem of a multiple-choice item should be as long as possible in order to help
students to understand the context.
7. If the Item Facility value of a multiple-choice item is .10(% 10), it means the item is
very easy.
8. Item discrimination index differentiates between high and low-ability students.

ANSWERE KEY

EXERCISE 1:
1. TRUE
2. FALSE
3. FALSE
4. TRUE
5. TRUE
6. FALSE (Not all placement tests suit to every teaching program.)
7. TRUE
8. FALSE
9. FALSE (The first task is to determine appropriate objectives.)

EXERCISE 2:

1. FALSE (It seems easy, but is not very easy.)
2. FALSE (They can be both practical and reliable.)
3. TRUE
4. TRUE
5. TRUE
6. FALSE (It should be short and to the point.)
7. FALSE (An item with an IF value of .10 is a very difficult one.)
8. TRUE

Designing Classroom Language Tests Chapter 3

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Designing Classroom Language Tests Chapter 3

Uploaded by

Copyright:

Available Formats

YESDL Eskiehirin LP Markas

You might also like