Professional Documents
Culture Documents
Delta M1 Testing and Evaluation PDF
Delta M1 Testing and Evaluation PDF
Assessment
Contents
0 Introduction Page 2
Task 1 Page 9
Task 2 Page 14
1
0 Introduction
This unit focuses on the evaluation of learners. A sound understanding of the various
principles of assessment is very important for Paper 2 Task 1 of the exam. Other
questions in which this knowledge is useful are Paper 1 Tasks 1 and 2, and possibly
Paper 2 Tasks 2 and 3.
The main aim of the unit is to provide an introduction to the key terms and concepts
of assessment. It is worth bearing in mind that it really only provides an overview of a
very complex subject, and while further reading is not essential for the exam, there is
a lot more to be read on the subject. As always, there’s a list at the end of this unit of
all terminology mentioned, which you can use for revision purposes, as well as some
recommended reading on the subject.
In Paper 2 Question 1 you are asked to evaluate a test or part or a test ‘using your
knowledge of relevant testing concepts’. In order to do this, you will need to be
familiar with some of the relevant terminology (The same terminology is also very
useful when writing the Module 3 extended assignment.)
For this unit we suggest you adopt a test-teach-test approach by first trying to match
the terms with the definitions on the following page, and then, when you have
finished reading the unit, try the task again.
N.B. It’s worth bearing in mind that several of these terms overlap.
2
achievement tests the extent to which a test appears to test what it is
designed to test
analytic marking schemes the extent to which a completed test would be given
the same score by two or more different markers
criteria-referenced testing the influence a final test has on the teaching that
comes before it
indirect testing the extent to which the same test given to the same
learner at a different time would produce the same
score
3
test reliability a means of analysing a learner’s strengths and
weaknesses, often used to provide information about
what to include in a course programme
Evaluation
There are three types of evaluation: summative, formative and congruent. The
differences between them are defined by when they take place. Summative
evaluation takes place at the end of a period of study and aims to assess what has
been achieved in that time. Formative evaluation takes place during a period of study
and aims to provide feedback during a course so that it can be improved. Congruent
evaluation refers to the evaluation of a course before it starts to ensure that the
course design matches the course aims and objectives.
Assessment
Assessment is the measurement of the amount of learning that has taken place. It
can be carried out by the teacher (teacher assessment), by students (self-
assessment), by students and teachers (collaborative assessment), by students
with one another (peer assessment).
There are many ways in which information on learning can be provided. Such
assessment activities can be:
Testing
4
If assessment is formal then it is known as testing. Andy Baxter (1997) describes
testing as a process in which teachers ask learners questions to which they already
know the answers (whereas when evaluating we are asking questions to which we
don’t have the answers). Testing is concerned with what has been learned whereas
evaluation is also concerned with the how and the why.
Given these definitions, which do you carry out more of: formal or informal testing?
5
2 Why assess learners?
• the learners;
• the teacher;
• a director of studies;
• parents;
• employers;
• etc.
While larger organisations tend to use tests which measure learners against a
standard of proficiency that is not based on any syllabus they may have followed
(proficiency tests), smaller organisations will often test their learners based on the
extent to which they have mastered certain aspects of a syllabus or the overall
objectives of a syllabus (achievement tests). So the CAE or IELTS exams would be
examples of the former, while a coursebook test or a school end-of-term test could
be examples of the latter.
We can also use tests to decide what course learners should take (placement
tests). Such tests may also be achievement tests if learners are to change level or
proficiency tests if the learners are new to the organisation .Alternatively, they could
be a mixture of the two.
Tests can also be used to identify the strengths and weaknesses of a learner or of a
teaching programme. Such tests are known as diagnostic tests.
Think of some tests you have given recently. What were the reasons for giving them?
6
3 Test Construction
Obviously the thoroughness of a test depends on its purpose. A good test though
should be:
• valid;
• reliable;
• practical;
• have no negative backwash (or washback).
• content validity;
• construct validity;
• face validity.
If the same test only tested the learners’ ability to recognise the different sounds
rather than their ability to produce them, then it would have low construct validity as
it would be testing the wrong thing. Content validity is often considered to be part of
construct validity, as clearly a test can’t accurately measure a learner’s ability or
knowledge if the test does not contain the language areas that are supposed to be
being tested.
Face validity is to do with the way in which a learner perceives a test. Does the
learner believe it is testing what it is supposed to test? In order to have face validity a
test needs to be designed in a way that allows the learner to see that it really is
testing what it is supposed to.
• test reliability
• scorer reliability
Test reliability refers to the degree to which the same test given to the same learner
under the same conditions would produce the same results. Obviously the more
thorough the test the more data it produces, which increases reliability, but issues of
practicality mean that we often cannot be as thorough as we would like to be.
Creating a reliable test then is a question of compromise between thoroughness and
practicality. Giving learners fresh starts by providing a variety of tasks rather than,
say, one long one is one way of increasing the reliability of a test. Varying the
question types in a test yet sticking to ones that the learners are familiar with is also a
way of ensuring test reliability. Instructions should also be intelligible and the
conditions in which the test is taken should be the same each time it is sat.
If two different people mark the same test and give it the same mark then the test is
said to have scorer reliability.
7
To increase scorer reliability the test either has to have a set of right answers to mark
against (e.g. in a multiple choice exercise) or an answer key and marking scheme
instructing markers on how they should be marking. The scorer reliability of tests that
can be answered in a variety of ways (such as writing tasks) can be increased if
there is more than one marker.
Practicality refers to how easy a test is to administer. This refers not only to finding
the space, time and invigilating staff for the running of the test but also the time and
expertise necessary for the test’s design, trialling and marking as well as the time
and skills necessary for coming up with a valid and reliable test and answer
key/marking scheme.
Validity and reliability are also not always easy to reconcile in a test. Reliable tests
are not always valid, and vice versa. Having learners write a letter of complaint might
be a very valid way of testing their ability to write such a letter, for example, but
unless the teacher has carefully considered how much guidance the learners will get,
devised exact criteria for marking and arrived at a clear understanding amongst the
various markers of what constitutes a good letter, then the test will not be reliable.
Alternatively, while getting learners to complete a gapped letter of complaint would
make for a very reliable test, its validity would be very low as it would not show how
well they could produce such a text themselves.
Tests such as the gapped letter described above have the advantage of being
relatively objective as there is often a single correct answer. Therefore they are very
easy to mark and can sometimes even be marked by a machine such as an OMR –
an optical mark reader. However, as well as lacking in validity they are often much
more difficult to design.
Subjective tests, such as writing a letter, are ones in which the marker uses his
judgement. These are usually easier to design but have to be marked by a teacher,
and the marking can be time-consuming. Other issues with subjective testing are that
learners can either play safe by avoiding things they are not sure about or produce
language that is beyond the scope of what is being tested.
Another way of labelling tests (and remember, many of these terms overlap
considerably) is by using the terms discrete-point testing and integrative testing.
A discrete-point test is one which consists of several items, each of which tests a
single point of knowledge at a time (e.g. a test in which each part tests a different
grammatical structure). If we want to know if a learner can recognise or produce a
specific language item, then we use discrete item techniques.
8
well a student can use their combined knowledge of single items, then integrative
testing techniques are the best method.
Discrete-point tests are usually objective and require short answers whereas
integrative tests are usually open-ended and require the learner to respond in their
own words. Most tests nowadays use a combination of these techniques depending
on the language and skill that is being tested.
Tests can also be described as direct or indirect. In a direct test a learner’s ability to
perform a task in the language is assessed by getting the learner to perform the task.
So if we want to assess a learner’s speaking then the test requires the learner to
speak. An indirect test assesses aspects of the language which give an indication of
how well a learner performs. To assess a learner’s spoken language he might be
asked to match spoken discourse markers with their functions, for example, or to
choose the right response to a request.
Backwash describes the effect a test has on the teaching that precedes it, i.e. the
extent to which a course is influenced by the test it is leading up to. If the content of a
course is improved by the teacher having to ‘teach to the test’ then this is known as
beneficial, or positive, backwash, whereas if the learners are deprived work on the
areas they really need to work on as a result of the course focusing too much on the
upcoming test then the backwash is said to be negative. It is important to be aware
of backwash and not automatically assume that what is in the exam is necessarily
what the learners need.
In the longer term of course it is also the case that tests change to reflect changes in
the way that teachers are teaching and in the content of courses.
TASK 1
1. Take a test you are familiar with (it can be an internationally taken test such as
IELTS, one your school uses, or even one you have designed yourself).
2. Using the key words in Sections 2 and 3, analyse the pros and cons of the test.
3. Post your list on the discussion board. Then read and discuss your colleagues’
reports.
9
4 What do we test?
Over the years there has been a move away from testing learners’ knowledge of a
language towards testing their ability to use it. There has also been less focus on
testing accuracy and more on testing communicative competence, and helping
learners to learn more effectively is seen by many these days as more constructive
than testing their memory.
Think of as many ways as you can for the testing of grammar and lexis and compare
your list with the one below.
Some common types of testing techniques used for grammar and lexis are:
• Gap-filling
• Multiple choice
• Error spotting
• Transformation exercises (e.g. when learners are given a sentence which
they have to express in another way using either a sentence head or a key
word)
• Jumbled sentences for students to order
• Matching tasks (e.g. word and definition, halves of collocation, sentence
halves, sentences, etc.)
• Cloze tests (a text in which every 7th – though in practice, to maintain
coherence it’s often between every 6th and 10th – word is removed)
• Skeleton sentences, which need to be written in full
• Odd one out
• Writing questions for answers
• Adding to categories
10
STOP AND THINK 5
Which of the techniques for testing lexis and grammar listed above could also be
used for testing reading and listening?
• Gap-filling
• Multiple choice questions
• True/false questions
• Completing tables
• Sequencing a jumbled text (reading)
• Writing answers to questions
• Matching (e.g. titles or topics to texts or paragraphs)
• Inserting headings, sentences, paragraphs back into texts
• Labelling diagrams
• Selecting a picture or sequencing pictures
• Spotting differences in content between a written and a spoken text
• Identifying features of spoken language (listening).
Testing Writing
Some of these tests involve reading, which could be said to lower their construct
validity. However, it could also be argued that the tasks are therefore more
communicative and reflect the circumstances in which we write in the real world.
The marking of a written text is a considerably more complex process than the
marking of a discrete-point grammar test. Ideally, a learner’s text should receive the
11
same score regardless of who marks it when, so a purely impressionistic mark is
insufficient. Marking writing involves a lot more than simply adding up points
however; a clear list of criteria is needed.
Such marking scales are either holistic or analytic. An example of a holistic scale is
the Cambridge suite exams (FCE, CAE, etc.), which uses descriptors to assess the
writing from a global point of view. Analytic marking scales break the writing skill
down into different components and award marks for each one. As you hopefully saw
when you clicked on the above link, the IELTS writing tasks are awarded marks
under the following categories, for example:
• task achievement;
• coherence and cohesion;
• lexical resource;
• grammatical range and accuracy.
Testing Speaking
Despite its prevalence in communicative classrooms and the fact that learners often
cite the need to speak as a priority, the testing of speaking is often neglected. Why
do you think this might be?
The main problems when testing speaking are practicality and reliability. Formally
testing learners effectively usually involves dividing a group up into individuals or
groups of 2 or 3 and testing them at intervals, which obviously takes a lot of time.
Assessing learners’ speaking can also be very difficult as speaking is ephemeral and
therefore difficult to analyse in real time. Also, the examiner is often involved in
communication with the learner(s), or so intent on understanding the message of
what is being said that assessment of their speaking is not easy.
12
In assessing speaking we can use scales similar to those used in writing, i.e. either
holistic scales or analytic scales based on the speaking skill.
What categories could be used in an analytical scale for the speaking skill?
• Role-play
• Problem solving tasks
• Ranking tasks
• Debates
13
5 Problems with testing and alternative approaches
• Many tests do not accurately measure the skill(s) or use of the language
system(s) that they are intended to evaluate.
• They only give learners one shot at ‘getting it right’; informal continuous
assessment arguably gives a far more representative picture of a learners’
abilities.
• Some learners simply ‘aren’t very good at tests’, maybe because of previous
testing experiences, nerves, attention span or because the way in which they
are being tested doesn’t match their learning style.
• Tests can result in negative backwash and therefore make lessons less
interesting/effective/relevant to learners’ needs.
The key to successful testing, i.e. testing that gives us accurate information about our
learners’ language abilities in a number of areas, is a little often rather than one or
two big tests at the end of units or terms. Getting learners more involved in the
assessment process is another way of increasing its effectiveness. One way of doing
this is to get them to keep portfolios in which they could keep:
• test/mini-test results
• marked homework
• project work (may have been written as part of a group)
• audio-cassettes
• video-cassettes
• interesting articles/texts/song lyrics, etc. that the student has
found/read/understood
• compositions
• pages/extracts from a learner diary
• checklists/learned lists
• previous reports/evaluations by teachers, peers, or self
• lesson-redesigns; lesson analyses
• results of previous performance reviews
This ensures that the learners are responsible for keeping a varied and personal
record of their progress over a course, and shares the responsibility of keeping tabs
on it.
TASK 2
Are teachers the best people to judge whether a learner has made progress?
Shouldn’t the learner have some say in the matter? Andy Baxter lists 8 ways in which
the learners can be involved in assessment.
- Confidence ratings
- Checklists
- Learned lists
14
- Learner diaries
- Redesign and analyse a class
- Self-reports
- Student tests
- Clinics
Give and explain your answers to these questions on the discussion board. Then
read and discuss your colleagues’ postings.
15
6 List of key terms: Testing terminology answer key
direct testing when the learner is asked to perform the skill that is
being tested
16
test reliability the extent to which the same test given to the same
learner at a different time would produce the same
score
NB Progress tests can be a type of formative assessment. Achievement tests are a type of
summative evaluation. The terms formative and summative are used to talk about evaluation
that provides information about both the learning and the teaching that has taken place,
whereas progress and achievement tests are designed primarily to test the effectiveness of
the learning.
17
7 Exam Practice
The text for this task is reproduced on the opposite page. It is being used in the
following situation
The group consists of six Czech learners all from the same company,
which provides financial services. There is a range of abilities within the
group, which is nominally Intermediate (CEFR B1-2). The stated needs
of the group are to improve their business vocabulary and spoken
fluency. The test was set as part of an end-of-term business English
test. There were other parts which tested the learners’ reading and
writing skills.
Using your knowledge of relevant testing concepts, evaluate the effectiveness of the
tasks for this learner in this situation.
Make a total of six points. You must include positive and negative points.
18
Progress
Test
December
2010
Prepositions
Contracts
1. The
known
and
Seller
hereinafter
Buyer
parties
be
as
will
The
The
2. contract
of
If
either
the
breaks
void
be
the
it
parties
will
and
null
3. in
this
many
clauses
There
contract
are
Personal Finance
Write two words that can both go before the words on the right.
1. __________
account
__________
2. __________
a
mortgage
__________
3. __________
money
__________
4. __________
an
invoice
__________
Accounting
Match the words on the left with their definitions on the right.
1. amortization
record
the
lower
price
of
an
asset
due
to
depreciation
2. write
down
the
current
value
of
an
asset
3. write
off
the
loss
in
value
of
an
asset
over
time
4. book
value
record
the
value
of
an
asset
as
zero
due
to
depreciation
19
8 Further Reading
Alderson, J. C., Clapham, C. & Wall, D. (1995) Language Test Construction and
Evaluation CUP
Allan, D. April (1999) Distinctions and Dichotomies: Testing and Assessment ETP
Issue 11
20