Professional Documents
Culture Documents
Assessment: evaluation process by which you give a score; there are people who are called
‘assessors’ – assessors are people who assess
Testing: refers to the process; a special kind of assessment; variety of ways to check the
ability of the learners; gathering information about what students can do
Formal assessment / Informal assessment: informally collected information what students
can do
Test: a set of questions a kind of assessment
1. physical measurement:
e.g. How long is a desk? You measure it with a tape and read what the tape says.
the only thing you can argue with is the matter of accuracy (40 cm or 40,1 cm)
2. psychological measurement:
e.g. How can you measure intelligence?
IQ test: list of questions (tasks, problem solving…) – depending on how successful
you solved the problems they will tell a percentage of how intelligent you are
WE CAN’T TELL HOW INTELLIGENT A PERSON IS
same with any psychological property (language as well)
you can’t directly get to people’s intelligence
there’s a lot more certainty in this field
Classifications of tests:
What is tested?
Achievement testing
dependent on a document (syllabus, textbook, etc.)
there’s a set goal and you measure whether people achieved that goal
achievement goals – CFR
test is based on expectations
Progress testing
subclass of achievement testing
progress-achievement testing
based on the reality, what actually happens
progress testing during the year
Why testing is done?
Placement testing
puts test-takers into pre-defined categories
this information can be available for the students
Diagnostic testing
provide a test treatment
to inform the learners and the learners only! (dialang.com)
formative assessment: to help forming the teaching progress
summative assessment: summarizing where they are
Testing for learning
Norm referenced
e.g. a race – what determines who’s the gold medallist: who’s the fastest compered to
the others; can be slower than the slowest last year
compare the test-takers to other test-takers
e.g. an entrance test
Criterion referenced
you need to meet a lot of criteria
e.g. learning to drive – driving test
e.g. a language exam
how many points you need to meet the criteria qualitative description
result: score number quantitative
How they meet? setting pass marks
inaccurate: no objective test tests are produced by people, people are not objective
objectively scorable test
no human judgment needed
e.g. multiple choice
often done by computers
subjective: writing task
there’s an element of human-judgement
done mainly by humans
semi-objective
there are item types where there’s a little bit of human judgement involved but not
many
multiple choice – still acceptable answers
Reliability
O=T+E
content validity:
comprehension: gate numbers and flight number connecting?
listening: does it test listening comprehension? – It does: you need to
understand the gate and the flight number and demonstrate it by connecting
them
at C2-CFR? – No. Why not? Too simple.
It doesn’t represent what I wanted to test (at C2)
You have to understand it at C2, the trouble is that this task doesn’t ask yo to do
anything else. At A1 is okay
problem: representativeness of the test content
how representative the test is of what you want to test
judgement is involved
construct validity:
construct: from psychology; construct as the definition of whatever you want to measure
to test intelligence, we need to define what it is, this is the construct
C2 level listening: this kind of test, under these circumstances, under this much time…
construct validity: whether the construct definition is valid or not
if the construct is not valid, the test is not valid
consequential validity:
whether the test has the desired consequences
If you produce a test, do you control the consequences? Not necessarily. People who
produce TOFL are not responsible for if it’s a criterion for university. (If you reach a % you get
in.)
face validity:
how valid the test looks like on the surface
people may get annoyed, anxious, nervous, angry if the test doesn’t look valid
how can a test look not valid? – it seems like it doesn’t test the ability
C-test: delete every 2nd word: reaction: they get scared (there’s so much missing, I can’t do
it), think it’s rather a riddle that has to be solved– it doesn’t look like a test that people are
used to
good rough measure of overall ability
comes from people that doesn’t know the test making process
face validity is serious (measurement error)
Can a test be reliable but not valid? Yes. (always 0 scores; always the same results)
Can a test be valid but not reliable? No.
reliability is a pre-requisite to validit; you have to know your test is reliable to make it valid
guessing: multiple choice to test writing: would be reliable; but it wouldn’t be valid (doesn’t
test writing itself); write about a topic of your choice – valid
fairness:
guarantee that everyone who take the test have equal chances
e.g. females can get as much point as males test bios
you want to avoid bios
can be unfair consequences: testing the language ability of emigrants, but the test was
designed to keep emigrants out - unfair
has to do with validity: If I want to keep you away then what is the point of testing?
2 - Stages of test construction
1. test specifications: detailed description of the test (the more detailed, the better); test
specificators are not necessarily the people who make the tests
you have to know What you want to measure?
Who you want to measure?
What type of language proficiency: what level, what specific skill(s) should be tested?
How you want to test?
Why are you testing?
How much time is available? - How long should the test be?
(item/task banks: ready made tasks for measurement (not usually available))
construct irrelevant variance: you measure something you didn’t want to measure
construct irrelevance:
3 - Testing reading
What is a Construct? The definition of what you want to measure. Now: Construct of reading.
Components of reading:
identified as a skill
we can identify sub-skills:
how you understand a text
jist reading: understanding the sense of a test – skimming
reading for specific information: attention to every tiny detail – scan reading/scanning
lower level of text processing: grammatical relations
when you run into a word you don’t know: guess from the context
what sub-skill you’re testing should be in the construct
sub-skills of listening:
listening specific: notion of language that it’s being spoken differently than written language
listening specific:
when you have a text you can read it, re-read it (have time to work with it); with listening the
time is limited
most of the listening you hear something and it’s over
access to the text is limited
how are the conditions in which you receive the text influence the accessibility of the text
size of the room
environmental conditions (traffic noise)
these are need to be considered when making a listening test
issues of accessibility:
3 type of formats
1. write down the answers while listening /2. you have time to write it down after listening
easy to forget the answer – not a listening issue
threat to validity / violating validity
listening comprehension and memory – memory is not included in the listening test’s
abstract (if you write down the answers after listening)
you can’t focus on the rest of the text (if you write down the answers while listening)
you forcing people to multitask – listen to the text, read the questions and write
down the answers
3. school leaving exam: play the text and then the text is paused, have time to write down
the text, text continues, pause…
not forcing people to remember
not forcing them to multitask
the test-taker knows that the answer will be in that part
lack of authenticity lack of validity (it’s not like in real-life)
while you are listening, you have to read the questions
read the questions before listening, getting familiar with the questions
you are supposed to give an answer while listening, while you are listening, you
shouldn’t need to hear an answer spacing between the questions
you should give enough time to write down the answers before candidates should
listen to the next answer
how many times were you allowed to listen to the text?
1 or 2
1: authenticity (in real-life you only have 1 chance to listen)
in spoken language (when you’re speaking with someone) you can ask to repeat
something, to explain/paraphrase something; in listening you can’t do that
1: reliability problem: if there’s something (a noise, a honking outside) you can miss
the item, and you have no other chance
if not once, then how many times?
the more times you play the text, the easier it will be to answer the questions
lower levels, younger learners you may argue to play it more than twice
most of the times listening tasks come from radio shows: not many people listens the radio
should we have videos instead of audios? – authenticity
they can guess what will happen
multitasking: watching and writing
text selection:
speed of delivery
depends on the proficiency level; higher the level the more you are expected to
handle a text that is faster than normal
slower, faster – relative expressions
normal: syllables per second – subjective category
slower than normal text: easier to understand; could be boring; validity – how valid
in testing canditates in understanding real-life listening
faster than normal: you can’t understand if it’s too fast (but it happens in real-life)
faster than normal is not necessarily a problem
clear speech
many authentic settings where it’s not the main concerns
higher levels: clear speaking is not a main criterion
lack clarity should not be an obstacle in understanding
tone of sarcasm, ridicule
after a certain level you should be able to identify them
humour – maybe not everyone will get the joke; validity
authenticity
you can’t edit is like a reading text
quality of the sound is appropriate
item types:
open-ended questions
questions and answers should not be too long
not a reading task
length of the answers should be given
have enough time to write down the answer – answers spaced out in time
if the hurry, they scribble, you won’t be able to read it
grammatical / spelling correctness shouldn’t be scored (neither in reading) just as
long as it contributes to comprehension
the shorter the better
gap filling
can be done in different ways
while listening you fill out the gaps: you see the script – guess the answers from
context
it must be the exact same word that you hear – objective hearing test rather than
a listening test; maybe you not understand the text just put down something that
you hear – hurting validity
true / false
multiple choice
hard to produce
extended reading load (options to read not just a question – more troublesome)
keeping the answer as short as possible
matching
with pictures (picture-picture, picture-text): visual and audio information together
are OK
dependent on the format: answers to be given while listening spacing out the
items: solution for multitasking
5 - Testing writing
compare: writing and receptive skills – what sort of differences might arise when testing writing than
testing reading or listening
reading/listening: everybody is expected to do the exact same things – same responses are
expected
writing: no way you can do that; if 2 candidates wrote exactly the same, you wouldn’t be
happy; has to be unique
assessment/scoring: everybody is giving the same answers - scoring is easy ; with writing,
where every performance is unique, has to be scored separately - lead to different scores
test writing:
task types:
How much you would allow test-takers to choose from among different tasks?
1. impressionistic marking:
assessor’s impressions
2. holistic assessment
assessment scale is needed
assessor reads the script and decides which descriptor the piece of writing belongs to, put
that score/grade on it
best matching description and gives that score
problems:
it may not fit into one category
qualitative descriptions but need to be interpreted – subjective
features of the writing fit into different categories (coherence 5, grammar 3, content
coverage…) assessor will make a subjective decision “This is more important than
that…”
3. analytic assessment
assessment scale is needed
not a holistic scale, but an analytic scale
Why they use holistic scale? What can be a drawback to analytic scale?
How is the testing of speaking being different from the testing of listening?
formats:
interview format: 1 test-taker/candidate, 2 test providers (1 observer, 1 interacting person)
– one of them interacting with the student = interlocutor
job of the observer: scoring = assessor
3 people present
peer to peer format / paired exam format: 2 test-takers/candidates have a conversation
1 interlocutor, 1 assessor 4 people present
group exam: in low stakes contexts (e.g. a classroom context)
1 interlocutor, 1 assessor (or the 2 is the same person)
you only have a candidate and an invigilator
student just talks to the computer
recording is made and later is assessed by somebody
+/-:
interview: can cause anxiety, can be scary to candidates; your language is dependent on the
interlocutor – they are trained to do this, you can trust them
peer to peer: more comfortable for people; scores are dependent on the partner one of
them is more talkative / less t.; more open / less open not about language proficiency
group: problems of the paired exam multiplied
talking to the computer: you do it a lot in real life: you talk to somebody through a computer
computer won’t help you if you are stuck; no one on the other end; no interlocutor
not authentic (validity)
interlocutor behaviour may be fundamental to how the performance comes out - interlocutor
behaviour, DOs and DON’Ts:
Task types:
role plays
acting out a situation
part of the literature: role-play / simulation task
role play: you are playing a role: you are not acting as yourself
simulation: you are acting as yourself
situation you are acting out is not authentic – you are not acting as yourself – role
play
it should approximate some sort of authentic setting in which people interact (life-
like)
key points to include: helps the candidate and the assessment
roles: interview format balance between the roles – candidate should speak
more; pair: both parties need to have their parts (balanced)
debate task
has points – ideas for the debate, so you don’t have to come up with your own
paired: both candidates need to have equal parts/talking
information gaps to be filled
paired exams
if the other party does not give the information then you are stuck
picture descriptions
describe what’s on the picture
lower level
something that is very clear, not ambiguous, rather a drawing (line drawing – simple)
and not a picture
picture-based tasks
use the picture as a starting point
a lot will depend on the picture
What is a good picture like?
something obvious: the topic is given – gives control of the speaking
something ambiguous: they could talk about all sorts of things – no control
people may memorize a text and use it with an ambiguous picture
students may have control over the speaking – shouldn’t be allowed
problem of creativity (doesn’t generate creativity – creativity is not part of the
construct
good picture: thought-provoking, something obvious
series of images: to discuss a story
give pictures to depict the story
there’s a topic and just have to talk about it either with your partner or the interlocutor
meaningful topics
candidates have an opinion, views or information about
if the language is okay you should just accept that
as an interlocutor you should not convince them about anything
controversial topics are not the best: if their view is not mainstream they may be
scarred to share it; a view the don’t believe in is forced on them negative effect
neutral but still accessible topics
topic selection criterion that we discussed earlier are fit here
candidates prepare a presentation beforehand
memorize it not representative of their language performance
authentic: you prepare and memorize it also in real-life
memorize it, but then we will have some questions
prepare by memorizing chunks of language
alternatively, you could ask people to discuss something based on the information
that you provide
interpret it: mediating information
some people make it difficult to interpret them (e.g. graph reading skills)
you have an interlocutor present and also an assessor
assessor: asses
2 independent assessment which are ten finalized (writing)
in speaking: 2 assessors not in most cases (expensive, human resources sensitive)
interlocutor and assessor are both provide assessment
interlocutor behaviour: the interlocutor is not supposed to give notes that’s a
challenge, standard procedure is that they both score
holistic and analytic scale can be used by them
difference in term of the assessment here: not during the performance or straight
after require some sort of consensus score: they come up with their own score
and then they discuss, final score: consensus-based score (OR: give more weight to
the assessor’s score it’s their only job); interlocutor cannot take notes, has to
focus on the interaction
assessor uses an analytic scale; interlocutor uses a holistic one
ideally there’s some sort of consensus score that will be the final score
reading is a
receptive skill
In some tests there are separate tasks for grammar and vocab, in other tests there are not. Why?
mediation: information is presented in one form and they you’ll have to put it in another
form (for skills)
grammar and vocab are not skills
when you are testing your vocab and grammar you other skills will be tested (productive
skills?)
analytic scale: separate score for task achievement, coherence, vocabulary, grammar
so why a separate test?
same is true for receptive skills when you communicate there’s no point for separate tests to
test grammar and vocabulary not functional; grammar and vocab are already tested while
communicating
since grammar and vocabulary are needed to all four skills it’s fundamental, so we want to
make sure that people have grammar and vocab
ways of testing grammar and vocabulary can be very communicative
formative function
feedback to the learning process
progress test: testing what you’ve taught (identified elements of grammar and vocabulary)
achievement test: what the syllabus says the learns should know (identified elements of
grammar and vocabulary)
proficiency test: test what the level defines (CFR) – what the language learners should be
able to do
when it comes to grammar and vocab there’s a problem
How do we know what grammar and vocabulary is needed for the different
proficiency levels?
what the CFR definition should mean in particular languages in terms of language
and vocabulary
1st : German: Hofiele Deutsch
then: English Profile – access to grammar and vocabulary profile check what sort
of grammar and vocabulary is required for the levels
English Profile – Vocabulary:
bat (animal) – B1 level word not accurate: Batman – you don’t have to know
English to know bat Why B1: those who constructed the English Profile looked at
exam performances and checked what sort of words are used in exam performances
– bat was in a B1 performance
What does it mean to use a word? – To be able to use that word while writing or
reading
productive knowledge only
what do we do when we want to test receptive knowledge
Have to rely on it carefully
High/low level of frequency: high frequency – not so difficult; low frequency – more difficult
weta: an insect that is native to New Zealand it is a high frequency, low difficulty
level word in NZ; but anywhere else it’s a low frequency, high difficulty level word
difficulty depends on the ‘where’
word Fortnite: game spelled difficulty: Fort nite = in 2 weeks used only in
Britain difficulty level?
difficulty level depends on the context!! (context sensitive)
no objective means, but there’s some judgement involved
some people say that grammar does not exist on it’s on it becomes manifest through
grammar (they can’t be separated)
vocabulary in isolation does not exist comes hand in hand with some sort of grammatical
context
vocab and grammar form a joined construct - should be tested together (not everyone will
agree)
you can score grammar and vocab separately – Why would you do that?
vocabulary is about memorizing vocabulary items (memory)
structures, grammar is not exclusively dependent on memory, you need to know
how they are connected, in what context they are appropriate…
CFR doesn’t separate grammar and vocab: range of grammar of vocabulary and
accuracy of grammar of vocabulary
gap-filling (reading: content focused, not grammar) gaps dependent on the grammatical
structures
text type: authentic – validity but: there are no text with that degree of density
test constructors write a text full of that grammar structure not authentic
it’s better to use authentic texts, but shorter sections/part of them
use of that structure is not mandatory, you can put it in a different form
relatively communicative task (communicates the message)
error correction and error identification
2 stage process: 1: find the error, 2: correct the error
stage 2 depends on stage 1
leaves you with the dilemma: What if people find the error, but they correct it into
something equally wrong? – partially score it?
tasks can be based on authentic texts
how often do you do this sort of activity – you do this when you are a teacher
multiple choice
least communicative/authentic: you see a sentence with no context, in isolation
no context: not clear what’s wrong
people who produce this have a context in mind, but test-takers don’t risky
transformation tasks
rewrite the sentences
active voice to passive voice
more possible solution / different ways to transform / alternative correct answers
meaning of the sentences change
in real life there’s a reason why it’s in active or passive voice and this task tells you
that you can do it on your will
no authentic reason if the meaning does not change then why would you change the
sentence
matching
find the end of the sentence based on the beginning of the sentence
you don’t do this in real life
test different types of grammar
Vocabulary:
MAKE SURE THEY PRODUCE THE VOCABULARY ITEM AND NOT GIVE THEM SOLUTIONS
8 - Reporting scores
reporting scores:
Type of test:
pass mark/standard/cut-off score: score at and above which test-takers pass and below they
do not pass
depends on the type of test: norm-referenced and criterion-referenced tests how we
make sense of the test results
nr: comparing the test-takers results to other test-takers results (entrance procedures)
cr: check how the performance and see if it meets the criteria
depending on the type of test, you would need different procedure to set up pass marks
Diagrams:
diagnostic test: check student’s strength and weaknesses diagram: scores gets high
low high low highlow diagram
different students in the same groups: group who score high and another one scoring
low – some students learn, other don’t – vocab test e.g.
group with more and less able students – test shows the difference between them
pass/didn’t pass – language exam
straight line diagram – there’s an almost even distribution of ability within the group
(unlikely)
a point in the diagram: everyone got the same score
Norm-referencing
general procedures:
basic question: At what CEFR level can a test taker already answer the following item
correctly?
What is the minimum CEFR level required to give a correct response to this item?
panelists to put each item in a “basket” corresponding to one of the relevant CEFR levels
candidates at higher levels to give correct response
correct response is not required at lower levels
50 item tests
2 items in Basket A1
7 items Basket A2
12 items Basket B1
2+7+12 = 21 items to be answered correctly for B1 or higher
cut-off score: 21
not always necessary to provide baskets for all levels
for B1 test: “lower than B1”, “B1” and “higher than B1”
if all items are B1: cut-off score would be 100% - for B1 level a candidate must
answer all the B1 items correctly not fair, not the min. but the max. for B1
Angoff Method:
basic concept: minimally acceptable person / borderline person just barely at the target
level
borderline person at B1
has the ability to be labelled B1
slightest decrease in ability: no longer B1
task for panel: keep borderline person in mind for all the judgements
for each item: What is the probability that the borderline person gives the correct answer?
15 rater, 50 items: average of the items’ % standard
Direct tests: speaking, writing
benchmarking
test-level statistics:
item-level statistics: