You are on page 1of 30

Basic Concepts

Testing vs. Assessment

 Assessment: evaluation process by which you give a score; there are people who are called
‘assessors’ – assessors are people who assess
 Testing: refers to the process; a special kind of assessment; variety of ways to check the
ability of the learners; gathering information about what students can do
 Formal assessment / Informal assessment: informally collected information what students
can do
 Test: a set of questions  a kind of assessment

How you can place testing?

measurement: the quantification of observation; 2 types of measurement

1. physical measurement:
 e.g. How long is a desk? You measure it with a tape and read what the tape says.
 the only thing you can argue with is the matter of accuracy (40 cm or 40,1 cm)
2. psychological measurement:
 e.g. How can you measure intelligence?
 IQ test: list of questions (tasks, problem solving…) – depending on how successful
you solved the problems they will tell a percentage of how intelligent you are
 WE CAN’T TELL HOW INTELLIGENT A PERSON IS
 same with any psychological property (language as well)
 you can’t directly get to people’s intelligence
 there’s a lot more certainty in this field

Classifications of tests:

What is tested?

 Proficiency test – testing proficiency


 you measure the ability without knowing the language learning history
 e.g. language exams
 whether can people meet the criteria of the exam or not

 Achievement testing
 dependent on a document (syllabus, textbook, etc.)
 there’s a set goal and you measure whether people achieved that goal
 achievement goals – CFR
 test is based on expectations

 Progress testing
 subclass of achievement testing
 progress-achievement testing
 based on the reality, what actually happens
 progress testing during the year
Why testing is done?

 Placement testing
 puts test-takers into pre-defined categories
 this information can be available for the students
 Diagnostic testing
 provide a test treatment
 to inform the learners and the learners only! (dialang.com)
 formative assessment: to help forming the teaching progress
 summative assessment: summarizing where they are
 Testing for learning

How can you test?

 Direct vs. Indirect testing


 all psychological tests are indirect
 what seems direct is indirect  e.g. reading test – complete tasks  indirect
 test problem: the questions don’t really represent what it should measure
 How can it be direct?  e.g. writing test – make them write a letter (still don’t know
how they write in general, performance, intelligence…)
 e.g. How to test pronouncation?
 direct: ask them to speak – indirect: write down the symbols or a multiple choice
test

 Discrete point testing vs. Integrative testing


 Discrete Point Testing: identify individually scorable items
o item: smallest scorable unit in a test
o identify what each item is testing
 Integrative testing: no items, testing many things at the same time
o e.g. write a letter  vocab, grammar, spelling, etc.
o criteria is scored
 NOT Discrete Point Testing  CLOZE TESTS
o looks like a gap filling test
o the deletions of words happen periodically
o hhbhb___evfszgf____vfgevh____
o originally every 7th word was deleted
o 1st and last sentence should be untouched
o what is tested: grammar, vocab, spelling…
o gives a measurement of overall language proficiency
o integrative: measures a lot of things at the same time

How to interpret the results:

Norm referenced vs. Criterion referenced

 Norm referenced
 e.g. a race – what determines who’s the gold medallist: who’s the fastest compered to
the others; can be slower than the slowest last year
 compare the test-takers to other test-takers
 e.g. an entrance test
 Criterion referenced
 you need to meet a lot of criteria
 e.g. learning to drive – driving test
 e.g. a language exam
 how many points you need to meet the criteria  qualitative description
 result: score number  quantitative
 How they meet? setting pass marks

Objective vs. Subjective tests

 inaccurate: no objective test  tests are produced by people, people are not objective
 objectively scorable test
 no human judgment needed
 e.g. multiple choice
 often done by computers
 subjective: writing task
 there’s an element of human-judgement
 done mainly by humans
 semi-objective
 there are item types where there’s a little bit of human judgement involved but not
many
 multiple choice – still acceptable answers

Reliability

When is a test reliable?

 you can predict how will they work


 it will work the same way every time
 tests work when people take them
 same test, students on different level  test doesn’t work the same way for them
 if you have a test taken by the same people without an effect from the earlier test you’ll get
the same results
 should work as much the same way as possible
 there are different levels of reliability

O=T+E

 O: observed score (ling. def. performance)


 T: true score (ling. def. competence)
 E: measurement error
 statistical definition: average of all possible observed score (taken many times)
 measurement error: e.g. you don’t pay enough attention – negative number; positive
number: if you get lucky
 a perfectly reliable test: the amount of error is 0 (zero)
 How can we make a test more reliable? Reduce the possibility of errors – e.g. don’t use
multiple choice
 there’s no perfectly reliable test, there’s always an error
Validity

 e.g. ID cards – photo no longer shows the reality


 a test is valid if you can demonstrate that it measures what it intends to measure
 e.g. it’s not a writing test to fill out an “ID” card: name, place of birth, address, etc.  it tests
the ability to read

content validity:
 comprehension: gate numbers and flight number connecting?
 listening: does it test listening comprehension? – It does: you need to
understand the gate and the flight number and demonstrate it by connecting
them
 at C2-CFR? – No. Why not? Too simple.
 It doesn’t represent what I wanted to test (at C2)
 You have to understand it at C2, the trouble is that this task doesn’t ask yo to do
anything else. At A1 is okay
 problem: representativeness of the test content
 how representative the test is of what you want to test
 judgement is involved

criterion related validity:


 criterion: we compare tests to each other
 criterion measure: another test
 I take a test of what I want to measure, give it to test-takers, I have 2 results, and
I compare them
 we find them similar: my test must be valid because of the same thing as the
other test
 What might be a problem with this approach?
 How do we know that the other test is valid?
 Just because 2 test result are similar does that automatically mean that they
measure the same thing?
 We compare the scores of our test with a highly (elismert) test
 on its own it doesn’t prove anything, but it can add to validation

construct validity:
 construct: from psychology; construct as the definition of whatever you want to measure
 to test intelligence, we need to define what it is, this is the construct
 C2 level listening: this kind of test, under these circumstances, under this much time…
 construct validity: whether the construct definition is valid or not
 if the construct is not valid, the test is not valid

consequential validity:
 whether the test has the desired consequences
 If you produce a test, do you control the consequences? Not necessarily. People who
produce TOFL are not responsible for if it’s a criterion for university. (If you reach a % you get
in.)
face validity:
 how valid the test looks like on the surface
 people may get annoyed, anxious, nervous, angry if the test doesn’t look valid
 how can a test look not valid? – it seems like it doesn’t test the ability
 C-test: delete every 2nd word: reaction: they get scared (there’s so much missing, I can’t do
it), think it’s rather a riddle that has to be solved– it doesn’t look like a test that people are
used to
 good rough measure of overall ability
 comes from people that doesn’t know the test making process
 face validity is serious (measurement error)

relationship between reliability and variety:

 Can a test be reliable but not valid?  Yes. (always 0 scores; always the same results)
 Can a test be valid but not reliable?  No.
 reliability is a pre-requisite to validit; you have to know your test is reliable to make it valid
 guessing: multiple choice to test writing: would be reliable; but it wouldn’t be valid (doesn’t
test writing itself); write about a topic of your choice – valid

fairness:
 guarantee that everyone who take the test have equal chances
 e.g. females can get as much point as males  test bios
 you want to avoid bios
 can be unfair consequences: testing the language ability of emigrants, but the test was
designed to keep emigrants out - unfair
 has to do with validity: If I want to keep you away then what is the point of testing?
2 - Stages of test construction

How should you start to put a test together?

1. test specifications: detailed description of the test (the more detailed, the better); test
specificators are not necessarily the people who make the tests
 you have to know What you want to measure?
 Who you want to measure?
 What type of language proficiency: what level, what specific skill(s) should be tested?
 How you want to test?
 Why are you testing?
 How much time is available? - How long should the test be?

(item/task banks: ready made tasks for measurement (not usually available))

2. collecting source materials


 for a reading task you need a text (same with listening)
 for a writing task: input text (e.g. a job advertisement – write a job application letter)
 speaking: e.g. pictures
 grammar: take a piece of authentic language (a real-life thing, 100% authentic,
something that is already there) – validity is guaranteed best by authenticity

3. write items / start producing the tasks


 gap filling: can take many shapes and forms
 tests language proficiency
 opportunities to put in whatever could fit in – test comprehension
 negative: to test specific words: not good if they give another one (because
the text allows that – not valid)
 1 correct answer: can be easily corrected
 banked gap filling (putting words in a box)
 15 answer – 15 gaps: at least 1 doesn’t measure anything
 distractors: distract those candidates who are not proficient enough from
the correct answers
 test-takers very rarely guess by random (e.g. which are the nouns)
 advantage: you don’t expect productive language use, you help them;
assessment can be 100% objective – correcting is faster
 multiple choice
 adv.: objective – can be assessed very quickly; fast to score
 adv.: you can use for testing almost anything (even for speaking,
pronouncation)
 disadv.: test-takers can guess – they don’t blind guess, they use test-taking
strategies – difficult to counterbalance
 What can you do against that? Give more options. – but it takes more
time
 distractors: they have to be wrong, but they have to look credible
 fix for guessing: increasing the number of items (not the options) / tasks
 BUT: it’s not really pheasable to make a long test – it would take forever to
make and take; people get tired – measurement error  loose reliability –
you can’t be valid
 Yes/No or True/False
 adv.: easy to score; objective; easy to produce (easier than multiple choice)
 disadv.: guessing
 tend to be unreliable  correct guessing - measurement error
 Why is it true or false?  more time consuming, much less objective
 fix: put in a 3rd option (usually not mentioned) – not in the text / we don’t
know
 problem: if it’s not in the text – against the construct you are trying to test
 at higher levels: the text implies something then what is the correct answer?
 in textbooks: for classroom practice, not for test-takers!!!!!
 open-ended questions
 test-takers can’t really guess them (except yes/no questions)
 can be used for different purposes: comprehension, reading, writing,
spelling…
 easier to produce than multiple choice questions
 not objective  harder to score
 threat to reliability (there are alternative correct answers)
 ask as precise questions as you can
 detailed set of instructions for the scorers
 we don’t have an answer key because of the alternative correct answers
 difficult to judge: you may not be able to read the answers  computer-
based tests
 matching
 objective  scoring is easy and fast
 you can guess  1 or 2 more answer than the number of questions  not
overwhelming, doesn’t take long time
4. take your own test (and an answer key in parallel)
 to check whether the tasks are good or not
 you can’t recognize the mistakes
5. ask a fellow professional to take the test
 if you want feedback, accept the feedback
6. pre-testing
 piloting?
 you administer the test to a group of test-takers whose abilities are similar to your
target group
 easier said than done
 students talk to each other, may tell each other the questions
 finding a target group is time consuming
 school-leaving exam is never tested: everyone who could be a part of the pre-testing
group is part of the target group who will take the test; security
 can be done: longer test, some of the questions are a part of the test, they have to
take it all seriously
7. moderation
 analyse the data
 everything is good
 more often: few problems  back to an earlier stage of test construction and modify
task  you don’t know how it works afterwards  pre-testing again,….
Threats to validity:

 construct irrelevant variance: you measure something you didn’t want to measure
 construct irrelevance:
3 - Testing reading

What is a Construct? The definition of what you want to measure. Now: Construct of reading.

Components of reading:

 identified as a skill
 we can identify sub-skills:
 how you understand a text
 jist reading: understanding the sense of a test – skimming
 reading for specific information: attention to every tiny detail – scan reading/scanning
 lower level of text processing: grammatical relations
 when you run into a word you don’t know: guess from the context
 what sub-skill you’re testing should be in the construct

Collecting source materials: texts - Criteria for text selection:

 good texts are authentic


 authentic: already existing text; not made for language learners or testing; written for a real-
life communication purpose
 authentic texts make the test more valid
 a text can’t be used in a fully authentic form
 there are words can make it more difficult – modify them  semi-authentic
 shouldn’t be too short or too long
 conditions: time, proficiency of the test-takers, number of items, level appropriate
 shouldn’t be longer/shorter than it needs to be
 level is a combination of a test and a task (difficult text, easy tasks don’t work)
 level is depending of the kind of text and what to do with that text
 you don’t need to understand everything of a text
 topic of the test matters
 when you need to read something, you are not interested in, you can’t stay focused
 go for something that is interesting (or not boring)
 boring  more mistakes  measure error  loses reliability  loses validity
 you can’t have a test that is equally interesting for everybody
 try to choose moderately interesting topic
 avoid sensitive topics (death, murder, abuse, diseases, politics, addiction, religion,
divorce, sex, war…) that could upset test-takers
 when we upset people, it’s causing measurement error … (also not ethical)
 any topic can be upsetting for some people
 there’s no topic that is 100% safe
 get rid off of the topics that can be upsetting to a high % of people
 What kind of topics? Facts, popular science, arts, music…
 texts that have content that is already known  should be avoid (part of background
knowledge)
 don’t use a text that is easily accessible
Item writing for reading tests:

 matching paragraphs: some paragraphs are missing, put them in


 you don’t need to understand the text if there’s a link between the previous paragraphs
and the following  avoid it
 if you can use a paragraph in more than 1 gap, it’s wrong
 sequencing
 putting paragraphs into the correct order
 if the correct sequence is ABCD you have 100% score; BCDA – 0 score  not valid
 more complicated algorithm that accepts the sequence of the paragraphs
 open-ended questions
 don’t ask for opinions
 implicit meanings / implied messages – at higher level more likely
 multiple choice
 takes forever to construct, to come up with credible wrong answers – distractors
 obvious answer – lifted from the text  except if you lift every answer from the text
 paraphrasing: rewrite the message with different words
 paraphrased options shouldn’t be more difficult then the original text
 decrease the reading load

Should focus on reading and not something else!


4 – Testing listening

reading and listening: receptive skills

sub-skills of listening:

 jist-listening / listening for the jist


 listening for details / listening for specific information
 differences in how much the text is assessable in listening and reading

listening specific: notion of language that it’s being spoken differently than written language

listening specific:

 understanding standard/non-standard language


 speed more proficient: handling more rapidly delivered speech
 tone of voice, and how it carries meaning
 decoding the pattern
 where information is located in the text

listening is different from speaking:

 when you have a text you can read it, re-read it (have time to work with it); with listening the
time is limited
 most of the listening you hear something and it’s over
 access to the text is limited
 how are the conditions in which you receive the text influence the accessibility of the text
 size of the room
 environmental conditions (traffic noise)
 these are need to be considered when making a listening test

issues of accessibility:
 3 type of formats
 1. write down the answers while listening /2. you have time to write it down after listening
 easy to forget the answer – not a listening issue
 threat to validity / violating validity
 listening comprehension and memory – memory is not included in the listening test’s
abstract (if you write down the answers after listening)
 you can’t focus on the rest of the text (if you write down the answers while listening)
 you forcing people to multitask – listen to the text, read the questions and write
down the answers
 3. school leaving exam: play the text and then the text is paused, have time to write down
the text, text continues, pause…
 not forcing people to remember
 not forcing them to multitask
 the test-taker knows that the answer will be in that part
 lack of authenticity  lack of validity (it’s not like in real-life)
 while you are listening, you have to read the questions
 read the questions before listening, getting familiar with the questions
 you are supposed to give an answer while listening, while you are listening, you
shouldn’t need to hear an answer  spacing between the questions
 you should give enough time to write down the answers before candidates should
listen to the next answer
 how many times were you allowed to listen to the text?
 1 or 2
 1: authenticity (in real-life you only have 1 chance to listen)
 in spoken language (when you’re speaking with someone) you can ask to repeat
something, to explain/paraphrase something; in listening you can’t do that
 1: reliability problem: if there’s something (a noise, a honking outside) you can miss
the item, and you have no other chance
 if not once, then how many times?
 the more times you play the text, the easier it will be to answer the questions
 lower levels, younger learners you may argue to play it more than twice
 most of the times listening tasks come from radio shows: not many people listens the radio
 should we have videos instead of audios? – authenticity
 they can guess what will happen
 multitasking: watching and writing

text selection:

 interesting, but not distressing


 not heard yet
 relevant
 accent: standard? 1: Which one? 2: All the time?
 1: American English is nothing like RP (received pronouncation)
 in Europe British English is the more acceptable accent/version – they control book
market
 but: more American input than British – they control the media market
 what have the candidates been exposed to?
 unfamiliar variety: won’t have the same chances as with a more familiar variety
 exposure: progression tests, …
 proficiency tests: we don’t care what they had been exposed to (we don’t know
them)
 offer more than 1 variety/standard – Which ones?
 British, American – most common varieties
 not always the case: TOFL (test of English as a foreign language – American)
 whether your English is good enough to study in America
 IELTS: British version of TOFL (international English language testing system)
 accent: Do we always needs standard speakers?
 speakers don’t speak standards - approximate standard
 standard, non-standard  level issue: higher – deal with non-standard language use
 for the sake of validity
 background noise
 depends on the level: the higher the level, the more background noise you should be
deal with
 test-makers: don’t want background noise, don’t want to disturb
 100% sterile text, no noise: doesn’t happen in real-life
 non-dominant, non-distractive background-noises are okay (scribbling with pens)

 speed of delivery
 depends on the proficiency level; higher the level the more you are expected to
handle a text that is faster than normal
 slower, faster – relative expressions
 normal: syllables per second – subjective category
 slower than normal text: easier to understand; could be boring; validity – how valid
in testing canditates in understanding real-life listening
 faster than normal: you can’t understand if it’s too fast (but it happens in real-life)
 faster than normal is not necessarily a problem
 clear speech
 many authentic settings where it’s not the main concerns
 higher levels: clear speaking is not a main criterion
 lack clarity should not be an obstacle in understanding
 tone of sarcasm, ridicule
 after a certain level you should be able to identify them
 humour – maybe not everyone will get the joke; validity
 authenticity
 you can’t edit is like a reading text
 quality of the sound is appropriate

item types:

 open-ended questions
 questions and answers should not be too long
 not a reading task
 length of the answers should be given
 have enough time to write down the answer – answers spaced out in time
 if the hurry, they scribble, you won’t be able to read it
 grammatical / spelling correctness shouldn’t be scored (neither in reading) just as
long as it contributes to comprehension
 the shorter the better
 gap filling
 can be done in different ways
 while listening you fill out the gaps: you see the script – guess the answers from
context
 it must be the exact same word that you hear – objective  hearing test rather than
a listening test; maybe you not understand the text just put down something that
you hear – hurting validity
 true / false
 multiple choice
 hard to produce
 extended reading load (options to read not just a question – more troublesome)
 keeping the answer as short as possible
 matching
 with pictures (picture-picture, picture-text): visual and audio information together
are OK
 dependent on the format: answers to be given while listening  spacing out the
items: solution for multitasking
5 - Testing writing

compare: writing and receptive skills – what sort of differences might arise when testing writing than
testing reading or listening

 reading/listening: everybody is expected to do the exact same things – same responses are
expected
 writing: no way you can do that; if 2 candidates wrote exactly the same, you wouldn’t be
happy; has to be unique
 assessment/scoring: everybody is giving the same answers - scoring is easy ; with writing,
where every performance is unique, has to be scored separately - lead to different scores

test writing:

 ask them to write something


 writing abilities tested without writing anything: error correction tasks are sometimes used
 if you can correct someone’s writing it means that you can write better than that
 types of errors would need to reflect the things that could go wrong in writing
 uncommon: difficult to do – content validity, how would it be representative?

task types:

1. test taker is expected to write a story of sorts


 if test takers are not creative, they can’t write anything – writing abilities can’t be tested
 what do we want to test? (grammar, vocab, creativity…) – construct: description of
what we want to measure
 is creativity part of a construct when to write a story in another language? No.
 you require test takers to be creative when it’s not part of the construct
 construct underrepresentation, construct irrelevant variance
 give an abstract of a story / sketch / outline
 they have to read it – reading task
 when there’s some input text, there’s need to be some reading, and reading is nor
part of the construct
 people may do worse because they misunderstood something in what they had to
read - construct irrelevant variance
 give them pictures – series of pictures to depict the story
 images don’t speak for themselves – they can be misunderstood
 different stories allowed? – depends on the criteria
 task completion – has the candidate done what they had been asked to do?
 have you done this bit of the story and this one…
 What if they learn a story by heart?
 problem of alternative interpretations
 people from middle east: interpret pictures from left to right  indicate the sequence of the
pictures
 Why would you ask people to write a story?
 How authentic is this? authenticity – validity
 most people don’t write stories (may read, but not write)
 doesn’t a real-life task
 test-takers don’t mind doing this, but only a little fraction of this is real-life like
2. writing an e-mail – a complain e-mail e.g.
 details are needed (why, to who)
 reading is included
 provide all this information in the test-takers native language
 in real life it just happens, you don’t need to read it
 what if all the test-takers don’t share the same native language
 forcing them to code-switch
 ask people to read something that is very simple language (simpler than what you expect
them to use)
 what if the target level is A2? you can’t guarantee that everyone understands
 lifting: when you use some of the prompt, you lift it up and put it in the text
 with the native language no lifting is possible
 if you’ve seen one, you’ve seen them all: it has a structure that is typical of the letter of
complaint
 how authentic is to write an e-mail with pen and paper
 write an e-mail to your friend: we don’t write e-mails to our friends
 instant messages: different language is used – reduced (acronyms e.g., no grammar, no
punctuation); what can we score in this?  problem of assessment; instant messaging is
more like speaking than writing; these are short messages, so little language, little to assess

 what can we asses?


 real-life writing:
 applying for a job – motivational letter: you’ve written one, you’ve written them all
(set form)
 research papers: would be too long for a test
 journals (if someone writes it) – it’s not for other people, it’s for yourself
 in real-life there’s very few contexts that you can call ‘writing’
 people don’t write letters anymore - that genre is gone
 essay: popular task, but who writes essays in real life? – students write, but nobody
else  for students it’s a real-life task, but non-students it’s not; it happens at home;
it wouldn’t be authentic to do it in school  problem of validity

3. write a comment on someone’s blog


 comments could be longer than few sentences
 reader’s opinion of this issue… write your opinion
 provide bullet points to be covered

How much you would allow test-takers to choose from among different tasks?

 why would you do this?


 to allow people to choose something that they feel can better relate to
 importance of topic choice just as with the reading
 don’t offer upsetting topics
 personal preferences
 be careful: level of topics; writings structurally should be the same
 bad idea according to some studies: lack of standardization
assessment of writing:

1. impressionistic marking:
 assessor’s impressions

2. holistic assessment
 assessment scale is needed

1 (grade/band) description of that band


2 description of that band
3 description of that band
4 description of that band
5 description of that band

 assessor reads the script and decides which descriptor the piece of writing belongs to, put
that score/grade on it
 best matching description and gives that score
 problems:
 it may not fit into one category
 qualitative descriptions but need to be interpreted – subjective
 features of the writing fit into different categories (coherence 5, grammar 3, content
coverage…)  assessor will make a subjective decision “This is more important than
that…”

3. analytic assessment
 assessment scale is needed
 not a holistic scale, but an analytic scale

coherence content coverage etc.


1
2
3
4
5

 several descriptions, more options to score (coherence, content coverage)


 much more reliable

Why they use holistic scale? What can be a drawback to analytic scale?

 more work to do an analytic one


 you would need to read the description for each
 no set criteria with the analytic
 analytic takes longer, more time consuming
 when you read a text in real life you don’t analyse it, it’s not authentic
 what we asses: overall impression  holistic scale
 put overall impression as a separate criteria in the analytic one
6 - Testing Speaking

How is the testing of speaking being different from the testing of listening?

 each performance is meant to be unique


 candidates are not left alone
 we have an influence on the performance during the testing
 Choice:
 will influence the performance and the assessment of the performance
 not all candidates do the same task
 1st student would tell it to the others / hear each other speaking
 they would give very similar answers
 isolation of test-takers from one another
 solution: different test-takers do different tasks
 different tasks, but the measurement is the same
 assessment
 of writing: when everybody is done, assessment will begin
 of speaking: you can do it afterwards if there’s a recording – not the usual way 
spoken language is illusive – it’s spoken and then it’s gone
 What sort of recording should we need to record every single performance? – audio
recording is not fully sufficient  non-verbal communication: video recording
 recording is expensive and more time consuming  listening again
 not a bad idea to record: proof when a peel happens
 assessment of speaking: on the spot; meant to happen during the exam – has to do
with examiners remembering what happened

How oral tests need to be conducted?

formats:
 interview format: 1 test-taker/candidate, 2 test providers (1 observer, 1 interacting person)
– one of them interacting with the student = interlocutor
 job of the observer: scoring = assessor
 3 people present
 peer to peer format / paired exam format: 2 test-takers/candidates have a conversation
 1 interlocutor, 1 assessor  4 people present
 group exam: in low stakes contexts (e.g. a classroom context)
 1 interlocutor, 1 assessor (or the 2 is the same person)
 you only have a candidate and an invigilator
 student just talks to the computer
 recording is made and later is assessed by somebody

+/-:
 interview: can cause anxiety, can be scary to candidates; your language is dependent on the
interlocutor – they are trained to do this, you can trust them
 peer to peer: more comfortable for people; scores are dependent on the partner  one of
them is more talkative / less t.; more open / less open  not about language proficiency
 group: problems of the paired exam multiplied
 talking to the computer: you do it a lot in real life: you talk to somebody through a computer
 computer won’t help you if you are stuck; no one on the other end; no interlocutor
 not authentic (validity)
interlocutor behaviour may be fundamental to how the performance comes out - interlocutor
behaviour, DOs and DON’Ts:

 stick to the appropriate level


 we want to hear people speak
 different ways in which we speak (do a talk; hold a presentation vs. speaking as an
interaction)
 when it’s performance: interlocutor is sitting back
 interaction: don’t place much burden to the student with listening – make sure that
you are always understood  level appropriate
 be authentic (if the task is authentic, you have a part in it to stay authentic)
 the interlocutor frame: tells the interlocutor to what to say and what not to say
 some of them are more permissive
 questions are standardized – not really authentic
 many of them are somewhat flexible (with the questions)
 feedback (non-verbal): if they give specific feedback, it may influence your performance in a
way  interfering: you are making it look better than it usually is
 be kind, friendly, easing on the tension, BUT not to give specific feedback
 feedbacks/note taking can also influence in a negative way
 they do this  most of them are teachers and it’s a part of their job
 the interlocutor has to pay attention to the candidate(s)
 interlocutor does too much talking (DON’T) – personality of the interlocutor
 interlocutor – assessor should not at all interact during the exam

Task types:

 role plays
 acting out a situation
 part of the literature: role-play / simulation task
 role play: you are playing a role: you are not acting as yourself
 simulation: you are acting as yourself
 situation you are acting out is not authentic – you are not acting as yourself – role
play
 it should approximate some sort of authentic setting in which people interact (life-
like)
 key points to include: helps the candidate and the assessment
 roles: interview format  balance between the roles – candidate should speak
more; pair: both parties need to have their parts (balanced)
 debate task
 has points – ideas for the debate, so you don’t have to come up with your own
 paired: both candidates need to have equal parts/talking
 information gaps to be filled
 paired exams
 if the other party does not give the information then you are stuck
 picture descriptions
 describe what’s on the picture
 lower level
 something that is very clear, not ambiguous, rather a drawing (line drawing – simple)
and not a picture
 picture-based tasks
 use the picture as a starting point
 a lot will depend on the picture
 What is a good picture like?
 something obvious: the topic is given – gives control of the speaking
 something ambiguous: they could talk about all sorts of things – no control
 people may memorize a text and use it with an ambiguous picture
 students may have control over the speaking – shouldn’t be allowed
 problem of creativity (doesn’t generate creativity – creativity is not part of the
construct
 good picture: thought-provoking, something obvious
 series of images: to discuss a story
 give pictures to depict the story
 there’s a topic and just have to talk about it either with your partner or the interlocutor
 meaningful topics
 candidates have an opinion, views or information about
 if the language is okay you should just accept that
 as an interlocutor you should not convince them about anything
 controversial topics are not the best: if their view is not mainstream they may be
scarred to share it; a view the don’t believe in is forced on them  negative effect
 neutral but still accessible topics
 topic selection criterion that we discussed earlier are fit here
 candidates prepare a presentation beforehand
 memorize it  not representative of their language performance
 authentic: you prepare and memorize it also in real-life
 memorize it, but then we will have some questions
 prepare by memorizing chunks of language
 alternatively, you could ask people to discuss something based on the information
that you provide
 interpret it: mediating information
 some people make it difficult to interpret them (e.g. graph reading skills)
 you have an interlocutor present and also an assessor
 assessor: asses
 2 independent assessment which are ten finalized (writing)
 in speaking: 2 assessors  not in most cases (expensive, human resources sensitive)
 interlocutor and assessor are both provide assessment
 interlocutor behaviour: the interlocutor is not supposed to give notes  that’s a
challenge, standard procedure is that they both score
 holistic and analytic scale can be used by them
 difference in term of the assessment here: not during the performance or straight
after  require some sort of consensus score: they come up with their own score
and then they discuss, final score: consensus-based score (OR: give more weight to
the assessor’s score  it’s their only job); interlocutor cannot take notes, has to
focus on the interaction
 assessor uses an analytic scale; interlocutor uses a holistic one
 ideally there’s some sort of consensus score that will be the final score
reading is a
receptive skill

7 – Testing grammar and vocabulary

Do we need separate tests for testing grammar and vocabulary?

In some tests there are separate tasks for grammar and vocab, in other tests there are not. Why?

 mediation: information is presented in one form and they you’ll have to put it in another
form (for skills)
 grammar and vocab are not skills
 when you are testing your vocab and grammar you other skills will be tested (productive
skills?)
 analytic scale: separate score for task achievement, coherence, vocabulary, grammar
 so why a separate test?
 same is true for receptive skills when you communicate there’s no point for separate tests to
test grammar and vocabulary  not functional; grammar and vocab are already tested while
communicating

Why some tests still have a vocabulary and grammar component?

 since grammar and vocabulary are needed to all four skills  it’s fundamental, so we want to
make sure that people have grammar and vocab
 ways of testing grammar and vocabulary can be very communicative

Presume: We want to test grammar and vocabulary – How should it be done?

 formative function
 feedback to the learning process

What exactly to test?

 progress test: testing what you’ve taught (identified elements of grammar and vocabulary)
 achievement test: what the syllabus says the learns should know (identified elements of
grammar and vocabulary)
 proficiency test: test what the level defines (CFR) – what the language learners should be
able to do
 when it comes to grammar and vocab there’s a problem
 How do we know what grammar and vocabulary is needed for the different
proficiency levels?
 what the CFR definition should mean in particular languages in terms of language
and vocabulary
 1st : German: Hofiele Deutsch
 then: English Profile – access to grammar and vocabulary profile  check what sort
of grammar and vocabulary is required for the levels
 English Profile – Vocabulary:
 bat (animal) – B1 level word  not accurate: Batman – you don’t have to know
English to know bat  Why B1: those who constructed the English Profile looked at
exam performances and checked what sort of words are used in exam performances
– bat was in a B1 performance
 What does it mean to use a word? – To be able to use that word while writing or
reading
 productive knowledge only
 what do we do when we want to test receptive knowledge
 Have to rely on it carefully
 High/low level of frequency: high frequency – not so difficult; low frequency – more difficult
 weta: an insect that is native to New Zealand  it is a high frequency, low difficulty
level word in NZ; but anywhere else it’s a low frequency, high difficulty level word
 difficulty depends on the ‘where’
 word Fortnite: game  spelled difficulty: Fort nite = in 2 weeks  used only in
Britain  difficulty level?
 difficulty level depends on the context!! (context sensitive)
 no objective means, but there’s some judgement involved

Whether grammar and vocabulary should be tested together?

 some people say that grammar does not exist on it’s on  it becomes manifest through
grammar (they can’t be separated)
 vocabulary in isolation does not exist comes hand in hand with some sort of grammatical
context
 vocab and grammar form a joined construct - should be tested together (not everyone will
agree)
 you can score grammar and vocab separately – Why would you do that?
 vocabulary is about memorizing vocabulary items (memory)
 structures, grammar is not exclusively dependent on memory, you need to know
how they are connected, in what context they are appropriate…
 CFR doesn’t separate grammar and vocab: range of grammar of vocabulary and
accuracy of grammar of vocabulary

How can we test grammar?

 gap-filling (reading: content focused, not grammar)  gaps dependent on the grammatical
structures
 text type: authentic – validity  but: there are no text with that degree of density
 test constructors write a text full of that grammar structure  not authentic
 it’s better to use authentic texts, but shorter sections/part of them
 use of that structure is not mandatory, you can put it in a different form
 relatively communicative task (communicates the message)
 error correction and error identification
 2 stage process: 1: find the error, 2: correct the error
 stage 2 depends on stage 1
 leaves you with the dilemma: What if people find the error, but they correct it into
something equally wrong? – partially score it?
 tasks can be based on authentic texts
 how often do you do this sort of activity – you do this when you are a teacher
 multiple choice
 least communicative/authentic: you see a sentence with no context, in isolation
 no context: not clear what’s wrong
 people who produce this have a context in mind, but test-takers don’t  risky
 transformation tasks
 rewrite the sentences
 active voice to passive voice
 more possible solution / different ways to transform / alternative correct answers
 meaning of the sentences change
 in real life there’s a reason why it’s in active or passive voice and this task tells you
that you can do it on your will
 no authentic reason if the meaning does not change then why would you change the
sentence
 matching
 find the end of the sentence based on the beginning of the sentence
 you don’t do this in real life
 test different types of grammar

Vocabulary:

 receptive and productive knowledge of vocabulary  need to be considered which one to be


tested

testing receptive knowledge of vocabulary:

 list of words, give the definition for each


 if we ask them to write a definition – it’s more than understanding the meaning of
that words; words that we understand, but it’s difficult to define them
 practical difficulties: handwriting, partly accurate
 usually done in a closed format: there are definitions and match them
 find the odd on out
 alternative correct answers  risky for this reason
 fun in a class: discuss the alternative solution
 multiple choice: synonyms, antonyms (means the opposite)
 teacher says the word in their native language, test-takers write it in the target language
 native language equivalent is not always equivalent
 what if all the students don’t share the same native language
 matching: definitions, synonyms, antonyms, images (for lower level)

testing productive vocabulary:

 there’s a definition and test-takers have to write down the word


 alternative correct answer – problem
 avoid it: provide a lot of words to choose from  guess it; receptive knowledge and
not productive
 provide the first letter / last letter / number of letters of the solution
 gap filling
 same problems, same solutions
 write down what you can see on the pictures
 same problems, same solutions
 use the words in sentences

MAKE SURE THEY PRODUCE THE VOCABULARY ITEM AND NOT GIVE THEM SOLUTIONS
8 - Reporting scores

What kind of scoring / types of scores:

 In Hungary: 1-5 scale  a grade


 percentage and/or the actual score - the actual score that a candidate gets for their test: raw
score
 extra tasks for extra scores – Good idea?
 if you are fast, you can do those, if you’re not, you can’t  you are good, you’ll get
more scores, if you are not that good, you can’t get extra scores
 I get a better score than I should have
 artificially generating measurement error
 Why does it happen? – motivating for students / rationale is motivation
 for in class tasks only, not tests!!
 equally weighted tasks
 1st 10 items; 2nd 15 items; 3rd 20 items
 raw scores: 3rd one will be twice as important as the 1st one
 weight the item scores: 1st task: 2 points, 3rd 1 point
 change the weighting at the task level
 before adapt the scores, I multiply the scores from task 1 by 2 (2x10)
 convert the original scores onto a common scale
 4 tasks – 25 points each  tasks become comparable
 score conversion: convert the scores onto a different scale
 What happens to the original scores?
 2 types of score conversion:
1: convert the scores to a scale that is longer than the original scale
2: convert the scores to a scale that is shorter than the original scale
 putting on a 100 points scale  percentage
 Why % better than points?
 we all are familiar with percentages; percentages are present in every aspect of
life
 What happens if the score is longer? 200%
 to show importance
 compare the 2 scenarios:
 What happens to the original information?
 putting from a shorter to a longer  information will be easier to interpret
(original information will still be there)
 longer to shorter  original information gets lost (different points, but same
grade)
 score aggregation: conversion of scores onto a common scale

reporting scores:

 sub-tests: sets of tasks


 language exam: 4 sub-tests for the 4 skills  How do you score?
1. adopt the sub-test scores  equally weighted or not (conversion) final pass mark
is established – passed or not
 disadvantage: overall pass mark is 60%  you scored 100% on one skill, 10% on
another skill, 100% on another, 60% in the 4 th skill
 altogether: 270% divided by 4 = more than 60%  you would pass
 for a candidate it’s okay
 from measurement: one of the skills is way below the expected scale
 compensatory system: sub-test performances can compensate one another
2. use the opposite of this system – non-compensatory system
 candidates have to pass all sub-tests to pass the whole exam
 you pass the individual components
 if you fail one of them, you fail the whole thing
 language exam: candidates wouldn’t be happy with a system like this
 all the individual passes that are justified, a non-compensatory system would
give the most accurate results regarding language proficiency
3. partly-compensatory systems:
 you don’t have to pass all the tests to pass the whole exam
 4 components, if you pass 3 you pass the whole thing
 you don’t need to pass al the sub-tests, no individual pass marks, 1 pass
mark for the whole, you have to reach a certain minimum score on each sub-
tests to pass (at least 40%)
 compensation: writing may compensate for your listening
 we can report the final score in a variety of ways
 pass/fail, or details down to the last sub-test
 total results only

 detailed scoring
 for development
 for diagnostic purposes
 diagnostic test: find out about the test takers strengths and or weaknesses and
report back to the test takers about that
 formative assessment: to form, shape the learning process – channelling back the
information to the learners – more detailed information the better
 Who test results need to be reported to?
 test-takers
 authorities, administrative body that takes care of results  they need a final end
result
 electronic registry  limited information – grade
 scores and hot to and who to report them depends on the purpose of the test!

How scores emerge?


 different item types – how they are scored
 How is a multiple-choice test scored?
 check the answers and compare to the answer key – objective type of task / scoring
 objectively scored tasks: use a key and score
 subjectively scores tasks: some sort of scale is used for scoring  scoring is done based on
qualitative descriptors  not scoring, rather rating (the performance)
 difference of 1 point where rating happens rather than scoring
 numerically is 1 point
 qualitatively carries a meaning (?)
 from a rating process: meaningful
 scoring: 1-point difference is not terribly meaningful
 rating process: high-stakes context: double marking  writing: never a single assessor that
asses each paper, 2 independent assessment will be made; both right/wrong --. bring in a 3 rd
rating – that’ll be the final rating (someone whose judgement is supposed to be more
credible)
 inter-marker reliability: check how similar the different ratings are
9. – Setting up pass marks

Type of test:

 pass mark/standard/cut-off score: score at and above which test-takers pass and below they
do not pass
 depends on the type of test: norm-referenced and criterion-referenced tests  how we
make sense of the test results
 nr: comparing the test-takers results to other test-takers results (entrance procedures)
 cr: check how the performance and see if it meets the criteria
 depending on the type of test, you would need different procedure to set up pass marks

Diagrams:

 diagnostic test: check student’s strength and weaknesses  diagram: scores gets high
 low  high low highlow diagram
 different students in the same groups: group who score high and another one scoring
low – some students learn, other don’t – vocab test e.g.
 group with more and less able students – test shows the difference between them
 pass/didn’t pass – language exam
 straight line diagram – there’s an almost even distribution of ability within the group
(unlikely)
 a point in the diagram: everyone got the same score

Norm-referencing

 pass mark should be in distribution valleys


 (valley and pitch of diagrams)
 reality is like dots in a diagram: scores that nobody got – blank gaps  put the pass mark
there

standard setting in indirect tests:

 for tests with a numerical score, performance standards to be set


 receptive skills (reading, listening)
 underlying competences (grammar, vocabulary)
 performance standard
 boundary (cut-off score) between two levels on the scale – standard setting
 decision if a person has reached a given CEFR level is based on grading, not on scoring
(marking)
 score must be transferred to a grading scale
 transformation of scores to grades based on a cut-off score on a test
 cut-off score
 border between lowest acceptable score for relevant CEFR level and highest score to fail
that level

How to arrive at Standards?

 group decision (panel)


 not an objectively made decision
 group  won’t be objective, but not as subjective if a single person would done the
decision
 group is familiar with CEFR
 test content specified in terms of the CEFR
 standard setting procedures formalized
 careful selection and training of panel members

general procedures:

 you do it after you have empirical data of the results


 length: 2-3 days (including familiarization)
 2-3 rounds with in-between
 discuss the decision, supporting points of the decision, convince each other, think
about it… in rounds
 information on the panel members’ opinion  how many people agree
 information on candidate behaviour: thought it’s a hard item but turns out it’s an
easy one
 effects of decisions
 documentation needed to judge validity of procedure

Standard setting - Basket method:

 basic question: At what CEFR level can a test taker already answer the following item
correctly?
 What is the minimum CEFR level required to give a correct response to this item?
 panelists to put each item in a “basket” corresponding to one of the relevant CEFR levels
 candidates at higher levels to give correct response
 correct response is not required at lower levels

Basket Method - Conversion of Judgements to Cut-off Scores:

 50 item tests
 2 items in Basket A1
 7 items Basket A2
 12 items Basket B1
 2+7+12 = 21 items to be answered correctly for B1 or higher
 cut-off score: 21
 not always necessary to provide baskets for all levels
 for B1 test: “lower than B1”, “B1” and “higher than B1”
 if all items are B1: cut-off score would be 100% - for B1 level a candidate must
answer all the B1 items correctly  not fair, not the min. but the max. for B1

Angoff Method:

 basic concept: minimally acceptable person / borderline person  just barely at the target
level
 borderline person at B1
 has the ability to be labelled B1
 slightest decrease in ability: no longer B1
 task for panel: keep borderline person in mind for all the judgements
 for each item: What is the probability that the borderline person gives the correct answer?
 15 rater, 50 items: average of the items’ %  standard
Direct tests: speaking, writing

benchmarking

 panel of judges: trained an all that


 they need a selection of actual performances, e.g. essays
 judges to decide to select performances that they think are just barely at the target level, e.g.
B2
 agree on this
 identified the performances that are just at the target level  what scores these
performances were given  that’s the pass mark score
10 – Item and test analysis

Classical item and tests statistics

 test-level statistics: statistic that refer to the test as a whole


 item-level statistics: about the test items themselves

test-level statistics:

 measures of central tendency  centeredness of test scores


 mean (average of scores)
 high average: high scores and low scores and everything in between
 very easy or very difficult
 mode (most frequent score) – top of the curve (diagram)
 where the mode is: why the test was easy or not
 alone is not very formative
 bar chart: tallest bar – mode
 median (midpoint of scores)
 12, 15, 19, 20, 38, 38, 50 (3 lower, 3 higher)
 12, 15, 19, 20, 21, 38, 38, 50  20,5
 in itself not very formative
 measures of dispersion (how much they are spread out)
 range (difference between top score and bottom score)
 small range: the test results are very similar, group of people unlikely that the
abilities of the test takers are the same
 big range: test has been able to show the differences which we know are there
 big range:
 standard deviation (average deviation of scores from the mean)
 deviation: difference between the score and the mean
 may be positive and negative: square them and take the square root of this
 big: scores are spread out
 small: scores are clustered together
 by these we can see how the test as a whole works
 alpha: matter of reliability – how reliable the test is
 SEM: standard error of measurement
 reliability
 2 sets of results to compare
 rank order correlation: rank the test-takers
 1st: best, 2nd: 2nd best…
 measure of rank order correlation
 d: difference in ranking of the tests
 sigma: sum of (add up all the difference) – multiple that by 6 formula is
 N: number of test-takers not needed
 -1 is the lowest possible value of rank order correlation

item-level statistics:

 facility value / proportion correct (p value): how many got it correct


 discrimination index: how much an item can show the difference between those who can
and those who can’t
 sample separation: you have a total population and you rank them by their total
score on the test – divide the groups: top, middle, bottom – top and bottom group
on all items – Ebel’ D
 correlation: comparing 2 sets of data and see how similar they are
 biserial correlation
 total scores and the scores on this item – check how similar they are
 how much the item is in harmony with the whole test

You might also like