You are on page 1of 95

Test usefulness

Model of test usefulness


(Bachman & Palmer, 1996)
• Provides a framework for evaluating the whole process of test
development and use
• Emphasizes testing as a contextualized practice.
• Includes 6 test qualities

Face validity acceptable ( tg làm bài)


Content validity weak Section 1: focus on only 1 gr feature. Section 2: The collected sample is sufficent
to make inferences about ability to write...
Construct validity weak (
Model of test usefulness

Usefulness = Reliability + Construct Validity


+ Authenticity + Interactiveness + Impact + Practicality

Operationalizing principles
1. Maximizing overall usefulness, rather than individual test qualities
2. Interdependence of the qualities
3. Appropriate balance is context-dependent
Reliability
Reliability

• Reliability refers to the consistency of estimation of candidates’


abilities
• A reliable test score will be consistent across different occasions,
settings and versions of a test
Consistencies across different sets of test task characteristics

Examples:
• Same test - 2 different occasions
• 2 interchangeable forms of a test
Reliability: refers to the consistency of test results

•Will a student get almost the same score on a test taken


one day and again on another day?
•Will one teacher score the assessment the same way as
another?
•If using two test forms, will students perform comparably
on both tests?
Factors contributing to the unreliability of a test

Student-related
reliability

Rater reliability

Test administration
reliability

Test reliability
Factors contributing to the unreliability of a test

Student-related • Temporary illness


reliability • Physical or psychological factors

• Subjectivity, bias
• Inter-rater reliability (inconsistent scores)
Rater reliability • Intra-rater reliability (not even-handed
judgment)

Test administration Conditions in which the test is administered


reliability
• Ambiguous test items
Test reliability • Unanticipated correct answers
• Too long
Types of reliability

• Test-retest reliability is a measure of reliability obtained by administering


the same test twice over a period of time to a group of individuals. The
scores from Time 1 and Time 2 can then be correlated in order to evaluate
the test for stability over time.

• Example: A test could be given to a group of students twice, with the


second administration perhaps coming a day later. The obtained correlation
coefficient would indicate the stability of the scores.
Types of Reliability

• Parallel forms reliability is a measure of reliability obtained by administering


different versions of an assessment tool (both versions must contain items that
probe the same construct, skill, knowledge base, etc.) to the same group of
individuals. The scores from the two versions can then be correlated in order to
evaluate the consistency of results across alternate versions.

• Example: If you wanted to evaluate the reliability of a critical thinking assessment,


you might create a large set of items that all pertain to critical thinking and then
randomly split the questions up into two sets, which would represent the parallel
forms.
Types of reliability

• Internal consistency reliability is a measure of reliability used to evaluate the degree


to which different test items that probe the same construct produce similar results.

• Average inter-item correlation is a subtype of internal consistency reliability. It is obtained by


taking all of the items on a test that probe the same construct (e.g., reading comprehension),
determining the correlation coefficient for each pair of items, and finally taking the average of all
of these correlation coefficients. This final step yields the average inter-item correlation.
• Alpha cronbach

• Split-half reliability is another subtype of internal consistency reliability. The process of obtaining
split-half reliability is begun by “splitting in half” all items of a test that are intended to probe the
same area of knowledge in order to form two “sets” of items. The entire test is administered to a
group of individuals, the total score for each “set” is computed, and finally the split-half reliability
is obtained by determining the correlation between the two total “set” scores.
Reliability

• Quality of test scores: unreliable scores do not constitute sound


evidence for making inference about test takers’ language ability
=> cannot draw any conclusions

• A necessary element of/condition for construct validity and thus test


usefulness

• Inconsistencies cannot be eliminated entirely; measurement errors


should be minimized
Scenario – Elementary education

Mawuse teaches English at an elementary and middle school. Parents are


very enthusiastic about the school’s high standards and international focus
and students are generally eager to excel in their classes. Mawuse
administers achievement tests to his students every quarter and he often
finds himself explaining the results to parents and students.

How could Mawuse handle each of the situations below? Get out a sheet of
paper. Using what you’ve learned about reliability, brainstorm possible
answers to these questions.
• What we have to do is construct, administer and score tests in such a way that the scores
actually obtained on a test on a particular occasion are likely to be very similar to those
which would have been obtained if it had been administered to the same students with
the same ability, but at a different time.
• The more similar the scores would have been, the more reliable the test is said to be.
How to make the tests more reliable
1 Take enough samples of behaviour.
2 Exclude items which do not discriminate well between weaker and stronger students.

3 Do not allow candidates too much freedom


4 Write unambiguous items
5 Provide clear and explicit instructions
6 Ensure that tests are well laid out and perfectly legible
7 Make candidates familiar with format and testing techniques
8 Provide uniform and non-distracting conditions of administration

9 Use items that permit scoring which is as objective as possible


10 Provide a detailed scoring key
11 Train scorers
12 Identify candidates by number, not name
13 Employ multiple, independent scoring
#1 Take enough samples of behavior

• Other things being equal, the more items that you have on a test,
the more reliable that test will be
• One thing to bear in mind, however, is that the additional items
should be independent of each other and of existing items.
• Each additional item should as far as possible represent a fresh
start for the candidate. By doing this we are able to gain
additional information on all of the candidates — information
that will make test results more reliable.
#2 Exclude items which do not discriminate well between
weaker and stronger students

• Items on which strong students and weak students perform with


similar degrees of success contribute little to the reliability of a
test. (too easy or too difficult items)
• A small number of easy, non-discriminating items may be kept at
the beginning of a test to give candidates confidence and reduce
the stress they feel.
#3 Do now allow candidates too much freedom

• In some kinds of language test there is a tendency to offer candidates a


choice of questions and then to allow them a great deal of freedom in
the way that they answer the ones that they have chosen.
• . The more freedom that is given, the greater is likely to be the
difference between the performance actually elicited and the
performance that would have been elicited had the test been taken,
say, a day later.
• In general, therefore, candidates should not be given a choice, and the
range over which possible answers might vary should be restricted.
#3 Do now allow candidates too much freedom

Compare the following writing tasks:


1.Write a composition on tourism.
2.Write a composition on tourism in this country.
3.Write a composition on how we might develop the tourist industry in this
country.
4.Discuss the following measures intended to increase the number of foreign
tourists coming to this country: i) More/better advertising and/or
information (Where? What form should it take?).
ii) Improve facilities (hotels, transportation, communication, etc.).
iii) Training of personnel (guides, hotel managers, etc.).

Which task is more reliable than the others?


#4 Write unambiguous items

• Candidates should not be presented with items whose meaning is


not clear or to which there is an acceptable answer which the
test writer has not anticipated.
• An individual candidate might interpret the question in different
ways on different occasions means that the item is not
contributing fully to the reliability of the test.
• If this task is entered into in the right, most of the problems can
be identified before the test is administered. Pre-testing of the
items on a group of people comparable to those for whom the
test is intended should reveal the remainder.
#5 Provide clear and explicit instruction

• This applies both to written and oral instructions.


• A common fault of tests written for the students of a
particular teaching institution is the supposition that the
students all know what is intended by carelessly worded
instructions.
• The use of colleagues to criticise drafts of instructions
(including those which will be spoken) is the best means
of avoiding problems. Spoken instructions should always
be read from a prepared text in order to avoid introducing
confusion.
#6 Ensure that tests are well laid out and perfectly legible

• Too often, institutional tests are badly typed (or handwritten),


have too much text in too small a space, and are poorly
reproduced. As a result, students are faced with additional tasks
which are not ones meant to measure their language ability.
Their variable performance on the unwanted tasks will lower the
reliability of a test.
#7 Make candidates familiar with format and testing
techniques

• If any aspect of a test is unfamiliar to candidates, they are likely


to perform less well than they would do otherwise (on
subsequently taking a parallel version, for example).
• For this reason, every effort must be made to ensure that all
candidates have the opportunity to learn just what will be
required of them.
• This may mean the distribution of sample tests (or of past test
papers), or at least the provision of practice materials in the case
of tests set within teaching institutions.
#8 Provide uniform and non-distracting conditions of
administration

• Great care should be taken to ensure uniformity.


• For example, timing should be specified and strictly adhered to;
the acoustic conditions should be similar for all administrations of
a listening test.
• Every precaution should be taken to maintain a quiet setting with
no distracting sounds or movements.
#9 Make comparisons between candidates as direct as
possible

• Candidates should not be given a choice of items and that


they should be limited in the way that they are allowed to
respond.
• Scoring the compositions all on one topic will be more
reliable than if the candidates are allowed to choose from
six topics, as has been the case in some well-known tests.
• The scoring should be all the more reliable if the
compositions are guided as in the example above, in the
section, 'Do not allow candidates too much freedom'.
#10 Provide a detailed scoring key

• This should specify acceptable answers and assign points for


acceptable partially correct responses.
• For high scorer reliability the key should be as detailed as
possible in its assignment of points.
• It should be the outcome of efforts to anticipate all possible
responses and have been subjected to group criticism.
#11 Train scorers

• This is especially important where scoring is most subjective.


• The scoring of compositions, for example, should not be assigned
to anyone who has not learned to score accurately compositions
from past administrations.
• After each administration, patterns of scoring should be
analysed.
• Individuals whose scoring deviates markedly and inconsistently
from the norm should not be used again.
#12 Identify candidates by number, not name

• Scorers inevitably have expectations of candidates that they


know. Except in purely objective testing, this will affect the way
that they score.
• Studies have shown that even where the candidates are
unknown to the scorers, the name on a script (or a photograph)
will make a significant difference to the scores given.
• For example, a scorer may be influenced by the gender or
nationality of a name into making predictions which can affect
the score given.
• The identification of candidates only by number will reduce such
effects.
VALIDITY
Reliability
Validity

• “A test is said to be valid if it measures accurately what it is intend to


measure.” (Hughes, 2003, p.26)

• “the extent to which inferences made from assessment results are


appropriate, meaningful, and useful in terms of the purpose of the
assessment” (Gronlund, 1998, p. 226) (cited in Brown, 2004)
Reflection
T asks S to write as many
words as they can in 15 mins,
then counts the words for final
score

Practical? Reliable? Valid?


Reflection

Practical? Yes! Easy to administer.

Reliable? Yes! Quite dependable scoring.

No! Without consideration of comprehensibility,


Valid? rhetorical discourse elements, organization of
ideas, etc.
Evidence of validity

Construct validity

Content validity

Criterion-related validity

Face validity

Consequential validity
Construct validity

• Construct validity pertains to the meaningfulness and appropriateness of the


interpretation that is made on the basis of test scores.
(Bachman & Palmer, 1996)

• Construct: any underlying ability or trait that is hypothesized in a theory of


language ability
• Justify the meaning of scores by means of evidence and logic.
Construct validity

Language knowledge is defined as constructs

• E.g. ability to read involves ability to guess the meaning of unknown


words from the context

Definition of construct is based on theories of language


knowledge and competence.
• E.g. theory of writing tells us that underlying writing ability involves
control of punctuation, sensibility to demands on style ;
• Or Oral proficiency: pronunciation, fluency, grammatical accuracy,
vocab. use, social linguistic appropriateness

Domain-specific construct

• E.g. ability to communicate in specific workplaces


A. Construct validity

Compatibility of defined construct and target


language use (TLU) tasks
• Correspondence between test tasks and TLU tasks: The
construct measured by test tasks should reflect the construct
defined, or underlying TLU tasks.
• Appropriate representation of construct in test tasks;
avoidance of construct under-representation or construct-
irrelevance.
• Performance measures (i.e. test scores) are indicators of
corresponding degrees of language competence or task
performance.
• Generalizing test performance to TLU domain.
Reflection on
Reliability & Construct validity

For example, that we needed a test for


placing individuals into different levels in an
academic writing course, how about a
multiple-choice test of grammatical
knowledge?
Reflection

Reliability Construct validity


Yes No
Consistent scores Not sufficient to
justify using this
test as a
placement test for
a writing course.
Reflection

•Grammatical knowledge is only one aspect of the


ability to use language to perform academic
writing tasks.
=> inappropriately narrow
•The construct involved in the TLU domain—ability
to perform academic writing tasks—involves other
areas of language knowledge
B. Content validity

“A test is said to have content validity if its content


constitutes a representative sample of the language
skills, structures, etc with which it is meant to be
concerned.” (Hughes, 2003)
Content invalidity - Examples

• If trying to assess S’s ability to speak E. in a conversational setting


→ ask S to answer paper-and-pencil MCQs requiring grammatical
judgments
→ not achieve CV

• Or, a course has 10 objectives


→ but only two covered in a test
→ not achieve CV
Discussion

•Quiz on E. articles for a high-level beginner of a


conversation class (L & S)
•S had had a unit on zoo animals and had engaged
in some open discussions and group work in which
they had practiced articles, all in L&S modes of
performance
English Articles Quiz
Comments

• The quiz uses familiar setting and focuses on previously practiced


language forms
=> somewhat content validity
• It was administered in written form, and required S to read the
passage and write their responses
=> quite low in content validity for L&S class
C. Criterion-related validity

“the degree to which results on the test agree with those


provided by some independent and highly dependable
assessment of the candidate’s ability” (Hughes, 2003, p.27)
=>This independent assessment is the criterion measure
against which the test is validated.
Criterion-related validity
Examples

Test to be validated Criterion measure


• The results of one teacher’s • A commercially produced
unit test test in a textbook

• (a) subsequent behaviour


• Scores of a test on grammar or (b) other communicative
in communicative use measures of the grammar.
Criterion-related validity
2 kinds

• Results supported by other concurrent


performance
Concurrent • E.g. high score on final exam of foreign
lang. course is substantiated by actual
proficiency in the language

Predictive
Concurrent Criterion-related validity

10 min test 1 hr test


• 7.5 •8
• 6,6 •7
• 6 •6
• 8 •8
• 6.5 •6
Criterion-related validity
2 kinds

• Results supported by other concurrent


performance
Concurrent • E.g. high score on final exam of foreign
lang. course is substantiated by actual
proficiency in the language

• Assess or predict a test taker’s likelihood


of future use
Predictive • E.g. placement tests, aptitude language
tests
Face validity (FV)

• FV refers to the degree to which a test looks right, and appears to


measure what it is supposed to measure

• FV is hardly a scientific concept/empirically tested, yet it is important


because a test without FV may not be accepted by candidates,
teachers, education authorities or employers.
• For example, a test that pretended to measure pronunciation ability but which
did not require the test taker to speak might be thought to lack FV.
Face validity

FV will likely be high if learners encounter:


• A well-constructed, expected format with familiar tasks,
• A test that is clearly doable within the allotted time limit,
• Items that are clear and uncomplicated,
• Directions that are crystal clear,
• Tasks that relate to their course work (content validity),
• A difficulty level that presents a reasonable challenge
Face validity
Example
A test to measure pronunciation ability, but not require S to speak
=> does not have FV
This is true for the test’s construct and criterion-related validity
Consequential validity

A test’s accuracy in measuring intended


criteria

Impact on the preparation of test-takers


Considerations

Effect on learners

Intended and unintended social


consequences of a test’s interpretation
and use
Consequential validity

• Washback: effect of assessments on students’ motivation,


subsequent performance in a course, independent learning, study
habits, attitude toward school work
Model of test usefulness

Usefulness = Reliability + Construct Validity


+ Authenticity + Interactiveness + Impact + Practicality

Operationalizing principles
1 . Maximizing overall usefulness, rather than individual test qualities
2. Interdependence of the qualities
3. Appropriate balance is context-dependent
AUTHENTICITY
Authenticity

• “The degree of correspondence of the characteristics of a given


language test task to the features of a TLU task”
(Bachman & Palmer, 1996, p.23)
ÞOne task is likely to be enacted in the ‘real world’)

• Relatively underemphasized in language testing, but recently


increased noticeably
• Authenticity in relation to TLU domain or the language instruction
domain (i.e. syllabus)?
Authenticity

Authenticity may be present in the following ways:


• The language in the test is as natural as possible.
• Items are contextualized rather than isolated.
• Topics are meaningful (relevant, interesting)
• Some thematic organization to items is provided such as through a
story line or episode
• Tasks closely approximate real-world tasks.
Model of test usefulness

Usefulness = Reliability + Construct Validity


+ Authenticity + Interactiveness + Impact + Practicality

Operationalizing principles
1 . Maximizing overall usefulness, rather than individual test qualities
2. Interdependence of the qualities
3. Appropriate balance is context-dependent
INTERACTIVENESS
Interactiveness

• Whether and to what extent test tasks call for test takers’ language knowledge
and ability for successful task completion.
LANGUAGE ABILITY
(Lang. knowledge
& metacognitive
strategies)
Topical Affective
knowledge schemata

Characteristics
of language
test task
Model of test usefulness

Usefulness = Reliability + Construct Validity


+ Authenticity + Interactiveness + Impact + Practicality

Operationalizing principles
1 . Maximizing overall usefulness, rather than individual test qualities
2. Interdependence of the qualities
3. Appropriate balance is context-dependent
IMPACT
Impact

• Test impact refers to the social dimension of language testing and is


also known as consequential validity.

• Test administration and use involves values and goals, the


consequences of which affect individuals, education systems as well
as the society at large.

• Test impact operates at both micro level and macro level.


Impact

• Impact of particular test use on


Micro individuals
level • E.g. test takers, test users, decision
makers, parents, educators, employers

Macro • Impact on society and educational


level systems
Impact on test takers
Test preparation and test taking experience

E.g. A writing test only sets two kinds of tasks: compare/contrast;


describe/interpret charts
→ much preparation for the test limited to those tasks
→ not beneficial backwash
Impact

Impact on test takers


• Test preparation and test taking experience
• Feedback from tests
• Test results: inclusion, exclusion
Impact on teachers
• Career, instructional preferences, etc.
Model of test usefulness

Usefulness = Reliability + Construct Validity


+ Authenticity + Interactiveness + Impact + Practicality

Operationalizing principles
1 . Maximizing overall usefulness, rather than individual test qualities
2. Interdependence of the qualities
3. Appropriate balance is context-dependent
Backwash …is the effect that tests have
on learning and teaching.

9/3/20XX Presentation Title 78


… is now seen as a part of the
impact a test may have on
learners and teachers, on
Backwash educational systems in
general, and on society at
large.

9/3/20XX Presentation Title 79


Test abilities whose development you want to encourage

• Test what is important to test.


• Don’t only test what is easiest to test.
• Each ability should be given sufficient weight in relation to other abilities.

E.g. If you want to encourage oral ability, test oral ability.

80
Sample widely and unpredictably

• Testing a restricted area will have the backwash effect only in that area.
• A wider range of tasks should be used in testing.
• Test across the full range of the specifications.

E.g. A writing test sets two kinds of tasks only:


1. compare/contrast
2. describe/interpret chart or graph

81
Use direct testing

• Test skills by getting students to perform in those skills.


• Texts and tasks should be as authentic as possible.
• Indirect testing removes an incentive (motivation) for students to practice the target
skill.

82
Make testing criterion-referenced

• Students have a clear picture of what they have to achieve.


• Students might be successful regardless of how other students perform

83
Base achievement tests on objectives

• Base achievement tests on objectives.


• Do not base achievement tests on detailed teaching and textbook content.
• A truer picture of what has been achieved is provided.
• Constant pressure to achieve objectives will contribute to beneficial backwash.

• E.g. ED writing syllabus

84
Ensure test is known and understood by students and teachers

• The rationale (principles) for the test, its specifications, and sample items should be
made available to candidates.

85
Where necessary, provide assistance to teachers

• The test will not achieve its intended effect, if the teachers need guidance and
possible training.
• Where new tests are meant to help change teaching, support has to be given to
help effect the change.

• E.g. National language project - how to teach communicative skills instead of


grammatical structure and vocabulary?

86
Counting the cost

• Practicality of the test is as important as validity and reliability.


• A good test should be easy and cheap to construct, administer, score and interpret.
• The cost of the test, waste of effort and time on the part of teachers and students
should always be taken into consideration while compiling and administering tests.

• What will be the cost of NOT achieving beneficial backwash?

87
Essay writing

• Compare – Contrast • Graph/Chart


• Opinion • Opinion
• Argumentative For/Against 2 ways
• 1. For (1)
• 2. For( 2) Balanced
• 3. Counter argument - Refutation
Prob-Solution
Hypothesiss

9/3/20XX Presentation Title 88


PRACTICALITY
Practicality

• Practicality refers to practical issues of test development and implementation.

• Consideration of the resources which determines the nature, type and


length of tests.
• Relationship between the resources that are required in the design,
development and use of the test and the resources that will be available for
this purpose.
• Three types of resources are required:
• Human resources
• Material resources
• Time
Practicality
Example
• How long it will take to sit and to mark the test
• stay within appropriate time constraints;
• scoring/evaluation procedure is specific and time-efficient

• Physical constraints of the test situation


• E.g. speaking test
Practicality

• Formula
Ar Ar: available resources
Nr: needed resources
Nr
Quotient ≥ 1: a test is of practicality
Quotient ≤ 1 not practical
(Bachman & Palmer, 1996)

• A test should be easy and cheap to construct, administer, score and interpret
Summary

• Test usefulness incorporates all essential test qualities to ensure


fairness and quality control.

• Social dimension of language test is emphasized in the model by


including impact as an important test quality.

• Test usefulness ultimately depends on the purpose and the context


of the test which also aims at a balance among the test qualities.

You might also like