Testing&Assessment - TEST QUALITIES

Test usefulness
Model of test usefulness

(Bachman & Palmer, 1996)
• Provides a framework for evaluating the whole process of test
development and use
• Emphasizes testing as a contextualized practice.
• Includes 6 test qualities
Face validity acceptable ( tg làm bài)

Content validity weak Section 1: focus on only 1 gr feature. Section 2: The collected sample is sufficent
to make inferences about ability to write...
Construct validity weak (
Usefulness = Reliability + Construct Validity

+ Authenticity + Interactiveness + Impact + Practicality
Operationalizing principles
1. Maximizing overall usefulness, rather than individual test qualities
2. Interdependence of the qualities
3. Appropriate balance is context-dependent
Reliability
Reliability
• Reliability refers to the consistency of estimation of candidates’

abilities
• A reliable test score will be consistent across different occasions,
settings and versions of a test
Consistencies across different sets of test task characteristics
Examples:
• Same test - 2 different occasions
• 2 interchangeable forms of a test
Reliability: refers to the consistency of test results
•Will a student get almost the same score on a test taken

one day and again on another day?
•Will one teacher score the assessment the same way as
another?
•If using two test forms, will students perform comparably
on both tests?
Factors contributing to the unreliability of a test
Student-related
reliability
Rater reliability
Test administration
reliability
Test reliability
Factors contributing to the unreliability of a test
Student-related • Temporary illness

reliability • Physical or psychological factors
• Subjectivity, bias
• Inter-rater reliability (inconsistent scores)
Rater reliability • Intra-rater reliability (not even-handed
judgment)
Test administration Conditions in which the test is administered

reliability
• Ambiguous test items
Test reliability • Unanticipated correct answers
• Too long
Types of reliability
• Test-retest reliability is a measure of reliability obtained by administering

the same test twice over a period of time to a group of individuals. The
scores from Time 1 and Time 2 can then be correlated in order to evaluate
the test for stability over time.
• Example: A test could be given to a group of students twice, with the

second administration perhaps coming a day later. The obtained correlation
coefficient would indicate the stability of the scores.
Types of Reliability
• Parallel forms reliability is a measure of reliability obtained by administering

different versions of an assessment tool (both versions must contain items that
probe the same construct, skill, knowledge base, etc.) to the same group of
individuals. The scores from the two versions can then be correlated in order to
evaluate the consistency of results across alternate versions.
• Example: If you wanted to evaluate the reliability of a critical thinking assessment,

you might create a large set of items that all pertain to critical thinking and then
randomly split the questions up into two sets, which would represent the parallel
forms.
Types of reliability
• Internal consistency reliability is a measure of reliability used to evaluate the degree

to which different test items that probe the same construct produce similar results.
• Average inter-item correlation is a subtype of internal consistency reliability. It is obtained by

taking all of the items on a test that probe the same construct (e.g., reading comprehension),
determining the correlation coefficient for each pair of items, and finally taking the average of all
of these correlation coefficients. This final step yields the average inter-item correlation.
• Alpha cronbach
• Split-half reliability is another subtype of internal consistency reliability. The process of obtaining
split-half reliability is begun by “splitting in half” all items of a test that are intended to probe the
same area of knowledge in order to form two “sets” of items. The entire test is administered to a
group of individuals, the total score for each “set” is computed, and finally the split-half reliability
is obtained by determining the correlation between the two total “set” scores.
Reliability
• Quality of test scores: unreliable scores do not constitute sound

evidence for making inference about test takers’ language ability
=> cannot draw any conclusions
• A necessary element of/condition for construct validity and thus test

usefulness
• Inconsistencies cannot be eliminated entirely; measurement errors

should be minimized
Scenario – Elementary education
Mawuse teaches English at an elementary and middle school. Parents are

very enthusiastic about the school’s high standards and international focus
and students are generally eager to excel in their classes. Mawuse
administers achievement tests to his students every quarter and he often
finds himself explaining the results to parents and students.
How could Mawuse handle each of the situations below? Get out a sheet of
paper. Using what you’ve learned about reliability, brainstorm possible
answers to these questions.
• What we have to do is construct, administer and score tests in such a way that the scores
actually obtained on a test on a particular occasion are likely to be very similar to those
which would have been obtained if it had been administered to the same students with
the same ability, but at a different time.
• The more similar the scores would have been, the more reliable the test is said to be.
How to make the tests more reliable
1 Take enough samples of behaviour.
2 Exclude items which do not discriminate well between weaker and stronger students.
3 Do not allow candidates too much freedom

4 Write unambiguous items
5 Provide clear and explicit instructions
6 Ensure that tests are well laid out and perfectly legible
7 Make candidates familiar with format and testing techniques
8 Provide uniform and non-distracting conditions of administration
9 Use items that permit scoring which is as objective as possible

10 Provide a detailed scoring key
11 Train scorers
12 Identify candidates by number, not name
13 Employ multiple, independent scoring
#1 Take enough samples of behavior
• Other things being equal, the more items that you have on a test,
the more reliable that test will be
• One thing to bear in mind, however, is that the additional items
should be independent of each other and of existing items.
• Each additional item should as far as possible represent a fresh
start for the candidate. By doing this we are able to gain
additional information on all of the candidates — information
that will make test results more reliable.
#2 Exclude items which do not discriminate well between
weaker and stronger students
• Items on which strong students and weak students perform with

similar degrees of success contribute little to the reliability of a
test. (too easy or too difficult items)
• A small number of easy, non-discriminating items may be kept at
the beginning of a test to give candidates confidence and reduce
the stress they feel.
#3 Do now allow candidates too much freedom
• In some kinds of language test there is a tendency to offer candidates a

choice of questions and then to allow them a great deal of freedom in
the way that they answer the ones that they have chosen.
• . The more freedom that is given, the greater is likely to be the
difference between the performance actually elicited and the
performance that would have been elicited had the test been taken,
say, a day later.
• In general, therefore, candidates should not be given a choice, and the
range over which possible answers might vary should be restricted.
#3 Do now allow candidates too much freedom
Compare the following writing tasks:

1.Write a composition on tourism.
2.Write a composition on tourism in this country.
3.Write a composition on how we might develop the tourist industry in this
country.
4.Discuss the following measures intended to increase the number of foreign
tourists coming to this country: i) More/better advertising and/or
information (Where? What form should it take?).
ii) Improve facilities (hotels, transportation, communication, etc.).
iii) Training of personnel (guides, hotel managers, etc.).
Which task is more reliable than the others?

#4 Write unambiguous items
• Candidates should not be presented with items whose meaning is

not clear or to which there is an acceptable answer which the
test writer has not anticipated.
• An individual candidate might interpret the question in different
ways on different occasions means that the item is not
contributing fully to the reliability of the test.
• If this task is entered into in the right, most of the problems can
be identified before the test is administered. Pre-testing of the
items on a group of people comparable to those for whom the
test is intended should reveal the remainder.
#5 Provide clear and explicit instruction
• This applies both to written and oral instructions.

• A common fault of tests written for the students of a
particular teaching institution is the supposition that the
students all know what is intended by carelessly worded
instructions.
• The use of colleagues to criticise drafts of instructions
(including those which will be spoken) is the best means
of avoiding problems. Spoken instructions should always
be read from a prepared text in order to avoid introducing
confusion.
#6 Ensure that tests are well laid out and perfectly legible
• Too often, institutional tests are badly typed (or handwritten),

have too much text in too small a space, and are poorly
reproduced. As a result, students are faced with additional tasks
which are not ones meant to measure their language ability.
Their variable performance on the unwanted tasks will lower the
reliability of a test.
#7 Make candidates familiar with format and testing
techniques
• If any aspect of a test is unfamiliar to candidates, they are likely

to perform less well than they would do otherwise (on
subsequently taking a parallel version, for example).
• For this reason, every effort must be made to ensure that all
candidates have the opportunity to learn just what will be
required of them.
• This may mean the distribution of sample tests (or of past test
papers), or at least the provision of practice materials in the case
of tests set within teaching institutions.
#8 Provide uniform and non-distracting conditions of
administration
• Great care should be taken to ensure uniformity.

• For example, timing should be specified and strictly adhered to;
the acoustic conditions should be similar for all administrations of
a listening test.
• Every precaution should be taken to maintain a quiet setting with
no distracting sounds or movements.
#9 Make comparisons between candidates as direct as
possible
• Candidates should not be given a choice of items and that

they should be limited in the way that they are allowed to
respond.
• Scoring the compositions all on one topic will be more
reliable than if the candidates are allowed to choose from
six topics, as has been the case in some well-known tests.
• The scoring should be all the more reliable if the
compositions are guided as in the example above, in the
section, 'Do not allow candidates too much freedom'.
#10 Provide a detailed scoring key
• This should specify acceptable answers and assign points for

acceptable partially correct responses.
• For high scorer reliability the key should be as detailed as
possible in its assignment of points.
• It should be the outcome of efforts to anticipate all possible
responses and have been subjected to group criticism.
#11 Train scorers
• This is especially important where scoring is most subjective.

• The scoring of compositions, for example, should not be assigned
to anyone who has not learned to score accurately compositions
from past administrations.
• After each administration, patterns of scoring should be
analysed.
• Individuals whose scoring deviates markedly and inconsistently
from the norm should not be used again.
#12 Identify candidates by number, not name
• Scorers inevitably have expectations of candidates that they

know. Except in purely objective testing, this will affect the way
that they score.
• Studies have shown that even where the candidates are
unknown to the scorers, the name on a script (or a photograph)
will make a significant difference to the scores given.
• For example, a scorer may be influenced by the gender or
nationality of a name into making predictions which can affect
the score given.
• The identification of candidates only by number will reduce such
effects.
VALIDITY
Reliability
Validity
• “A test is said to be valid if it measures accurately what it is intend to

measure.” (Hughes, 2003, p.26)
• “the extent to which inferences made from assessment results are

appropriate, meaningful, and useful in terms of the purpose of the
assessment” (Gronlund, 1998, p. 226) (cited in Brown, 2004)
Reflection
T asks S to write as many
words as they can in 15 mins,
then counts the words for final
score
Practical? Reliable? Valid?

Reflection
Practical? Yes! Easy to administer.
Reliable? Yes! Quite dependable scoring.
No! Without consideration of comprehensibility,

Valid? rhetorical discourse elements, organization of
ideas, etc.
Evidence of validity
Construct validity
Content validity
Criterion-related validity
Face validity
Consequential validity
Construct validity
• Construct validity pertains to the meaningfulness and appropriateness of the

interpretation that is made on the basis of test scores.
• Construct: any underlying ability or trait that is hypothesized in a theory of

language ability
• Justify the meaning of scores by means of evidence and logic.
Construct validity
Language knowledge is defined as constructs
• E.g. ability to read involves ability to guess the meaning of unknown

words from the context
Definition of construct is based on theories of language

knowledge and competence.
• E.g. theory of writing tells us that underlying writing ability involves
control of punctuation, sensibility to demands on style ;
• Or Oral proficiency: pronunciation, fluency, grammatical accuracy,
vocab. use, social linguistic appropriateness
Domain-specific construct
• E.g. ability to communicate in specific workplaces

A. Construct validity
Compatibility of defined construct and target

language use (TLU) tasks
• Correspondence between test tasks and TLU tasks: The
construct measured by test tasks should reflect the construct
defined, or underlying TLU tasks.
• Appropriate representation of construct in test tasks;
avoidance of construct under-representation or construct-
irrelevance.
• Performance measures (i.e. test scores) are indicators of
corresponding degrees of language competence or task
performance.
• Generalizing test performance to TLU domain.
Reflection on
Reliability & Construct validity
For example, that we needed a test for

placing individuals into different levels in an
academic writing course, how about a
multiple-choice test of grammatical
knowledge?
Reflection
Reliability Construct validity

Yes No
Consistent scores Not sufficient to
justify using this
test as a
placement test for
a writing course.
Reflection
•Grammatical knowledge is only one aspect of the

ability to use language to perform academic
writing tasks.
=> inappropriately narrow
•The construct involved in the TLU domain—ability
to perform academic writing tasks—involves other
areas of language knowledge
B. Content validity
“A test is said to have content validity if its content

constitutes a representative sample of the language
skills, structures, etc with which it is meant to be
concerned.” (Hughes, 2003)
Content invalidity - Examples
• If trying to assess S’s ability to speak E. in a conversational setting

→ ask S to answer paper-and-pencil MCQs requiring grammatical
judgments
→ not achieve CV
• Or, a course has 10 objectives

→ but only two covered in a test
→ not achieve CV
Discussion
•Quiz on E. articles for a high-level beginner of a

conversation class (L & S)
•S had had a unit on zoo animals and had engaged
in some open discussions and group work in which
they had practiced articles, all in L&S modes of
performance
English Articles Quiz
Comments
• The quiz uses familiar setting and focuses on previously practiced

language forms
=> somewhat content validity
• It was administered in written form, and required S to read the
passage and write their responses
=> quite low in content validity for L&S class
C. Criterion-related validity
“the degree to which results on the test agree with those

provided by some independent and highly dependable
assessment of the candidate’s ability” (Hughes, 2003, p.27)
=>This independent assessment is the criterion measure
against which the test is validated.
Examples
Test to be validated Criterion measure

• The results of one teacher’s • A commercially produced
unit test test in a textbook
• (a) subsequent behaviour

• Scores of a test on grammar or (b) other communicative
in communicative use measures of the grammar.
2 kinds
• Results supported by other concurrent

performance
Concurrent • E.g. high score on final exam of foreign
lang. course is substantiated by actual
proficiency in the language
Predictive
Concurrent Criterion-related validity
10 min test 1 hr test

• 7.5 •8
• 6,6 •7
• 6 •6
• 8 •8
• 6.5 •6
2 kinds
• Results supported by other concurrent

performance
Concurrent • E.g. high score on final exam of foreign
lang. course is substantiated by actual
proficiency in the language
• Assess or predict a test taker’s likelihood

of future use
Predictive • E.g. placement tests, aptitude language
tests
Face validity (FV)
• FV refers to the degree to which a test looks right, and appears to

measure what it is supposed to measure
• FV is hardly a scientific concept/empirically tested, yet it is important

because a test without FV may not be accepted by candidates,
teachers, education authorities or employers.
• For example, a test that pretended to measure pronunciation ability but which
did not require the test taker to speak might be thought to lack FV.
Face validity
FV will likely be high if learners encounter:

• A well-constructed, expected format with familiar tasks,
• A test that is clearly doable within the allotted time limit,
• Items that are clear and uncomplicated,
• Directions that are crystal clear,
• Tasks that relate to their course work (content validity),
• A difficulty level that presents a reasonable challenge
Face validity
Example
A test to measure pronunciation ability, but not require S to speak
=> does not have FV
This is true for the test’s construct and criterion-related validity
A test’s accuracy in measuring intended

criteria
Impact on the preparation of test-takers

Considerations
Effect on learners
Intended and unintended social

consequences of a test’s interpretation
and use
• Washback: effect of assessments on students’ motivation,

subsequent performance in a course, independent learning, study
habits, attitude toward school work

1 . Maximizing overall usefulness, rather than individual test qualities
AUTHENTICITY
Authenticity
• “The degree of correspondence of the characteristics of a given

language test task to the features of a TLU task”
(Bachman & Palmer, 1996, p.23)
ÞOne task is likely to be enacted in the ‘real world’)
• Relatively underemphasized in language testing, but recently

increased noticeably
• Authenticity in relation to TLU domain or the language instruction
domain (i.e. syllabus)?
Authenticity
Authenticity may be present in the following ways:

• The language in the test is as natural as possible.
• Items are contextualized rather than isolated.
• Topics are meaningful (relevant, interesting)
• Some thematic organization to items is provided such as through a
story line or episode
• Tasks closely approximate real-world tasks.

INTERACTIVENESS
Interactiveness
• Whether and to what extent test tasks call for test takers’ language knowledge
and ability for successful task completion.
LANGUAGE ABILITY
(Lang. knowledge
& metacognitive
strategies)
Topical Affective
knowledge schemata
Characteristics
of language
test task

IMPACT
Impact
• Test impact refers to the social dimension of language testing and is

also known as consequential validity.
• Test administration and use involves values and goals, the

consequences of which affect individuals, education systems as well
as the society at large.
• Test impact operates at both micro level and macro level.

Impact
• Impact of particular test use on

Micro individuals
level • E.g. test takers, test users, decision
makers, parents, educators, employers
Macro • Impact on society and educational

level systems
Impact on test takers
Test preparation and test taking experience
E.g. A writing test only sets two kinds of tasks: compare/contrast;

describe/interpret charts
→ much preparation for the test limited to those tasks
→ not beneficial backwash
Impact
Impact on test takers

• Test preparation and test taking experience
• Feedback from tests
• Test results: inclusion, exclusion
Impact on teachers
• Career, instructional preferences, etc.

Backwash …is the effect that tests have
on learning and teaching.
9/3/20XX Presentation Title 78

… is now seen as a part of the
impact a test may have on
learners and teachers, on
Backwash educational systems in
general, and on society at
large.

Test abilities whose development you want to encourage
• Test what is important to test.

• Don’t only test what is easiest to test.
• Each ability should be given sufficient weight in relation to other abilities.
E.g. If you want to encourage oral ability, test oral ability.
80
Sample widely and unpredictably
• Testing a restricted area will have the backwash effect only in that area.
• A wider range of tasks should be used in testing.
• Test across the full range of the specifications.
E.g. A writing test sets two kinds of tasks only:

1. compare/contrast
2. describe/interpret chart or graph
81
Use direct testing
• Test skills by getting students to perform in those skills.

• Texts and tasks should be as authentic as possible.
• Indirect testing removes an incentive (motivation) for students to practice the target
skill.
82
Make testing criterion-referenced
• Students have a clear picture of what they have to achieve.

• Students might be successful regardless of how other students perform
83
Base achievement tests on objectives
• Base achievement tests on objectives.

• Do not base achievement tests on detailed teaching and textbook content.
• A truer picture of what has been achieved is provided.
• Constant pressure to achieve objectives will contribute to beneficial backwash.
• E.g. ED writing syllabus
84
Ensure test is known and understood by students and teachers
• The rationale (principles) for the test, its specifications, and sample items should be
made available to candidates.
85
Where necessary, provide assistance to teachers
• The test will not achieve its intended effect, if the teachers need guidance and
possible training.
• Where new tests are meant to help change teaching, support has to be given to
help effect the change.
• E.g. National language project - how to teach communicative skills instead of

grammatical structure and vocabulary?
86
Counting the cost
• Practicality of the test is as important as validity and reliability.

• A good test should be easy and cheap to construct, administer, score and interpret.
• The cost of the test, waste of effort and time on the part of teachers and students
should always be taken into consideration while compiling and administering tests.
• What will be the cost of NOT achieving beneficial backwash?
87
Essay writing
• Compare – Contrast • Graph/Chart

• Opinion • Opinion
• Argumentative For/Against 2 ways
• 1. For (1)
• 2. For( 2) Balanced
• 3. Counter argument - Refutation
Prob-Solution
Hypothesiss

PRACTICALITY
Practicality
• Practicality refers to practical issues of test development and implementation.
• Consideration of the resources which determines the nature, type and

length of tests.
• Relationship between the resources that are required in the design,
development and use of the test and the resources that will be available for
this purpose.
• Three types of resources are required:
• Human resources
• Material resources
• Time
Practicality
Example
• How long it will take to sit and to mark the test
• stay within appropriate time constraints;
• scoring/evaluation procedure is specific and time-efficient
• Physical constraints of the test situation

• E.g. speaking test
Practicality
• Formula
Ar Ar: available resources
Nr: needed resources
Nr
Quotient ≥ 1: a test is of practicality
Quotient ≤ 1 not practical
• A test should be easy and cheap to construct, administer, score and interpret
Summary
• Test usefulness incorporates all essential test qualities to ensure

fairness and quality control.
• Social dimension of language test is emphasized in the model by

including impact as an important test quality.
• Test usefulness ultimately depends on the purpose and the context

of the test which also aims at a balance among the test qualities.

Testing&Assessment - TEST QUALITIES

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Testing&Assessment - TEST QUALITIES

Uploaded by

Copyright:

Available Formats

Test usefulness

Model of test usefulness

Face validity acceptable ( tg làm bài)

Usefulness = Reliability + Construct Validity

• Reliability refers to the consistency of estimation of candidates’

•Will a student get almost the same score on a test taken

Student-related • Temporary illness

Test administration Conditions in which the test is administered

• Test-retest reliability is a measure of reliability obtained by administering

• Example: A test could be given to a group of students twice, with the

• Parallel forms reliability is a measure of reliability obtained by administering

• Example: If you wanted to evaluate the reliability of a critical thinking assessment,

• Internal consistency reliability is a measure of reliability used to evaluate the degree

• Average inter-item correlation is a subtype of internal consistency reliability. It is obtained by

• Quality of test scores: unreliable scores do not constitute sound

• A necessary element of/condition for construct validity and thus test

• Inconsistencies cannot be eliminated entirely; measurement errors

Mawuse teaches English at an elementary and middle school. Parents are

3 Do not allow candidates too much freedom

9 Use items that permit scoring which is as objective as possible

• Items on which strong students and weak students perform with

• In some kinds of language test there is a tendency to offer candidates a

Compare the following writing tasks:

Which task is more reliable than the others?

• Candidates should not be presented with items whose meaning is

• This applies both to written and oral instructions.

• Too often, institutional tests are badly typed (or handwritten),

• If any aspect of a test is unfamiliar to candidates, they are likely

• Great care should be taken to ensure uniformity.

• Candidates should not be given a choice of items and that

• This should specify acceptable answers and assign points for

• This is especially important where scoring is most subjective.

• Scorers inevitably have expectations of candidates that they

• “A test is said to be valid if it measures accurately what it is intend to

• “the extent to which inferences made from assessment results are

Practical? Reliable? Valid?

Practical? Yes! Easy to administer.

Reliable? Yes! Quite dependable scoring.

No! Without consideration of comprehensibility,

• Construct validity pertains to the meaningfulness and appropriateness of the

• Construct: any underlying ability or trait that is hypothesized in a theory of

Language knowledge is defined as constructs

• E.g. ability to read involves ability to guess the meaning of unknown

Definition of construct is based on theories of language

• E.g. ability to communicate in specific workplaces

Compatibility of defined construct and target

For example, that we needed a test for

Reliability Construct validity

•Grammatical knowledge is only one aspect of the

“A test is said to have content validity if its content

• If trying to assess S’s ability to speak E. in a conversational setting

• Or, a course has 10 objectives

•Quiz on E. articles for a high-level beginner of a

• The quiz uses familiar setting and focuses on previously practiced

“the degree to which results on the test agree with those

Test to be validated Criterion measure

• (a) subsequent behaviour

• Results supported by other concurrent

10 min test 1 hr test

• Results supported by other concurrent

• Assess or predict a test taker’s likelihood

• FV refers to the degree to which a test looks right, and appears to

• FV is hardly a scientific concept/empirically tested, yet it is important