You are on page 1of 29

Standards-

based
assessment
Tim McNamara
The University of
Melbourne
Standards-based assessment
and criterion referencing
Standards-based assessment is a form of
criterion-referenced assessment (cf norm-
referenced assessment).
Information derived from a
Criterion-Referenced Test
The degree to which the
student has attained
criterion performance, for
example whether he can
QuickTi me and a
TIFF ( Uncompressed) decompr essor
satisfactorily prepare an
are needed to see thi s picture.
experimental report.
Glaser 1994 [1963], p.6
Information derived from a
Norm-Referenced Test
The relative ordering of
individuals with respect
to their test
performance, for
QuickTime and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
example, whether
Student A can solve his
problems more quickly
than Student B.
Glaser 1994 [1963], p.6
Definition of a criterion-
referenced test
A criterion-referenced test is
one that is deliberately
constructed to yield
measurements that are directly
interpretable in terms of
QuickTime and a
TIFF (Uncompressed) decompressor specified performance
are needed to see this picture.
standards. Performance
standards are generally
specified by defining a class or
domain of tasks that should be
performed by the individual.
Glaser and Nitko, 1971, p. 653
Definition of a criterion-
referenced test (2)
A students score on a criterion-referenced measure provides
explicit information as to what the student can and cant do.
Criterion-referenced measures indicate the content of the
behavioural repertory, and the correspondence between what
an individual does and the underlying continuum of
achievement. Measures which assess student achievement
in terms of a certain criterion standard thus provide
information as to the degree of competence attained by a
particular student which is independent of reference to the
performance of others.
Glaser, 1963, p. 519
Norm-referenced test
Any test that is primarily designed to
disperse the performances of students in
a normal distribution based on their
general abilities, or proficiencies, for
purposes of categorizing the students into
levels or comparing students
performances to the performances of
others who formed the normative group.
Brown and Hudson (2002, p. 2)
Is CRT behaviourist?
Criterion-referenced
testing has its origins in
behaviourism, but need
QuickTime and a
not be atomistic, purely
TIFF (Uncompressed) decompressor
are needed to see this picture. dichotomous, or
reductive.
Criterion-referencing and levels
on a continuum
ELICOS placement test
---------------------------------------------------------------------------
Underlying the concept of achievement
Item Estimates (Thresholds) all on all (N = 86 L = 57)
---------------------------------------------------------------------------
3.0 |
|
|
measurement is the notion of a
|
|
|
|
|
continuum of knowledge acquisition
2.0 X
|
|
|
|
24

18
50
ranging from no proficiency at all to
perfect performance. An individuals
|
XX | 37
|
XXX | 53
XX |
XX | 39 54
1.0 XX
X
XXXXXXX
XXXX
|
|
|
|
3

32
12
13

34
25
43

35
45

36 47
achievement level falls at some point
on this continuum as indicated by the
XX | 19
|
XXXX | 27 49 51 57
XXXXXX | 38 42
XXXXX | 29 31 52

behaviors he displays during testing.


.0 XX | 5
XX | 28
XXXX | 9 21 46
XXXXXXXX | 14
XXXX | 23 26

XXXXX
XXX
XXXXX
XXXX
|
|
|
|
|
4
33

15
41

16 20 40
The degree to which his achievement
-1.0 X
X
XXXXX
|
|
|
|
|
22
17
30
8
10
44

resembles desired performance at any


-2.0
X |
|
|
|
|
7

1
level is assessed by criterion-
|
|
|
|
|
6
referenced measures of achievement
-3.0
|
|
|
|
2 11
or proficiency.
---------------------------------------------------------------------
------
Scales and CRT
The standard against which a students performance is
compared when measured in this manner is the
behavior which defines each point along the
achievement continuum. The term criterion, when
used in this way, does not necessarily refer to final end-
of-course behavior. Criterion levels can be established
at any point in instruction where it is necessary to obtain
information as to the adequacy of an individuals
performance.
Glaser, 1963, pp. 519-520
Interface with policy - scales
and frameworks
Dominant movement in language
education internationally
Driven by need for accountability
and emphasis on demonstrable
outcomes
Has adopted functionalist view of
language education (i.e. not cultural,
intellectual, values dimension)
Response to demands of
globalization, efficiency
Curriculum and assessment
addressed in single framework
Emphasis on reporting
Format of standards
Standards are typically formulated as an
ordered series of statements about levels
of achievement or stages of development.
(There may be multiple sets of ordered
statements for different aspects of
language development)
CEFR Levels A2 , B1 (speaking)
A2: Can understand sentences and frequently used expressions related to
areas of most immediate relevance (e.g. very basic personal and family
information, shopping, local geography, employment). Can
communicate in simple and routine tasks requiring a simple and direct
exchange of information on familiar and routine matters. Can describe
in simple terms aspects of his/her background, immediate environment
and matters in areas of immediate need.

B1: Can understand the main points of clear standard input on familiar
matters regularly encountered in work, school, leisure, etc. Can deal
with most situations likely to arise whilst travelling in an area where the
language is spoken. Can produce simple connected text on topics
which are familiar or of personal interest. Can describe experiences
and events, dreams, hopes and ambitions and briefly give reasons and
explanations for opinions and plans.
Mislevy: claims and evidence
An assessment is a ASSESSMENT
machine for reasoning ARGUMENT

about what students CLAIMS


know, can do or have
accomplished

based on a handful of OBSERVATIONS/


things they say, do, or EVIDENCE
make in particular settings
What is the CEFR?
It represents a construct definition; it is an
exercise in domain modelling
It provides a set of claims
It provides a general characterization of
evidence and tasks
It is not a test - it allows different kinds of
tests to be realizations of this construct
Possible functions of standards
Planning: to act as a series of objectives of goals for
teaching and learning; involve clear and specific
statements of teaching aims
Professional understanding: to inform teachers about
the typical progress of learning; more complex
statements and include contextual and interpretative
information in order to help the teacher understand
more fully the nature of the emergent ability in the
learner
Accountability: to act as statements of learning
outcomes for administrative purposes - tends to be
dominant function
Formative vs summative
assessment
Can standards-based assessment help
with formative assessment?
Gathering evidence to form
basis of reporting
Gathering of evidence a mixture of teacher-led
assessment and external examination
External evidence may be seen as intrusive, insensitive
to learning
Places burden on teacher for record keeping
Requires intensive professional development of
teachers
Best schemes provide good advice to teachers about
integrating assessment in instruction - Assessment for
learning movement
The assessment pyramid
LEVELS
(NUMBERED)

LEVEL
SUMMARIES

STRAND DESCRIPTIONS

WITHIN EACH MODE, EXAMPLES PROVIDED

[ADVICE TO TEACHERS] DETAILED EXAMPLES

TEACHER CHOOSES ACTIVITY & CRITERIA


Competing demands in
standards-based assessment
Validity demands Managerialist Teacher/
demands learner demands
Intellectual Reporting Meaningfulness in
defensibility of Accountability instructional process
construct Facilitation of
Evidence of learning
Reliability Enhanced quality
Other validity of teaching
evidence Minimization of
Concern for administrative
consequences burden on teachers
Dylan Wiliam: Beyond norm-
and criterion-referenced tests
Norm-referenced - hard to
interpret in terms of what a
student can do; limited to
placing student in cohort
group
Criterion-referenced -
leads to narrowing of
teaching
Also implies a cohort group
Wiliam on the role of teachers
An assessment is valid to the extent that you
are happy for teachers to teach towards the test
Therefore:
Involve teachers in summative assessment
Increases reliability and validity
Externalize standards
Locates teacher as coach, not judge
Requires teachers to form a community of
practice
Wiliam on construct-referenced
assessment
Criteria do not define but exemplify
grades
Standards are shared by the community
of practice
Standards are implicit and evolve
Example: Standards and the
PhD
Implies a yes/no decision about individuals
Impossible to specify criteria
But examination process proceeds
successfully
Granting PhD is a performative utterance,
an illocutionary act (not a description) - the
person is launched on their career
Wiliam on summative and
formative assessment
Effective summative assessment
requires teachers to share a construct of
quality
Effective formative assessment
Requires students to share the same
construct of quality
Requires teachers to posses an anatomy of
quality
Wiliam on quality rather than
criteria
Maxims cannot be understood, still less applied by
anyone not already possessing a good practical
knowledge of the art. They derive their interest from our
appreciation of the art and cannot themselves either
replace or establish that appreciation.
(Polanyi, 1958 p50).

Quality doesnt have to be defined. You understand it


without definition. Quality is a direct experience
independent of and prior to intellectual abstractions.
(Pirsig, 1991 p64).
Our questions
1 assessment vs testing vs evaluation vs
validation vs measurement
2 affective factors in assessment
3 influence of L1 on assessment
4 raters/judges
5 effect of tasks - (esp CELU)
6 criteria in writing and oral interaction
7 history of assessment
8 why assessment? Can we do without it?
9 performance assessment
Our questions
10 qualitative vs quantitative aspects
11 correction in an oral exam
12 assessment as a process - and the final
exam?
13 scales/descriptors for oral language
14 should listening be part of the oral exam?
15 Are we assessing what we want to assess?
16 Defining standards - intermed/advanced etc
17 Inter-rater reliability?
Our questions
18 Inferring actual performance from
exam performance?
19 Exam strategies
20 Criteria in assessing a performance -
e.g. grammar?
21 Cultural aspects - interference in
performance, rating, etc?

You might also like