Davison (2004) - Teacher-Based Assessment in OZ and HK

The contradictory culture of
teacher-based assessment: ESL teacher

assessment practices in Australian
and Hong Kong secondary schools
Chris Davison The University of Hong Kong
A growing concern in teacher-based assessment, particularly in assessing

English language development in high-stakes contexts, is our inadequate
understanding of the means by which teachers make assessment decisions.
This article adopts a sociocultural approach to report on the background
and findings of a comparative study of ESL teachers’ assessment of written
argument in the final years of secondary school in Australia and Hong
Kong. Using verbal protocols, individual and group interviews and self-
reports, the study explored the different assessment beliefs, attitudes and
practices of teachers working with senior secondary Cantonese-speaking
students acquiring English as a second language. The study found that the
Australian teachers varied considerably in their approach to assessing
student work with two somewhat conflicting assessment orientations
revealed: the legalistic assessors who ‘ticked the boxes’ according to the pub-
lished assessment guidelines and those assessors who relied much more on
professional judgment. In Hong Kong, there was much more variability in
the underlying assessment criteria with consensus reached through reference
to community norms rather than explicit statements of performance. The
article concludes that traditional notions of validity may need to be recon-
ceptualized in high-stakes teacher-based assessment, with professional judg-
ment, interaction and trust given much higher priority in the assessment
process.
I Introduction
In some countries (e.g., England, and increasingly over recent years,
the USA) the distrust of teachers by politicians is so great that in-
volving teachers in the formal assessment of their own students is
unthinkable. And yet, in many other countries (e.g., Norway and
Sweden) teachers are responsible not just for determination of their
students’ results in school leaving examinations, but also for univer-
Address for correspondence: Chris Davison, Associate Professor in English Language

Education, Faculty of Education, University of Hong Kong, R211, Runme Shaw Building,
Pokfulam, Hong Kong, SAR, China; email: cdavison@hkucc.hku.hk
Language Testing 2004 21 (3) 305–334 10.1191=0265532204lt286oa # 2004 Arnold

306 Teacher assessment practices in Australia and Hong Kong
sity entrance . . . As one German commentator once remarked: ‘Why

rely on an out-of-focus snapshot taken by a total strang-
er?’ . . . High-quality educational provision requires that teachers are
involved in the summative assessment of their students (cited in
Wiliam, 2001a: 166; emphasis in original).
The role of the teacher-assessor in classroom-based assessment
has received relatively little attention in the research literature in
English language education, despite the increasing moves around
the world to adopt school-based assessment for high-stakes summa-
tive purposes in countries as diverse as Australia, the UK and Hong
Kong. Even less attention has been paid to the way in which differ-
ent educational and cultural contexts, and teacher assumptions
about those contexts, shape teachers’ assessment beliefs, attitudes
and practices. This article arises from a comparative study of the as-
sessment beliefs, attitudes and practices of English language teach-
ers in two very different educational and cultural contexts: Hong
Kong (SAR, China), and Melbourne (Australia). The two contexts
differ in many ways, including in their use of English, in their edu-
cational and assessment systems, in their teacher training and in
their assessment culture (see also Cheng, this issue).
In Hong Kong a former British colony and now a special
administrative region of the People’s Republic of China English
functions for all practical purposes as a foreign language. Although
it is taught as a discrete subject from the first year of school and is
highly valued as a language of international communication, it is
used for very limited purposes within the wider community. Partly
because of perceived concerns about declining English language
standards as well as economic competitiveness, the Hong Kong
government has initiated major reforms of its curriculum and assess-
ment systems (Curriculum Development Council, 2000; 2002).
‘Assessment for learning’ (Curriculum Development Institute, 2003)
is being actively promoted to counter the entrenched dominance of
an elite traditional external examination system, and from 200708
onwards there will be a significant school-based assessment compo-
nent to the main external examinations in all subjects, including
English (see Harlen and Winter, this issue). At the same time stand-
ards-referenced curriculum and assessment schemes are being
introduced rapidly, complementing the recent adoption of a more
task-based English syllabus. However, inadequate teacher supply
and training is seen as hindering the educational reforms (SCO-
LAR, 2003). Although English teachers have always undertaken
school-based assessment of student work, their assessments have
focused primarily on improving examination performance, and
there is concern that they may not be well equipped to handle the
Chris Davison 307
demands of a more formalized system of high-stakes school-based

assessment (Hong Kong Examination and Assessment Authority,
2003). Only about 14% of English language teachers in Hong Kong
secondary schools have both a relevant degree and teacher training.
However, there has been virtually no research into the criteria
and constructs such teachers are currently using to make their inter-
nal assessments, nor any research into how they go about making
judgments.
In contrast, although Melbourne has a high immigrant popu-
lation, English is the main language for all public communication,
including schooling. Criterion-referenced assessment systems, linked
to system-wide standards, are well established and teacher-based
assessment processes have been in operation for some time (see
Arkoudis and O’Loughlin, this issue). To complete secondary
school, all students must take a compulsory English subject with
teacher-based assessment contributing 50% of the grade (Victorian
Curriculum and Assessment Authority, 2003). Teachers of English as
a second language generally have considerably more teaching
experience than their counterparts in Hong Kong, and are much
more used to dealing with the demands of high-stakes school-based
assessment. However, ESL in schools is a comparatively recent field
of curriculum activity, and has been continually evolving to meet
new conceptualizations of its role (Davison, 2001). Despite this, little
research has been undertaken on documenting how ESL teachers
make assessment decisions in such a strongly criterion-based culture.
Internationally, too, there has been comparatively little research
into how English language teachers in schools come to their assess-
ment decisions (Clarke and Gipps, 2000), although there is much
evidence of variability in judgments, particularly in writing assess-
ment.1 A number of Australian researchers (Moore, 1996; Breen
et al., 1997; Davison and McKay, 2002; Nicholas, 2002; Davison
and Williams, 2003) argue that one factor contributing to this varia-
bility is the actual assessment frameworks and criteria used (see
Scott and Erduran’s review, this issue), as well as the reporting
requirements of schools and systems. This reinforces earlier argu-
ments (Lantolf and Frawley, 1985; 1992) that the nature of achieve-
ment described is mediated, even constituted, by the instruments
used to describe it. This also highlights the importance of examining
context and teachers’ context-specific beliefs (Freeman, 2002), that
is, teachers’ assumptions about the factors that will shape and
1
There is not necessarily more variability in writing assessment than other modes. It is simply
that writing assessment is the most common mode of assessment in English language
education, and the easiest to research.
constrain their practices in the assessment process. One of the most

important contextual factors is teachers’ individual beliefs about
how others will evaluate their behaviour, so-called ‘normative
beliefs’ (Ajzen, 1988). If teachers believe that those important to
them in their professional community will approve a certain prac-
tice, then they will be predisposed to carry out that practice, even if
it is not acceptable in the wider community, and vice versa. Other
contextual factors include teachers’ beliefs about the social, insti-
tutional and cultural context of their assessment practice, including
the purpose of assessment, its relation to learning and teaching, the
role of the teacher in relation to assessment, and the teachers’ pre-
existing beliefs about the students and texts they are assessing. This
article explores some of the underlying constructs and criteria teach-
ers use to make assessment decisions in two very different contexts,
and the extent to which those constructs and criteria are shaped
and constrained by the different assessment cultures in which the
teachers work.
II Criterion-referenced vs. construct-referenced assessment

approaches and their limitations
The move from a ‘culture of testing’ to a ‘culture of assessment’ in
most school systems has invariably been accompanied by the devel-
opment of often quite complex criterion-referenced assessment
frameworks to ensure reliability and consistency in teacher-based
assessment. However, growing evidence from research into portfolio
assessment and from studies of raters’ assessment processes in for-
malized, large-scale testing would suggest that there are at least
three problems with such criterion-referenced systems from the per-
spective of teacher-based assessment.
First, the common assumption underlying such approaches is that
criteria can be teacher free and context free, but many studies show
that assessment criteria are interpreted differently by teacher-asses-
sors according to their personal background, previous experience,
unconscious expectations, internalized and personalized preferences
regarding the relative importance of different criteria and ideological
orientation. Research into assessor behaviour shows that despite
similar training and language background, teachers seem to differ
from each other in a variety of ways in their interpretation of
assessment criteria (Brindley, 1991; Pula and Huot, 1993; Brown,
1995; Zhang, 1998), in their ‘reading’ practices (Hamp-Lyons, 1989;
Cumming, 1990; Vaughan, 1991; Lumley, 2001) and in their
response to different features of writing (Huot, 1990; O’Loughlin,
Chris Davison 309
1994; Weigle, 1994; Zhang, 1998). In a study of portfolio assess-

ment, DeRemer (1998) examined the process of assessment used by
teachers employing a standardized assessment scheme and found
not only low agreement amongst the teachers, but different reading
and rating processes. As Gipps (1994: 167) suggests, the central
problem is that ‘we are social beings who construe the world
according to our values and perceptions.’ The generalized and rela-
tive nature of much criterion-referenced assessment schemes makes
conflicting interpretations even more likely.
The second problem with much criterion-based assessment is that
it assumes teacher-based assessment is essentially a technical acti-
vity, requiring little professional judgment or interpretation. How-
ever, research suggests assessment criteria can never be made
explicit (Brindley, 2001) and statements are always ambiguous, and
require the application of implicit knowledge (Claxton, 1995;
Wiliam, 2001a). In other words, criteria can provide a basis for
negotiating assessment, but ultimately teachers will only understand
and interpret criteria by looking at actual examples of students and
texts. This raises a deeper problem with assumptions. Criterion-
based assessment systems are founded on the premise that agreement
(with criteria as well as other assessors) rather than disagreement
are essential for ‘validity’ and ‘reliability’. However, anecdotal
evidence would suggest that it is only through the teacher interpret-
ation and negotiation of judgments that judgments can be made
valid and reliable.
The third problem with much criterion-referenced assessment
and this is perhaps its fatal flaw is that it assumes teachers will
accept externally imposed criteria as a basis for judgment. However,
there is growing evidence that when a conflict arises between stand-
ardized criteria and teachers’ own personalized judgments, teachers
manipulate and=or reject the criteria (Davison, 1999; Arkoudis and
O’Loughlin, this issue).
Even if it is possible to define assessment criteria unambiguously,
it is not necessarily desirable (Smith, 1991; Mabry, 1999). Greater
and greater specification of assessment criteria ‘criterion-
referenced hyperspecification’ (Popham, 2000) can lead to students
and teachers predicting too accurately what is to be assessed, thus
over-emphasizing those aspects of the curriculum, or fragmenting
and compartmentalizing learning and teaching; these are both com-
mon problems already in the senior secondary area. As Wiliam
(2001a: 171), a vocal critic of criterion-referenced assessment sys-
tems argues: ‘Put crudely, the more precisely we specify what we
want, the more likely we are to get it, but the less likely it is to
mean anything.’
On the other hand, construct-referenced assessment2 (Wiliam,

1994) currently being strongly promoted in educational assess-
ment circles as an alternative to criterion-based systems (Brindley,
1998; Wiliam, 2001a; 2001b) also has clear limitations as a frame-
work for teacher-based assessment.
First, construct-referenced assessment is founded on the assump-
tion that amongst teachers there develop common ‘communities of
practice’ (Black and Wiliam, 1998) that provide implicit frameworks
for judgment and evaluation. It is claimed that these communities
of practice are stable enough to allow teachers to reach consensus
without the provision of objective criteria. According to Wiliam
(2001a: 17273), the assessment system ‘relies upon the existence of
a construct of what it means to be competent in a particular domain
being shared amongst a community of practitioners’. No attempt is
made to prescribe learning outcomes, they are defined simply as the
consensus of the teachers making the assessments. Although it may
be possible to establish a common community of practice in local
education systems, in established areas of curriculum activity in
which teachers share a common history and training, it seems a
much more ambitious assumption in large state-based education
systems such as Hong Kong and Melbourne. In such systems there
are very diverse pathways into the English language teaching pro-
fession, and the field itself is continually evolving so that at any one
time teachers may hold very different interpretations of their role
and purpose.
Secondly, construct-referenced systems cannot work unless
teacher-assessors are ‘authorized (i.e., granted the power and legiti-
mated as users of that power) to make judgments’ (Wiliam, 2001a).
The extent to which those judgments are seen as valid ultimately
resides in the degree of trust placed in teachers by those who use the
results of the assessments (Wiliam, 1996). In order to gain and=or
maintain this trust Wiliam (2001a: 17374) argues that communities
have to show that their procedures for making judgments are fair,
appropriate and defensible, even if they cannot be made totally
transparent. However, trust in ESL teachers is the one thing that
seems conspicuously lacking in the regular media outcries about
declining standards of English in Melbourne and Hong Kong.
Although Wiliam (2001a: 172) argues ‘it is not necessary for
2
The difference between criterion- and construct-referenced assessment systems is in the
relationship between written descriptions (if they exist at all) and the domains (Wiliam,
2001a). In criterion-referenced systems written statements collectively define the level of
performance required (or, more precisely, the justifiable inferences), whereas in construct-
referenced systems written statements merely exemplify the kinds of inferences warranted.
Chris Davison 311
the raters (or anybody else) to know what they are doing, only that
they do it right’, English language education is something in which
the wider community has a vested interest. Thus, debates over the
rights and wrongs of English language teaching (and teachers)
become very heated when teacher-based assessment is discussed.
Finally, construct-referenced assessment, like criterion-based sys-
tems, assumes that agreement within the community, rather than
disagreement, is primary. This can lead to moderation meetings, an
essential requirement of construct-referenced systems, valuing com-
monality over diversity of opinion, with a consequent undermining
of validity as teachers bury their differences in the drive for consen-
sus. As Clarke and Gipps (2000: 40) argue, group moderation must
be used to ‘ensure that teachers have common understandings of the
criterion performance’. This implies development, not just vali-
dation of understandings, and inevitably requires more time than
most systems are prepared to fund.3
In summary, it can be argued that both criterion and construct-
referenced systems have their limitations as approaches for teacher-
based assessment. This study not only reveals the extent to which
these limitations are realized in teachers’ actual assessment practices
in different contexts, but also suggests some ways to overcome these
limitations by exploring what teachers actually do as they assess,
not just what systems expect them to do.
III The research design

This study explored two aspects of teachers’ beliefs, attitudes and
practices which emerged as most salient from the above discussion,
that is:
. the extent to which senior secondary English language teachers
in Melbourne and Hong Kong schools shared common
3
Paradoxically, moderation meetings are almost always the first casualty of the implemen-
tation of school-based assessment. For example, when the new Certificate program was intro-
duced in Melbourne schools in 1992, all teachers’ grades were subject to systematic
verification and assessor moderation but this checking process was abandoned as too costly,
and replaced by a system of top-down state-wide reviews and introduction of an external
comparison=control, the ‘context-free’ General Achievement Test (GAT). If there was a dis-
crepancy between the school-based results and the GAT, reviews of all school-based assess-
ment in specific schools were automatically triggered. In 2000 this was replaced by
standardization of results with those of the external exam, hence undermining the integrity of
the school-based assessment. In Hong Kong there is already talk of the need to supplement
teacher moderation in school-based assessment with standardization against the external
examinations (South China Morning Post, 31 May 2003, p. E3)
(and stable) beliefs and values about the construct being

assessed, with specific reference to written argument; and
. the extent to which teachers felt their judgments were legiti-
mated and ‘trusted’ in their community.4
Given the different contexts and assessment systems, major differ-
ences in teachers’ practices were expected. However, in order to rec-
oncile and, eventually, perhaps resolve the apparent limitations of
the two existing approaches to teacher-based assessment, common-
alities in beliefs and practices were also sought.
Using questionnaires, verbal protocols, individual and group
interviews and self-reports, the study explored the different con-
structions of written argument as well as the different interpretations
of assessment criteria of 12 teachers of English as a second language
in each context, all with varying degrees of teaching experience, all
teaching in the final two years of secondary school. The method-
ology chosen was primarily qualitative and interpretive, as what was
of interest was understanding and theorizing what teachers actually
do and why. The theoretical framework for analysis was influenced
by the work of Lave and Wenger (1991) on learning communities
and Harré (1993; 1999) on social positioning. Harré suggests
that discourse ‘acts’ are a form of discourse that realize the norms
of actions or practices of an activity, whereas discourse ‘accounts’
of theory and discourse ‘commentaries’ in action realize norms of
social acts. The meaning of social acts is made more explicit
through different types of discourses.
The teachers in this study were asked to assess the texts according
to their usual processes. Their think-alouds were audio-taped while
they assessed the students’ work, and made comments on scripts
(acts=commentary). They were then brought together in groups of
three and asked to share their results, and to reflect upon their
assessment processes (commentary=accounts). These group dis-
cussions were also audio-taped and transcribed. After a week’s
break, the teachers were then brought back into a whole group led
by the researcher, and any contradictions in their practices and atti-
tudes (accounts) were explored. This allowed for a certain degree of
triangulation and sufficient opportunities for differences in teacher
beliefs to emerge. Data were analysed by looking for different con-
structions of ‘effective’ written argument as well as different con-
structions and interpretations of assessment criteria in order to
4
This research was carried out with the aid of grants from the Australian Research Council
and the University Grants Commission, Hong Kong.
Chris Davison 313
make more explicit the cultural meanings of the teachers’ actions,

including contradictions and uncertainties.5
The six texts being assessed were written arguments produced to
meet the local assessment requirements. In Melbourne this was the
section of the subject of English (ESL) dealing with issues and argu-
ments which asked students to produce a sustained point of view on
a particular issue that had appeared in the Australia media in the
previous year. In Hong Kong it was a written argument on a spe-
cific issue in preparation for the Hong Kong Advanced Level of
Education Examination (HKALE). Despite the different contexts in
which written argument was taught, the topics were remarkably
similar: e.g., gambling, euthanasia, environmental pollution, and
debates about whether to host a Grand Prix. The Melbourne texts
were elicited from Cantonese-speaking Hong Kong-born students
with varying periods of schooling in Victoria as part of an earlier
study (Davison, 1998). The Hong Kong texts included two of the
Melbourne texts to enhance comparability as well as four locally
produced ones on the same topic, namely, euthanasia.
IV Preoccupations of teachers involved in the assessment process

Even allowing for the inherent difficulties of collecting data through
verbal protocols (Cooper and Holzman, 1983; Smagorinsky, 1989) it
was assumed that the Australian teachers would be far more objec-
tive and consistent in their judgments of students’ performance,
given the much longer tradition of teacher-based assessment in
Melbourne schools compared with Hong Kong. However, this was
apparently not the case. The verbal protocols and small group dis-
cussions of the Melbourne teachers revealed a great deal of conflict
over how to interpret the assessment criteria, with some students
being given marks from A to D and others being given the same
mark but for very different reasons.
Not surprisingly, all the Melbourne teachers chose to use the
established and familiar criteria (see Figure 1) for assessing the writ-
ten texts that had been developed for summative school-based
assessment, as well as the standardized reporting pro forma that
gave equal weight to all criteria (see Appendix 1; see also Board of
Studies, 1999). Although all the Melbourne teachers appeared to be
very systematic in following the published criteria, they struggled at
times to reconcile the ‘legalistic’ framework with what they called
5
Space precludes a more detailed analysis of the teachers’ assesssment conversations and
commentaries as situated and distinct forms of discourse.
Fig. 1 Assessment criteria for CAT 1 Presentation of an issue (ESL) Source: Board of
Studies, 1999: 1620.
their ‘gut reactions’ or professional judgment. This is demonstrated

in Extract 1 below, from a verbal protocol dealing with a text that
received the widest range of grades, AD (see Appendix 2).6
Extract 1 (from verbal protocol): Criteria vs. instinct
1 R: (Giving overall impressionistic comments as reads text) There is no

obvious evidence of an introduction in terms of paragraphing. However,
Vince7 would appear to have some appreciation of the generalities of the
issue that he’s considering. And eventually come to line 6 sentence, he
5 does actually have rather general contention. In keeping with the lack of
paragraphing, there’s no obvious structure to what follows. Although he
starts with points, he has somepoints, what he calls the economic point
of view. And, some rather vague and undocumented material which he
tries to substantiate his point of view. And he’s obviously been taught all
10 the ‘therefore’, ‘besides’, ‘however’, ‘hence’ connectors which he sprinkles
quite liberally in an attempt to connect his list of points. Yes, he does
have a conclusion. And in fact that constitutes the second paragraph.
(Looks at criteria as rereads bits of text, at the same time jotting down
15 marks and comments on the assessment sheet in blue pen) The whole
thing does show a limited understanding of the task (explicit reference to
Criterion 1, Figure 1). I think I’d rate it as a medium. And the reasons
why I’d rate it as medium rather than high or even low, is that he is
aware that he needs to express his opinion and that it should be an
20 informed opinion and that he does make, I suppose, he does sort of list a
6
In all of the extracts, initial letters identify the different teachers participating in the assess-
ment discussion. Transcription conventions include bold for my emphasis; ¼ indicates over-
lapping turns.
7
Pseudonymns are used for all references to students.
Chris Davison 315
whole host of ideas, which one assumes he has collected from his read-
ings. And, I think then that too, shows at least a medium knowledge of
the chosen content. Perhaps medium to high. High because of the actual
number of opinions. But much of the material which he calls to support
25 his ideas is rather superficial and tends to be opinionative rather than
substantive. So yes, there is some knowledge, I’ll give him a high for that
(explicit reference to Criterion 2, Figure 1) -perhaps a little generously.
Effectiveness and appropriateness of the exploration of ideas (explicit

reference to Criterion 3, Figure 1) because of the structure really,
30 because of the sort of list-like nature, because of its sort of rambling
rather than coherence, although perhaps we’re pushing into structure
here, or we’re heading that way I wouldn’t really say this is parti-
cularly effective. So I’m just going to give him a medium there. And
again, coherence and development (explicit reference to Criterion 4,
35 Figure 1). Well, coherent not in sort of big picture way there’s no
obvious evidence of that the lack of paragraphing. Perhaps we’ll
penalise him heavily for such things, but I’ll say that’s a medium. And
the structure (Criterion 5, Figure 1) really lacks punch, it lacks effective-
ness as a consequence. Fluent (Criterion 6, Figure 1). Yes, it’s fluent and
40 reasonably expressive. So, given his second language background I think
we could give him a high for that-mildly generous. The language chosen is
quite sophisticated in places. One assumes, because it’s not in inverted
commas, that’s it’s his own language. But he does use such words as
indemnity, compensation, the crux of the problem, figures cited by the
45 government, by the same token, inadvertently encourage. There’s some,
impersonate their idols, there’s some quite sophisticated language there. I
suppose that’s balanced in places by the fact that very fundamental
language occasionally has some obvious errors. Effective and appropriate
(Criterion , Figure 1). Well, taken for granted that, it’s certainly appro-
50 priate language. And, I suppose the control of the mechanics of English
language to support meaning (Criterion 8, Figure 1). The specific mech-
anics are quite good, I would say. Mr Generous today. We’ll be generous
and give him a high. So, we’ll total that and probably be horrified at the
high score he’s got. Four highs which gives him that’s a total of 16.
55 Gives him a B plus. My instinct now comes in and I think really he’s not
worth a B plus. This is now my subjective I would probably be more
inclined to give him a B. But I’m going to be honest and I’m going to give
Vince a borderline B plus. Which I think is generous and I think perhaps a
B would have been a much more appropriate mark.
What is clear in this extract is the way in which the assessment

criteria and procedures mediate, even construct, the nature of
achievement described. At the end of the process the teacher feels
the weighting of the criteria he has applied results in too high a
grade: his ‘instincts’ tell him the student is not worth a B þ (lines
5559). Yet he feels compelled to be ‘honest’ and objective, that is,
to follow the assessment guidelines. He then rationalizes this by
identifying himself as ‘Mr Generous today’ (line 52). This conflict or
‘discrepancy’ in assessment beliefs and practices was fairly
representative of the Melbourne teachers. As another teacher, S, in

the same group commented:
You’re thinking globally first off on first reading, and then you start to
apply the criterion and your frame of mind changes according to ‘Do I
downgrade?’ or ‘Do I upgrade?’ . . . lots of us have discrepancy, we’d like
to reward a little bit more for that particular expression or style of writ-
ing, but we’re not going to reward it for its ideas, for example, depend-
ing on the first reading of the piece. (Follow-up interview, p. 3)
However, S, in contrast to R, seems more comfortable in manipu-

lating the weighting of the criteria to suit his initial global impres-
sions of the student’s achievement.
When the grades given for this text were discussed in the small
group follow-up session, it provoked a lot of debate (see Extract 2
below). Teachers resisted giving equal weighting to the different
criteria. However, at times they appeared to feel they had to make a
choice between following the published assessment guidelines or
their own professional judgment. This was exemplified by teachers
talking in terms of ‘being good’, or ‘honest’ and doing the right
thing, vs. following their ‘instincts’ (lines 000, 000, 000). R who had
ignored his ‘gut feelings’ and ticked the boxes in the way he felt was
expected, commented that he felt ‘ashamed’ (line 000) of ignoring
his instincts:
I just followed the criteria. I normally don’t do that, I normally go back
and work very hard to make sure that the mark is realistic, that it’s what
the student should get. (Follow-up interview, p. 2)
Extract 2: Honesty vs. shame
R: I was appallingly sort of honest with what I did here. I ticked the boxes
and was horrified when I finished up with a B plus, because my instinct
told me it was a C plus, but I stuck with it.
J: Why did your instinct tell you that?
5 R: Well, I when I mean instinct, when I read it through the first time I
thought well no paragraphs, no this, no that, there were some glaring,
well, let me stick to what I did rather than that. I noticed with several
of these, I was under the impression that they were meant to state con-
tention at the head of the piece.
10 M: That’s what I meant by formulaic the formula is, it’s in the title.
R: They haven’t done that, and I thought that was a fundamental cri-
terion. I mean not a criterion in the sense that you penalised or
rewarded them. And I thought that, well, his appreciation was at
best, moderate. Now, I suppose by that I mean medium, but I think I
15 should have gone for low.
M: The reason why mine was not above a C plus in the end, even though
as you can see I . . . this is a B, because I did change my marks after a
Chris Davison 317
while, was because I felt the main task here is to understand. If the
first couple, I consider the first three more important than say the last
20 three. In fact, I’ve often argued that they should not be weighted
equally. If the language is such, this is definitely such, that you could
clearly understand what he is saying, then the language does not de-
tract from meaning. Then it boils down to the first three criteria.
There is where I felt he really had quite a few problems.
25 R: So, what did you give him for those first three?
M: He did not maintain a personal viewpoint and argue coherently, sup-
porting and substantiating his arguments, so I couldn’t go the to B=B
plus, because for me that has to happen. Even at the lowest level, it
has to happen. And yet, you can’t go down any lower than that
30 because his ¼
B: ¼ so what are you saying that there, you’re talking about criterion 4
(see Figure 1) for example.
R: No, you’re initially restricting yourself to you’re commenting on 1,
2 and 3. I would argue that what you are saying is more pertinent to
35 4 and 5.
B: So when you’re saying criterion 3 is no good, it’s not effective and
appropriate exploration of ideas.
M: I’m not saying it’s not good. What I’m saying is instead of the struc-
ture, instead of arguing the case and substantiating the argument,
40 and then saying the opposition may have claimed this and that and
then rebut it, he seems to have stacked arguments one after the other.
B: OK, so you’re talking about criterion 5, 4 and 5 (see Figure 1). So, if
we look at the criteria, would you say that criterion 5 medium, the
work demonstrates some ability. I mean it’s a modifier, isn’t it? It’s
45 not total, or universal ability, some ability to . . . He’s demonstrated
some ability, it’s certainly not high. It’s certainly not some structure.
R: I think you and I are actually, what we’re doing is being quite legalistic
about this, aren’t we?
B: I always try to do that myself.
50 R: I have a dreadful conflict within myself as to what I call my intuitive
judgment, which is what you are going on, I think.
J: It is, it is.
R: And thinking about the future and where is this child going, can this
child cope? But that’s not what we’re being asked to do. We’re being
55 asked to tick the box. You say something is low. I don’t think you
can say that’s low, sadly.
B: But here, I think all we have to do is exactly I just read the
words. And it’s always worried me, this good or sound, because to me
there’s quite a range between good and sound. Good, this is good.
60 Sound is much better, much more concrete ability.
J: When I was speaking on the tape, I was saying I don’t know whether
I’m going to go medium or low here. I just felt, when I read it through
and look at it holistically, as you say, as a whole thing, I just think, it
65 hasn’t had a lot of depth, it doesn’t have a lot there and therefore
shouldn’t score well.
R: You’re saying this child is not tertiary material, that’s what you’re
saying.
B: That’s not a judgment ¼

70 J: ¼ For me he ended up with a score of 25 which was a B, but I said
into the tape I put C plus, because I think it was a C plus and I’d like
to go back and read it again and see where I would change, but if you
are being critical and going by the boxes, I got a B. Yes, I said into
the tape, I said it’s not right. So C plus to B. C plus was my gut feel-
75 ing, B was what my mark was.
R: It’s quite amazing how diverse we are.
J: But I find with this marking, this scoring system going by these criteria,
that you always end up with, I always end up with a higher mark than I
would give it if I just made it off. And therefore if I go back and read
80 it again, I think, right well I’m not going to give them quite a high
box there, I’m not going to give them quite ¼
R: ¼ That’s what happens to me. That I would normally go through
these and finish up as I have done, with a mark, and then I’d go back
and I’d switch the machine. I’d work very hard then to pull them down
85 to what I think is a realistic mark, but I haven’t done that. I’ve ticked
the boxes.
J: Yes, you’ve been very good and followed the ¼
R: ¼ No, I haven’t been good, I feel ashamed of myself.
This tension in balancing the requirements of the assessment

procedures with the teachers’ own professional judgment was
re-conceptualized by R in the whole group reporting back session as
a ‘philosophical divide’ (line 7, Extract 3) within his group. How-
ever, it sparked off a discussion about the ways in which all teachers
felt that their assessment decisions were influenced by factors other
than the criteria, such as their knowledge of the student or the con-
text. This led to comments about the role of the criteria (line 37):
Extract 3: Being legalistic vs. being human
R: Interestingly, there was a division in our group, please correct me if I’m

wrong, between the legalist who ticked the boxes according to the cri-
teria, and then perhaps in my own case, were horrified at how high
the marks were, and should have immediately crossed them out and
5 dragged them down, but didn’t. . .And those who didn’t quite do it
the same way. I’m dobbing you in here ladies. But this, I think, is a
reflection of a sort of philosophical divide. By coincidence it ran along
gender lines. And strangely enough, the men were the generous, car-
ing sorts who tried to help these students. I don’t know if it makes
10 sense, but I think that’s true of our thing. . .
M: ¼ Well, in our group we did say if they were our own students, this is
a very hypothetical, theoretical exercise, if they were our own stu-
dents, I think we all agreed that we are human, and if we saw someone
was having hardship, particular cases of where a student beats the
15 odds, then again, where there was borderline decision to be made,
we’d err on the side of the higher mark . . . when we met for consen-
sus marking, in the past we used to have a regional meeting and it
Chris Davison 319
was very interesting, because when we marked and, say I had G’s stu-
dent marked a couple of grades lower than G. had given it. G. would
20 come out with the background of this ‘but she’s tried so hard, you
know, and that was a justification’. And I remember those very
clearly.
G: Oh really, OK.
B: Yes, we did it that way. It’s pretty hard to avoid it when you get into
25 that sort of classroom relationship with some kids.
G: I suppose, I mean, if it operates over too wide a spread then it’s a real
problem if it isn’t there. If it’s a question of like, whether it’s a C or a
C plus, then, OK, I guess you can live with that. But if it’s a question
of whether it’s a C or an A, then that really calls it into question,
30 doesn’t it? If you can give it that much of the benefit.
M: It’s really those border areas and I think, if truth be told, subcon-
sciously, we are all affected by the students in front of us when we
know them.
R: By definition, I mean, we’re human beings, aren’t we?
35 M: That’s right.
G: Oh yes.
R: On the other hand, I think that’s the idea of the criteria, to try and
minimise that.
In this exchange the Melbourne teachers emphasize their ‘hum-

anity’, (e.g., lines 13, 34), and assert their need to adapt the assess-
ment criteria (and teaching goals) to fit their own teaching context
and professional reality (lines 1122, 3133). Criteria are valued by
the teachers as a way of ‘keeping them honest’ (line 37), but not a
substitute for professional judgment. Consequently, the teachers
resist what R sees as a legalistic process of ticking boxes, while at
the same time recognizing the ethical dilemmas created by their own
‘humanity’.
In fact the individual verbal protocols showed that a holistic and
interactive approach was used by the majority of the teachers to
come to assessment decisions, with the published criteria and
teacher professional judgment including knowledge of typical stu-
dent behaviour both playing an important role in the assessment
process. I would argue that this did not undermine the validity of
the teacher judgments. Rather, it contributed to and even con-
structed that validity, resulting as it did to in an in-depth portrait,
rather than an ‘out of focus snapshot’ (Wiliam, 2001a: 166).
In contrast, in the assessment behaviours and conversations of
the Hong Kong teachers, there was much less commonality in terms
of assessment processes and much more debate over which under-
lying criteria shaped their assessment judgments. This is not sur-
prising given the present reliance on norm-referencing and
impressionistic marking in Hong Kong, and the lack of any system-

wide common assessment criteria for evaluating written work at the
school level in senior secondary English. Some schools traditionally
focus more on accuracy and mechanics while others because of
their underlying philosophy and student profile give more empha-
sis to creativity and content. This debate is exemplified in Extract 4,
in which Hong Kong teachers were comparing their differing views
of the construct being assessed, with some teachers focusing more
on grammar and sentence level structure (line 3) and others more on
ideas and overall organization (lines 7, 18):
Extract 4: Conflicting constructs
C: . . . Why are you focusing on grammar?

T: I still think that content is more or less the same for my students. So,
I better pay more attention on the grammar part.
C: Did you agree with him?
5 L: Not really. For my students, I guess they can master the basics, at
least the fifth level. I think they lack practice on logical thinking and
organisation of ideas, that sort of thing. So, I have a very heavy em-
phasis on organisation and logical flow of ideas.
T: Well, more or less, we are on the same ground. But, of course, it
10 depends on the students I see each year. For the Form 6 students I
am teaching this year, only a few students are really weak in gram-
mar at the exam level that the expression and ideas are greatly dis-
torted by the grammar. So, I think as long as their writing can be
understood by me, I won’t be bothered too much by their grammar
15 because when I am marking, sometimes, I focus on introduction or
the topic sentences, elaboration of ideas and in one composition, I
may focus on particular use of sentence patterns. So, what I focus on
more is again the structure or whether they are able to develop ideas
in a logical way.
The wide discrepancy in the constructs being used to evaluate the

texts in Hong Kong compared with Melbourne is probably a result
of not only the lack of agreed published criteria but also the lack of
widespread teacher training. Teachers also have little or no opport-
unity to participate in moderation meetings or collaborative school-
based assessment. Like the respondents in an earlier study of teacher
attitudes towards school-based assessment (Davison and Tang,
2001), the majority of these Hong Kong teachers favoured the intro-
duction of summative school-based assessment, but wanted detailed
criteria to guide their assessments and to justify their results.
The Hong Kong teachers also demonstrated considerable conflict
over the weighting, implicit or explicit, of the criteria being used.
This is evidenced in Extract 5.
Chris Davison 321
Extract 5: Respectful safe marks vs. failure
S: This is a very good grade. That should be 71, a safe C. Now, I don’t
know. The language because though she used a lot of idioms, she does
have a mass group of the language there. If it is overused, I think it is
a very general thing with Hong Kong students. They swallow diction-
5 aries and then they try and pump out as many of these sort of like
‘Every cloud has a silver lining’ and all of these sorts of stuff. I think
it’s better, much, much better than any of the others that we have
looked at so far. And, I know there are two ways you can look at this
and there are two ways that I can look at this. Either I can give this a
10 respectful mark or I can fail it. There is nowhere in between. Either it’s
going to get it a C or it’s going to get F. I went for the C in the end
because I guess I understood what she was trying to say. If you can
get your mind around her sort of very flowery language, then there is
an argument there.
15 V: ¼ I think she’s trying to show [unintelligible] poetic language.
As in Davison and Tang’s (2001) study, the teachers here argued

over the relative importance of creativity and mechanics, with
opinion evenly split over whether ‘content’ should be given higher
marks than ‘language’. This suggests that there is much work
needed at a system level to forge agreement about what counts
in written language assessment and to develop common under-
standings of the terminology of the criteria themselves. These
conclusions are reinforced by the findings of a more recent survey
(Davison and Tang, 2003) of 30 Hong Kong English teachers, all
professionally trained university graduates working at the Form Six
level (1617 years old). These teachers were very critical of the
widespread practice of teachers having ‘every single mistake cor-
rected’, and they wanted ‘a balanced marking scheme’ that would
set out explicit and agreed common criteria for marking.
In Hong Kong, as in Melbourne, consideration of the psychologi-
cal and social impact of the assessment on learners the individual
washback (see Alderson and Wall, 1993; Andrews, 1994; Watanabe,
1996; Cheng, 1998; Watanabe, 2000) was always significant in
teacher decision-making, although interpretations of this varied con-
siderably according to context. In Melbourne it is the fate of the
individual and their life chances that are paramount in teachers’
thoughts, demonstrated in Extract 2 in the discussion of whether a
student is ‘tertiary material’ (see lines 6768). In contrast, in Hong
Kong respect and face are more important considerations, exempli-
fied in the reference in Extract 5 to ‘a respectful mark’ (line 10). The
notion of respect also resurfaces explicitly in discussions over the
uses and consequences of assessment (see line 4 in Extract 6), but in
this case it is the teachers who are striving for respect and recog-
nition. Without the ‘authority’ of the external examination, many

Hong Kong teachers feel that their judgments will not be authorized
or respected, even by the students.
Extract 6: Respect
R: . . .Those kids they’ve got into Form 6 by virtue of passing exams,

they are therefore the cream of the exam-oriented people. That’s what
the clients want. If you tell them that’s what I think you’ll get on the
UE [Use of English] paper, there’s more respect towards the grade that
5 means something.
This construction of respect is in striking contrast to the Melbourne
teachers who seemed to be more concerned about censure from
those within their professional community than from those outside.
In Melbourne, in a criterion-dominated assessment system, the
application of ‘gut feeling’ or ‘instinct’ seemed to engender greater
respect amongst teachers than simply ‘ticking the boxes’. In con-
trast, in Hong Kong teachers seemed to feel that there would be no
respect for any school-based judgments unless they were explicitly
linked to the external examination and, by implication, the auth-
ority of the wider Hong Kong community. This need for respect
and authorization by the outside rather than the ‘inside’ community
of practice is not an unexpected finding in Hong Kong. Davison
and Tang (2003) also reported widespread teacher concern about
lack of respect and trust for teachers as assessors, with most schools
closely monitoring internal teacher-based assessment through a var-
iety of surveillance mechanisms, including collecting books and
evaluating, in the teachers’ words:
. ‘the effort that we put into marking students assignments’;

. ‘whether a teacher has completed a fixed number of assignments
and whether he or she has marked each one of them intensively’;
. ‘whether the teacher has done his [or her] job properly in terms
of quantity or quality’; and
. ‘the type of tasks given and the way they are marked and
assessed.’
Even more striking is the way in which the Hong Kong teachers in
this study unlike the Melbourne teachers expressed their con-
cern about their lack of authority and influence as teachers, as can
be seen in Extract 7. Here, R makes the extraordinary statement
that his marking a dominant and time-consuming feature of his
daily routine is ‘pretty negligible’ (line 8), because the grades and
comments have no effect on learning or teaching in his school.
Chris Davison 323
Extract 7: Teaching vs Assessing
R: ‘I think you need to know whether we believe this is a paper exer-

cise and we are doing it simply to justify their writing in the first
place or whether the feedback we give them has a lot of bearing on
how they write the next time round. The more I do this, the more
5 I believe this is a justification process, that we mark to show them
we have been there and we encourage them [unintelligible] and to
give them a grade. We don’t mark. . . Any marking we do is pretty
negligible. . .. It’s quite sad that should be the case. I believe we
waste a lot of our time marking when we should be giving back
10 their process writing or getting them to write journals or all sorts
of other things. ‘So, my faith in the system is pretty low. And there-
fore, my faith in how I mark and the devotion I put into marking
is very low, and therefore I am very frequently pretty superficial in
the way I mark because I don’t believe it’s going to make the slight-
15 est bit of difference. I am sorry that it should be the case. I really
wish it was going to improve the students’ writing, but it isn’t the
case.’
Teachers reported a near universal desire for more time simply to

‘enjoy’ the work being produced and to respond to it as pro-
fessionals. The notion of teacher desire as a necessary ingredient in
school-wide reform has been highlighted by Hargreaves (1994: 11)
in relation to North American teachers:
Political and administrative devices for bringing about educational

change usually ignore, misunderstand or override teachers’ own desires
for change. Such devices commonly rely on principles of compulsion,
constraint and contrivance to get teachers to change. They presume that
educational standards are low and young people are failing or dropping
out because the practice of many teachers is deficient or misdirected. The
reason why many teachers are like this, it is argued, is that they are
either unskilled, unknowledgeable, unprincipled or a combination of all
three.
Lists of assessment criteria and one-size-fits-all assessment frame-

works can perhaps be seen as a justifiable bureaucratic response to
this apparent lack of teacher knowledge and skill, especially in sit-
uations like Hong Kong where the educational community is simply
too fragmented and diverse and the stakes too high to rely on im-
plicit assessment constructs. However, if we assume that even
unskilled teachers actually desire more knowledge and skill and
are principled in their approach to assessment then there is an
alternative to criterion-referenced and construct-referenced teacher-
based assessment. This approach sees teacher disagreement and
conflict not as a threat to validity but as the actual source of val-
idity, as is explained below.
IV An emerging framework for describing teacher assessment

beliefs, attitudes and practices
The results of this comparative study of very different assessment
contexts was the catalyst for the development of an embryonic
framework or cline for mapping teacher assessment beliefs, attitudes
and practices and the extent to which they are criterion-referenced
or construct-referenced. This framework (see Table 1) first emerged
from the data collected in this research and is being refined and de-
veloped through ongoing analysis of the different orientations and
discourses of Hong Kong (HK) and Melbourne (A) teachers (see
also Davison and Tang, 2003; Davison and Williams, 2003).
Teachers’ assessment orientations classified along a cline from
assessor as technician, to interpreter of the law, to principled yet
pragmatic professional, to arbiter of ‘community’ values, to assessor
as God can be mapped according to their discourse positioning
towards different aspects of assessment: task, process, product,
‘validity=reliability’ and assessor needs, including for support
and=or training. The framework also provides a mechanism to de-
scribe more systematically the effects on teachers of different sorts of
assessment approaches, including norm, construct and criterion-
referenced, and the interaction of these frameworks with their
professional knowledge.
At the one extreme you have the assessor-technician who ticks
the boxes and whose assessment discourse is bound by the criteria.
The underlying view of the assessment process is very mechanistic,
procedural, seemingly universalized, and exemplified in comments
such as ‘I just follow the criteria’ (A9) and ‘I tick the boxes’ (A5).
The discourse appears oblivious to the human dimension of assess-
ment, entirely text-focused (‘I just read the words’; A6) and untrou-
bled by inconsistencies, unless they lie in the criteria themselves.
Less dominated by criteria is the ‘interpreter of the law’, legalistic
but willing to make localized accommodations, albeit in a rather
de-personalized, codified and culturally detached way. There is strong
focus on the text but some awareness of the student: ‘I would like to
give a higher grade but I can’t because of the criteria’ (A10). Incon-
sistencies in judgment may be seen as a problem or a threat to reli-
ability, but the solution is to require better assessor training to
interpret criteria more accurately. Such assessment discourses are, not
surprisingly, more characteristic of Australia than Hong Kong.
At the other extreme is the ‘assessor as God’, an assessment dis-
course that is community-bound, reliant on implicit, inarticulate
norms of reference, exemplified in comments such as ‘it’s hard to
specify. . .’ (HK3), and a highly personalized, intuitive approach to
Table 1 Teacher-based assessment: a cline of assessor beliefs and practices
Assessor as technician Assessor as the interpreter Assessor as the principled Assessor as the arbiter Assessor as God
of the law yet pragmatic professional of ‘community’ values
View of the assessment Criterion-bound Criteria-based Criteria-referenced, but Community-referenced Community-bound

task e.g., It’s a checklist . . . e.g., It’s just like a driving localized accommodations e.g., It has to be like on e.g., It’s hard to specify . . .
(A6) test (A10) the exam (HK1) (HK3)
View of the assessment Mechanistic, procedural, De-personalized, explicit, Principled, explicit but Personalized, implicit, Personalized, intuitive,
process automatic, technical, codified, legalistic, interpretative, attuned to local highly impressionistic beyond analysis
seemingly universalized culturally detached. cultures=norms=expectations culturally bound e.g., You just know. . .(HK5);
e.g., I just follow the e.g., I have to be legalistic . . . e.g., It’s very complex and e.g., Can this child cop?e She’s just got it. . .(A3)
criteria (A9); I ticked the (A5); I would like to give ultimately you have to give more (A5); You’ve saying this
boxes (A5); The criteria a higher grade but I can’t weight to one thing than another, child is not tertiary
are just there, so it’s because of the criteria (A10) it comes down to professional material (A5)
really easy (A6) judgement (A4)
View of the assessment Text-focused Text-focused, but awareness Text and student focused Student-focused Student-focused
product of student
View of inconsistencies Seemingly unaffected by Inconsistencies a problem, Inconsistencies inevitable, Inconsistencies a Seemingly unaffected by
inconsistencies threat to reliability cannot necessarily be resolved problem, threat to inconsistencies
e.g., I worry when I make satisfactorily, teachers need to validity, assessor
judgment. . .am I interpreting rely on professional judgement training needs to be
the criteria correctly? (A2) e.g., I have to juggle things, weight improved
them up in my own mind and
think what the alternatives are (A1)
e.g., I think would my
colleagues accept this
as an A? (HK2)
View of assessor needs Need better assessment Need better assessor Need more time for moderation Need ‘better’ assessors System not open to scrutiny,
e.g., for support=training criteria training (to interpret and professional dialogue (to uphold standards) not accountable, operated
criteria) (to make basis of judgments by the ‘chosen’ few.
more explicit)
Key : A ¼ Melbourne teacher; HK ¼ Hong Kong teachers.

Chris Davison 325
assessment that is beyond analysis, evident in asides such as ‘You

just know. . .’ (HK5) and ‘She’s just got it. . .’ (A3). Seemingly unaf-
fected by inconsistencies, this view of assessment is closed to wide-
spread scrutiny, and gates are guarded by the carefully anointed few.
Less construct-bound is the ‘arbiter of community values’, whose
judgments are both community and student referenced, although
still personalized and implicit, and highly contextualized. This view
of assessment is evident in appeals to community norms, such as
‘Can this child cope; You’re saying this child is not tertiary mater-
ial’ (A5), and in judgments that foreground the human dimensions
of assessment; for example, ‘I give him or her the mark they
deserve’ (HK 2). Inconsistencies may be seen as a problem and a
threat to validity, but the assumption is that ‘better’ assessors are
needed in order to maintain standards. Such assessment discourses
are more characteristic of Hong Kong than Australia, but not
exclusively so, as can be seen by A5, who is R in Extracts 13 above.
The findings of this study also suggests a middle ground, an alter-
native approach to teacher-based assessment by teachers that takes
account not only of common assessment criteria and community
constructs, but also the learner and the context. It assumes that the
assessor-teacher is attuned to local cultures and expectations, yet is
keen to articulate and interpret community norms, to make explicit
their own and others’ underlying criteria and to hold them up for
critique. Inconsistencies in assessment judgments are seen as inevi-
table and cannot necessarily be resolved satisfactorily. Teachers seek
dialogue with others as a way of reaching a principled and pro-
fessional decision: ‘I have to juggle things, weigh them up in my
own mind and think what the alternatives are’ (A1).
This approach which I term classroom-referenced assessment
assumes assessment is embedded in classroom practice, but that
classrooms are not isolated and decontextualized units. It assumes
that teachers are professionals not autonomous individuals, but
social actors whose assessment decisions are grounded in com-
plexity and conflict. In this sense, then, conflicts are seen as an
inevitable part of the educational endeavour, that is, the muddy
‘swamp’ (Schön, 1987) of professional life and not necessarily solv-
able. Teachers comfortable with this approach see their assessment
role as that of an active negotiator, ‘a dilemma-manager. . . a broker
of sorts’ (Lampert, 1985: 192, 190), balancing a variety of interests
that need to be satisfied in the assessment process and accepting
conflict as ‘endemic and even useful . . . rather than seeing it as a
burden that needs to be eliminated’ (Lampert, 1985: 192). Teachers
may at times exhibit characteristics or behaviours across the cline,
revealing multiple and often conflicting attitudes and beliefs, even in
Chris Davison 327
same assessment situation (e.g., A5s, R; Table 1) as they draw on

different assessment ‘discourses’ to make their judgments. However,
this instability needs to be recognized as a continuing condition that
will lead to greater professional understanding, and as the only real
source of validity and reliability in school-based assessment. Criteria
and community are idealized by their different proponents; they give
the allusion of certainty and stability. However, as this study
reveals, teachers are first and foremost human beings. Teacher-
based assessment needs to be seen as an intrinsically subjective,
ideological, multi-dimensional and context-dependent process
(Birrenbaum and Dochy, 1996; Broadfoot et al., 2001).
This analysis of different assessment orientations has significant
implications for teacher support and development. If a system puts
all its faith in fixed, published and imposed assessment processes
and criteria and assumes that no other teacher support is
needed then it will inevitably construe the teacher as a technician
who is supposed to ‘tick the boxes’ and produce a seemingly objec-
tive result. But at what cost is this achieved? Adapting the words of
Jones and Moore (1993: 388), this view of teachers and teaching can
be seen as a move from a cultural to a technical mode of control
over teacher expertise, and from a professional to a technician
model for the role and status of the practitioner:
By decontextualizing [assessment] ‘skills’ and abstracting them from their

constitutive cultural practices, these reductive procedures construct par-
tial, disembedded representations of the complex social interactions
[involved in assessment] . . . a set of operations to be implemented and
followed as instructed. Its practitioners have no automatic access to its
theoretical base or means of critical examination.
In other words, the educational system either assumes a prior,

shared understanding of curricular and assessment goals by its
teachers or it does not consider such shared understanding to be sig-
nificant in making assessment decisions, and hence does not provide
opportunities for extensive moderation and professional dialogue.
The lack of attention to the underlying social context and the
very different and even conflicting cultural and belief systems of
teachers and their assessment processes may lead to teachers
recontextualizing, distorting or transforming published assessment
criteria, or relying on their own unexamined unarticulated assump-
tions about shared constructs.
However, if the system leaves teachers to resolve conflict on their
own, it will lead to even greater stress and anxiety (see Arkoudis
and O’Loughlin, this issue), and inevitably less trust within the
wider community. In order to do the work of assessment, teachers
need to have the resources to make decisions when confronted with

equally weighted alternatives. They need structured opportunities to
exchange advice with others but also to know what to do when that
advice is contradictory, or when it contradicts knowledge that only
can be gained in a particular context. As Lampert (1985: 193)
argues, ‘one needs to be comfortable with a self that is complicated
and sometimes inconsistent.’ Even when it is possible to establish
common understandings of the task, high-quality publicly-agreed
and explicit assessment criteria, and strong moderation processes,
teacher interpretation will always be needed. This should be seen as
a strength and not a weakness of teacher-based assessment.
V Conclusions
Wiliam (2001a) argues that high-quality educational provision
demands that teachers are involved in the summative assessment of
their students. This study reveals that teachers in Hong Kong and
Australia have very different approaches to assessment and very dif-
ferent ‘assessment’ problems. However, there is an urgent need in
both ‘old’ and new teacher assessment contexts to provide more
opportunities for teacher interaction around assessment issues.
Teachers need explicit high quality assessment criteria as a frame-
work for dialogue. They also need time and space to develop a
sense of ownership and common understanding of the assessment
process and to articulate and critique their often implicit constructs
and interpretations. Such teacher interactions are also necessary to
help all stake-holders develop a more informed perspective of
teacher assessment practices and to establish the key ingredients for
validity and reliability in teacher-based assessment: dialogue and
trust.
VI References
Ajzen, I. 1988: Attitudes, personality and behavior. Milton Keynes: Open
University Press.
Alderson, J.C. and Wall, D. 1993: Does washback exist? Applied Linguis-
tics 14, 11529.
Andrews, S. 1994: The washback effect of examinations: its impact upon
curriculum innovation in English language teaching. Curriculum
Forum 4, 4458.
Birrenbaum, M. and Dochy, F.J., editors, 1996: Alternatives in assessment
of achievement, learning processes and prior knowledge. Boston, MA:
Kluwer.
Black, P. and Wiliam, D. 1998: Assessment and classroom learning.
Assessment in Education 5, 774.
Chris Davison 329
Board of Studies 1999: Assessment advice for school-assessed Common

Assessment Tasks for 1999. Melbourne: Board of Studies.
Breen, M., Barratt-Pugh, C., Derewianka, B., House, H., Hudson, C.,
Lumley, T. and Rohl, M., Editors, 1997: Profiling ESL children: how
teachers interpret and use national and state assessment frameworks.
Canberra: Department of Employment, Education and Training, and
Youth Affairs.
Brindley, G. 1991: Defining language ability: the criteria for criteria. In
Anivan, S., editor, Current developments in language testing. Singa-
pore: South East Asian Ministries of Education Organization.
—— 1998: Outcomes-based assessment and reporting in language learning
programs: a review of the issues. Language Testing 15, 4585.
—— 2001: Outcomes-based assessment in practice: some examples and
emerging insights. Language Testing 18, 393407.
Broadfoot, P., Osborn, M., Sharpe, K. and Planel, C. 2001: Pupil assess-
ment and classroom culture: a comparative assessment study of the
language of assessment in England and France. In Scott, D., editor,
Curriculum and assessment. Westport, CO: Ablex, 4161.
Brown, A. 1995: The effect of rater variables in the development of an occu-
pation-specific language performance test. Language Testing 12, 115.
Cheng, L. 1998: Does washback influence teaching? Implications for Hong
Kong. Language and Education 11, 3854.
Clarke, S. and Gipps, C. 2000: The role of teachers in teacher assessment in
England 19961998. Evaluation and Research in Education 4, 3852.
Claxton, G. 1995: What kind of learning does self-assessment drive? Devel-
oping a ‘nose’ for quality: comments on Klenowski. Assessment in
Education 2, 33943.
Cooper, M. and Holzman, M. 1983: Talking about protocols. College Com-
position and Communication 34, 28495.
Cumming, A. 1990: Expertise in evaluating second language compositions.
Language Testing 7, 3151.
Curriculum Development Council 2000: Learning to learn: the way forward
in curriculum development. Consultation document. Hong Kong:
Government Printer.
—— 2002: English language education: key learning area curriculum guide:
Primary 1 Secondary 3. Hong Kong: Government Printer.
Curriculum Development Institute 2003: School Policy on Assess-
ment, available online at: http:==cd.emb.gov.hk=basic guide=
BEGuideeng0821=chapter05.html (March, 2004).
Davison, C. 1998. ‘It’s your opinion that counts’: written argument and
ESL students in secondary English. Unpublished PhD dissertation,
La Trobe University.
—— 1999: Missing the mark: the problem with benchmarking ESL in
Australian schools. Prospect 14, 6676.
—— 2001: Current policies, programs and practices in school ESL. In
Mohan, B., Leung, C. and Davison, C., editors, English as a second
language in the mainstream: teaching, learning and identity. Harlow:
Longman Pearson, 3050.
Davison, C. and McKay, P. 2002: Counting and dis-counting learner group

variation: English language and literacy standards in Australia. Jour-
nal of Asian-Pacific Communication 12, 7794.
Davison, C. and Tang, R. 2001: The contradictory culture of school based
assessment: teacher assessment practices in Hong Kong and
Australia. Paper presented at AAAL Conference, St Louis, MO.
—— 2003: Assessing in the swamp: formative assessment in Hong Kong
secondary school English. Paper presented at AAAL Conference,
Arlington, VA.
Davison, C. and Williams, A. 2003: Learning from each other: critical
connections. Studies of child English language and literacy development
K-12. Volume 1. Melbourne: Language Australia.
DeRemer, M.L. 1998: Writing assessment: raters’ elaboration of the rating
task. Assessing Writing 5, 729.
Freeman, D. 2002: The hidden side of the work: teacher knowledge and
learning to teach. Language Teaching 35, 113.
Gipps, C. 1994: Beyond testing: towards a theory of educational measure-
ment. London: Falmer Press.
Hamp-Lyons, L. 1989: Raters respond to rhetoric in writing. In Dechert, H.
and Raupach, G., editors, Interlingual Processes. Tübingen: Gunter
Narr, 22944.
Hargreaves, A. 1994: Changing teachers, changing times. London: Cassell.
Harré, R. 1993: Social being. Oxford: Basil Blackwell.
Harré, R. and Van Langenhove, L. 1999: The dynamics of social episodes.
In Harré, R. and Van Langenhove, L., editors, Positioning theory:
moral contexts of intentional action. London: Blackwell, 113.
Hong Kong Examination and Assessment Authority 2003: Strategic review
of the Hong Kong examination and assessment authority. Final
consultancy report, May 2003, http:==www.hkeaa.edu.hk=doc=isd=
Strategic Review.pdf (March, 2004).
Huot, B. 1990: Reliability, validity and holistic scoring: what we know and
what we need to know. College Composition and Communication 41,
20113.
Jones, L. and Moore, R. 1993: Education, competence and the control of
expertise. British Journal of the Sociology of Education 14, 38597.
Lampert, M. 1985: How do teachers manage to teach? Perspectives on
problems in practice. Harvard Educational Review 55, 17894.
Lantolf, J. and Frawley, W. 1985: Oral proficiency testing: a critical analy-
sis. Modern Language Journal 69, 33745.
—— 1992: Proficiency: understanding the construct. Studies in Second
Language Acquisition 10, 18196.
Lave, J. and Wenger, E. 1991: Situated learning, legitimate peripheral
participation. Cambridge: Cambridge University Press.
Lumley, T. 2001: The process of the assessment of writing performance:
the rater’s perspective. Unpublished PhD, University of Melbourne.
Mabry, L. 1999: Writing to the rubric: lingering effects of standardized
testing on direct writing assessment. Phi Delta Kappan 80, 67379.
Chris Davison 331
Moore, H. 1996: Telling what is real: competing views in assessing English

as a second language development. Linguistics and Education 8,
189228.
Nicholas, H. 2002: Is there progress in defining standards? Australian
Review of Applied Linguistics 23, 7988.
O’Loughlin, K. 1994: The assessment of writing by English and ESL teach-
ers. Australian Review of Applied Linguistics 17, 2344.
Popham, W. 2000: Modern educational measurement: practical guidelines
for educational leaders, 3rd edition. Boston: Allyn and Bacon.
Pula, J.J. and Huot, B. 1993: A model of background influences on
holistic raters. In Williamson, M.M. and Huot, B., editors, Vali-
dating holistic scoring for writing assessment. Cresskill, NJ: Hampton
Press.
SCOLAR (Standing Committee on Language Education and Research)
2003: Action plan to raise language standards in Hong Kong; avail-
able online at: http:==cd.emb.gov.hk=scolar=html=new_index_en.htm
(April 2004).
Schön, D.A. 1987: Educating the reflective practitioner. Jossey-Bass:
London.
Smagorinsky, P. 1989: The reliability and validity of protocol analysis.
Written Communication 6, 46379.
Smith, M. 1991: Meanings of test preparation. American Educational
Research Journal 28, 52142.
Vaughan, C. 1991: Holistic assessment: what goes on in the raters’ mind?
In Hamp-Lyons, L., editor, Assessing second language writing in
academic contexts. Norwood, NJ: Ablex.
Victorian Curriculum and Assessment Authority 2003: Approaches to the
administration of school-assessed coursework; available online at:
http:==www.vcaa.vic.edu.au=vce=exams=index.html (April 2004).
Watanabe, Y. 1996: Investigating problems of washback in Japanese EFL
classrooms: problems of methodology. In Wigglesworth, G. and
Elder, C., editors, The language testing cycle: from inception to
washback. Australian review of applied linguistics series S, No.13.
Canberra: ANU Printing Services, 20839.
—— 2000: Washback effects of the English section of Japanese university
entrance examinations on instruction in pre-college EFL. Language
Testing Update 27, 427.
Weigle, S.C. 1994: Effects of training on raters of ESL compositions. Lan-
guage Testing 11, 197223.
Wiliam, D. 1994: Assessing authentic tasks: alternatives to mark schemes.
Nordic Studies in Mathematics Education 2, 12941.
—— 1996: Standards in examinations: a matter of trust? Curriculum Jour-
nal 7, 293306.
—— 2001a: An overview of the relationship between assessment and
the curriculum. In Scott, D., editor, Curriculum and assessment.
Westport, CO: Ablex, 16581.
—— 2001b: Reliability, validity, and all that jazz. Education, 3, 1721.
Appendix 1 332 Teacher assessment practices in Australia and Hong Kong
Chris Davison 333
Appendix 2 Sample text: Vince
Should the Grand Prix be held in Albert Park?

1 The Government has planned to build a circuit for the For-
mula One Grand Prix at Albert Park.
2 However, it is a hotly debatable issue.
3 Some consider that as it will be the boon to tourism and such
bring about great economic benefits, it is worthwhile in the
extreme.
4 On the contrary, others comment that the very environmental
impact arising from the race on the park may disturb the
nature as well as the residents living nearby.
5 So the plan had better being put aside.
6 Hence, it is difficult to say that which notion is more reason-
able before discussing.
7 To a certain extent the plan seems to be very beneficial for
from the economic point of view it will bring at least ten
thousand temporary jobs and eight thousand and five hundred
jobs.
8 Nowadays, the rate of unemployment in Australia is low.
9 Also, the economy is shattered as well.
10 Therefore, the Grand Prix Promotions put forward their cause
by stating that this plan will surely boost the economy of Mel-
bourne because the money may flow within the country, in-
stead of the outflow to foreign states, is a good finesse.
11 Besides, the supporters of the Grand Prix also claim that the
plan will promote Melbourne to the Melbourne to the world.
12 It is because the television rights to other countries will cer-
tainly let foreigners know Melbourne, thus promoting tourism
and heightening the international reputation of Melbourne as
well as Australia itself.
13 Also it is believed that the building of the Grand Prix atant (?)
will make the lies of Mebournians more colourful because of
the diversity of recreation.
14 Hence, from the above points, the plan seems to be worth-
while.
15 However, to a large extent, it is not the case.
16 Besides the economic benefits one should bear in mind that the
race will surely produce much pollutions, say, air pollution,
sound pollution and land pollutions.
17 The noise arising from the race will make the citizens very
annoyed in a short term.
18 To make matters worse, in a long run, the noise will even

make the people deaf.
19 In addition, dust particles coming form the race also endanger
the health of the citizens.
20 What is more, Land pollution, which is a direct result of the
audiences unconsciouness, can make the environment impact,
thus contaminating the surroundings and affect the people’s
health as well.
21 As well it has been criticized that the government has not con-
ducted a proper inquiry into environment impact on the race.
22 The government has been granted an exemption from such an
inquiry by the Governor in Council i.e. the exemption from
the Freedom of Information Act (707).
23 Also the indemnity from compensation for people affected by
injury or other claims cannot appeal to the
24 Worse still, in September, the Vic parliament passed a bill
called the Australian Grand Prix Bill (which) allows the Grand
Prix free from usual planning and environmental rules give the
Melbournians much more perplexity to the problem.
25 Hence, it is just to say that the lack of Government negoti-
ation with the citizens is the crux of the problem.
26 In addition, the huge amount of the initial cost of the building
the Grand Prix has been the focus of criticism.
27 According to the figures cited by the Government, at least $45
million dollars is used to build it.
28 In fact, it is better to use such money to help revive the
Australian recession.
29 By the same token, it is hard to predict whether the Grand
Prix will attract much audience.
30 Only using the money in a normal way can the result be clear
and good for the economy.
31 Furthermore, the plan will inadvertently encourage the illegal
car racing, too.
32 Many drivers may fault as if they were the Formula One driv-
ers, especially after watching the program concerning the race.
33 it is of harmful effect to the youngsters who are prone to
impersonate their idols ie. the heroes of the race.
34 Therefore, the plan may give rise to many unnecessary deaths.
35 In conclusion, after discussing the above, it is hard to say that
the plan is absolutely good or not.
36 One fact we know is that the plan is by far not worthwhile for
the shortcomings arising from the plan is much more and det-
rimental than the pros generated.
37 Thereby, personally I am against this inconsiderate plan.

Davison (2004) - Teacher-Based Assessment in OZ and HK

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Davison (2004) - Teacher-Based Assessment in OZ and HK

Uploaded by

Copyright:

Available Formats

The contradictory culture of

teacher-based assessment: ESL teacher

A growing concern in teacher-based assessment, particularly in assessing

Address for correspondence: Chris Davison, Associate Professor in English Language

Language Testing 2004 21 (3) 305–334 10.1191=0265532204lt286oa # 2004 Arnold

sity entrance . . . As one German commentator once remarked: ‘Why

demands of a more formalized system of high-stakes school-based

constrain their practices in the assessment process. One of the most

II Criterion-referenced vs. construct-referenced assessment

1994; Weigle, 1994; Zhang, 1998). In a study of portfolio assess-

On the other hand, construct-referenced assessment2 (Wiliam,

III The research design

(and stable) beliefs and values about the construct being

make more explicit the cultural meanings of the teachers’ actions,

IV Preoccupations of teachers involved in the assessment process

their ‘gut reactions’ or professional judgment. This is demonstrated

Extract 1 (from verbal protocol): Criteria vs. instinct

1 R: (Giving overall impressionistic comments as reads text) There is no

Eﬀectiveness and appropriateness of the exploration of ideas (explicit

What is clear in this extract is the way in which the assessment

representative of the Melbourne teachers. As another teacher, S, in

However, S, in contrast to R, seems more comfortable in manipu-

Extract 2: Honesty vs. shame

B: That’s not a judgment ¼

This tension in balancing the requirements of the assessment

Extract 3: Being legalistic vs. being human

R: Interestingly, there was a division in our group, please correct me if I’m

In this exchange the Melbourne teachers emphasize their ‘hum-

impressionistic marking in Hong Kong, and the lack of any system-

Extract 4: Conﬂicting constructs

C: . . . Why are you focusing on grammar?

The wide discrepancy in the constructs being used to evaluate the

Extract 5: Respectful safe marks vs. failure

As in Davison and Tang’s (2001) study, the teachers here argued

nition. Without the ‘authority’ of the external examination, many

R: . . .Those kids they’ve got into Form 6 by virtue of passing exams,

. ‘the eﬀort that we put into marking students assignments’;

Extract 7: Teaching vs Assessing

R: ‘I think you need to know whether we believe this is a paper exer-

Teachers reported a near universal desire for more time simply to

Political and administrative devices for bringing about educational

Lists of assessment criteria and one-size-ﬁts-all assessment frame-

IV An emerging framework for describing teacher assessment

View of the assessment Criterion-bound Criteria-based Criteria-referenced, but Community-referenced Community-bound

Key : A ¼ Melbourne teacher; HK ¼ Hong Kong teachers.

assessment that is beyond analysis, evident in asides such as ‘You

same assessment situation (e.g., A5s, R; Table 1) as they draw on

By decontextualizing [assessment] ‘skills’ and abstracting them from their

In other words, the educational system either assumes a prior,

need to have the resources to make decisions when confronted with

Board of Studies 1999: Assessment advice for school-assessed Common

Davison, C. and McKay, P. 2002: Counting and dis-counting learner group

Moore, H. 1996: Telling what is real: competing views in assessing English

Appendix 2 Sample text: Vince

Should the Grand Prix be held in Albert Park?

18 To make matters worse, in a long run, the noise will even

You might also like