The evaluation of observed lessons has been the subject of much debate in the eld of teacher training. Teacher trainers have tried to dene quality in relation to teaching and to nd ways to measure it in a reliable way. Can we evaluate the quality of teaching by observable behaviour and measurable components, in which case, can the lesson be assessed analytically by the use of discrete criteria? Or, does a lesson constitute an entity, which cannot be broken into discrete components so that it has to be assessed impressionistically? We believe that in order to construct a more comprehensive view of the issue, it is pertinent to collaborate with our trainees and provide some space for their voices. Evidence from a small-scale practitioner-based research project reveals that trainees need explicit criteria for effective teaching in order to identify their strengths and weaknesses and use them as guidelines for improvement. Introduction This paper presents a three-year practitioner-based research that emerged from our reection in action and on action (Schon 1983) as teacher trainers and lecturers in EFL pre-service training programmes in a teacher education college. In the framework of the training programme, one of the core requirements is the practicum, which is the application of the practical pedagogical knowledge acquired during the didactic lessons and workshops. In the literature, the practicum has been viewed as critical to the development of trainees. It is their rst hands-on experience with their chosen career. It creates opportunities for trainees to develop their pedagogical skills and it is the best way to acquire professional knowledge and competences as a teacher (Hascher, Cocard, and Moser 2004: 626). During the practicum trainees can put into practice their beliefs based on language learning theories they acquired in the course of their studies. It also serves as a protected eld for experimentation and socialization within the profession (Hascher et al. ibid.) and it allows for evaluation of teachers. Thus it sets the stage for success or failure instudent teaching and a trainees future in education may be determined by what happens during their training period. These ideas have been mainly expressed by those who design the programmes and are in charge of pre-service teacher training. Trainees, as well, consider the practicum experience as the most signicant element in their teacher training (Zeichner 1990). Quite often trainees claimthat they benet more from spending time in the eld watching others teach, than ELT Journal Volume 62/3 July 2008; doi:10.1093/elt/ccm020 257 The Author 2007. Published by Oxford University Press; all rights reserved. Advance Access publication April 13, 2007 from attending sessions at the university or colleges. This assertion is supported by Tsui (2003) inher discussiononteachers personal values and beliefs. She claims that teachers consider classroom experience the most important source of knowledge about teaching. We found that there is a plethora of literature dealing with multiple aspects of the practicum but there is a dearth in the eld of practicum assessment. This could be described as surprising given that assessing trainees practicum is a complex activity, which entails multiple sources of assessment. Each one of these sources provides information about a different aspect of teaching. Furthermore, assessment of the trainees performances in their practicum has far-reaching implications for their entry into our profession. In order to achieve a comprehensive prole of a trainee we, in our programme, use different sources of assessment such as: reective journals, portfolios, observationlessons, tests, self-assessment, peer assessment, cooperating teacher assessment, and pedagogical counsellor assessment. However, the nal grade for the practicum is based primarily on the grades that trainees receive for their observation lessons. For the purpose of this study, lessonobservationis viewed as a lessontaught by a trainee and observed by a pedagogical counsellor. The observation lesson is a critical component of the practicum. How it is assessed reects an equally critical issue for both evaluators and evaluees. This issue is the focus of our paper. The venue There are twoteacher-trainingEFL programmes at our college: one is a four- year programme, which awards the students both a BEd and a teaching certicate, and the other is a two-year certicate programme for people holding a BA in English. A signicant part of both programmes is the practicum. The practicumentails weekly observations of trainees inschools by teacher trainers. At the beginning of the academic year, a trainee is placed in a host school with an experienced English teacher, who is appointed as a cooperating teacher. The mainrequirement of the trainees inthe practicumis to observe their cooperating teachers teach in their classrooms and gradually to start teaching on their own. This usually commences after a short period of getting acquainted with the school. The trainees are assessed informally by their cooperating teachers who serve more as mentors than as assessors. The formal assessment is carried out at least twice a semester by pedagogical counsellors who are usually their methodology teachers. In our programme, observation has two main purposes: trainees development and accountability. Here, development means improvement of trainees performance in class by identifying their strengths and weaknesses and by raising their awareness through providing feedback and recommendations. This process can be regarded as formative assessment, since the focus is more on development and progress than on the nal product itself. The second purpose, which pertains to accountability, is to determine the trainees suitability for entry to the educational system. This in itself creates conicting perspectives concerning observation and role identity. The message that is conveyed to trainees during the practicum 258 Shosh Leshem and Rivka Bar-Hama is that it represents a trial and error phase which is integral to their learning and professional development. This is intended to foster an element of trust and openness inthe traineeobserver relationship. However, this trust can be impeded by the observer having to act as an inspector and nal assessor. Trainees may put on an act in order to satisfy the observers expectations and gain a higher grade for their conduct. If this happens, then they may sacrice their own development and rapport with their observer. These contradicting roles of the observer constitute potential problems not only for the trainee but for the observer as well. The latter may feel forced into a situation of assessor due to institutional policy or, at times, national demands, when their preferred tendency is to function as a coach rather than as an assessor. Pedagogical counsellors use different observational tools to record data of the lessons that they observe. The most common tools are: 1 observation forms; 2 detailed written notes on the lesson; 3 audio-recordings for reinforcement of written notes; and 4 video-recordings for use collaboratively by the trainer and the trainee during the feedback session. They are sometimes used by the trainee at a later stage for further reection. Our main tool of assessment is the observation form that consists of several components. Examples for each component are provided to show a model of what the forms entail: n instructional components: clarity of instructions, sequence of activities, and classroom management; n affective components: giving feedback and reinforcement, awareness of students needs; n language components: use of L1, oral, and written prociency; n cognitive components: lesson planning, stating clear objectives, and designing activities to achieve lesson objectives; and n metacognitive components: ability to analyse the lesson and to reect upon their professional development. We are both veteran teacher trainers, department coordinators, and have been counsellors in a wide range of contexts. From our professional experience we realized that the observation forms that were used for assessment were changed from year to year both by us and by our colleagues. Analysing minutes fromthree years of departmental meetings, we noticed that the issue of the assessment forms appeared regularly onthe agenda as a theme requiring modication. Some items were changeddue to different approaches, beliefs, worldviews, or experiences of the teachers teaching a particular group that year. However, the changes were not signicant and the essence of the evaluation forms has remained the same. We then analysed our personal diaries where we had recorded comments from trainees and our own queries and impressions. Common comments from trainees expressed operational constraints due to a particular school culture, methodological obligations to the cooperating teachers style of teaching, and dissatisfaction with grades. This evidence made us ponder upon the issue with our colleagues. We discovered that they shared our Evaluating teaching practice 259 discontent about the way that trainees performance was assessed during the observation lesson. The feeling that prevailed among us was that, as experienced observers and assessors, we were able to provide an impressionistic value judgement of the trainees performance. However, when we assessed the lesson according to the benchmarks on the assessment form, we realized that quite often there was a gap between the two results. Three of our colleagues who shared the same professional experience expressed the gap as follows: Observer 1 While observing I already formulate a grade in my mind. I know that this lessondoes not deserve more than80percent, for example. At the endof the lesson I go over the assessment form and grade each item according to the weight allocated. If there are incongruities with my grading, I try to narrow the gap. Observer 2 I have enough experience to know immediately after the lesson what the grade is going to be. I personally dont really need the criteria and would have preferred to ignore them. However, as I am required to provide a detailed assessment record, I use it and I often get annoyed with the fact that I cant nd the criteria that I would like to grade the student on, or I nd some of the criteria irrelevant to the context and to my frame of reference. Observer 3 I have to admit that initially I determine the grade during observation or immediately after that. When I use the assessment sheet, I nd that the grade is usually higher. I feel that I cannot take off all the points for a certain criterion and this leads to an accumulated higher grade. These views reinforced our problem in accepting the reliability of assessment in the observation lesson. Taking into consideration the critical role of the observationlessoninthe practicumand in students professional careers, we felt that it was our responsibility to try and assess our trainees in a way that reectedtheir performance accurately, reliably, andtransparently. In addition, we realized that the voices of the trainees concerning this issue were not considered and decisions on assessment were top-down. We believedthat inorder toconstruct a more comprehensive viewof the issue, it was pertinent to collaborate with our trainees and provide some space for their voices (Nunan and Bailey 1996). Moreover, new trends in current assessment demandactive student participationintheir assessment. This is reinforced by Shohamy (1996) discussing ethical testing and assessment, who sees a need for students to participate actively in the construction and use of tests and assessment systems. Another problemis that despite eachassessor havingsimilar criteria against which to assess the lesson, their interpretation of those criteria is not always identical. Each lesson is assessed by three people: the cooperating teacher, the pedagogical counsellor, and the trainees themselves. However, the weight andthe importanceallottedby the college tothe various assessors are not evenly distributed. Each of the three assessors makes signicant contributions to the developmental process of the individual teacher. 260 Shosh Leshem and Rivka Bar-Hama In terms of the teachers assessment for the purpose of accountability, the pedagogical counsellor undertakes most of the responsibility and has the nal say in grading the trainee while the others can only slightly affect the grade. The observation lesson is considered a high stake test by the trainees and at times puts them under the tremendous pressure of a major test. It also entails conicting decisions concerning whose theories to implement, their pedagogical counsellors, their cooperating teachers, or their own. This led us to investigate the following issues: 1 To what extent are we actually assessing quality of teaching through observation? 2 What are the perceptions of our trainees regarding the way of assessment? Exploring the literature While surveying the literature we found unsettled perspectives on issues that underpin our questions. There is a general consensus about the importance of observation in the development and assessment of a teacher. This notion is also supported by OLeary (2004: 14) who claims that Traditionally, classroom observation has occupied a prominent role in terms of its use as a tool by which to judge and subsequently promote good practice. He also advocates a holistic way of assessing. He contends that although it would be na ve to discount classroom observation per se as a useful learning tool for teacher development . . . the existing assessment approach contains a number of inadequacies that directly conict with the fundamental aims of genuine teacher development. One of his objections is to the assessment of a teachers ability by using a checklist of subjective criteria. He supports his contention by claiming that: 1 a lesson is a complete entity and cannot be dissected into separate parts; 2 criteria for effective teaching differ for every instructional situation; 3 checklists measure low inference skills and these are limited because they tell us very little about teacher behaviour and the learning process itself; 4 effective teaching manifests itself in high inference skills, which are fundamentally qualitative; 5 adopting a quantitative approach is discouraging and undermining to teachers. Voices contradicting this approach maintain that observations tend to be subjective, based on the observers own teaching approach. To attain objectivity it is argued that we have to develop systematic observation tools. Acheson and Gall (1997) reect students feeling of being threatened when they are unaware of the criteria by which they will be judged, thus dened criteria should be provided to lower the level of anxiety among students. Inthe same vein, Brooker, Muller, Mylonas, andHansford(1998) claimthat an increased demand for quality and accountability in teacher education programmes requires a criterion-based standard reference framework for assessment. Leung and Lewkowicz (2006: 27) highlight the point of subjective interpretation and contend that due to the fact that teachers can interpret assessment criteria differently, the idea that teachers should observe what Evaluating teaching practice 261 learners say and do, interpret their work, and then provide guidance for improvement is anuncertainbusiness. Moreover, they claimthat teachers judgements are inuenced by wider social and community practices and values and therefore might lead to different perspectives. As we consider the observation lesson to be a performance test, we found McNamaras (1995) point relevant to our argument even though he does not refer to observation lessons. His assertion is that performance tests that strive to be highly authentic are often extremely complex due to the extraneous social inuences on the grade awarded. We also realized that there is much concern about the reliability of examination scores as determinants of teaching qualications. Alderson (1991: 12) refers to the fact that we know little about how to measure the progress of learners . . . and that we lack sensitive measures. Broadfoot (2005: 127) is even more extreme in his assertion and claims that we use what are a very blunt set of instruments to intervene in the highly sensitive and complex business of learning. As a result of these diverse views, going to the literature was a journey of mixed blessings. It supported our sense of discomfort and it became apparent to us that our problem warranted attention. The study Data were retrieved from questionnaires, interviews, personal diaries, and documents that included minutes from meetings and assessment forms. A questionnaire was designed to explore the preferences that students had towards how they might be assessed. We drew upon our involvement with the assessment process to draft a simple survey with two closed questions and one open-ended question. To aid completion, the choices that were provided reected the issues that trainees had mentioned to us regularly. The three questions were: 1 How would you like your pedagogical counsellor to assess your observation lesson? By giving you a fail/pass or a numerical grade? 2 If you chose a numerical grade, how would you like to be assessed: analytically or holistically? 3 Which items on the observation form would you omit and which would you like to add? We explained to each group that the term holistically implied assessing impressionistically by looking at the lessonas a whole, andthat analytically implied using set criteria to assess numerically each aspect of the lesson. The questionnaire was distributed to trainees of two TEFL courses at a teacher training college. The timing of this corresponded with the end of theacademic year whentrainees hadalready nishedtheir practiceteaching duties. The interviews with twenty trainees were conducted after the questionnaires were read and analysed. We concluded from this analysis that it was important to gain a wider set of trainee perspectives and achieve a richer picture of the trainees reasoning. Thus, we discussed the general responses that had been provided to the questionnaire with twenty randomly chosen trainees. 262 Shosh Leshem and Rivka Bar-Hama Population The study was undertaken with 58 trainees studying on two different programmes: 1 Afour-year Bachelor of Education programme in an English department of an Academic teacher training college in Israel. The subjects were trainees from the second and third years. Trainees of this group pursue a study programme, which certies them to teach both general subjects in the trainees mother tongue (Hebrew or Arabic) and English as a second/foreign language. 2 Second year trainees on a two-year retraining programme. Trainees in this groupholda BAdegree andstudy for a teachingcerticate inEnglish. These trainees are usually older than those on the BEd programme. The subjects constituted three groups: 1 second year trainees from group A, 2 third year trainees from group A, and 3 second year trainees from group B. Findings In these ndings, none of the groups wanted a verbal grade of pass or fail. All three groups preferred a numerical grade. Two groups (1 and 3) favoured holistic assessment for different reasons. Group 3 preferred this form of assessment as they felt they did not need the criteria to analyse the lesson. They claimed to be competent enough to analyse their lesson and reect uponit independently without specic criteria. By that time intheir training they were much more condent in their teaching and assessment. Group 1 chose the holistic approach for the opposite reasons. They justied their choice by lack of condence and fear. They felt intimidated by the use of clear-cut criteria to analyse their lesson. They actually preferred the unspecied nature of the holistic approach to a lesson being dissected by specic teacher behaviours. Group 2 chose the analytical approach. They explained that they saw the functionof the criteria as guidelines to helpthemfocus and construct better lessons. They claimedthat the criteria helpedthemidentify weaknesses and strengths and thus contributed to their pedagogical knowledge and their professional development. In terms of assessment, they felt that this approach was more reliable since assessing according to set criteria is more objective. Evidence from the interview showed how trainees voices reected their choices: In favour of specied criteria on the observation forms The items on the form helped me remember what was discussed when I had to write a reective journal on my lesson. I nd them very useful. They were like post signs for me. The whole form is like an outline for a lesson plan. It gives me a clear picture of what was goodand what needs to be workedon. It really gives you a picture where you are and what to focus on next time. The criteria help you see the process. I can compare the form of my rst observation and the second one and know exactly where I improved. Evaluating teaching practice 263 It gives you a fairer picture of the evaluation. I do not like vagueness. I have to see how many points have been taken off or given for each item. Not in favour of specied criteria on the observation forms There are too many details to process. I cant focus on all the items. It confuses me. I would rather focus on one or two features of the lesson. The criteria should be more general and not so detailed. It is too technical and robot like. I feel as if my lesson has been put under a microscopic lens and it does not really depict the dynamics of the lesson. The following were some of the suggestions fromthe open-ended question: 1 Acknowledgement within the items of originality and risk taking. 2 Credit for preparing extra time activities in their lesson plan. 3 Evidence of improvement from previous observations. 4 Awareness of the teachers action zone. Insights and conclusions Teaching is a web of interrelated dimensions. Some are clearly observable and others are not. As a consequence, the assessment of teaching quality through observation entails an internal paradox. This paradox encapsulates our initial urge to re-examine our own practice. Our research questions related to the extent to which quality of teaching is assessed through criteria-based observation and we found that our students felt that it was a valid method of assessment. Although all trainees voted for numerical assessment, there were differences between trainees in the choice between holistic or analytical approaches, with the majority choosing the holistic approach. The fact that none of our subjects chose the fail or pass as evaluation criteria accords with Kennedys assertion that trainees prefer to receive a numerical grade for the observed lesson (Kennedy 1993). This may be a result of conditioning, of trainees upbringing, andthe constraints of social demands and norms. However, a numerical grade on its own did not seem to be satisfactory, as it didnot provide explicit feedback ontheir performance. The trainees who were in favour of the holistic approach needed the stated criteria on the assessment form to aid discussion during feedback sessions and to provide signposts for further reection. Yet, they did not want to be assessed analytically where each criterion was allotted numerical points, in spite of this approach enhancing reliability and transparency. Our small-scale investigation demonstrated that trainees at their initial stages of teaching perceive the lesson as separate parts and not as a whole entity. The sum of the parts represents quality of teaching. Trainees need explicit criteria for effective teaching in order to identify the quality of their teaching. Their preferences for assessment show that they regard the observationlessonas botha test and a means for reectionand professional development. These conclusions are situated in the limited context of just one practicum experience, thus they cannot have wide implications. However, as teachers researching our own eld of practice, we gained deeper understanding and insights into a troublesome issue. Our ndings represent insights of 264 Shosh Leshem and Rivka Bar-Hama an exploratory nature and they support the claim that quality and accountability should be achieved through explicit and objective criteria. Final version received October 2006 References Acheson, K. A. and M. D. Gall. 1997. Techniques in clinical supervision of teacher in E. Pajak. Approaches to Clinical Supervision: Alternatives for Improving Instruction. Norwood: Christopher- Gordon Publishers, Inc. Alderson, C. 1991. Language testing in the 1990s: how far have we come? How much further have we to go? in S. Anivan (ed.). Current Developments in Language Testing. Singapore: SEAMEO Regional Language Center. Broadfoot, P. 2005. Dark alleys and blind bends: testing the language of learning. Language Testing 22: 12341. Brooker, R., R. Muller, A. Mylonas, and B. Hansford. 1998. Improving the assessment of practice teaching: a criteria and standards framework. Assessment and Evaluation in Higher Education 23/1: 525. Hascher, T., Y. Cocard, and P. Moser. 2004. Forget about theorypractice is all? Student teachers learning inpracticum. Teachers and Teaching: Theory and Practice 10/6: 62337. Kennedy, J. 1993. Meeting the needs of the teacher trainees on teaching practice. ELT Journal 47/2: 15765. Leung, C. and J. Lewkowicz. 2006. Expanding horizons and unresolved conundrums: language testing and assessment. TESOL Quarterly 40/1: 21134. McNamara, T. 1995. Modelling performance: opening Pandoras box. Applied Linguistics 16/2: 15079. Nunan, D. and K. M. Bailey (eds.). 1996. Voices from the Language Classroom. Cambridge: Cambridge University Press. OLeary, M. 2004. Inspecting the observation process: classroom observations under the spotlight. IATEFL Teacher Development SIG 1/4: 1416. Schon, D. 1983. The Reective Practitioner. San Francisco: Jossey-Bass. Shohamy, E. 1996. Language testing: matching assessment procedures with language knowledge in M. Birenbaum and F. J. R. C. Dopchy (eds.). Alternatives in Assessment of Achievements, Learning Processes, and Prior Knowledge. Dordrecht, Netherlands: Kluwer Academic. Tsui, B. M. 2003. Understanding Expertise inTeaching: Case Studies of Second Language Teachers. Cambridge: Cambridge University Press. Zeichner, K. M. 1990. Changing directions in the practicum: looking ahead to the 1990s. Journal of Education for Teaching 16/2: 10532. The authors Shosh Leshem is involved in teaching and teacher education in Israel. Her publications are in the area of teacher training and language teaching methodology. She is currently teaching at Haifa University and Oranim, Academic School of Education. She is also a visiting lecturer at Anglia Ruskin University in the UK, focusing on doctoral processes from an ethnographic perspective. Email: shosh-l@zahav.net.il Rivka Bar-Hama is involved in teaching and teacher education in Israel. Her publications are in the area of teaching English as a foreign language and teacher training and focus on testing and assessment. She taught at Haifa University and is currently head of the English Department at Gordon Academic College of Education. Email: rivkab@macam.ac.il Evaluating teaching practice 265