You are on page 1of 15

Studies in Educational Evaluation 49 (2016) 15–29

Contents lists available at ScienceDirect

Studies in Educational Evaluation


journal homepage: www.elsevier.com/stueduc

Classroom observation for evaluating and improving teaching: An


international perspective
Felipe Martineza , Sandy Tautb,* , Kevin Schaafa
a
University of California, Los Angeles, United States
b
Pontificia Universidad Católica de Chile, Chile

A R T I C L E I N F O A B S T R A C T

Article history:
Received 21 July 2015 Teacher evaluation and development policies around the world are undergoing significant reform.
Received in revised form 14 March 2016 Classroom observation often carries a considerable weight in teacher appraisal and improvement
Accepted 18 March 2016 systems, and provides the critical formative anchor informing professional development. This study
Available online 15 April 2016 examined a purposively selected sample of sixteen classroom observation systems in six countries,
including high performing Singapore and Japan, regional exemplar Chile, the three largest school districts
Keywords: in the United States, and other interesting examples in Australia, Germany and the United States to add
Classroom observation diversity to the sample. The study offers an analytic framework for understanding classroom observation
Teacher appraisal
systems across contexts, distinguishing conceptual, methodological and policy aspects that shape these
Teacher development
systems. The sixteen systems were remarkably consistent in their stated overall purposes, but there was
Instructional quality
Professional development variation in terms of how they operationalized good teaching, the degree of standardization of the
Evaluation systems observation process, emphasis on validation, and information uses. The paper describes and discusses
these characteristics in order to help researchers and policymakers reflect on the available options and
take more informed decisions in designing classroom observation for evaluating and improving teaching.
ã 2016 Elsevier Ltd. All rights reserved.

1. Introduction comprehensive approaches for evaluating teacher effectiveness,


a great deal of attention and discussion among educators,
Teacher evaluation and development policies are in the midst of statisticians, and policymakers has focused on the use of
substantial reform in education systems around the world. Such aggregates of student achievement to evaluate teachers, typically
calls for reform tend to reflect continuing concerns about through so-called Value Added Models, or VAMs (for reviews see
shortcomings in educational quality and equity, reflected in state, e.g. Baker et al., 2010; Braun, 2005).
national, or international student achievement testing results, While the attention paid to VAMs in the research literature and
either in absolute terms or in relation to peer districts, states, or the press could suggest otherwise, however, student achievement
countries (Feuer, 2012). International reviews also reflect growing is rarely the driving indicator for teacher evaluation. In most cases,
interest in revamping teacher evaluation and development at local, measures of teacher practice based on direct observation of
state, and national levels, with a focus on supporting valid teachers in classrooms receive an equal or larger weight. A
inferences about teachers and providing useful feedback to help common assumption to most teacher evaluation systems around
them improve their practices in the classroom (OECD, 2013a, the world is that classroom practice is the key mediator between
2013b; Santiago & Benavides, 2009). In the United States, in education policies and student outcomes (Schleicher, 2011), and
particular, growing interest in teacher evaluation reflects policy classroom observation remains the method of choice (a de facto
shifts away from highly qualified teachers to a notion of highly gold standard) for gaining systematic insight into these practices in
effective teachers (Goe, Bell, & Little, 2008). In the context of federal their natural setting (Kennedy, 2010; Stallings, 1977). In the context
guidelines that require that districts and states develop of teacher evaluation policies, observation-based assessments of
teachers in the classroom are seen as key both for understanding
the mechanisms linking classroom processes and desired improve-
ments in student outcomes, and for informing formative and
* Corresponding author at: Pontificia Universidad Católica de Chile, Escuela de
Psicología, Centro de Medición MIDE UC, Avda Vicuña Mackenna 4860, Macul,
developmental feedback to guide teacher improvement efforts
Santiago, Chile.
E-mail address: staut@ucla.edu (S. Taut).

http://dx.doi.org/10.1016/j.stueduc.2016.03.002
0191-491X/ã 2016 Elsevier Ltd. All rights reserved.
16 F. Martinez et al. / Studies in Educational Evaluation 49 (2016) 15–29

(Kane & Staiger, 2012; OECD, 2013a, 2013b; Reform Support observation protocols became a staple of education research (see
Network, 2012). e.g. Brophy & Good, 1986; Medley & Mitzel, 1963; Shavelson, Webb,
Yet, even as teacher development and evaluation policies & Burstein, 1986; Stallings, 1977). Validation research is again
receive unprecedented attention around the world, systematic being conducted, as a new generation of researchers works to
study of classroom observation in this context has been scarce. develop observation protocols and instruments to help us
Direct observation of the work of teachers in the classroom has understand the relationship between classroom practices, teacher
been a staple of research on teaching for nearly a century (see effectiveness, and student achievement (Bell et al., 2012; Hill,
Medley & Mitzel, 1963). However, educators and policymakers Charalambous, & Kraft, 2012; Mihaly, McCaffrey, Staiger, &
need to understand how evaluative contexts influence the Lockwood, 2013).
information collected through classroom observation, and how Growing interest in classroom observation also crucially
it is used to inform inferences about, and feedback to teachers. reflects the widening reach of teacher evaluation policies around
With this paper we start to fill this gap by investigating how a the world (Bruns & Luque, 2014). Despite the ubiquity of efforts to
sample of notable and varied regional and national education use student achievement to evaluate teachers, in practice most
systems around the world apply classroom observation to inform teachers in the United States are evaluated through procedures
teacher evaluation and professional development. Specifically, by that rely heavily on classroom observation (Kane & Staiger, 2012;
adopting a comparative lens for data collection and analysis, we Loup, Garland, Ellett, & Rugutt, 1996). Indeed, all states granted
seek to a) outline an analytic framework for classroom observation funding under the new Race To The Top legislation in the United
systems based on the conceptual, methodological, and contextual States included a new or redesigned classroom observation
commonalities and differences to help bring cohesion to the component for teacher evaluation (Reform Support Network,
research literature, and b) to support policymakers and system 2012), and a similar trend is evident in international reviews (see
designers in systematically considering key questions involved in e.g., OECD, 2013a, 2013b; Santiago & Benavides, 2009). In the
the design of classroom observation systems. 2007 Teaching and Learning International Survey (TALIS), for
example, over 70% of teachers in participant countries reported
2. Theoretical background that classroom observation was a component in performance
evaluation and feedback (OECD, 2009). This is not unexpected
Large-scale policy initiatives on teacher evaluation and because classroom observation enjoys considerable face validity
professional development have traditionally been informed by among educators and policymakers, and is seen as the key source
two different models. The standards-based model emphasizes of information supporting teacher formative evaluation and
explicit frameworks to model quality instruction and classroom feedback (Danielson, 1996; Goe et al., 2008; Pianta, La Paro, &
practice (Peterson, 2000), while the outcome-based model Hamre, 2007; see also Chait, 2010; Weisberg, Sexton, Mulhern, &
privileges productivity in terms of student achievement and other Keeling, 2009), teacher self-reflection (Richards, 1991), and the
relevant outcomes (Kennedy, 2010). These two models have diagnosis of areas of strength and those in need of improvement
started to converge in recent years, as a new crop of systems of (Protheroe, 2002).
teacher evaluation and development combines an emphasis on As with teacher evaluation more generally, however, what it
student achievement with explicit and detailed models of means to do classroom observation for teacher evaluation can
instructional practice (Kane & Staiger, 2012). Importantly, despite differ across systems and contexts. Below we outline an analytic
the growing international consensus around the importance of framework for systems of classroom observation that involves
evaluating and improving teaching, the specific characteristics of a three main dimensions, related to their basic conceptualization of
system depend on a series of factors that include assumptions instructional practice and teacher effectiveness (i.e., conceptual
about the goals and purposes, stakes and incentives attached to the issues), the sources of evidence and methods used to gather
evaluation results, frameworks or models that define the domains information about these constructs (i.e., methodological issues),
and criteria to be assessed, and methods used to collect and and policy context, processes, and decisions that shape the
analyze information about these domains (OECD, 2013a, 2013b). evaluation (i.e., policy issues).
A common feature in recent efforts to revamp teacher
evaluation across countries and contexts is a focus on formative 2.1. Conceptual issues
uses of information. For example, the federal Race to the Top
legislation in the U.S. calls for data systems “that inform teachers The first step in designing an observation system is to decide
and principals about how they can improve instruction” (U.S. upon the theoretical or conceptual underpinnings that will provide
Department of Education, 2009), while in Australia the national the basis for understanding, describing, and assessing teacher
agreement on teacher performance seeks to promote professional practice. Standards of effective teaching have become widespread
conversations that improve teaching (Australian Institute for in the past decade (e.g., Campbell, Kyriakides, Muijs, & Robinson,
Teaching and School Leadership (AITSL), 2016). Similarly, Chilean 2004; Danielson, 1996) and many education systems are stan-
legislation sets up a formative teacher evaluation system focused dardizing their approaches to classroom observation to align them
on “improving the pedagogical work of educators and promoting their to such frameworks. The expectation is that clear and explicit
continuous professional development” (Law, 2004, Law No. 19.961). teaching standards, and matching observation rubrics, will provide
This formative focus has renewed interest in classroom observa- useful guidance to teachers and administrators for understanding
tion as a method for collecting information to support improve- and promoting high-quality instruction (Darling-Hammond, 2013;
ment efforts regarding teaching quality. Ingvarson & Rowe, 2008). One example is the Framework for
Classroom observation is a powerful tool that offers an Teaching (Danielson, 1996, 2011) adapted in a number of school
unobstructed view of classroom practice (Millman & Darling- districts in the United States (e.g., New York, Los Angeles,
Hammond, 1990) and allows us to understand how teachers teach Cincinnati) and abroad,—e.g., as “Marco para la Buena Enseñanza”
within a realistic context (Putnam & Borko, 2000). Although in Chile (Ministry of Education, 2004). Within each of these sets of
observation protocols can be found in the research literature since standards, protocols and rubrics are typically developed targeting a
at least the 1920s (Stallings, 1977), classroom observation enjoyed narrower set of dimensions, deemed most amenable to direct
widespread popularity as a method of large scale data collection in observation in classrooms. These protocols may characterize
the 1950s and 1970s, when studies investigating the validity of teacher performance in general pedagogical aspects of instruction
F. Martinez et al. / Studies in Educational Evaluation 49 (2016) 15–29 17

(e.g. time and classroom management; differentiation; teacher- Finally, it is well established in the measurement literature that
student interactions) and content- or subject-specific aspects (e.g. in addition to adequate reliability, developers need to ensure that
cognitive challenge of subjects; alignment of instruction and sufficient evidence is available to support the validity of intended
learning goals). inferences about teacher practice based on information collected
Systems can also target different groups of teachers for through observation (American Educational Research Association,
observation—e.g. some systems seek to monitor and mentor American Psychological Association, & National Council on
novice teachers, others are meant as a general approach for Measurement in Education, 2014; Bell et al., 2012). As with
assessing teacher practice, irrespective of status or seniority. They reliability, variation can also be expected in the extent to which
can also give different weights to the information collected teacher evaluation and improvement systems will study and
through observation in informing broader judgments about provide support for the validity of the inferences derived from
teacher practice and effectiveness—effectively changing the classroom observation indicators (Taut, Santelices, & Stecher,
definition of the teacher construct(s) targeted. Finally, systems 2012).
can differ in terms of the extent and nature of feedback offered to
teachers; for example, feedback can be more or less explicit (Hattie 2.3. Policy issues
& Timperley, 2007), linked to concrete avenues for positive change
(Kimball, 2002), or based on single or multiple sources (Seifert, Policy-related factors are as important as methodological and
Yukl, & McDonald, 2003). Notably, various actors and stakeholders conceptual considerations for influencing the design, credibility,
may be expected to use the information collected through and sustainability of a teacher evaluation and development system
classroom observations. For example, administrators or observers (Herman & Baker, 2009). Critical policy factors include the actors
(or both) may have a role in selecting and designing professional involved in creating and designing the system, buy-in from key
development activities based on diagnosed shortcomings, and stakeholders, and public narratives around the system’s motiva-
inform personnel decisions related to promotion, allocation, hiring tion, credibility, and predicted consequences. The literature
and firing (Halverson, Kelley, & Kimball, 2004; Jacob & Lefgren, consistently suggests that contextual factors are particularly
2005). relevant where the political context is charged, and objective
technical guidance for practitioners and policymakers is scarce
2.2. Methodological issues (Elmore, 2008; Engel, Williams, & Feuer, 2012; Fullan, 1992; Joint
Committee on Standards for Educational Evaluation, 2009;
A range of methodological issues and choices are involved in Peterson, 2000). Here we consider how three key policy factors
designing and implementing observation systems for teacher may inform and explain the conceptual and technical features of
evaluation and development. These include prominently, the observation systems we have examined.
identity and qualifications of the observers; the evidence available The first factor relates to the stakes attached to teacher
suggests that trained administrators, peers, and district personnel evaluation in general, and to classroom observation in particular.
are capable of similarly reliable ratings (Bell et al., 2014; Ho & Kane, Systems that attach concrete consequences to the evaluation
2013). Second, the modes of observation, and within these, the intend to exert maximum influence over teacher practice through
type and extent of training provided to observers for certification, detailed indicators and rubrics. Developers get maximum control
has been found to be a critical component of accurate scoring over what gets measured and how, but need to choose and
(Casabianca et al., 2013; Leahy, 2012; McKay & Silva, 2015). Third, operationalize the indicators of practice of interest in advance, and
the number of observations per teacher, the approach for sampling to carefully consider issues of standardization, reliability, and
lessons for observation, and the appropriate combination of fairness in measurement. These systems can offer incentives to
announced and unannounced, formal and informal classroom teachers for artificially improving performance, raising questions
visits also need to be considered. The preponderance of evidence in about authenticity of the measures, and compromising their
this regard suggests that a reliable assessment of teaching requires formative value. At the other end, low stakes systems, while less
multiple observations, but beyond that general level the results are effective at driving behavior can do better at serving formative
far from conclusive. Recent evidence from the influential Measures purposes and ease concerns about resistance to the system and
of Effective Teaching (MET) project in the U.S. suggests that at least challenges to its legitimacy as a policy (City, Elmore, Fiarman, &
two different observers, each observing four lessons may be Teitel, 2009). Notably, however, even within low stakes systems
needed (Ho & Kane, 2013), while earlier work by Erlich and valid and effective feedback requires attention to the quality and
Shavelson (1978) suggests that as many as seven or more lessons fairness of the measures.
may be required for generalizations involving a single observer. The second dimension is locus of control—who implements the
Ultimately, no broad guidance is available about the number of observation on the ground. Systems can be locally defined and
observers or observations; any such judgment would have to be implemented at the school level with school leaders serving as
local and take into consideration the purpose and context of the observers and evaluators, or centrally organized and implemented
observational system, and the variety of factors that might affect at the district, regional, or national level—issues of standardization
the reliability of the results. However, the literature does seem to and measurement are more salient in the latter case, as
suggest that the reliability of indicators of teaching practice based observation tools gain in complexity, and the stakes tend to
on classroom observation is generally lower than that traditionally become higher (OECD, 2013b). Local control often goes hand in
expected of large-scale standardized paper and pencil tests. Even hand with a desire to strengthen school-level autonomy, but this
reliability coefficients that would be considered mediocre by requires a minimum of capacity in all schools. As a result, lower-
traditional standards can be difficult and expensive to attain, performing systems with lower professional capacity tend towards
requiring extensive training, and multiple visits to the classroom centralization and standardization in evaluation (Mourshed,
by multiple observers (Martinez, 2013; Ho & Kane, 2013). The Chijioke, & Barber, 2010).
precision of measures has not been examined extensively, but The third dimension addresses the model of accountability that
what evidence exists suggests the measures can differentiate is dominant in each context (Abelmann & Elmore, 1999). As
between the upper and lower ends of a performance distribution, proposed by Levitt, Janta, and Wegrich (2008) we differentiate
while finer distinctions are more difficult to establish (Ho & Kane, between self-regulating, professional, and organizational models
2013; Taylor & Tyler, 2012). of educational accountability. Self-regulated accountability
18 F. Martinez et al. / Studies in Educational Evaluation 49 (2016) 15–29

normally precludes teacher evaluation as a formal policy tool, since school evaluation, for which we were able to obtain sufficiently
teachers are trusted to hold themselves accountable based on detailed information from available documents (either publicly
strong feelings of moral and ethical responsibility for the common accessible or made available to us by program staff). We did not
good. Professional accountability systems see teachers as trusted seek to statistically represent a hypothetical target population of
professionals with peers and school level personnel playing the observation systems, but instead to demonstrate and describe the
key role in teacher development and evaluation—while explicit variety of uses of and approaches to classroom observation in a
standards and regulations may provide a basis for accountability, variety of national, regional, and local contexts around the world.
monitoring compliance with these standards is more implicit and We thus sought to construct a sample to maximize variation
locally shaped. Finally, organizational accountability models tend (Patton, 2002) in terms of the three key dimensions in our analytic
to rely on explicit, bureaucratic, centrally implemented mecha- framework, namely conceptualization (model of instructional
nisms to monitor teacher practice and effectiveness, with a focus practice and professional development; target constructs; weight
on ensuring compliance with standards and regulations, and the and uses of observation data), methodology (identity of observers
expectation that stronger control and more pressure will result in and observed, mechanisms of data collection, standardization and
improving desired outcomes (such as student achievement). data quality), and policy context (system size; locus of control;
Where teacher status is high, systems may lean towards self- stakes for participants). The resulting sample comprises 16 class-
regulated or professional in-service evaluation; while lower room observation systems in six countries, including two of the
professional status for teachers may be coupled with organiza- highest performing education systems in the world (Singapore and
tional mechanisms of in-service evaluation (Tatto, Schwille, Fukuoka Province, Japan), the highest performing and one of the
Ingvarson, Rowley, & Peck, 2012). fastest improving system in Latin America (Chile), and regional
systems in Australia (Barwon South Western Region, Victoria) and
3. Methods Germany (Hamburg and Saxony) that add to the diversity of the
sample. The study also included a variety of state and local systems
Our paper examines classroom observation systems in a diverse in the United States, including the three largest school systems in
cross-section of local, regional and national educational juris- the country (New York City, Los Angeles, Chicago), one state
dictions in six countries. The overarching research question is (Tennessee), and notable examples in smaller districts (Toledo,
focused on understanding how different systems around the world Cincinnati, Pittsburgh, Santa Monica), along with two high profile
use classroom observation to inform teacher evaluation and independent systems that reach a large number of teachers across
professional development. More specifically, we view similarities the country (the observation component used for training and
and differences across observation systems in relation to the certification in Teach for America, and the National Board for
analytic framework outlined above, which includes the conceptu- Professional Teaching Standards) (see Table 1).
alizations of classroom practice underlying the observation,
methodological approaches employed to collect high quality data, 3.2. Data collection
and contextual and policy factors that interact with the two.
Data was collected from multiple sources and through a variety
3.1. Sample of methods. First, we collected general information about the
systems in our sample in publicly available reports, online
We purposively selected observation systems designed in the documentation, and relevant publications. Second, lead personnel
context of teacher professional development and appraisal, or at each participating system provided a range of additional

Table 1
Classroom observation systems included in the sample.

System Country (region) Number of Students* Year of Inception


Instructional Rounds Australia (Barwon South Western Region, 541,990 (2015) 2008
Victoria)
Sistema de Evaluación del Desempeño Profesional Docente, Docentemás Chile 1,120,000 (2014) 2003
Schulinspektion Hamburg Germany (Hamburg) 188,000 (2015) 2007
Externe Schulevaluation Sachsen Germany (Saxony) 350,000 (2014) 2007
Performance Evaluation System Japan (Fukuoka) 481,092 (2013) 2006
Enhanced Performance Management System Singapore 510,700 2001
(2012)
National Board of Professional Teaching Standards USA 110,000+ teachers 1987
(2014)
Teaching As Leadership Performance System (Teach For America, USA 50,000+ teachers 2001
TFA) (2015)
Excellence in Teaching System USA (Chicago) 399,680 2013
(2015)
Teacher Evaluation System USA (Cincinnati) 34,000 1999
(2015)
Educator Growth and Development Cycle USA (Los Angeles) 655,490 (2014) 2012
Annual Professional Performance Review USA (NYC) 1,100,000 2013
(2012)
Research-based Inclusive System of Evaluation USA (Pittsburgh) 25,000 2010
(n.d.)
Standards-based Teacher Evaluation System USA (Santa Monica) 11,400 (2011) 2005
Tennessee Educator Acceleration Model USA (Tennessee) 995,890 (2015) 2011
The Toledo Intern Plan USA (Toledo) 22,200 1981
(2014)
*
Note: in parentheses included is the year the information represents.
F. Martinez et al. / Studies in Educational Evaluation 49 (2016) 15–29 19

materials and documents, including unpublished technical reports, commonalities or differences across systems. Such contextualized
internal research documents, policy guidelines and directives, qualitative summary was nevertheless necessary, given the scope
scoring rubrics, and observation manuals, among others. Finally, and variety of the observation systems we review and synthesize in
we conducted structured interviews (by telephone or in person) this paper.
with key personnel, or experts with direct knowledge of each
system; these interviews allowed us to fill remaining gaps in the
4. Contextual information on the sample of classroom
information available in the materials, and obtain additional detail
observation systems
about the historical and policy context for each participating
system.1
Before presenting the results we offer a brief description of the
broader teacher assessment and development systems in order to
3.3. Analytic approach
help situate and contextualize our subsequent analysis of
similarities and differences among classroom observation systems.
International comparative educational research emphasizes the
The three largest school districts in the United States (New York,
value of analyzing similar experiences across different contexts in
Los Angeles and Chicago) launched comprehensive efforts in
order to derive insights that can inform educational practice and
recent years to redesign their teacher evaluation systems to align
policy-making. Our study sought to understand, compare, and
with guidelines of the federal Race to the Top legislation, which
classify a sample of existing systems of classroom observation in
requires a comprehensive approach that integrates evidence of
educational jurisdictions around the world, along dimensions of
student academic growth, and other indicators of teacher practice
conceptual, methodological, and policy variation. Importantly, as
and performance (U.S. Department of Education, 2009). Indeed, in
discussed above this is not an international comparative study in
addition to adopting some form of value added model for assessing
the sense of contrasting representative sets of units (in this case,
teacher contributions to student achievement, the three districts
observation systems) across countries. Instead, observation
replaced informal and largely perfunctory classroom observation,
systems were selected based on the analytic framework outlined
with a standardized large-scale observation system tied to models
above, and examined with attention to how they exist within their
of instruction and teacher practice adapted from Danielson
national, regional and local contexts. Our analyses seek not to
(Danielson, 2011).
explain why certain countries (or districts) adopted the systems
The state of Tennessee also recently redesigned its approach to
they have, but to outline an analytic framework of classroom
teacher evaluation in order to secure funding under the new
observation systems based on conceptual, methodological, and
federal guidelines. Unlike other examples, however, Tennessee’s
contextual commonalities and differences. By doing so we intend
redesign did not entail adding indicators of student achievement to
to help bring cohesion to the research literature and help support
a stagnant observation system. Instead, it strengthened its long-
policymakers in systematically considering key questions involved
standing value-added accountability system (the first imple-
in the design of classroom observation systems.
mented in the country in the 1990s) by adding a robust
The analytic approach involved qualitative examination and
standardized classroom observation component. Districts can
synthesis of information from multiple sources described in the
choose from four state-approved frameworks to guide classroom
previous section. In gathering data about these systems, we sought
observation, of which the most widely adopted is the TAPTM model
to triangulate or “cross-examine” the initial descriptions gained
(originally known as the Teacher Advancement Program) from the
from publicly available documents by interviewing key district/
National Institute for Excellence in Teaching (Barnett, Wills,
system personnel, in some cases multiple people, and by
Hudgens, & Alexander, 2015).
examining internal documents made available to us. This
Cincinnati and Toledo exemplify a long-standing model of
triangulation of sources provided a more in-depth understanding
formative teacher evaluation that relies on classroom observation,
of observation systems and a clearer sense of how they are
and focused feedback to teachers from peers, experts and
implemented on the ground (as opposed to how they are described
administrators—while still including high-stakes summative
on paper). We organized the analysis around the analytic
evaluation for tenure and dismissal (Danielson, 2011; Koppich,
framework outlined earlier, which comprises conceptual aspects
2009). Notably, while these systems do not include student
(e.g. models and dimensions of instruction), methodological issues
achievement as a factor in high stakes decisions, research
(e.g. structure of observations, sources of evidence), and policy
undertaken in Cincinnati schools offers empirical evidence that
context (e.g. type of accountability model, stakes). These dimen-
observation scores are associated with value added indicators of
sions were the basis for classifying systems and highlighting areas
teacher effectiveness (Milanowski, 2004) and that the observation
of conceptual and methodological commonality and difference,
process leads to improvements in teachers’ value-added scores
within specific policy contexts. To examine systems along these
(Taylor & Tyler, 2012). Similarly, Pittsburgh recently redesigned its
dimensions we used within- and cross-case analyses (Miles &
evaluation system to incorporate formative evaluation based on
Huberman, 1994) to generate case summaries for each system, and
classroom observations in all personnel steps from teacher
devise classification schema that were then revised and refined
selection and training to advancement and promotion (without
over multiple rounds of discussion. Based on these discussions our
stakes for teachers). A collaborative effort brought together district
three-person research team developed an in-depth understanding
leadership, administrators, and the teacher union, an exception to
of the characteristics of each system, and ultimately reached a
the contentious process common in other districts.
consensus classification decision. Also, this classification neces-
The system in place in the Santa Monica Unified School District
sarily involved a qualitative data reduction exercise where
exemplifies a common model in small and mid-size districts in the
complex features of systems were summarized and matched to
U.S., where detailed rubrics are available to guide standardized
dimensions in our framework. The analyses are thus subject to
observation, but school principals are given ample discretion to
human judgment error and inconsistency, and involve a certain
implement them in the field, and use them to provide feedback to
degree of arbitrariness in the decision to emphasize certain
teachers (Ellett & Garland, 1987; Loup et al., 1996). This is also the
more common approach internationally, if compared to the large-
scale standardized observation systems emerging in recent years
1
A complete list of documents and resources consulted for each participating in the United States.
system is available in Appendix A.
20 F. Martinez et al. / Studies in Educational Evaluation 49 (2016) 15–29

Table 2
Summary of key conceptual aspects for selected teacher evaluation systems.

Framework or Standards of Practice Emphasis of Target of Weight in Role in Professional Development


Observation Observation evaluation
Chicago Illinois Professional Teaching Classroom Tenured and 65-70% of total Mandatory feedback after observations (and post-
Standard + CPS Framework for Teaching Environment and Non-tenured summative obs. conferences). Observer designs PD plan for
(Synthetic Danielson: 4 Domains, Instruction Teachers score areas in need of improvement. Online and live
19Components) mentoring aligned to CPS is available
Tennessee “TAP” Model (National Institute for Planning, Apprentice & 50% of total Observer offers feedback in writing & in-person
Excellence in Teaching). 4 Domains, 19 Instruction, Professional summative within a week focusing on areas identified for
Indicators Environment teachers score development. Final yearly summary rating
Teach For Teaching as Leadership (TAL) Rubric. Goals, Planning, Teachers in first 0%: Information from observations, and pre- and post-
America Internally developed: Execution, Working two years of Summative; conferences helps determine the kind and level of
observations + interviews of TFA teachers relentlessly career 100%: support needed
with largest student achievement gains Formative
evaluation/PD
Toledo Toledo Plan, internally developed: joint Procedures, CR Teachers in 1st- 100%: Each CT is responsible for mentoring and evaluating
union/management committee management, 2nd year or Evaluation is a small number of teachers. Mentoring relies on
subject knowledge, identified for based entirely observation, but can vary at discretion of CT
professionalism intervention on observation
Chile Adapted for local use from the Danielson Learning Public school 43% of Automated report describes performance level for
FFT (1996): 4 Domains, 20 criteria environment, teachers starting portfolio score each dimension and indicator. Municipalities
Lesson structure, in 3rd year of (26% of overall determine PD activities based on evaluation results
Pedagogical service score)
interaction
Singapore Singapore Teaching Competency Model. Nurturing the All teachers Not specified: Prior results guide observation. Verbal feedback
Developed internally w/support from US- whole child One piece in a after each observation, mid- and end-of year review.
based HR firm (13 Competencies, in holistic Observer & teachers discuss PD/mentoring needs,
5 Clusters) evaluation and activities
Fukuoka Developed by Board of Education. 3 Counseling skills; All teachers Not Specified: Holistic, at discretion of teacher and principal
Domains, 18 criteria Instructional skills; One piece in a
Others holistic
evaluation

The sample also includes two non-traditional systems with far (City et al., 2009) are used to guide both school improvement and
reach across districts and states in the United States. The National teacher development. Wide leeway is given to each local district
Board for Professional Teaching Standards (NBPTS) offers subject- and school to adapt the approach to their local needs. Novice
and grade-specific certificates for teachers through a portfolio- teachers often are assigned a mentor, while a team of school
based evaluation model (Hakel, Koenig, & Elliott, 2008), intended leaders and peers conducts periodic classroom observations to
to identify and recognize the best teachers in the country, and to support teacher improvement efforts in a direct and contextual-
inform the broader discussion around the characteristics of ized way.
effective teachers. On the other hand, Teach for America is an Singapore’s education system is notable for the high level of
independent organization seeking to improve school districts’ achievement of its students in international assessments, and its
access to qualified teachers, by training and deploying a corps of teacher evaluation approach for its low stakes and dual focus on
high-achieving young professionals to teach in schools that serve excellence and professional growth (Sclafani & Lim, 2008). Teacher
high proportions of disadvantaged students. evaluation is largely based on multiple classroom observations
The Chilean national teacher evaluation system has received throughout the year by peers, experts, and administrators, using
attention as an exemplar at regional level (Santiago, Benavides, the local Enhanced Performance Management System (EPMS) as a
Danielson, Goe & Nusche, 2013; Bruns & Luque, 2014). The system model of reference for classroom practice and teacher competence.
collects teacher portfolios that include written evidence of work Similarly, teacher evaluation in Fukuoka, Japan emphasizes helping
and a video of one lesson, along with supervisor questionnaires, beginning teachers develop the skills and practices of senior
peer interviews, and a self-evaluation (Taut & Sun, 2014; Manzi, faculty.2 Principals are trained in classroom observation by the
Gonzalez, & Sun, 2011). Teachers undergo evaluation every four Board of Education and given wide leeway to determine the
years and may be eligible for significant monetary incentives, or frequency, focus, and use of observation to inform instructional
subject to mandated professional development, and even removal assessment and improvement. Principals typically involve senior
if underperformance persists. Formative feedback to teachers teachers in the school in the process of observing and providing
relies to an important extent on a professionally videotaped lesson feedback to novice teachers.
that is contrasted to a model of good instruction (Marco para la In Germany, teacher candidates are observed frequently and
Buena Enseñanza) adapted from the Danielson Framework offered detailed formative feedback by their mentors during initial
(Danielson, 1996). Australia is in the process of developing national
teaching standards (Australian Institute for Teaching and School
Leadership (AITSL), 2012) and a comprehensive teacher develop- 2
Our discussion refers specifically to the Fukuoka prefecture but this system is
ment and appraisal system. Here, we studied teacher evaluation in largely representative of how teacher evaluation is carried out in other locales in
one district in the territory of Victoria, where instructional rounds Japan (Ministry of Education et al., 2016).
F. Martinez et al. / Studies in Educational Evaluation 49 (2016) 15–29 21

Table 3
Frameworks for teacher practice in selected international teacher evaluation systems.

Fukuoka (Japan) Singapore Teach For America Danielson (USA)


Student development & Counseling Nurturing the Whole Child Set Big Goals Planning and Preparation
Consider pupils’ languages, values, Share values with student Ambitious, feasible, aligned goals Knowledge (content + pedagogy)
and human rights Take action to develop student Knowledge of students
Build trust and develop pupils; Act in student’s interest Invest Students and Influencers to Work for Select instructional goals
Address student incidents the Big Goal Knowledge of resources
Consider health and safety Instill “I can” and “I will” message Design coherent instruction
Facilitate learning environment Reinforce effort and mastery Assess student learning
Provide appropriate counseling Model persistence and success
Support co-curricular activities Winning Hearts and Minds Understand Create a welcoming environment Classroom Environment
the environment Create environment of respect
Instructional skills Develop others Plan Purposefully Establish a culture for learning
Develop annual plan based on Assess progress toward big goal Manage classroom procedures Manage
standardized curriculum Design long-term and unit plans student behavior
Facilitate an organized class Develop aligned lesson plans Organize physical space
Set appropriate objectives Differentiate
Facilitate learning creatively Knowing Self and Others Develop rules and procedures Instruction
Understand learning conditions Emotional intelligence Communicate clearly
Provide supports to students Execute Effectively Use questioning strategies
Conduct evaluations to encourage Clearly present academic content Monitor Engage students in learning
learning student understanding
Have and improve content knowledge Working with Others Rules, consequences Provide feedback to students
and skills Partner with parents Time-saving procedures Be flexible and responsive
Others Work in teams Continuously Increase Effectiveness Professional Responsibilities
Cooperate and collaborate to achieve Identify progress for subgroups Reflect on teaching
school missions Identify key actions (stdnt/tchr) Maintain accurate records
Build relationships with community Cultivating Knowledge Adjust to solve problems Communicate with families Contribute
and parents Subject mastery Build on strengths to school/district
Practice risk management Analytical thinking Grow/develop professionally
Discharge duties as public official Initiative Work Relentlessly Show professionalism
Teach creatively Persist in the face of challenges
Solve time/resource constraints

training (Referendariat). Moreover, certification and tenure require include the model or framework of instructional practice; the
that teachers be observed and assessed teaching a lesson by a teachers targeted for observation; the weight assigned to
board of examiners. Otherwise observation is given a marginal role observation in the evaluation; and the role of observation in
in career promotion, but mostly as a pro-forma exercise with little informing feedback and professional development. Table 2 sum-
feedback for teachers or tie to models of instruction or professional marizes this information for the illustrative subset of seven
practice. On the other hand, classroom observation is used in the systems.
context of school evaluation or inspection (following the traditions
in the Netherlands and England). This approach relies on highly 5.1.1. Models and dimensions
knowledgeable inspectors or trained teacher evaluators, and state What observers look for in the classroom varies across systems
frameworks for classroom and school quality. It considers in our sample, from a few broad dimensions of instruction, to
classroom observation results in the aggregate by school, with detailed outlines including dozens of specific features of teacher
no feedback provided to individual teachers (Müller, Pietsch, & Bos, practice. For example, a number of districts in the U.S. (including
2011). We discuss examples in two German states (Hamburg and the three largest in New York, Los Angeles and Chicago) and
Saxony). internationally (e.g. Chile) have adopted the Danielson Framework
for Teaching (Danielson, 1996), which codifies the work of teachers
5. Results in great detail in terms of dozens of specifics elements of classroom
practice. While use of the Danielson framework is becoming
We analyzed the characteristics of the 16 systems of classroom widespread, a great variety of frameworks are in use in other U.S.
observation in relation to the three components of our theoretical districts and internationally. Table 3 shows the dimensions of four
framework: conceptual aspects, methodological issues, and policy models of teaching practice used around the world. Not only do the
context. For clarity of exposition, the tables in the Results section only numbers of domains and dimensions vary, but definitions of good
contain a subset of seven systems (identified in bold in Table 1), teaching practice are also to some extent local and culture–specific.
chosen to eliminate redundancy and illustrate key types of systems As we might expect, there are areas of considerable overlap across
and dimensions of variation. The subset includes national systems in systems and countries; for example all systems consider teachers’
Chile and Singapore, state systems in Tennessee (United States) and content knowledge, and their ability to plan and set teaching and
Fukuoka (Japan), and local systems in Chicago Public Schools and learning goals. Similarly, every model includes in some form
Toledo (both in the United States), along with an independent U.S. teachers’ consideration of student characteristics, and the use of
based system (Teach for America). Information for the full sample of appropriate assessment practices in the classroom as a key
16 systems is available in Appendix B. dimension of high quality instructional practice (Brookhart, 2011).
At the same time, there are differences in the focus and breadth
5.1. Conceptual aspects of areas targeted for observation across systems. Classroom
observation in Singapore and Japan leverages information
In order to investigate how different systems conceptualize obtained through frequent observation of classroom practice
classroom observation, we prioritized aspects of our framework throughout the year, to infer other underlying competencies and
that are a particular focus in current policy reform efforts. These attitudes (e.g. nurturing the whole child,winning hearts and minds;
22 F. Martinez et al. / Studies in Educational Evaluation 49 (2016) 15–29

pupils’ values and human rights, pupil trust) which tend to receive observation). While some allowances are made for less frequent
less attention in American frameworks, where technical and observation of teachers exhibiting adequate performance, even
procedural aspects of instruction play a larger role (e.g. questioning veteran teachers are observed in regular cycles, and at key points of
techniques, classroom management, identify progress for subgroups). career advancement. The Tennessee or Chicago systems, for
In Singapore, supervisors monitor each teacher’s progress on these example include all teachers, but prescribe different schedules
competency goals, informally observing and conferring with of observations at different career stages (apprentice vs. profession-
teachers frequently throughout the year (and providing feedback al; tenured vs. non-tenured).
and guidance when needed). The emphasis and attention paid to
areas of student growth and interests also differs: Singapore 5.1.3. Weights and roles of observation
broadly directs teachers to share values with students and act in In most of the evaluation systems we reviewed classroom
their best interest (e.g. recognizing individual potential and observation is the most important component, both in terms of
developing self-confidence); Danielson references knowledge of summative weight and value in informing professional develop-
students as part of planning for quality instruction and later to use ment. Where information from various evaluation components is
questioning strategies effectively; a third of the Fukuoka frame- combined into an overall indicator of effectiveness, observation
work centers on student development and counseling, in relation typically receives the largest weight, ranging from 40% to as much
to values, trust, language, health, and safety; finally, the Teach for as 100% of the overall score. In Chicago, for example, 65% to 70% of
America model is based on the Teaching as Leadership frame- the total summative score assigned to teachers is based on the
work—which perhaps more than the others emphasizes concrete observation ratings. In Los Angeles, New York, and Tennessee the
behavioral aspects of classroom practice (reinforce rules and weight ranges from 50% to 60%. In other cases, the information
consequences, identify progress for subgroups), as well as attitudinal collected through observations is also weighted highly but
and motivational targets like personal effort and perseverance. combined in a qualitative or holistic process, without explicit
One important note of caution is in order concerning the grain mathematical rules or algorithms. This is the case in Santa Monica
size of the analysis of dimensions. Because we did not have access where the observation is the main source of evidence in a holistic
to detailed observation protocols for all systems, we analyzed the evaluation. Similarly, in Singapore evidence of performance is
general definitions and descriptions of the dimensions included in collected through intensive observation and is considered holisti-
the frameworks. However, we did not examine in detail the specific cally for evaluating (mostly early career) teachers. Chile presents an
classroom behaviors and processes that may be comprised in the intermediate case where the information from observation is
operationalization of these dimensions. Thus, even dimensions incorporated through explicit quantitative guidelines and makes
that seem to be similar could differ in terms of the specific up almost half of the portfolio, but its weight is minor when
classroom processes and interactions included or emphasized in combining scores on four different instruments to determine
their operationalization—the specific features of instruction that teachers’ final performance category.
an observer would attend to. Lastly, while all systems share a discourse of formative
evaluation focused on improving teaching, important differences
5.1.2. Target of observation emerge in how observations are structured and used specifically to
The systems we examined vary considerably in terms of the inform teacher professional development. Teachers in Chile
specific teachers they observe in the classroom. Systems like receive a written, automated report that describes their perfor-
Toledo, or those of German states focus primarily on novice or in- mance on the seven portfolio dimensions and respective indicators
training teachers. After this initial intensive period, classroom (based on the written evidence as well as the recorded video).
observation for tenured and experienced teachers in these systems Chilean school districts use these results to assign federal funds to
becomes infrequent and somewhat routinized. Singapore empha- professional development plans for low-performing teachers. In
sizes summative aspects of evaluation for teachers in training and Japan principals are free to use the information collected through
hiring decisions; once hired, teachers continue to be observed and standardized observation rubrics holistically to determine profes-
evaluated, but punitive consequences are rare since the focus shifts sional development needs and plans for each individual teacher.
toward professional growth and shaping teachers’ career paths. In Finally, in Germany the link operates at the level of the school, not
these systems in general, veteran teachers are observed less the individual teacher; inspectors may give feedback to the school
frequently, feedback is less systematic, and the consequences of community emphasizing issues observed during classroom
the evaluation are less critical. Finally, the Teach for America observation, which the principal can then further discuss with
system in the United States focuses resources entirely on teachers teachers in order to delineate school professional development
in their first two years, after which they are considered alumni of plans.
the program and no longer receive formalized support or Most of the new large-scale observation systems in the U.S.
mentoring from the organization. require that trained observers (and/or school administrators) meet
Santa Monica exemplifies a decentralized, semi-standardized with teachers, both before the observation to discuss focus areas,
system common to smaller districts in the United States, where and after observation. Post-observation meetings are used to
principals exert a great deal of discretion in selecting the teachers debrief teachers on the results of the observation, discuss them in
that will be target of observation—and the aspects of practice that more detail, and delineate plans for moving the teacher forward,
will be the focus of observation. Thus, one principal may decide to including appropriate professional development. This is the case
observe all teachers focusing on one dimension of practice across for example in Chicago, Tennessee, and Los Angeles. On the other
the school, while another may decide to observe only novice or hand, traditional peer assistance models like Toledo rely on a
struggling teachers, and tailor observations to focus on specific trained Consulting Teacher (or teachers) for mentoring and
dimensions of interest for individual teachers. Their choices evaluating small cohorts of mainly novice teachers using
exemplify an effort to balance depth, detail, and comparability information they collect through classroom observations.
that is present in one form or another across systems.
The new generation of systems in development in large school 5.2. Methodological aspects
districts in the United States expands standardized observation to
all teachers throughout the career cycle (e.g. New York City and Los As with conceptual features, classroom observation systems
Angeles explicitly aim to include all teachers in classroom can also differ in important ways in terms of how the observation is
F. Martinez et al. / Studies in Educational Evaluation 49 (2016) 15–29 23

carried out in practice. For our analysis we focus on three key and improve instruction among teachers in training, and to inform
aspects of methodological variation: the frequency, number, and tenure and promotion decisions.
format of observations; the background and qualifications of those The number of observations carried out within these observa-
observing; and the procedures followed to examine the reliability tion cycles varies across systems. We found the lowest number
of the data collected and the validity of the inferences drawn. (one observation per cycle) in Chile and Santa Monica; other
Table 4 presents this information for the same sub-set of seven systems require two observations per cycle (Los Angeles), four
systems presented previously. (Pittsburgh, Tennessee), or more (New York City). Toledo on the
other hand focuses on the volume of evidence from observation,
not frequency; the guidelines call for teachers to be observed for a
5.2.1. Frequency and number of observations
total of 20 h of instruction throughout the year. As before, the
As discussed above, most of the systems in our sample establish
number of observations can also vary within systems depending on
different cycles of observation for teachers in different career
the stage of career advancement or prior performance: Chicago for
stages. Generally speaking novice teachers are observed more
example requires two observations for tenured teachers, and three
frequently than tenured and experienced teachers; this includes
for those without tenure or identified for improvement, and New
both the cycle of observation and how many times they are
York requires two and six observations, respectively, for the same
observed in an observation cycle. Some systems aim to observe all
groups; similar differential criteria are set in Los Angeles,
teachers every year; in our sample these include New York and
Tennessee, and Pittsburgh.
Tennessee in the U.S., and Fukuoka and Singapore internationally—
Another dimension of variation is whether classroom visits are
although in these latter cases principals have discretion for
arranged with or otherwise known in advance to teachers (Los
deciding who is observed and how often. Among the remaining
Angeles, Chile), or involve a combination of announced and
systems, yearly observation of beginning teachers is the norm (Los
unannounced visits (Tennessee, New York, Singapore). None of the
Angeles, Chicago, Santa Monica, Toledo). More experienced
systems in our sample uses only unannounced visits for formal
teachers are observed in cycles of two years (Chicago), three years
assessment, but a number use informal unannounced observations
(Los Angeles), or four years (Chile). The Chilean system is
to complement the formal planned observations and enrich
interesting in that the cycle of evaluation (and thus, observation)
formative feedback and professional development (Chicago,
is not tied to career stage but to past performance: teachers rated
Cincinnati). A distinction is needed here between visits that are
as unsatisfactory are evaluated the following year, and those rated
announced and those that are negotiated (i.e. the teacher and the
as basic after two years; everyone else is evaluated in four-year
principal or another observer jointly determine which lessons will
cycles. Examples where teachers are not observed in regular
be observed) or self-selected (i.e. the teacher determines which
intervals beyond the early career stages seem to be less frequent; in
lessons will be observed). A similar distinction is needed between
our sample these were limited to Toledo in the U.S., and Germany,
unannounced observation (i.e. the teacher does not know which
where observation of individual teachers is used mostly to assess
lessons will be observed) and informal observation. In the latter

Table 4
Summary of key methodological aspects for select teacher evaluation systems.

Who is observed? How Who observes? How? What training/certification do observers Research on quality of the observation data
often? Internal v external Announced receive?
(A)/Unann.
(U)
Chicago Tenured: 1–2 every 1 or 2: A (1 informal 50 hours 1) Proficiency test on Internal: agreement w/master raters;
other year; Principal + Calibrators observation framework + Agreement w/master video dimensionality; VAM; qualitative; fairness
Non-tenured: 4/yr in a small% of CR may be U). rating 2) Ability to communicate and develop reviews when observation and VAM differ
PD plan
Tennessee Apprentices: 6/yr Always 2, 2A + 2U 32 + hours. Certification: agreement with Developer studies: agreement w/master
Professionals: 4/yr Principal + AP; master video rating raters; correlation w/VAM
external teams in 20%
of districts
Teach For 1st and 2nd year External Manager of A Varies by region. Rubric norming on teacher Internal: Revision of TAL Framework in
America teachers, as often as Teacher Leadership video & docs; book clubs, webinars, mini- process
determined by Development retreats
observer
Toledo 1st and 2nd year Y1/Intervention: discretion of Ad-hoc training by veterans; CTs share office; N/A; Qualitative
teachers and teachers External Teacher observer new and veteran CT observe together and
in intervention consultants compare notes
Y2: Principal
Chile All public school Externally selected & A (Video) 24 hours, 3-day trial scoring period (scoring Generalizability studies, factor analyses,
teachers, freq. trained teacher raters supervisors are trained 64 hours) in-depth group comparison studies,
depends on prior correlation with VAM, consequential
evaluation validity
Singapore All teachers 2+: Internal. AP, Dept 1 A & 2+ U 100+ hrs/yr (PD to improve teaching, incl. Internal
Head, Senior Teachers, doing observation)
Principal
Fukuoka All teachers Internal: Principal and discretion of Board of Education makes manual and Internal
possibly vice principal principal demonstrates to principals
(s)
24 F. Martinez et al. / Studies in Educational Evaluation 49 (2016) 15–29

case, the teacher may or may not know which lessons will be 5.2.3. Reliability and validity
observed, but the information collected is either a) not part of the As could be expected from their diverse goals, designs, and
formal evaluation and used only formatively for feedback, or b) contexts, the systems in our sample also vary in the extent and
used by the principal or observer to inform their judgment outside kinds of empirical evidence regarding the quality of information
of the observation scoring rubric. they make available. Of 16 systems in our sample, we found that
Little variation is apparent in terms of the format of only about half had published or otherwise made available reports
observation, since in most systems we find direct observation in explicitly examining the reliability of the indicators generated
classrooms, as is currently the case in most districts in the United through classroom observation, and only four presented empirical
States. Only two systems in our sample the Chilean national evidence supporting the validity of the resulting inferences about
teacher evaluation system, and NBPTS for teacher certification teachers and teaching practice.1 The extent and nature of reliability
use video-recordings of classroom teaching. Interestingly, video- and validity research varies depending on system size and
taping is garnering increasing attention as an alternative to live resources, as well as the stakes for teachers: a small number of
classroom observations (Gates, 2013), as districts start to bear the established systems in our sample have implemented standard-
full cost of large-scale classroom observation. Video can be ized classroom observation systems on a large scale, with serious
important for supporting high stakes inferences and decisions, stakes for participant teachers, presenting the strongest incentives
and could also offer cost savings by allowing remote asynchronous and needs for providing empirical evidence supporting the validity
review—especially if the cost of videographers can be reduced. of their inferences (Taut et al., 2010). Not surprisingly, these
systems (e.g., Chile, Chicago, NBPTS) offer the most extensive
empirical evidence of reliability and validity for their indicators (
5.2.2. Observer background and qualifications
Taut et al., 2012). These systems also tend to have longer histories
Most systems in our sample involve school administrators as
and access to technical expertise to use modern psychometric
observers and sources of feedback for teachers, either on their own
techniques like Generalizability Theory or Item Response Theory to
(Santa Monica, Pittsburgh), or jointly with other administrators
investigate the properties of indicators and resulting inferences,
(Tennessee, New York, Japan, Germany). A handful of systems, like
and a comprehensive understanding of validation (Ho & Kane,
Chile, Chicago, and Toledo, relies entirely on external raters—
2013; Kane, 2006).
typically trained experienced teachers. In Australia there are two
In the smaller school districts included in our sample, on the
observation tracks: principals are tasked with observation for
other hand, the stakes of teacher evaluation tended to be lower and
performance appraisal and professional development, whereas
the need for standardization, as well as reliability and validity
external observers are used for certification of accomplished
evidence, was perceived as less acute. Thus, systems like Santa
teaching.
Monica, Toledo, or Fukuoka offer little or no empirical evidence of
Rater training is critical in large-scale observation for teacher
the reliability of observation indicators. Evidence supporting the
evaluation and development. In all systems in our sample
validity of the intended inferences is even more limited. These
observers undergo training before they can conduct official
smaller systems provide more space for the professional judgment
observation ratings. Training ranges from 20 to 100+ hours, and
and expertise of school leaders to play an evaluative role, so that
may take the form of formal calibration or norming sessions (Los
empirical research on the reliability and validity of quantitative
Angeles, TFA, Chile), train the trainer approaches (Pittsburgh), or
standardized measures seems less applicable. Finally, a number of
shadowing experienced raters during observation (Toledo). Most
systems that were recently created or re-designed (e.g. Los
systems establish explicit criteria for certifying external observers;
Angeles, New York, Pittsburgh, Tennessee) are in the process of
on the other hand, systems that rely on principals or inspectors as
conducting studies to collect evidence of reliability and validity,
observers often lack any explicit certification requirements (e.g.
which will be critical for supporting inferences as these systems
Santa Monica, Saxony, Japan). Certification typically requires a
become operational.
minimum level of agreement with master ratings during training
(Los Angeles, Chile), and may also involve tests of content or
5.3. Policy aspects
pedagogical knowledge (Chicago), or open ended assessments
measuring the ability to design appropriate professional develop-
The final dimensions of variation we examined for our sample
ment plans (New York, Singapore).
of systems centers on their goals and policy context. As outlined in

Table 5
Summary of key policy aspects for select teacher evaluation systems.

Accountability Locus of control Stakes for participant teachers


model
Chicago Organizational Central District 1) Tenure decisions. 2) Formative Evaluation. Mandatory Mentoring PD focused on identified framework
components if unsatisfactory summative rating. Hearing and potential termination for those who fail to
improve.
Tennessee Organizational State and District level Used to inform human capital decisions, including mandatory individual and group professional development,
hiring and promotion decisions, tenure and dismissal, recognition and compensation.
Teach For Organizational Regional level; Discretion Observations are used to support teachers and identify the practices of top performers but performance is
America to Evaluators defined exclusively by student gains.
Toledo Professional Discretion to Evaluators Unsatisfactory teachers can be recommended for non-renewal after 2 semesters without improvement.
Chile Organizational National level Basic/unsatisfactory performance: mandatory professional development. Teachers may be removed with
repeat Basic/Unsatisfactory performance, or receive salary bonus if competent/outstanding AND successful on
subject/pedagogical knowledge test.
Singapore Professional School level Identify candidates for leadership and research roles. Determine annual bonuses. Dismissal in rare instances.
Fukuoka Professional School level Principal tells teachers the results of the observations, advises each teacher how to improve his/her teaching
based on the observation. Generally speaking, observation results not used to fire teachers or give bonuses.
F. Martinez et al. / Studies in Educational Evaluation 49 (2016) 15–29 25

our framework, we focus on three relevant aspects of the policy and Toledo, for example, exist within professional models of
context, namely the stakes associated to the evaluation for accountability that are less standardized, with generally lower
teachers; the locus of control for the observation or the degree (punitive) stakes attached and give flexibility to the local
of local discretion; and the type of accountability model professional community. On the other hand, observation systems
underlying the approach to observation and teacher evaluation. being developed in large school districts around the United States
Table 5 condenses this information for the same illustrative subset in recent years seem to reflect an organizational model of
of systems presented throughout the paper. accountability, which raises the stakes for teachers and aims to
standardize the process as much as possible in order to reduce
5.3.1. Stakes of evaluation for teachers discretion (subjectivity) from the evaluators. Not surprisingly,
The stakes attached to classroom observation for participant these systems show most emphasis on establishing the reliability
teachers differ across systems in our sample. Importantly, the of the observation processes and resulting measures. Finally, none
frequency and stakes of the observation can differ for novice and of the systems we studied relied on a strictly self-regulated model
tenured teachers, within the same system. Novice teachers are of accountability (systems with this model of accountability would
often observed more frequently (e.g. Chicago, Tennessee) and in likely not have any systematic classroom observation system);
most systems face higher stakes in the form of a tenure decision in while the systems often emphasize the value of teacher self-
the first few years of their teaching careers (Toledo, New York, reflection as a mechanism for improving teaching, in all cases
Santa Monica). In addition, the stakes of the overall evaluation are professional or organizational accountability, or a combination,
not necessarily always the same as the stakes of the classroom was in place or being developed.
observation. In the Teach for America system, for example,
observation carries no summative weight in evaluation and is 6. Designing classroom observation systems for teacher
used only for formative purposes (i.e. to improve teaching in ways development and evaluation
conducive to improving the key outcome of interest: student
achievement). Our project aimed to characterize the diversity of approaches to
For experienced teachers there are non-punitive consequences classroom observation implemented in education systems around
in Singapore, Germany, and Santa Monica, where the information the world for the purpose of teacher professional development and
collected through observation is mostly used for formative (or pro evaluation. We assembled a purposive sample of 16 observation
forma) purposes. This contrasts with high-stakes systems like systems that comprises two of the highest performing education
Tennessee, or Chicago, where the results of observation keep systems in the world (Singapore and Japan), the highest perform-
carrying 50% to 70% of the weight in an overall summative ing and fastest improving system in Latin America (Chile), and
evaluation score used to determine personnel decisions, in regional systems in Australia and Germany that add to the diversity
addition to being used for formative purposes. In fact, most of the sample. The study also included a variety of state and local
systems combine formative and summative purposes but they systems in the United States, including the three largest school
differ in the extent to which the stakes attached are punitive and/ systems in the country (New York City, Los Angeles, Chicago), and
or rewarding (e.g., dismissal, salary bonuses). The Singapore notable examples in smaller districts (Toledo, Cincinnati, Pitts-
system offers an interesting case: while punitive stakes are low, the burgh), along with two high profile independent systems that
results of the evaluation can have important positive consequences observe large numbers of teachers across the country (TFA, NBPTS).
for teachers. Specifically, high performing teachers in Singapore While the sample is diverse, as the basis for our study it also
can be placed in a track to become school leaders, and even poses a number of challenges and limitations. First, there is over-
education researchers (with assignment to the National Institute of reliance on U.S. systems, partly because a wider range of distinctive
Teachers). approaches have been used in that country, but also because of
reasons of access, opportunity, and availability of information.
5.3.2. Locus of control for the observation (degree of local discretion) Overall, the sample includes systems in six countries, but
Related to this point, systems differ in how much control over undoubtedly leaves out others that would yield valuable insights
the observation process rests at school level and how much control if examined. For instance, for maximum variation it might have
stays with a central authority that determines the methodological been advisable to include an example like Norway, where there are
details. For example, the fact that principals act as evaluators neither national nor regional formalized systems of classroom
means very different things in Japan and New York City: In Japan, observation for teacher appraisal, development or school improve-
the principals decide much about how to carry out the ment, while such efforts do exist to varying extent at school (and,
observations while in New York City they have to complete an rarely, municipal) levels (Nusche et al., 2011). We recognized early
online training and pass a certification test to assure calibration. In on that inclusiveness and representativeness would be elusive
Chile, the Ministry of Education sub-contracts implementation to a goals and instead focused on procuring a sample of systems for
specialized university center, but districts are involved at various which sufficient information could be collected, and that would
steps during the evaluation process. A local evaluation commission represent a broad enough slice of variation along our key
has the legal right to ratify or modify the final evaluation results of dimensions of analysis. We believe the range and nature of the
their teachers, taking into account contextual information that sample of systems we examined can spearhead a broader
could otherwise not be considered in this national standardized conversation around the conceptualization, design, and use of
assessment system. In Victoria, Australia each region or even each classroom observation systems for improving teaching practice
school within each region defines how it will use the so-called around the world, but it undoubtedly can be further extended. In
“instructional rounds” approach to classroom observation for their this context it is also important to note that some of the countries
benefit and improvement, so here the control over the purposes, seen as exemplars in terms of the quality of teachers and teaching,
methods and uses of observation rests entirely at the local or have not traditionally used systematic classroom observation as a
school level. tool for professional development.
Along with maximum variation, accessibility of information
5.3.3. Accountability model was another key concern in the study. This included having to
The observation systems sampled differ in terms of the overcome language barriers (the research team covers English,
accountability models in which they are immersed. Singapore Spanish, German, and French), as well as establishing rapport with
26 F. Martinez et al. / Studies in Educational Evaluation 49 (2016) 15–29

key informants. These considerations determined in part which any subject, but there are others thus far mostly used for
systems were included in the study, and thus constitute another research purposes focusing on specific features of instruction in
limitation of the study. Still, available information varies across particular subjects (e.g. the Mathematical Quality of Instruction,
systems, with national, long-standing and higher stakes systems MQI, observation instrument; Hill et al., 2008). In the United States,
offering more complete and publicly available information, while commercially available models of instructional practice are
for smaller, local and newly developed systems we had to rely more receiving widespread attention and are increasingly becoming
on interview data and documentation that was not always publicly synonymous with the broader notion of a teaching framework (e.g.
available. Danielson). While these models can offer solid design, ease of use,
A third limitation has to do with the framework we used for the and comparability, locally developed or adapted frameworks may
analysis. The emerging framework we have offered in the previous offer an alternative that consider the local context and facilitate
sections is inherently theory- and sample-dependent and does not involvement and buy-in from key local stakeholders.
purport to be exhaustive or definitive. In fact, the framework we If the decision is made to develop a framework locally, the next
outline involves an important component of informed judgment set of questions involves the ideal combination of teacher
on the part of the researchers, based on dimensions that are competencies, psychological traits, and observable classroom
conceptually derived, empirically refined, and far from orthogonal behaviors to include. Developing the framework itself can then
(a number of areas of overlap or common influences were involve a combination of inductive processes whereby high
discussed in the paper). It might seem difficult to extract specific performing teachers are first identified and used as the basis for
conclusions from comparisons across abstract dimensions span- defining the dimensions of the framework, and theory- or
ning vastly different contexts. However, in practice international empirically-based approaches that rely on examination of the
comparative study of a diversity of approaches to classroom literature or research studies relating teacher practices to desired
observation is important for informing our views about the outcomes. However, developers should not underestimated the
advantages and limitations of this long standing research number of empirical steps necessary for moving from a nascent
technique for capturing instructional practice, and can be useful conceptual framework to a set of protocols and rubrics that can be
for considering the possibilities observation affords for evaluative used for standardized classroom observation on a large scale. As
purposes in specific policy contexts. demonstrated by recent high-profile efforts, the development and
In this section we discuss the implications of our emerging validation of such frameworks is a process that takes multiple
findings for informing the design of classroom observation systems years and substantial human and financial resources (see Hill &
for teacher evaluation and development. In particular, we focus on Grossman, 2013; Kane & Staiger, 2012).
four issues, related to the choice of framework or model of
instruction; the degree of standardization and locus of control; the 6.2. Standardization and locus of control
reliability and validity of observation measures; and the appropri-
ate uses of the information collected through observation. Districts, states, and countries need to decide whether teacher
evaluation and development systems will be designed in a
6.1. Model of instruction standardized way based on analytical assessment, or as holistic
examination of instructional practice. Similarly, developers can
A wide variety of frameworks or standards are used in districts design a system to be either centrally or locally controlled. These
and states in the United States and around the world to decisions typically involve not only technical considerations, but
characterize high quality instruction. One conclusion that can be are immersed in broader policy contexts and influenced by
drawn from our analysis is that the definition of quality teaching is resource availability, funding requirements, legal regulations, and
dependent on local context and culture, and so is the process by various other historical and social conditions and factors
which frameworks are developed and examined. However, there surrounding the teaching force. Decisions regarding standardiza-
are some core aspects of quality teaching that can be found in all tion and locus of control are determined in no small part by the
systems, for example some form of student orientation; instruc- level of professionalism and human resources available at local and
tional or pedagogical skills; and professional duties. Despite these central levels. While these decisions may not be under direct
commonalities the complexity of the underlying construct the control of developers or even individual policymakers, it is
quality of teaching leaves room for different emphases and nevertheless important to maintain awareness of the variety of
operationalizations of these shared dimensions. The U.S. systems designs as well as the trade-offs between them. Highly standard-
we examined tend to have a behavioral focus and emphasize ized systems may gain centralized ability to set equitable standards
narrowly defined technical aspects of general or content-specific of teaching quality, but can also stifle creativity and capacity
instructional practice, supported (to greatly varying extents) by building for locally driven professional development. Systems that
empirical research linking individual practices to student out- emphasize local control may win local buy-in and ownership but
comes. By contrast, the framework in Singapore comprises what risk losing control of comparable standards and process quality.
teachers are as much as what they do in classrooms, and place
evaluative focus on more broadly defined dimensions (e.g. creative 6.3. Reliability and validity
teaching, or teaching that facilitates creative learning)—these
dimensions were developed inductively examining characteristics Attaining reliable and valid indicators of teacher practice based
and practices of teachers identified as exceptional by peers and on classroom observation can be challenging and costly, requiring
supervisors. The TFA framework balances emphasis on classroom extensive training, and multiple visits to the classroom by multiple
behaviors with consideration of psychological traits—interestingly, observers. Observation systems face critical questions about how
it was developed inductively but through in-depth study of many observers will be needed (for each teacher and for the system
teachers who made the largest gains in student achievement (Farr, overall), how much training will be necessary, and what costs are
2010). to be incurred. The answers to these questions will naturally
A key question facing developers and policymakers is whether present trade-offs in considering the resources available alongside
to develop a framework specific to the needs of the district or state, goals for accuracy and representativeness of the observation
or adopt (or adapt) an existing framework. Most frameworks results. Additional observations have direct cost implications, in
capture generalizable elements of quality instruction applicable to terms of training and substituting the time observers would have
F. Martinez et al. / Studies in Educational Evaluation 49 (2016) 15–29 27

spent on other activities. As Shavelson et al. (1986) pointed out, considerations about objectivity and methodological rigor may
developers need to assess the key relevant sources of error and cost need to be incorporated for uses involving certification or tenure.
in classroom observation (e.g. observers, occasions, subject matter) With in-service teachers, observation is increasingly tied to
to understand their influence on the measures of classroom external high stakes decisions related to promotion, referral to
processes derived. In this context, Generalizability theory (Sha- development, or monetary incentives; but in-service teacher
velson & Webb, 1991) is reclaiming prominence as a flexible and observation can also be used within collegial systems for low-
powerful tool suited to investigate the properties of measures of stakes self-initiated professional development purposes.
teaching practice and the designs that maximize their reliability For formative purposes the focus and nature of the feedback
and accuracy within existing organizational constraints (Hill et al., provided depends on the one hand on the level of detail in the
2012; Ho & Kane, 2013; Marcoulides, 1993; Newton, 2010; teaching framework, and on the other hand on the degree of
Praetorius, Pauli, Reusser, Rakozcy, & Klieme, 2014). standardization in the observation system itself. Systems also use
It is also important to draw a distinction between evidence of different formats for providing feedback; written reports allow less
reliability and validity that generally supports the quality of the detail and interaction than post-observation conferences or other
information collected in a psychometric sense, and ongoing in-person mechanisms. Ultimately, the utility and consequences of
mechanisms in place to ensure the quality of the observation the feedback given and received will depend on the access to, and
data collection process—these could include quality control checks, quality of, support mechanisms and opportunities for professional
checks of consistency or drift, and administrative indicators of development in order to address diagnosed weaknesses.
implementation among others. In our analyses we found a few Finally, the appropriate role to give the information collected
examples of the former, particularly in larger, more established through observation within a broader performance appraisal
systems. On the other hand, we found very little evidence of the system also presents complexities that should be considered
latter, but this type of quality control mechanism should be seen as carefully. All systems in our sample reflect formally or
a pre-requisite for any subsequent formal examination of informally- a growing consensus in the literature that calls for
psychometric reliability and validity. combining multiple measures in order to present a more complete
Finally, evidence on whether inferences about teaching quality picture of teacher performance. However, this is often interpreted
are in fact valid is yet more complex and costly to produce. Such narrowly to suggest that these measures should be combined to
evidence relies on a comprehensive validation agenda that form a more robust weighted indicator of effectiveness (Mihaly
includes different sources of evidence regarding observation et al., 2013). In reality, however, there are a number of alternative
protocol content, factorial structure and relation to other relevant models for combining measures. The question is thus not what
variables (like student learning progress), to name only the most weight to give observation scores to create a combined measure to
crucial. Political commitment in terms of sufficient time and evaluate teacher effectiveness, but how to use multiple measures
financial resources to ensure valid inferences is necessary but in combination to more effectively appraise teaching and teachers.
difficult to come by, particularly in the pilot stages of system Developers and policymakers need to consider the advantages and
development. Undoubtedly, the higher the stakes of the decisions disadvantages of compensatory models that rely on the creation of
to be taken, the more important validation evidence becomes, but summary weighted indices and those that consider the informa-
even formative uses should be based on sound diagnostic tion separately through decision rules (e.g. conjunctive, disjunctive
information on teaching quality (American Educational Research models; see Mehrens, 1990).
Association, American Psychological Association, & National
Council on Measurement in Education, 2014). 7. Conclusions

6.4. Appropriate uses of the information collected The growing prominence of classroom observation across a
wide variety of teacher evaluation and development systems calls
Information collected through observation can serve summa- for much more systematic discussion of critical issues and
tive and formative evaluation purposes and present policy decisions related to what is observed? by whom? when? and
scenarios with different stakes associated for teachers and other how? and equally important, how the information is used for
actors. In principle, formatively focused systems address how best formative and summative purposes. Our study is ultimately
to provide feedback to teachers from the evidence collected intended to start a conversation among education researchers
through observation in order to help them improve their practice. and policymakers around the theoretical and practical common-
In a summative system, the focus would be on the reliability and alities and distinctions among observation systems, and the
precision of the scores and the most appropriate weight to give assumptions, goals, and factors that influence how these systems
observation scores in the overall assessment of teachers. We did are designed and implemented on the ground. The trend towards
not identify any systems falling in this extreme group; all systems more comprehensive appraisal of teacher and teaching perfor-
incorporate some kind of formative component and aim to use mance continues to gain strength in regional and national systems
information to improve teaching—although they can vary consid- worldwide, and with it the prominence of classroom observation.
erably in the extent of their formative emphasis, and the specific It is worth noting that a large-scale and highly standardized
ways in which formative mechanisms are implemented. approach to classroom observation is far from the norm among
Using teacher appraisal data for both formative and summative traditionally high performing countries. There is little empirical
purposes requires a difficult balance, and a range of methodologi- evidence to suggest that large-scale standardized observation, and
cal and policy considerations; summative decisions may lead to high-stakes teacher evaluation in general, constitutes a good
defensiveness of those receiving lower results and thus limit vehicle for countries to exert improvements and achieve educa-
teachers’ openness to consider evaluation feedback for their tional success (Feuer, 2012), while at the same time there is little
learning and development. While combining formative and dispute that high quality teaching is the main factor modifiable by
summative uses within the same appraisal system may be the educational policy that predicts students’ educational outcomes
logical default from a policy perspective, whether and how these (Nye, Konstantopoulos, & Hedges, 2004). New policy efforts
parallel goals can be successfully achieved in practice remains an involving classroom observation are under way among others in
open question (Herman & Baker, 2009). Observations targeting Norway in the form of a national effort aimed at training school
teachers in training are typically formative in nature, but principals as classroom observers; Australia, where national and
28 F. Martinez et al. / Studies in Educational Evaluation 49 (2016) 15–29

regional systems of observation and evaluation are being devel- Engel, L. C., Williams, J., & Feuer, M. J. (2012). The global context of practice and
oped; and Mexico, where new legislation established a national preaching: do high-Scoring countries practice what US discourse preaches? working
paper 2–3. Washington, DC: George Washington University.
teacher evaluation and incentive system. Adopting a comparative Erlich, O., & Shavelson, R. J. (1978). The search for correlations between measures of
perspective based on concrete analytic dimensions like the ones teacher behavior and student achievement: measurement Problem,
discussed in this paper will continue to be valuable for informing conceptualization Problem, or both? Journal of Educational Measurement, 15(2),
77–89.
research and policy debates about the appropriate uses of Farr, S. (2010). Teaching as leadership: the highly effective teacher’s guide to closing the
classroom observation for assessing and improving teaching in achievement gap. San Francisco, CA: Jossey-Bass.
classrooms around the world. Feuer, M. (2012). Validity issues in international large-scale assessments: ‘Truth’
and ‘Consequences’. Paper presented at the teachers College/Educational testing
service conference on educational assessment, accountability and equity.
Appendix A. Supplementary data Fullan, M. G. (1992). Successful school improvement: the implementation perspective
and beyond. Bristol, PA: Open University Press.
Gates, B. (2013). Teachers need real feedback. TED conferences, LLC . http://www.ted.
Supplementary data associated with this article can be found,
com/talks/bill_gates_teachers_need_real_feedback.html.
in the online version, at http://dx.doi.org/10.1016/j. Goe, L., Bell, C., & Little, O. (2008). Approaches to evaluating teacher effectiveness: a
stueduc.2016.03.002. research synthesis. Washington, DC: National Comprehensive Center for Teacher
Quality.
Hakel, M. D., Koenig, J. A., & Elliott, S. W. (2008). Assessing accomplished teaching:
References advanced-level certification programs. Washington, DC: National Academy Press.
Halverson, R., Kelley, C., & Kimball, S. (2004). Implementing teacher evaluation
Abelmann, C., & Elmore, R. F. (1999). When accountability knocks, will anyone answer? systems: how principals make sense of complex artifacts to shape local
CPRE research report No. RR-042. Philadelphia, PA: University of Pennsylvania, instructional practice. In W. K. Hoy, & C. G. Miskel (Eds.), Research and theory in
Consortium for Policy Research in Education. educational administration: (Vol. 3. pp. Greenwich, CT: Information Age Press.
American Educational Research Association, American Psychological Association, & Hattie, J., & Timperley, H. (2007). The power of feedback. Review of Educational
National Council on Measurement in Education (2014). Standards for educational Research, 77(1), 81–112.
and psychological testing. Washington, D.C: American Educational Research Herman, J., & Baker, E. (2009). Assessment policy: making sense of the babel. In G.
Association. Sykes, B. Schneider, & D. Plank (Eds.), Handbook of education policy research (pp.
Australian Institute for Teaching, & School Leadership (AITSL) (2016). National 176–190).New York: Routledge.
partnership agreement on improving teacher quality. Australian Institute for Hill, H. C., Blunk, M. L., Charalambous, C. Y., Lewis, J. M., Phelps, G. C., Sleep, L., et al.
Teaching and School Leadership (AITSL). http://www.aitsl.edu.au/docs/default- (2008). Mathematical knowledge for teaching and the mathematical quality of
source/default-document-library/ instruction: an exploratory study. Cognition and Instruction, 26(4), 430–511.
national_partnership_on_improving_teacher_quality. Hill, H. C., Charalambous, C. Y., & Kraft, M. A. (2012). When rater reliability is not
Australian Institute for Teaching and School Leadership (AITSL) (2012). Australian enough: teacher observation systems and a case for the generalizability study.
teacher performance and development framework. education services Australia, Educational Researcher, 41(2), 56–64.
standing council on school education and early childhood (SCSEEC). Australia: Hill, H. C., & Grossman, P. (2013). Learning from teacher observations: challenges
Carlton South. and opportunities posed by new teacher evaluation systems. Harvard
Baker, E. L., Barton, P. E., Darling-Hammond, L., Haertel, E., Ladd, H. F., Linn, R. L., Educational Review, 83(2), 371–384.
(2010). Problems with the use of student test scores to evaluate teachers Vol. 278, Ho, A. D., & Kane, T. J. (2013). The reliability of classroom observations by school
Washington, D.C: Economic Policy Institute. http://www.epi.org/publication/ personnel. Seattle, WA: The Bill & Melinda Gates Foundation.
bp278/. Ingvarson, L., & Rowe, K. (2008). Conceptualising and evaluating teacher quality:
Barnett, J. H., Wills, K. C., Hudgens, T. M., & Alexander, J. L. (2015). TAP research substantive and methodological issues. Australian Journal of Education, 52(1), 5–
summary spring 2015: examining the evidence and impact of TAP: the system for 35.
teacher and student advancement. National Institute for Excellence in Teaching. Jacob, B. A., & Lefgren, L. (2005). Principals as agents: subjective performance
http://www.niet.org/assets/Publications/tapresearch-summary-0615.pdf. measurement in education. Working Paper #11463. Cambridge, MA: National
Bell, C. A., Gitomer, D. H., McCaffrey, D. F., Hamre, B. K., Pianta, R. C., & Qi, Y. (2012). An Bureau of Economic Research.
argument approach to observation protocol validity. Educational Assessment, 17 Joint Committee on Standards for Educational Evaluation (2009). The personnel
(2-3), 62–87. evaluation standards, 2nd ed. Thousand Oaks: Corwin press.
Bell, C., Qi, Y., Croft, A., Leusner, D., McCaffrey, D., Gitomer, D., et al. (2014). Improving Kane, M. T. (2006). Validation, In R. L. Brennan (Ed.), Educational measurement (pp.
observational score quality: challenges in observer thinking. In T. J. Kane, K. A. 17–64).4th ed. Washington, DC: The National Council on Measurement in
Kerr, & R. C. Pianta (Eds.), Designing teacher evaluation systems: new guidance Education & The American Council on Education.
from the measures of effective teaching project (pp. 50–97).San Francisco, CA: Kane, T., & Staiger, D. (2012). Gathering feedback for teaching. Seattle, WA: The Bill
Jossey-Bass. and Melinda Gates Foundation.
Braun, H. I. (2005). Using student progress to evaluate teachers: a primer on value- Kennedy, M. (2010). Teacher assessment and the quest for teacher quality. San
added models. Princeton, NJ: Educational Testing Service. Francisco, CA: Jossey-Bass.
Brookhart, S. M. (2011). Educational assessment knowledge and skills for teachers. Kimball, S. M. (2002). Analysis of feedback, enabling conditions and fairness
educational measurement. Issues and Practice, 30(1), 3–12. perceptions of teachers in three school districts with new standards-based
Brophy, J., & Good, T. L. (1986). Teacher behavior and student achievement. In M. evaluation systems. Journal of Personnel Evaluation in Education, 16(4), 241–268.
Wittrock (Ed.), Handbook of research on teaching (pp. 328–375).New York: Koppich, J. (2009). Toledo: Peer Assistance and Review. Retrieved from: http://
Macmillan. www.smhc-cpre.org/resources/.
Bruns, B., & Luque, J. (2014). Great teachers: how to raise student achievement in Latin Law 19.961 [Ministry of Education]. Sobre Evaluación Docente [About Teacher
America and the Caribbean. Washington, D.C: The World Bank. http://dx.doi.org/ Evaluation]. Diario Oficial de la República de Chile. Santiago, Chile.A ugust 14th,
10.1596/978-1-4648-0151-8. 2004.
Campbell, R. J., Kyriakides, L., Muijs, R. D., & Robinson, W. (2004). Assessing teacher Leahy, C. (2012). Teacher evaluator training: ensuring quality classroom observers.
effectiveness: a differentiated model. London: Routledge Falmer. Education Commission of the States. http://www.ecs.org/clearinghouse/01/01/
Casabianca, J. M., McCaffrey, D. M., Gitomer, D. H., Bell, C. A., Hamre, B., & Pianta, R. 14/10114.pdf.
(2013). Effect of observation mode on measures of secondary mathematics Levitt, R., Janta, B., & Wegrich, K. (2008). Accountability of teachers: literature review.
teaching. Educational and Psychological Measurement, 73(5), 757–783. Prepared for the general teaching council england. Santa Monica, CA: RAND
Chait, R. (2010). Removing chronically ineffective teachers: barriers and opportunities. Corporation.
Washington, DC: Center for American Progress. Loup, K. S., Garland, J. S., Ellet, C. D., & Rugutt, J. K. (1996). Ten years later: findings
City, E. A., Elmore, R. F., Fiarman, S., & Teitel, L. (2009). Instructional rounds in from a replication of a study of teacher evaluation practices in our 100 largest
education: a network approach to improving teaching and learning. Cambridge, school districts. Journal of Personnel Evaluation in Education, 10, 203–226.
MA: Harvard Education Press. La evaluación docente en Chile [teacher evaluation in Chile]. In J. Manzi, R. González,
Danielson, C. (1996). Enhancing professional practice: a framework for teaching. & Y. Sun (Eds.). Santiago, Chile: Facultad de Ciencias Sociales, Escuela de
Association for Supervision and Curriculum Development (ASCD). Psicología, Pontificia Universidad Católica de Chile.
Danielson, C. (2011). Enhancing professional practice: a framework for teaching, 3rd Marcoulides, G. A. (1993). Maximizing power in generalizability studies under
ed. Association for Supervision and Curriculum Development (ASCD). budget constraints. Journal of Educational Statistics, 18(2), 197–206.
Darling-Hammond, L. (2013). Getting teacher evaluation right: what really matters for Martinez, J. F. (2013). Combining multiple measures of teacher practice and
effectiveness and improvement. New York: Teachers College Press. performance: technical and conceptual considerations for teacher evaluation.
Ellett, C. D., & Garland, J. (1987). Teacher evaluation practices in our largest school Pensamiento Educativo Latinoamericano, 50(1), 4–20.
districts: are they measuring up to ‘state-of-the-art' systems? Journal of McKay, S., & Silva, E. (2015). Improving observer training: the trends and challenges.
Personnel Evaluation in Education, 1(1), 69–92. issue Brief. Stanford, CA: Carnegie Foundation for the Advancement of Teaching.
Elmore, R. (2008). Leadership as the practice of improvement. Improving School http://www.carnegiefoundation.org/resources/publications/improving-
Leadership, 2, 37–67. observer-training-the-trends-and-challenges/.
F. Martinez et al. / Studies in Educational Evaluation 49 (2016) 15–29 29

Mehrens, W. A. (1990). Combining evaluation data from multiple sources. In J. Putnam, R., & Borko, H. (2000). What do new views of knowledge and thinking have
Millman, & L. Darling-Hammond (Eds.), The new handbook of teacher evaluation: to say about research on teacher learning? Educational Researcher, 29(1), 4–15.
assessing elementary and secondary school teachers (pp. 322–334).Newbury Park, Reform Support Network (2012). Evaluations of teacher effectiveness: state
CA: Sage. requirements for classroom observations. U.S. Department of Education. https://
Medley, D., & Mitzel, H. (1963). Measuring classroom behavior by systematic rtt.grads360.org/services/PDCService.svc/GetPDCDocumentFile?fileId=2421.
observation. In N. Gage (Ed.), Handbook of research on teaching (pp. 247–328). Richards, J. C. (1991). Towards reflective teaching. The Teacher Trainer., 5(3), 4–8.
Chicago: Rand McNally. Santiago, P., & Benavides, F. (2009). Teacher evaluation: a conceptual framework and
Mihaly, K., McCaffrey, D. F., Staiger, D. O., & Lockwood, J. R. (2013). A composite examples of country practices. OECD-Mexico workshop ‘Towards a teacher
estimator of effective teaching. Seattle, WA: The Bill & Melinda Gates Foundation. evaluation framework in Mexico: international practices, criteria and mechanisms’.
Miles, M. B., & Huberman, A. M. (1994). Qualitative data analysis: an expanded Santiago, P., Benavides, F., Danielson, C., Goe, L., & Nusche, D. (2013). Teacher
sourcebook. Beverly Hills, CA: Sage Publications. evaluation in Chile 2013. OECD reviews of evaluation and assessment in education.
Milanowski, A. (2004). The relationship between teacher performance evaluation Paris, France: OECD Publishing. http://dx.doi.org/10.1787/9789264172616-en.
scores and student achievement: evidence from Cincinnati. Peabody Journal of Schleicher, A. (2011). Building a high-Quality teaching profession: lessons from
Education, 79(4), 33–53. around the world. Educational Studies 1, 74–92. http://dx.doi.org/10.1787/
Millman, J., & Darling-Hammond, L. (1990). The new handbook of teacher evaluation: 9789264113046-en.
assessing elementary and secondary school teachers. Thousand Oaks, CA: Corwin Sclafani, S., & Lim, E. (2008). Rethinking human capital in education: Singapore as a
press, Inc.. model for teacher development. aspen institute education and society program.
Ministry of Education (2004). El Marco para la Buena Enseñanza. Santiago, Chile: Washington, D.C: Aspen Institute. http://files.eric.ed.gov/fulltext/ED512422.
Ministry of Education. pdf.
Ministry of Education, Culture, Sports, Science, & Technology (2016). Japan country Seifert, C. F., Yukl, G., & McDonald, R. A. (2003). Effects of multisource feedback and a
background report. Ministry of Education, Culture, Sports, Science and feedback facilitator on the influence behavior of managers toward subordinates.
Technology. http://www.oecd.org/edu/school/29163349.pdf. Journal of Applied Psychology, 88(3), 561–569.
Mourshed, M., Chijioke, C., & Barber, M. (2010). How the world’s most improved school Shavelson, R. J., & Webb, N. M. (1991). Generalizability theory: a primer. Thousand
systems keep getting better. London: McKinsey. Oaks, CA: Sage.
Müller, S., Pietsch, M., & Bos, W. (2011). Schulinspektion in Deutschland: Eine Shavelson, R. J., Webb, N. M., & Burstein, L. (1986). Measurement of teaching, In M.
Zwischenbilanz aus empirischer Sicht [School inspection in Germany: A summary Wittrock (Ed.), Handbook of research on teaching (pp. 50–91).3rd ed. New York,
based on empirical evidence.]. Münster: Waxmann. NY: Macmillan.
Newton, X. A. (2010). Developing indicators of classroom practice to evaluate the Stallings, J. (1977). Learning to look: a handbook on classroom observation and teaching
impact of district mathematics reform initiative: a generalizability analysis. models. Belmont, CA: Wadsworth Publishing Company.
Studies in Educational Evaluation, 36(1), 1–13. Tatto, M. T., Schwille, J., Ingvarson, L., Rowley, G., & Peck, R. (2012). Policy, practice,
Nusche, D., et al. (2011). OECD reviews of evaluation and assessment in education: and readiness to teach primary and secondary mathematics in 17 countries:
Norway 2011. OECD reviews of evaluation and assessment in education. Paris, findings from the IEA teacher education and development study in mathematics
France: OECD Publishing. http://dx.doi.org/10.1787/9789264117006-en. (TEDS-M). Amsterdam: IEA.
Nye, B., Konstantopoulos, S., & Hedges, L. V. (2004). How large are teacher effects? Taut, S., & Sun, Y. (2014). The development and implementation of a national,
Educational Evaluation and Policy Analysis, 26, 237–257. standards-based, multi-method teacher performance assessment system in
OECD (2009). Creating effective teaching and learning environments: first results from Chile. Education Policy Analysis Archives, 22(71), 1–30. http://dx.doi.org/
TALIS. Paris, France: OECD Publishing. 10.14507/epaa.v22n71.2014.
OECD (2013a). Teachers for the 21st century: using evaluation to improve teaching. Taut, S., Santelices, V., Araya, C., & Manzi, J. (2010). Theory underlying a national
Paris, France: OECD Publishing. teacher evaluation program. Evaluation and Program Planning, 33, 477–489.
OECD (2013b). Synergies for better learning: an international perspective on evaluation http://dx.doi.org/10.1016/j.evalprogplan.2010.01.002.
and assessment. OECD reviews of evaluation and assessment in education. Paris, Taut, S., Santelices, V., & Stecher, B. (2012). Validation of a national teacher
France: OECD Publishing. assessment and improvement system. Educational Assessment Journal, 17(4),
Patton, M. (2002). Qualitative research and evaluation methods. Thousand Oaks, CA: 163–199. http://dx.doi.org/10.1080/10627197.2012.735913.
Sage. Taylor, E., & Tyler, J. (2012). The effect of evaluation on teacher performance.
Peterson, K. (2000). Teacher evaluation: a comprehensive guide to new directions and American Economic Review, 102(7), 3628–3651.
practices, 2nd ed. Thousand Oaks, CA: Corwin press. U.S. Department of Education (2009). Race to the top program. Executive summary.
Pianta, R. C., La Paro, K. M., & Hamre, B. K. (2007). Classroom assessment scoring Department of Education. http://www2.ed.gov/programs/racetothetop/
system—CLASS. Baltimore. MD: Brookes. executive-summary.pdf.
Praetorius, A., Pauli, C., Reusser, K., Rakozcy, K., & Klieme, E. (2014). One lesson is all Weisberg, D., Sexton, S., Mulhern, J., & Keeling, D. (2009). The widget effect: our
you need? Stability of Instructional Quality across Lessons. Learning and national failure to acknowledge and act on differences in teacher effectiveness.
Instruction 31, 2–12. http://dx.doi.org/10.1016/j.learninstruc.2013.12.002. Brooklyn, NY: The New Teacher Project.
Protheroe, N. (2002). Improving instruction through teacher observation. Principal,
82(1) 48–51.