You are on page 1of 10


EDRXXX10.3102/0013189X15574904Educational Researcher

Feature Articles

Using Student Test Scores to Measure Teacher
Performance: Some Problems in the Design and
Implementation of Evaluation Systems
Dale Ballou1 and Matthew G. Springer1

Our aim in this article is to draw attention to some underappreciated problems in the design and implementation of
evaluation systems that incorporate value-added measures. We focus on four: (1) taking into account measurement error in
teacher assessments, (2) revising teachers’ scores as more information becomes available about their students, and (3) and
(4) minimizing opportunistic behavior by teachers during roster verification and the supervision of exams.

Keywords: accountability; educational policy; policy analysis; regression analyses; statistics; teacher research

Introduction statisticians. American Federation of Teachers President Randi
Weingarten has called for an end to using value-added measures
Race to the Top (RTTT) is a competitive grant program created
as a component of teacher evaluation systems (Sawchuk, 2014).
under the American Recovery and Reinvestment Act of 2009.
Much has been said and written about the difficulty of drawing
RTTT provides incentives for states to reform K–12 education
valid statistical inferences about teacher quality from student test
in such areas as turning around low-performing schools and
scores (American Statistical Association, 2014; Harris, 2011;
improving teacher and principal effectiveness. To date, the U.S.
McCaffrey, Lockwood, Koretz, & Hamilton, 2003). It is not our
Department of Education (USDOE) has awarded 19 states
intent to repeat those arguments here. We suspect that such
more than $4.35 billion to implement RTTT reforms. These
reforms are here to stay and that test-based measures of teacher
states serve approximately 22 million students and employ 1.9
performance will be incorporated into teacher evaluation sys-
million teachers in 42,000 schools, representing roughly 45 per-
tems with increasing frequency.
cent of all K–12 students and 42 percent of all low-income stu-
On the whole we regard the use of educator evaluation sys-
dents (USDOE, 2014).
tems as a positive development, provided judicious use is made
As part of RTTT, USDOE called for states and their partici-
of this information. No evaluation instrument is perfect; every
pating school districts to improve teacher and principal effective-
evaluation system is an assembly of various imperfect measures.
ness by developing comprehensive educator evaluation systems.
There is information in student test scores about teacher perfor-
State and district educator evaluation system plans were reviewed
mance; the challenge is to extract it and combine it with the
by USDOE to ensure districts (1) measure student growth for
information gleaned from other instruments.
each individual student; (2) design and implement evaluations
Our aim in this article is to draw attention to some underap-
systems that include multiple rating categories that take into
preciated problems in the design and implementation of evalua-
account data on student growth as a significant factor; (3) evalu-
tion systems that incorporate value-added measures. We focus
ate teachers and principals annually and provide feedback,
on four: (1) taking into account measurement error in teacher
including student growth data; and (4) use these evaluations to
assessments, (2) revising teachers’ scores as more information
inform decisions regarding professional development, compen-
becomes available about their students, and (3) and (4) minimiz-
sation, promotion, retention, tenure, and certification (USDOE,
ing opportunistic behavior by teachers during roster verification
The development of educator evaluation systems in which
one component is student performance on standardized tests
is unpopular with some teachers and controversial among 1
Vanderbilt University, Nashville, TN

Educational Researcher, Vol. 44 No. 2, pp. 77­–86
DOI: 10.3102/0013189X15574904
© 2015 AERA.
March 2015    77

The first step would focus on the numerator dents a teacher has. Colorado. between −1 and −2. In conclusion. and charter such a two-step procedure would be an improvement over pres- school in North Carolina. Rather. SAS EVAAS for K–12 reporting is also used by large. which classifies teachers into five categories we focus on two problems that we find in teacher evaluation based on the ratio of the teacher’s value-added score to its stan- systems’ being implemented in RTTT states: (1) systems that dard error (the teacher’s t-statistic). Are they 95% confident the teacher differs from Added Assessment System (EVAAS). 78    EDUCATIONAL RESEARCHER . Yet the goal in evaluation ought elementary teacher with a self-contained classroom. of students who scored at the same percentile in the prior year. The teachers at greatest risk of this are those for students a teacher has. Incorporating Measurement Error in a t-Statistic The decision ought to be based on the relative costs of false posi- One of the most widely used methods for teacher value. This statistic is conven- Where students fall in this year’s distribution (of identically scor. Arkansas. California. between 1 and −1. but whose value added is ability of student performance for reasons other than teacher estimated with a high degree of precision.g. identifying those teachers whose estimated Clearly it is unfair to treat two teachers with the same true value added falls below (or above) some agreed-on threshold. can have a t-statistic quality (including test measurement error). but to identify teachers who are quite a instruction. Pennsylvania. The North (Betebenner. 2008. We have presented some illustrative cal. 2014). cantly from average.) In our illustrative examples the former is 4 to 12 bit worse than average. and small districts and schools in many other states. It is important to note that it is statistic or a transformation of it is then incorporated into the not our purpose in this article to offer a general critique of the teacher’s evaluation. Teachers consistently in the Student performance on standardized tests is compared to that bottom category may face sanctions. and Wyoming (SAS Teacher value-added estimates are notoriously imprecise. EVAAS incorporates measurement error into a single sum- sions. of the fact that the measures used in such accountability systems A more reasonable approach would be to apply a two-step are noisy and that the amount of noise is greater the fewer stu.3 It is far from based on error.. If the average percentile across a teacher’s students falls the teacher in question is so bad that some kind of corrective below 30. whereas the not to be to identify teachers who are worse than average. test to each teacher. value-added scores are to be used for high-stakes personnel deci. A teacher who function of three things: the teacher’s true effectiveness.and the supervision of exams. It is not the high value of the culations in the appendix to this article. the the likely error in these estimates. including Arizona. not on the answer to the purely conven- added assessment is the SAS Institute’s Educational Value. evaluation system. with a high degree of con- The probability that this happens to a particular teacher is a fidence. solely as a function of confidence in that estimate. the hypothesis that the teacher is average. we offer some prac. appropriate account must be taken of the magnitude of mary measure of performance in the form of a t-statistic. resources will be so. The denominator measures how precisely we have estimated the numerator. This them will be unfair to teachers. and less than −2. South Carolina. An example is the North Carolina teacher treatment of estimation error in value-added assessment. An example is the Georgia teacher this ratio: greater than 2. the vari. Louisiana. evaluation system (Georgia Department of Education. even if second teaches in a middle school using departmentalized only by a small amount. Equally troubling. tical guidance and recommendations. ent practice. Taking Account of Estimation Error Delaware. Ohio. is just a little bit worse than average. Virginia. have enough information to rule out. New York. If effectiveness differently. provided decision makers have sufficient times more likely to be deemed ineffective. This use of a t-statistic is misguided. If Institute. the teacher is deemed ineffective with respect to this action should be taken. public school. that the standard error be no greater than half the value-added estimate). tionally used to test the hypothesis that a teacher differs signifi- ing students the previous year) is used to gauge teacher effective.2 An affirmative answer does not mean that ness. whom we have a lot of data. between 1 and 2. tives and false negatives. and Tennessee. EVAAS reporting to every district. then one would proceed to the second step. asking whether wasted if teachers are targeted for interventions without taking the estimate is sufficiently precise for this teacher to be identified into account the probability that the ratings they receive are as a case requiring further attention or action. New Jersey. Clearly both are important. Otherwise decisions based on ratio of the teacher’s value-added score to its standard error. Connecticut. SAS provides statewide average? But even if this standard were used in the second step. Illinois. and the number of less than −2. Missouri. medium. 2011) or a close variant are particularly likely Carolina system classifies teachers into five categories based on to ignore estimation error. comparing a teacher of numerator that puts the teacher in the lowest category but the 25 students to a teacher of 100 students. A large Ignoring Estimation Error denominator relative to the numerator means we cannot be very confident that the teacher in question truly differs from the aver- Evaluation systems based on the Colorado Growth Model age: the value-added estimate is too imprecise. It simply means that decision makers component of the evaluation system. from average. (The first might be an small value of the denominator. tional question. The t-statistic is the wrong tool for the number of the teacher’s students who are tested—a reflection this job. 2013). Texas. Georgia.1 of the t-statistic. The numerator of this statis- ignore estimation error altogether and (2) systems that rely on tic is an estimate of how much the teacher in question differs t-statistics as a summary measure of teacher performance. obvious that the answer to this second question should rely on a conventional standard for statistical significance (e.

revising value-added estimates poses problems when the passed out of the testing regime.8 Indeed. behavior in response to high-stakes accountability. one might hope that state administrative data systems would be ger furnish revised estimates. 1999. what is the Roster Verification response to teachers who claim that their revised scores would show they ought not to have lost their jobs? What is to prevent a Value-added assessment requires data linking students to the discharged teacher from going to court and demanding they be teachers who provided instruction in tested subjects. This affects teachers quite differently.7 data. They are increasingly less 2014 school year. Although altering test scores. Whether or not there are sound statistical reasons for 2008–2009. even when there are ing subsequent revisions. This circumvents the problem only in sound statistical reasons for carrying out such revisions.9 estimates (Ballou. scores poses political and legal problems for states that want to stakes decision will be called into question on the basis of later make high-stakes personnel decisions in a timely manner. these calculations in systems are not up to this challenge. data window that comprises the 2003–2004 through 2007– Although this may be true. But if it is statistically sound practice to make these revi- sions. nobody knows who taught a student more his or her value-added score in 2008 is first calculated. Indeed. the mented procedures wherein teachers are called on to verify and most recent 5 school years furnish the data used to produce the correct their class rosters. and test results from the most might fail to claim students they fear will lower their value-added recent year are added. one firm that specializes in developing software depending on the grades they teach. As time passes and the window slides for. Such behav- ward. scores. no high. depending on a teacher’s grade level and the revised estimate for the same year. as each revision is based on less information only the first set of estimates produced for a given cohort. make linking student test data to individual teachers difficult” repeated claim that EVAAS uses a student’s entire history of test (Webber et al. Because many state administrative data scores when estimating teacher value added.. When his or her value-added score for 2007– revised score for that same cohort a year later in 2008–2009. as calculated in the summer of 2014. Notwithstanding the oft. Consider a fourth-grade to assist with the linkage of students to teachers proclaims on its teacher in a state where standardized testing begins in Grade 3 website. In some cases (e. it clearly makes no sense to revise personnel decisions may attempt to evade this problem by using these estimates. as the data receive yet another revision in the 2009–2010 school year for the window slides forward. case. With each passing Roster verification raises the obvious concern that teachers year.5 past. who wonder why their 2004. EVAAS drops older years: first 2003– same cohort. whether their progress is sustained). but it is not clear that even this sufficiently accurate for this purpose. released in the summer of amount of information that subsequently becomes available 2015. & Wright. was so reliable. Moreover. a fourth-grade teacher who receives a value-added This is not the case for an eighth-grade teacher: The situation score for the cohort he or she had in 2007–2008 will receive a is asymmetric. when students were in 4th grade. when students scores obtains indirect support from other studies of strategic were third graders.g. Koretz & Barron. He or she will enrolled in the system that long). low that the teacher loses his or her job or license but whose To summarize. they have below). Sanders.4 This has confused teachers. At the other end it adds past. Although produced? The state may request that the SAS Institute no lon. about student performance. not more. What will be grade teacher’s EVAAS scores are based on fewer and fewer data done about the teacher whose performance during the 2013– about their students. the only other year for which they have test manipulate their rosters in order to improve their value-added scores for this cohort of students is 2006–2007. the 5-year data window includes information taking into account subsequent performance of those students about those 8th-grade students in 4 prior years (if they have been (basically. “The major challenge reported The issue becomes still messier when we consider whether the by the largest number of SEAs was that current data systems revised scores are actually better. then 2004–2005. it can make sense to which these sanctions would apply? revise her estimated value added. 2009–2010—but there are no more data on these undertaking these revisions (a question to which we return students in those years. Schools have been found to engage in strategic March 2015    79 .. as the SAS Institute has always claimed. the notion that teachers might 2008 school years. 2005. ignor. the oldest data are dropped. for such a teacher could demand they be pro. many states have imple- fact have been based on a 5-year window of data—that is. the eighth- States contemplating the use of EVAAS scores for high-stakes grade teacher described above). (Goodnough. places the teacher’s performance above the threshold at about students he or she had in the past.6 Thus. 2014). As 9th and 10th graders. or assisting students with test questions there are some concerns about using post-fourth-grade perfor. as noted in a recent review of RTTT implementa- duced. tion conducted for the USDOE. when value-added score keeps changing for students they had in the they were in 5th grade. 2008 cohort (as they advance through higher grades). Thus. Jacob & Levitt. mance to evaluate how much these students learned in fourth 1999). and so forth. 2014). revising the narrow sense: If you never look at the revised score. When their student data. it will include increasing amounts of data about the 2007– ior can take the form of focusing excessively on a single test. frequently this is not the would suffice. as they always were in the past. on balance it makes sense to revise the teacher’s value- added estimate for the 2007–2008 school year: the individual SAS EVAAS reassesses the performance of a teacher as more data performing the calculation has more information about those become available about students that teacher has taught in the students than he or she had the first time it was calculated. the revisions of an 8th- evaluation system is used for high-stakes decisions. using a than that student’s teacher” (RANDA Solutions. In subsequent years.Revising Value-Added Estimates grade. “We’ve empowered the teachers to have control over and ends in Grade 8 (not an uncommon configuration). 2008 is calculated. 2004).

unclaimed. relying on administrative records to ascertain which students dents will be absent on test day (Figlio. We identi- ing strategically: The students they do not claim have on average fied these as exempt based primarily on attendance and special test scores far below those of the students who are claimed. Jacob. over- well busy administrators working with faulty data systems per. Our data for this analysis number of students in the first category is quite small: fewer than include student test scores for the 2011–2012 and 2012–2013 1. and exempt but analyzed data for students in Grades 4 through 8 in one of the claimed. and the portion of the school year in category. a student who is not claimed is very likely to be one we identify students who should not be claimed on rosters who would lower teachers’ value added. a majority of the students we deem exempt are which they were enrolled in a particular classroom and school. particularly given that determining What about the possibility that unclaimed students were whether a student should be claimed requires accurate atten. 2006. part—misidentifying which students are truly exempt? Suppose cess. scores are not to be used in calculating teacher value added. the often fail to drop students whom we have placed in the exempt identity of their instructors. During roster verification. 2003). Indeed. honest errors arising for haphazard istrative data (Peabody & Markley. it is possible (indeed. High levels of absenteeism would also explain administrative records are the reason that teachers are asked to their poor test performance. that there are any errors in the the set who should go unclaimed—for example. and sample size for three categories of students: To illustrate the potential for abuse of such processes. we have also retention policies (Haney. we have unclaimed. claimed by their teachers. egories of special education students should be claimed. assigned to the wrong teacher in the administrative records and dance data for the entire year and. math instructor in the 2012–2013 course file and for whom teachers are instructed to drop from their rosters students whose prior year scores are available.g. In Table 1 we present results from a regression a test-based evaluation system and that educator behavior will analysis predicting 2012–2013 student math scores as a function need to be monitored. The first and third categories represent discrepancies RTTT states where test scores in mathematics and reading/ between the information we have gleaned from administrative English language arts are used to produce value-added assess. The latter fall into two main groups: students who spent the basis of grade level and past scores. it seems highly unlikely to claim students they should. there are conjunction with the first. 2002. There are many discrepancies between our classification of Could these findings be due to classification errors on our students and the rosters that emerge from the verification pro. We report regression coefficients. related to student test performance. a second fact. exempt.13 More curious is the fact that teachers course enrollment files reporting the classes students take. and oth- Figlio & Getzler. based on administrative records and are asked to delete students Students who are dropped from teachers’ rosters tend to be those who should not be used to calculate their value-added assess. of whether administrative records indicate a student is exempt or attendance records. education records. records and rosters as verified by teachers. whose test performance is far below what would be expected on ments.12 All scores are departments of education typically require principals or other expressed in scale score units. This in itself is not compelling evidence that teachers are that the students unclaimed during roster verification are precisely behaving strategically or. These studies suggest that at least some teachers and The two groups should look similar. Given that school funding is tied to the enroll- Discrepancies can arise in two ways: when teachers claim stu. including who should be claimed.11 Using course enrollment files. The sample is restricted to students form this function is open to question. some cat- proficiency (Cullen & Reback. non- fewer than the requisite number of days in a teacher’s class and exempt and the unclaimed. teachers receive prepopulated rosters In both cases. ment of special education students and to attendance and that dents that they are instructed not to claim and when teachers fail these files are therefore carefully checked. We note that the ments of teachers of those subjects. The process is made more for whom we are able to identify at least one math course and complicated and error prone by the fact that in many states. However. the imperfections of excessive absences. In short. Deere & Strayer. Jacob. that so many of our exempt students are actually nonexempt. and special education enrollment records. 2003). The discrepant cases do (“exempt”). There is evidence that they ers should not). with virtually identical means and administrators to corroborate the rosters teachers submit. There is little evidence here of widespread school years. Clearly it is possible to make hon. schools will take advantage of virtually any opportunity to game They do not. 2005). made mistakes. manipulate grade were taught by which teachers. misreport admin. 2000.. We are skeptical that this hypothesis verify rosters in the first place. and plan nutritionally reasons should not introduce systematic differences between the enriched lunch menus prior to test day (Figlio & Winicki.000 statewide. likely) that in employ discipline procedures to ensure that low-performing stu. linkages established between students and teachers cheating in the sense that teachers are failing to claim students via roster verification. taken in accounts for our findings for the following reasons. indeed. nonexempt. of a fourth-order polynomial in a student’s prior year math scores The roster verification procedures implemented by state and an indicator for grade level in 2012–2013. strongly suggests teachers are behav. in many cases.10 standard errors. 66. Both the unclaimed. and for how long. test performance of students claimed and students unclaimed: 2005). exempt groups score two or more students with certain categories of disabilities qualifying them standard deviations below the level expected. the regression coefficients tell a similar story.004 students in the exempt but claimed category. and administrative records. 2001. all standard deviation = 95). nonexempt. regardless for special education services. 2005). est errors of both types. Others (“nonexempt”) are expected to be claimed not arise as the result of random errors but are systematically during roster verification. students with rosters that teachers submit: after all.classification of students as special education and limited English about the rest of the student’s academic program (e. Similarly. First. How standard deviations at each grade level (overall mean = 749. information then slipped through the cracks during roster verification? There 80    EDUCATIONAL RESEARCHER . However.

25)   n = 643 n = 548 n = 474 Exempt but claimed –2. they have only to read gory. and about whom there should have the Exam been no doubt about who was the responsible instructor.36) (0. it is clear that processes for roster verification are would systematically underperform claimed students.27   (0. Again. given the imperfections of administrative data. roster verification process is far from perfect. suggesting that teachers drop stu. The will grow stronger. raises the prospect of more serious manipulation of Such cases may also arise more often among poor-performing roster verification should value added come to be used for high- students who are moved in and out of various instructional stakes personnel decisions.20) (0. a far more widespread abuse: coaching students during test- ments in the state we have studied. The rosters that moreover. their performance as a group is essentially average. The most egregious forms of cheating We close this section by stressing what this analysis shows as involve changing student answer sheets and revealing answers well as what it does not show. Even if all of perhaps even teachers. the impact on teacher value added would be stepped a line. but there is no resulting dent knows the right answer and has missed the question due bias in value-added scores on the whole. supervisors. There may be Rather. exempt –371. This evidence of self-interested behavior on an extra help session). There larly by such wide margins as are evident in our data.333 n = 9.554 Unclaimed.39)   n = 13. according to our administrative data. a practice particularly likely to be effective if the stu- drop such students from their rosters. If the likelihood of error on our part. nonexempt and unclaimed. answer sheets and point to questions that students have Individual teachers may be helped or hurt by their failure to missed. Additional regressors include indicators of student grade level. they do not appear to be acting strategically in the teachers submit contain many errors that slip past the supervi- sense of selectively failing to claim a subset of their students. Teachers circulating throughout the room can negligible.04   (1. are not aware that they have over- them were claimed. in Column 3 we have should not be.94 –2.19) (1. The something special about these teachers we were unable to detect. teachers to cheat. March 2015    81 . were not linked to any students in the linkage file. Some highly publicized incidents have shown that the use of exempt are strongly negative. the coefficients on unclaimed.731 n = 38. would be no presumption in such cases that unclaimed students Moreover.48 –245.412 Sample size 320. in the course file—that is. to a careless error. A very small number of ing.23)   n = 66.36) (0.427 n = 12.004 n = 62. Table 1 Achievement Test Scores as a Function of the Teacher’s Claiming the Student Explanatory Variables (1) (2) (3) Unclaimed. As for the teachers in the exempt but claimed cate. Sample in Column 4 excludes students who had more than one math teacher during the school year. In such cases teachers and principals may the part of teachers.127 Note. we drop students who appear to have These discrepancies are not neutral. we are benefits to allowing teachers to check their rosters. nonexempt –218.80   (0. to students. In Column 2 we have dropped system relies on data known by teachers to be incorrect. Coaching can take such subtle forms that students. Teachers tend to avoid switched classes and teachers at some point during the year or claiming students whose test performance would lower their who had multiple math classes (perhaps a regular math class and value-added scores. Teachers have conducted two additional analyses to further reduce the must buy into any evaluation system for it to be successful. a quadratic function of prior year mathematics score. and students fall in the unclaimed. Sample in Column 3 excludes students whose teachers claimed no students in the roster verification process. particu. our analysis points to the potential for abuse. our find. sors responsible for checking them: Students are claimed who This does not alter our findings. Finally.39 –382.21 –377. To begin with the latter. nonexempt category. when incentives to game the system arrangements in an effort to find something that works.79 –2. combined with inadequate oversight by have been confused about who ought to have claimed the student. Sample comprises students in Grades 4 through 8 who took the 2012–2013 spring achievement test in mathematics. had only one math Teachers Monitoring Their Own Students During teacher for the entire year. and the number of days the student received instruction from his or her principal math instructor. Finally. coach students without saying a word.28 –248. Less attention has been paid to what we suspect is ings do not challenge the overall accuracy of value-added assess.278 235.21) (0.12) (1. value-added assessments in high-stakes decisions may lead dents they perceive will hurt their value-added scores. it is not from the estimation sample students taught by teachers who likely to be recognized as legitimate. required.325 256. others are unclaimed when it appears from limited the sample to students who have only one math teacher administrative data that they ought to be on someone’s roster.14 remaining sample comprises the least ambiguous cases—students who.

Noguchi. Model 2 adds an indicator for whether the student was supervised during testing by a teacher he or she did not have in any of his or her classes. monitored by their own teacher are being compared to other matics.04   (.12) (. native English speakers. the improve- rooms. At every grade level. The inclusion of teacher hypothesis using data on students in Grades 4 through 8 in a fixed effects means this is a within-teacher comparison: Students single urban system in the South. DC (Baker. For this year we also have administra.80) (. In addition. gender.15 Although we are unable to provide a ing the exam. Column 2 contains the percentage of his or her own students that the median math teacher supervises during testing.07 0. to find the total number of additional questions Grade 6. Popular school math teacher. California.15 1. the coefficient represents how monitor their own students during the exam than when they are many more questions students get right.17 administration of the math test.23) 7 21 1. this per.98   (. where have been dropped from the analysis.97 1. Except in sixth grade.46 1.44) (. This could produce an upward bias in obtained using the student-teacher linkages established by the our estimates for Grades 4 and 5. However.29) (. tored by their own teacher. environment) who might for other reasons be expected to per- The teacher who provided mathematics instruction has been form poorly on the exam. full-year enrollees in the school.02 –0. Because this is the median percentage is 90. it is 20% to 21%. the number of centage tends to be high in the elementary grades.49 –0. where most questions answered correctly is higher when students are moni- instruction as well as testing takes place in self-contained class. students who require a special testing tive records that identify the teacher who monitored the exam. By average effect. special education students and students with multiple math instructors have been dropped from the sample in Column 6. and an indicator for and Washington. Table 2 Students Monitored by Their Own Classroom Teachers: Effect on Math Scores (1) (2) (3) (4) (5) (6)   Model 1 Model 2 Model 3 Median Effect of Being Effect of Being Effect of Effect of Being Percentage Supervised by Supervised by Being Supervised by of Own Students Student’s Own Student’s Own Supervised by Student’s Own Grade Supervised Math Teacher Math Teacher Stranger Math Teacher 4 90 1. Our data include student scores on tests given in the great majority of fourth and fifth graders take exams moni- the spring of the 2009–2010 school year as well as prior year tored by their own teachers.86   (. departmentalized instruction makes it infeasible for most math In the second column of Table 2 we report the percentage of teachers to administer exams to more than a small fraction of a teacher’s own students that the teacher monitors during the their students. Model 3 contains the following additional covariates not in Model 2: student race/ethnicity. Dockterman.55 1.88 2.g. As one might expect. a teacher fixed effect. Model 1 also controls for math score in the prior year and for math teacher fixed effects.. are exceptional cases (e. Column 3 contains coefficients from a regression of spring math scores on an indicator for whether the student was supervised during testing by his or her own math teacher.22) Note. though informal contacts with students under his or her supervision falls sharply.30) (. For the median middle and teachers provide anecdotal evidence that they occur. Thus. we regress 2009–2010 math scores on students during testing in New York.70 0. though we have conducted similar analyses for reading/ students of the same teacher monitored by someone else.21) (. Pennsylvania. 2013.16 1. We are unaware of research documenting the extent of this percentage of a math teacher’s own students who take the test and similar practices.16 This falls to 74% in Grade 5.21) (. whether the student is monitored by his or her math teacher dur- 2014.21   (. We report coefficients on the last variable in the comprehensive answer here.46 0. size of the test-taking group.13 0.23) (.23) (.29) (. 2013.34) 6 20 0. Among fourth-grade math teachers in this district.21) (.29 1.32 1.46 –1.48) (. We report results for mathe.14) (. The dependent variable is the raw are more likely to engage in this kind of behavior when they number right on the exam. Brown. it is possible that those who do not scores for these students. on average. 2013). bias from the same roster verification process. media have also reported on the problem of teachers coaching To test our hypothesis. We test this own math teacher monitors the exam. when their assigned to monitor students of other teachers. and the answered correctly when students are monitored by their own 82    EDUCATIONAL RESEARCHER . The results are striking. Because language arts. we test the hypothesis that teachers second column of Table 2.34) 5 74 1. the ment is at least one question per student. Students not claimed in this process source is not a plausible factor in Grades 6 through 8. departmentalized instruction is the norm.52   (.21) (. student eligibility for free or reduced-price lunch.15) (.22) 8 21 0. and age. prior year scores.

However. (In reading/language arts. underappreciated problems if they are going to become a perma- ing advantage of opportunities presented by the system to nent part of the K–12 landscape.645*. the aggregate impact on class perfor. Evaluation systems need to find better ways interventions that can be used in math (e. Although the results are not as strong The USDOE’s RTTT reform initiative required states and school as for mathematics. and . plans are in place to use fractional not fall into a clear pattern. The effect To conclude.947*.854* these evaluation systems as a positive development. them during the test.18 nervous. value-added scores and in the other by providing assistance to mance appears to exceed what would occur if teachers only students taking tests under their supervision. . the complexity of these procedures mitigates against school students who are accustomed to having multiple instructors complete accuracy. worse as the demands on these evaluation systems increase. inclusion of this variable makes almost no dif. in science they are . stakes decisions ought to be based on initial. step procedure wherein the system first identifies teachers whose An alternative interpretation of these findings is that students estimated value added falls above (or below) some threshold and naturally do better when their own teacher supervises the exam as then.g. March 2015    83 .363. The more complicated the criteria determining which stu- whether the teacher monitoring the test is one the student knows dents do and do not count. We have examined only the differential effect of taking a for this problem. From a practical standpoint. the estimated coefficients inform personnel decisions about professional development. that. New York State has shown the way with a two- mistake) are harder to apply in other subjects. the rate of linkage failures is likely to increase. It may be that teachers generally provide assistance to stu. a state that made a practice of issuing revised “improved” student is a native English speaker. to control for the possibility that the students moni. the most difficult problems to solve Finally. in part. in a second step.711*. Rather. formance. It is too late to monitored by someone else. promotion. and whether the Likewise. . in the one case by testing. While we regard −. rosters students who are not to be used in computing value-added We have tested this hypothesis by adding an indicator for scores. them. we must multiply by the number of students present for improve their own measured performance. of handling measurement error when conducting summary assess- less error and leaving it to the student to identify and fix the ment of educators. amount of instructional time provided by each teacher.19 math instructor (Column 4).078. Likewise. occasionally offered a limited amount of assistance to particu- larly importunate students. and certification. exhibiting a mix of positive and linkages in the calculation of individual teacher value added. it would appear to apply princi. such as race. Given that such systems are only as good as the data ing.266. though in fact the grounds for regarding the revised estimates as an dents during testing.505*. Conclusion We have conducted similar analyses for teachers of reading/ language arts and science. If this is a valid concern. if (as it appears) administrators defend the practice of having teachers monitor their teachers monitoring their own students are more likely to coach own students during testing on just these grounds: that perfor. The introduction of these estimates would appear to be in a poor position to argue that high- controls makes little difference to our findings. tenure. in about half the grades we find a statistically districts to develop educator evaluation systems that. test under the supervision of one’s own math teacher. particularly when teachers are asked not sim- and who are apt to be given the test by a homeroom teacher or ply to verify which students they taught but to drop from their someone else they know from other courses and activities. states are installing educator evaluation systems is positive and both statistically and substantively significant. com- for Grades 4 through 8 are −. We have heard teachers and given the precision of the estimate.025. there would seem to be an obvious remedy: mance will suffer because an unfamiliar teacher will make students Someone else should monitor them (or at least be in the room). decides whether further action is warranted opposed to a teacher they do not know. in part. In room. whether they have these students in class or improvement are sometimes highly dubious. and pensation. dropping from their rosters students who will not harm their dents in a testing session. the results suggest that teachers are tak. Other explanations are possible. . not believe these results indicate that math teachers are more Some of the problems we have described can be more easily eager to assist their students on tests. As frac- dents. for example. unrevised estimates. we suspect that corrected than others. It is likely that roster verification procedures can be improved. although the strongly negative coefficient in is. though those we have been available to them and the policies and procedures that govern able to test have not accounted for our findings. rely on student test scores to measure teacher per- This is not definitive proof that the effect is the result of coach. . take significant advantage in favor of students monitored by their into account student performance on standardized tests and instructor. . The problem may grow appear in any of the student’s course records. (“Strangers” are defined as teachers who do not other supervisors to check compliance. a number of (estimates with asterisks are significant at the 5% level). not to middle However. districts will provide the state with fine-grained instructional Grade 4 is consistent with the hypothesis that a strange teacher linkage time so that value-added estimates can be weighted by the has a negative impact on the performance of the youngest stu. may be those posed by the practice of revising estimates of teacher tored by their own math teacher differ systematically from those effectiveness as more recent data become available. There is no obvious fix not. pointing out a care.085. for student characteristics. tional linkages become a part of teacher-level value-added esti- ference to the estimated effect of being supervised by one’s own mates. −. Given that teachers typically monitor upwards of 20 stu. Like our analy. it is clear that states need to better address the outlined sis of roster verification. that negative signs. which we expect will be fought out in the courts.) Coefficients on this indicator (reported in Column 5) do New York State.. We do problems in their design and implementation need to be addressed. including home. pally to younger students in elementary school. income. we introduce a variety of controls pretend such estimates don’t exist: The genie is out of the bottle. the harder it is for principals and or is a stranger.teacher.

and then simply follow this rule. New York. taking into account the costs of delaying deci. Center for Educational Leadership and up for having fewer students by teaching more subjects to them. Maryland. Rhode Island Department of Education (2013).).004 students in growth only during the year the teacher has a student. and Walsh (2013). and Tennessee. Kluender et al. See. If teachers more than 62% of the mathematics instructors in the sample. Isenberg. 84    EDUCATIONAL RESEARCHER . we did not find a 8 On this issue see Han. Although the former figure is a fraction of a teacher’s most recent three cohorts. Carolyn Herrington. while the education agencies (Illinois. We operationalized this rule by requiring stu- + Idiosyncratic Error(t). Delaware. Colorado. North Carolina. Louisiana. Rhode offered expert insight on teacher evaluation systems in Race to the Top Island. Teh. and even science scores across her 25 students. More than 10% of fourth-grade math teachers do sions. adding or subtracting 10 days from these figures). In our review of RTTT state testing protocols. (2011) is a qualitative study describing some of the challenges to estab- 4 The reason for the revisions is that Educational Value-Added lishing reliable student-teacher linkages. revising value-added estimates not monitor any of their students during the exam.. dent rather than on whether any teacher claims a student. PVAAS Statewide 1 One might wonder if the teacher with the smaller class makes Core Team for PDE (2013). erees. however. This problem is not limited to the United States.d. which was estimated once already in year t – 1. public schools. and no such stakes were in ner each year (e. and yet again in year t + 1 We have conducted a parallel analysis of scores in reading/ when they reach sixth grade. But these do not appear in a regular man. treating teachers differently (even if “opti. Indeed. Any errors explicit reference to roster verification procedures involving teachers for remain the sole responsibility of the authors. The 66. At Special education students were exempt from the claiming pro- a minimum.g. Ohio. We doubt a remaining special education students from the sample. Louisiana. Some states leave these procedures up to local guage arts.” in which a The exact rule is that students count only if they are expected to student’s score at any point in time represents the cumulative impact of be in the class at least 150 days of the school year (75 days for semester- past teachers. Doug Harris. Arizona). as small num- – 1. New York. there is considerable variation. two. Battelle for Kids (2012). and to our knowledge EVAAS has never attempted to tion system to determine compensation. the optimal waiting period will vary by grade taught. and the results are barely affected by their mally”). further revisions could be obtained English language arts. relative. the 407 teachers represent 5 The assumption underlying EVAAS is that a teacher affects 3% of the mathematics instructors in the sample. It is not have an effect on how much students learn in the future (as we believe the case that these discrepancies arise in only a small subset of the sample. Tennessee. Florida. for example. As an additional check. the following equation: SCORE(t) = Grade 5 Value-Added(t) + Grade Since testing occurs in April. this article do not necessarily reflect those of sponsoring agencies or North Carolina. Even in fourth grade. is reestimated in bers of students are near enough these cutpoints to be affected. as well as the District of Columbia. and Gottfried single state that prohibited classroom teachers from administering state (2012). 7 It might be argued that the solution is to determine how long to For example. whether veteran teachers’ licenses would be renewed. the costlier it will be to wait. We found states and Jason Spector for excellent research assistance. Pennsylvania. the top and bottom categories of the North Carolina of departmentalized instruction in these grades not reflected in enroll- evaluation system correspond to teachers deemed more effective than ment records. Rhode Island. future scores are the second row of Column 1 are taught by 8. Similar reports ducted on a regular. Braun. more complicated. Georgia.. 16 estimating value added of elementary and middle school teachers. claimed by a teacher. of such students is small. we have excluded these the worse the initial score. The state plans to use the evalua- that course). and her performance measure might average math. Days absent are subtracted in arriving at these totals. Thorn. Results are very similar. using only 407 different teachers.g. school end-of-course exams. District of Columbia. 6 Additional test scores may be available as students complete high among other things. and assessments to their own students provided none of the students was a Watson (2011). but EVAAS reports results as 3-year moving averages. They attribute the discrepancies primarily to the use 2 Indeed. Pennsylvania. Their focus is therefore on which teacher claims a stu- average and less effective than average with 95% confidence. Technology (n. step test of this kind. (2013) investigate discrepancies between unconfirmed and con- Including more scores for the same set of students does not reduce noise firmed rosters in upper elementary grades in the District of Columbia in the same way as averaging across a larger number of students. his or her students. reading/English lan. Georgia. and two anonymous ref. Isenberg a given student are highly correlated across subjects. only New York has adopted a staggered. the teacher at the 25th percentile monitors only 34% of wait for additional data. this is a false hope: et al. The number system of such complexity. thus. and the individuals acknowledged. We would also like to acknowledge the many individuals who New Jersey. The views expressed in Colorado. Our results are not sensitive to small changes the fourth grade teacher’s value added for her cohort of students in t (e. Springer. We believe this would be utterly infeasible. The state subsequently backed off this plan. a fifth grader’s test score in year t is represented by length classes). Ohio. 14 is at least sometimes the case.120 different instructors— used only to get a better fix on how much that growth was. EVAAS yields estimates of the first four com. plans were announced to use value-added different. Thus. this obviously requires some forecasting 4 Value-Added(t – 1) + Grade 3 Value-added(t – 2) + District Effect(t) on the part of teachers. Because the scores of We are aware of two other studies of roster verification. 15 combine such information with the results of standardized testing con. 10 second teacher has scores only in a single subject. Tyler (2011). Florida.g. annual basis in core subjects in lower grades when have surfaced about Australia’s Naplan tests (McDougall. and they do 3 To our knowledge. 1% of the student sample used in the analysis. Massachusetts. McCaffrey. with transformational teachers). but this change lies in the future. an Algebra I test taken when the student completes place during the 2012–2013 school year. 9 Notes We have visited state Department of Education websites of all We appreciate helpful comments and suggestions from Henry Race to the Top (RTTT) states: Arizona. 12 year t when that cohort reaches fifth grade. 18 legal challenge. Illinois. 2014). dents to have been in class at least 120 days by the test date (60 days for ponents of this expression when the model is estimated in year t. a In the state we studied. Kentucky. not investigate the performance of unclaimed students. model is required. e. assessments as part of a personnel evaluation system that would determine. 13 from years t + 2 and t + 3 as these students reach seventh and eighth The 643 students in the first row of Column 1 are assigned to grades. It cess and have already been dropped from the analysis unless mistakenly should also vary based on the initial value-added estimate: for example. Hawaii. Thus semester-length classes).. Hawaii. would be regarded as politically viable or as likely to survive a exclusion. and Kluender. Kentucky. 17 in accordance with it. 11 Assessment System (EVAAS) is based on a “layered model.

Jacob. B. D. Why are student-teacher RosterVerificationConceptPaper.C. 1–22. Unpublished manuscript. A technical overview of the student growth Instructional Results Information System (KIRIS) (M-1014-EDU). ability and disabil. Cambridge. Principals and teachers banned from coaching http://www.. Author. Teacher effect estimates and decision rules for establishing student-teacher linkages: What are the implications for high-stakes References personnel policies in an urban school district? Statistics. (2000). (2006). Dover. Santa Monica. 118(3). Answers allegedly supplied in 4 to 8 have fractional linkages that will need to be tracked and incorpo. C. W. University of Florida. Controlling for stu. & Levitt.d. R.. 3(2). San Jose Mercury News. (2013). E. Brown.. New USDOE report reveals many Figlio. I. Politics.eride.. & Watson.. Retrieved from http://www. (2003. Journal of Economics.amstat. of high-stakes testing in Chicago public schools. report says. Retrieved from http://www. The Washington Post. D. Philadelphia educators indicted for helping PVAAS teacher specific reporting guide to implementation 2013. ASA statement on using value. (2013).. 19 In a technical report prepared for New York State’s educator evalu.engageny. 89. 2012). Cambridge. This means that more than 328. R.. L.ri. SAS EVAAS for K–12. B.pdf .com/content/dam/SAS/en_us/doc/productbrief/sas- Office of School Improvement. Teacher keys effectiveness system. accountability. (2013). D. L. Springer. (2012). Value-added measures in education/long-island-educators-under-inquiry-for-test-help elementary_school_data_issues. The New York Policy Research Working Paper). (n. Lockwood. Allegations of test help by teachers. W. effort to raise test scores.html naplan-tests/story-fn3o6nna-1226865821078 Center for Educational Leadership and Technology. crime and Education Analysis Policy Archives. MA: National Bureau for Economic Research.pdf Research. (2012). MA: National Bureau for prweb11562558. (2011). S511OVjAYVF4Lgqwdy3IsZlA6LOIxrMj0LZ-r66tk Harris. Washington. PVAAS Statewide Core Team for PDE.. Sanders. (2014).. M.html?_r=2& Jacob. students cheat. R. (2013). incentives. (2013).gov/RosterVerification/UserGuide_Teacher. rating. (2011). report: Teachers in 18 classrooms cheated on stu. Madison. Improvement of Educational Assessment. M. Retrieved from http://www.pdf . Quarterly Journal of to Colorado Department of Education).pdf. (1999. Retrieved from https://www. B. B. CA: RAND Corporation. & Wright. ddb2bb6-a35e-11e2-9c03-6952ff305f35_story. The Advertiser.. 37–65.prweb. & Strayer. A user’s guide for teachers. American Institutes for Research. Kentucky Noguchi. 843–877. states struggle to link individual student test data to the proper ity: Gaming the system? (National Bureau for Economic Research teacher. Retrieved from http://www. The validity of gains on the Kentucky Betebenner. (1998).. 8(41). 17% of students with valid test data were not linked Teacher-and-Leader-Effectiveness/Documents/TKES%20Hand to teachers for the required duration of time (American Institutes for book%20FINAL%207-18-2013. New York Times. A1.). NH: National Center for the McCaffrey. MA: Harvard Education Press. & Gottfried.. for educator evaluation technical report: Final. www. American Statistical Association. Retrieved from http://www. RANDA Solutions. (2013). & Getzler. Putting schools to the test: School Chronicle. (2011). 89(5/6). Cullen. Retrieved from http://www school accountability plans on school nutrition... (2014). (2003). Retrieved from McDougall. J. Kluender. Guide%20%20October%202013. (2002). D. 2011-2012 growth model and Policy. Teh. Accountability. Rhode Island Department of Education. F. The myth of the Texas miracle in education. December 8). (2001). E.tsdl. Tinkering toward accolades: School gam. Almost 3. W.gadoe. Educational and Behavioral Statistics. D. rated into the state’s educator evaluation growth model in the future. P.cliu. growth projections/trajectories. Dover.mercurynews. Retrieved from http:// download/growth-model-11-12-air-technical-report. DC: Haney. D. wide implementation: Pennsylvania value-added assessment system Dockterman. (2004). Retrieved from http:// Economics. Retrieved from http://www general/White%20Papers/TSDL_KentuckyFinalReport.000 dropouts miscounted. (2008). www. tions in the context of classroom-level performance measures. accountability. B. & (2005). Retrieved from http:// Georgia Department of Education. & Winicki. S. A primer on student growth percentiles.washingtonpost. 29(1). Koretz. San Jose teacher helped second-graders cheat on final report. Food for thought? The effects of 2012–13. Journal of Public .com/2013/04/12/ mathematica-mpr. (2014). (2014). and incentives: The impact Ballou. Testing. J. CA: Rand. S. Houston Deere.asu. ation growth model. Times.pdf March 2015    85 . & Hamilton. Retrieved from http://time. D. & Walsh. Z. Journal of Public dent background in value-added assessment of teachers. (2005).com/93196/phil. Roster verification Figlio.htm Economic Research. Rotten apples: An investigation of the Battelle for Kids. McCaffrey.nytimes. & Reback. added models for educational assessment.adelaid- revive-allegations-of-cheating-in-dc-public-schools/2013/04/12/9 enow. D. J.pdf script.. F. Retrieved from www. A. N. D. D.. Retrieved from http:// Isenberg. Han. Naplan tests. W.tsdl. W.. SY2013-14 state- Texas A&M University. helped-second-graders-cheat-star ing under a performance accountability system (NBER Working Paper Peabody. June 14).pdf?token=Jn epaa. State may lower HISD #12286). SAS Institute. 381–394. & Barron. Cambridge. D. Elementary school data issues: www. Unpublished manu. Thorn.. (2005). D. E. adelophia-educators-charged-with-helping-students-cheat/ Centricity/Domain/21/SY1314%20Implementation%20 Figlio. M. dents’ high-stakes tests in 2012. linkages important? An introduction to data quality concerns and solu- Betebenner. E. Roster verification using BFKLink (presented prevalence and predictors of teacher cheating. (2012). Evaluating value-added models for teacher Assessment.pdf Implications for research using value-added models (Mathematica Working Paper 9307). S. S. Retrieved from STAR test. percentile methodology: Student growth percentiles and percentile Santa Monica. Teacher and Leader Effectiveness Division.. (2013). D. (2003). 761–796.000 students in Grades Goodnough. (2014). NH: National Center for the Improvement of Educational WI: Center for Educator Compensation Reform.pdf Economics. J. and behavior.

We assume that in the population students from populations with different values of σ. (2011). MATTHEW G. and teacher performance. Milanowski.. MSGPi = TSGPi + (1 / Ni ) Σ j u ij . Education Week. A. is an assistant professor of public tion/evaas.pdf vanderbilt. DC: Author. State implementation of reforms promoted under Revisions received December 10. E. Washington. and February 3.205. teacher B a class of 100. Nashville. 141 Wyatt Center. we find that the probability for A is . Louisiana Department of Education.. education at Vanderbilt University. gram. Authors added measures for evaluation. PhD. His research focuses on incentives.05 for B and solve for σ. Washington. Retrieved from 86    EDUCATIONAL RESEARCHER . the Recovery Act. Curriculum verification and results (CVR) reporting focuses on incentives. 2014 Goertz. TN 37203-5701. United States Department of Education. Department of (2009). (2014). the uij are normally distributed with mean our view the inequities based on class size are disturbing enough. S. 43. is an associate professor of public policy and http://blogs. Accepted February His research Tyler. A.. portal implementation guide. 2014. and respectively. or 4 times greater.Sawchuk. Reisner. AFT’s Weingarten backtracks on using 2015 Appendix: Illustrative Calculations of the of 0 and variance σ2. Then the probability that a teacher is rated Student Growth Component of Georgia’s ineffective is Teacher Effectiveness Measure Teacher A has a class of S. Peabody No. The true ( Prob ( MSGPi < 30 ) = Prob (1 / σ ) √ Ni ( ) ( MSGPi − 40 ) < mean student growth percentile (TSGP) for both teachers is 40. DC: U. director of the National Center on Performance Incentives. Twitter:@eduspringer. 2015 Institute of Education Sciences. matthew. B.html M. tens_retrenchment_on_va. 12 in teaching assignments and test measurement error and Ni is times greater. & Manuscript received July 7. though in these teachers serve. Peabody No. 14. and compensation. SPRINGER. Webber. accountability.. TN 37203-5701.g.springer@ ed.edweek. P. If we set this prob- where the uij are student-level errors reflecting good or bad luck ability to . Washington. Retrieved from http://www.. Retrieved from DALE BALLOU. Gutmann.white- house. Troppe. Retrieved from https://www2. in the sense that this is the mean percentile we would observe if (1 / σ ) ( √ Ni ) ( 30 − 40 ) ) the two teachers were assigned students of average ability whose = Prob ( z < (1 / σ ) ( −50 ) ) for teacher A performance was measured without error. The relationship of MSGPi to TSGPi is = Prob ( z < (1 / σ ) ( −100 ) ) for teacher B.01 for B. The measured mean growth percentiles for these teachers are MSGPA and MSGPB. O. we find that the probability for A is .html policy and education at Peabody College of Vanderbilt University and the United States Department of Education. DC: Author. 230 Appleton Place. (2014). dale. accountability. If we set this probability to . Clearly further inequities arise if two teachers draw number of tested students. Race to the top pro.122. (2014). Race to the top.