You are on page 1of 5

Building Predictive Models for NYC Public High Schools

Alec Hubel | Introduction to Data Science - all !"#$ Abstract The New York City public school system (responsible for the education of over 1 million students) is the largest in the country. nfortunately! it"s si#e only makes it more susceptible to impeding issues. The fact that school budgets are consistently tightening is only worsened by the fact that $merican students are falling behind their international competition. $s a way to monitor the success of a school! the %epartment of &ducation monitors two key statistics ' high school graduation rates and aspirational performance measures. This study looks to uncover the key drivers of those measures in an at tempt to isolate the factors that are most responsible for a successful education in New York City public schools. Introduction New York City public schools employ ()!*** teachers across over 1!(** schools. These teachers are responsible for the education of 1.1 million students and represent an overwhelming portion of the +,million annual budget. $ system of this magnitude re.uires consistent monitoring in order to determine it"s efficacy. nfortunately! a evaluation of each and every school! teacher! and students would be a huge draw on already limited resources. /ecause of this! the %epartment of &ducation must rely on certain performance metrics to decide if a school"s performance is up to snuff. 0or high schools! the primary metrics that are used for this purpose are a schools graduation rate (what percentage of a senior class will successfully graduate in a given year) and it"s aspirational performance measure ($12). The New York 3tate %epartment of &ducation uses the below definition for aspirational performance measures4 • The percent of students in the cohort who earned a 5egents diploma with $dvanced %esignation (i.e.! earned ,, units of course credit6 passed (78 5egents e9ams at a score of :) or above6 and took advanced course se.uences in Career and Technical &ducation! the arts! or a language other than &nglish)6 and • The percent of students in the cohort who graduated with a local! 5egents! or 5egents with $dvanced %esignation diploma and earned a score of () or greater on their &nglish 5egents e9amination and an ;* or better on a mathematics 5egents e9am (note4 this aspirational measure is referred to as the <&=$>2ath $12?) This data point is meant to measure what percent of a graduating class is prepared for college or a post7 high school career. 0or this analysis! @ attempted to build predictive , models. Ane for a school"s graduation rate! and one for a school"s aspiration performance measure. Data The New York City %epartment of &ducation makes an enormous amount of data available for public use and review. Thanks to this fact! collecting all the data re.uired for my analysis was substantially easier than anticipated. To start! @ decided to focus on the ,*117,*1, school year. @ wanted to keep the data as recent as possible! in order to have my results be as reflective of the current status of the school system as possible. The data that @ re.uired was held across primarily B separate data7sets. The first data7set contained demographic data! presenting values for the racial composition of schools! what percentage of the student body .ualified for free or subsidi#ed school lunches (a common pro9y for the income levels of a student population)! student7teacher ratios! and the graduation rates and $12s of individual schools. The second data7set contained budgetary information for each of the schools. 0rom this! @ was able to e9trapolate the dollar allocated per student. This would be a more useful measure for the funding of a school than the absolute budget! because a larger school would naturally have a larger budget! but may

not necessarily have enough resources for all of the students that it is responsible for. @ was also able to collect the average salary for teachers in a given school. 2y intention was to use this measure as a pro9y for the .uality of teacher in a given school. $ teacher"s salary in New York is determined by how much training they have received and how many years of e9perience they have. @ decided to operate under the assumption that that a school with a higher average teacher salary has a higher .uality teachers.

Drawing 1: NYC School Districts

Drawing 2: Heatmap of Demographic Data by District

=astly! @ collected a data7set from the yearly school survey that is administered to parents! teachers! and students. 0rom this survey! the NYC %epartment of &ducation is able to e9tract scores for safety and respect! communication! engagement! and academic e9pectations. $dditionally! it contained information on the e9tracurricular offerings of a school. &ach school in the above data sets was given a uni.ue identifier called a "%/N". This was e9tremely useful for two reasons. 0irstly! it allowed me to use the 1andas "Coin" function to combine all of my data in to easily combine all of my data in to a single data frame! without too much e9traneous data cleaning. 3econdly! the %/N allowed me to e9tract the district and borough for a given school. 0or e9ample! /ron9 =eadership $cademy Digh 3chool"s %/N is *8E),). The first two digits ' *8 ' signify that this school is located in district 8 (there are B, school districts within New York City). The third character ' E ' corresponds to the /ron9 (the other letter>borough pairs are 2>2anhattan! F>Fueens! 5>3taten @sland! and G>/rooklyn).

Methodolog% 0irst! @ had to narrow down my data7set from the 1(**H schools to the -*, high schools in the NYC school system. )* of those schools did not report graduation rates and $12 measures. This is a result of a regulatory re.uirement that prevents a school from releasing this information when there are ,* or less graduates (generally the smaller schools in the system). $fter removing those schools with missing data! @ employed a randomi#ed (*>B* split to create a training set and a testing set. 0or both of my models! @ was attempting to predict a continuous value ' graduation rate and $12. $s such! @ decided to use scikit7learn"s ridge regression algorithm. @ began with a "kitchen sink" approach and threw all of my variable in to the model. @ then removed variables one7by7one until @ had could isolate the factors that most influenced graduation rate and $12. Iariable were selected for removal when their p7values indicated a lack of statistical significance and their absence from my model did not substantially detract from my model"s accuracy. The accuracy of my model was determined using both the 57s.uared and mean s.uared error (23&). &esults ' (raduation &ate

The accuracy of my final model for graduation rate came in with an 57s.uared of *.:1 and a 23& of *.*1B,. 2ost of the results were relatively unsurprising. Daving a high portion of your students receiving subsidi#ed or free school lunches (i.e. a poorer student body) resulted in a lower graduation rate! while having a better funded school on a per7student basis had the opposite effect. Daving a larger portion of your student body represent non7white ethnicities! come from households where &nglish is not the native language! or .ualify for special education also put downwards pressure on a school"s graduation rate. The results of the survey data were a bit more interesting. Digher academic e9pectations and safety>respect scores resulted in strong a significant improvements in a school"s graduation rate! while the other , scores derived from the survey data (engagement and communication) did not have significant impacts. Dowever! a stronger student response rate! regardless of what kind of answers the students gave! seemed to correlate strongly with a school"s graduation rate. This may suggest that students who are motivated to perform well in school! will most likely complete a school sponsored survey! in addition to putting forth effort in their classes. The prevalence of e9tracurricular activities also impacted graduation rate positively ' though not all e9tracurriculars were created e.ual. Anly sports! academic>tutoring clubs! and theater clubs seemed to have a significant and positive impact on graduation rate. Ane une9pected result did emerge from the analysis. The student7teacher ratio of a school ha d a statistically significant impact with a positive coefficient (i.e. more students per teacher resulted in a higher graduation rate). Jhile it may be a bit of a stretch to suggest a policy where we have fewer teachers in our schools! it may be worthwhile to reconsider the conventional wisdom that hiring more teachers can be cure7all for struggling schools. 1erhaps better teachers would have a greater impact than more teachers. &esults ' APM

The final iteration of my model for $12 yielded an r7s.uared of *.:; and a 23& of .*118 ' incrementally more accurate than my model for graduation rates. Kenerally speaking! the same drivers

that impacted graduation rates impacted a school"s aspirational performance measure. Jealthy and safe schools with a high proportion of white students tended to outperform. &9tracurricular activities had a strongly positive impact! with music and technology clubs Coining the sports and academic clubs as the e9tracurriculars that had an outsi#ed positive impact. There was one takeaway from this model that stood out from the crowd4 the most predictive variable in this model was the percentage of the student body that was of $sian descent. @n fact! when @ built this model with that data point as the sole variable! it generated an 57s.uared of *.-;. Jhile this lends credence to the notion that $sian students tend to outperform on standardi#ed tests! it does not solve the .uestion of why $sian students tend to outperform. $ common hypothesis is that $sian students tend to have a stronger work ethic and spend more time studying! naturally yielding better test scores! but @ do not have access to data that could confirm or refute that idea. Caveats and uture &esearch There are a few caveats to consider in order to put the results of this analysis in the proper conte9t. 0irstly! the magnet school system in New York City throws a bit of a wrench in to the data. $t the end of middle school! every New York City public school student takes a standardi#ed test. 1erform well enough on that test! and they will be admitted in to one of the higher performing or specialty schools in the city (i.e. 3tuyvesant! /ron9 3cience! etc.). This system creates two issues. 0irstly! there will be self7selection bias. 3tudents with higher natural ability will go to better schools! reinforcing their high standing (particularly in terms of graduation rate and aspiration performance measures). 3econdly! many students do not attend schools in their home boroughs or districts. This may be one of the reasons that a school"s location was not particularly significant in my models. 2ore data would have been useful as well. %irect metrics for the natural ability of a student body! non7 school activities that take up a student"s time outside of the classroom! and the .uality of teaching staff were unavailable. 3ome of this may be alleviated soon ' in the ,*1B7,*1- school year! teachers will be evaluated on a continuous scale. These data points may prove useful in future research. =astly! the results of this study would have been much more intriguing if the data could be collected on a student level! as opposed to a school level. $fter all! policy recommendations directed at schools are meant to improve the .uality of education for individual students. @f the school7level could be bypassed! it may be easier to identify ways to more directly help students.