You are on page 1of 5

# Building Predictive Models for NYC Public High Schools

not necessarily have enough resources for all of the students that it is responsible for. @ was also able to collect the average salary for teachers in a given school. 2y intention was to use this measure as a pro9y for the .uality of teacher in a given school. \$ teacher"s salary in New York is determined by how much training they have received and how many years of e9perience they have. @ decided to operate under the assumption that that a school with a higher average teacher salary has a higher .uality teachers.

Drawing 1: NYC School Districts

Drawing 2: Heatmap of Demographic Data by District

=astly! @ collected a data7set from the yearly school survey that is administered to parents! teachers! and students. 0rom this survey! the NYC %epartment of &ducation is able to e9tract scores for safety and respect! communication! engagement! and academic e9pectations. \$dditionally! it contained information on the e9tracurricular offerings of a school. &ach school in the above data sets was given a uni.ue identifier called a "%/N". This was e9tremely useful for two reasons. 0irstly! it allowed me to use the 1andas "Coin" function to combine all of my data in to easily combine all of my data in to a single data frame! without too much e9traneous data cleaning. 3econdly! the %/N allowed me to e9tract the district and borough for a given school. 0or e9ample! /ron9 =eadership \$cademy Digh 3chool"s %/N is *8E),). The first two digits ' *8 ' signify that this school is located in district 8 (there are B, school districts within New York City). The third character ' E ' corresponds to the /ron9 (the other letter>borough pairs are 2>2anhattan! F>Fueens! 5>3taten @sland! and G>/rooklyn).

Methodolog% 0irst! @ had to narrow down my data7set from the 1(**H schools to the -*, high schools in the NYC school system. )* of those schools did not report graduation rates and \$12 measures. This is a result of a regulatory re.uirement that prevents a school from releasing this information when there are ,* or less graduates (generally the smaller schools in the system). \$fter removing those schools with missing data! @ employed a randomi#ed (*>B* split to create a training set and a testing set. 0or both of my models! @ was attempting to predict a continuous value ' graduation rate and \$12. \$s such! @ decided to use scikit7learn"s ridge regression algorithm. @ began with a "kitchen sink" approach and threw all of my variable in to the model. @ then removed variables one7by7one until @ had could isolate the factors that most influenced graduation rate and \$12. Iariable were selected for removal when their p7values indicated a lack of statistical significance and their absence from my model did not substantially detract from my model"s accuracy. The accuracy of my model was determined using both the 57s.uared and mean s.uared error (23&). &esults ' (raduation &ate