You are on page 1of 17

IRE2004: Data Analytics and Metrics for IR/HR

Centre for Human Resources and Industrial Relations


University of Toronto

Instructor: Professor Greg Distelhorst


Date: April 8th, 2019

Predictors of the Length of TTC Subway Delay


A Multiple Regression Model
Yichen Liu
Karen Wang
Yinting Wang

1
Table of Contents
INTRODUCTION ........................................................................................................................................................................1
INTEREST ............................................................................................................................................................................................................1
RESEARCH QUESTION ...................................................................................................................................................................................2
METHODOLOGY ........................................................................................................................................................................2
DATA SETS ..........................................................................................................................................................................................................2
DEPENDENT VARIABLE: LENGTH OF SUBWAY DELAY IN M INUTES .........................................................................................3
INDEPENDENT VARIABLES ..........................................................................................................................................................................4
Passenger Related Issues and Technical Related Issues .............................................................................................................4
Weekday..........................................................................................................................................................................................................4
Peak Hours ....................................................................................................................................................................................................4
Holiday ...........................................................................................................................................................................................................4
Snow on Ground..........................................................................................................................................................................................5
Temperature..................................................................................................................................................................................................5
Rainfall ...........................................................................................................................................................................................................5
REGRESSION MODEL..............................................................................................................................................................5
FINDINGS .......................................................................................................................................................................................7
LIMITATIONS ........................................................................................................................................................................... 11
SNOW ON GROUND DATA .......................................................................................................................................................................... 11
LIMITED INFORMATION FROM THE DATA COLLECTION PROCESS....................................................................................... 11
OMITTED VARIABLES ................................................................................................................................................................................. 12
FUTURE STUDIES ................................................................................................................................................................... 12
APPENDIX................................................................................................................................................................................... 14
APPENDIX A: PROPOSED RANDOMIZATION MODEL FOR FUTURE STUDY OF ATC SYSTEM ...................................... 14
REFERENCES............................................................................................................................................................................ 15
INTRODUCTION

Interest
Our interest for the subject in this project mainly comes from two aspects. First of all, we

have constantly experienced service delays on TTC subways because we commute by the subway

on a daily basis. Unexpected massive delays on Toronto’s subway stations can cause frustration to

transit passengers who have a designated time to arrive at a destination. Fellow students who go

to campus by the subway might be late to class, hand-in assignments late, and even miss an exam

as a result of a delay. According to our data set, there were almost 805 hours of delay in the year

of 2018 (“TTC Subway Delay Data”, 2018). On average, there are 1,477 minutes of delays each

year at St. George station alone, which poses a concern to us as U of T students. Therefore,

analyzing the impact from major variables that are hypothesized to relate to delays of TTC

subways will help passengers to better plan and estimate their commuting time while taking the

subway. Variables that are selected to be analyzed in this project derive from our personal

experiences of TTC delays. For example, we have generally experienced more frequent subway

delays on snow days and during peak hours of subway operations.

Another motivation for this project is that analyzing the variables that are correlated to

subway delays is able to assist TTC to better prospect and manage possible delays in subway

stations. TTC can take more precautions of passengers’ safety before a prospected subway delay

to prevent injuries and accidents due to a massive volume of commuters. Moreover, preventions

can be taken by the stations to reduce the possibility of subway delays or to shorten the delay

minutes.

1
Research Question
The main purpose of our project is to analyze the variables that are correlated with the

delays of TTC subway from 2014 to 2018. The research question of this project is: What factors

predict TTC subway delays?

METHODOLOGY

Data Sets

Information on delays of TTC subways from the year of 2014 to 2018 is retrieved from

monthly data sets that are provided by the City of Toronto Open Data Catalogue website (“TTC

Subway Delay Data”, 2018). Data frames from 2014 to 2018 were combined to form a general

data frame of the four years of subway delays, which is named “gta”. This data set contains a total

of 102,682 observations of the TTC subway service. For each observation, the following

information are included:

● The date, time and day of the week of the observation (code “Date”, “Time” and
“Day.x”)
● The delayed station, the delayed subway line, and the direction of the subway
service (code “Station”, “Line” and “Bound”)
● The service delay in minutes and the length between trains (code “Min Delay” and
“Min Gap”)
● The reasons for delay (code “Code”)

The subway system in Toronto is subject to delays and stop service due to the cold winter

climate and severe weather conditions, such as heavy snowfall, wind chill and freezing rain.

Therefore, the impact from weather conditions on TTC service in this project is analyzed by the

temperature, the amount of rainfall and the depth of snow in centimetres on the ground in the GTA

region. The weather data frame was obtained from the Government of Canada website. The data

frame is read into the R program and named as “w”. It contains a total of 1,826 observations that

record the following information:

2
● Date of observation (code “Date”)
● Daily temperature (code “max temp” and “min temp”)
● Total snowfall (code “Total Snow (cm)”),
● Total rainfall (code “Total Rain (mm)”), and
● The depth of snow on ground in centimetres (code “Snow on Grnd”)

The “Date” variables in the weather data frame “w” and the subway delay data frame “gta”

are compatible. Therefore, the two data frames were merged basing on “Date” to generate a new

data frame named “ttc”.

Dependent Variable: Length of Subway Delay in Minutes

Since our research question is attempting to find out the impact from different variables on

the length of subway delays, the dependent variable in the regression model is the length of TTC

subway delay. The ‘MinDelay’ variable is obtained from the TTC subway data set, which is a

continuous variable measured in minutes. By inspecting the distribution of MinDelay observations,

we found that there are no delays that are between 1 to 2 minutes. However, there are plenty of 0-

minute-delays. We assumed that delays that are under 3 minutes are not considered as delays by

the data collector. Therefore, we transformed all the 0-minute-delay observations into 1.5-minute-

delay observations, which is the mean of 3 minutes to reflect the actual delays.

There needs to be a complete data set with both delay and non-delay observations in order

to generate the impact from independent variables on subway delays. However, the original data

set only contains observations which delay occurred. As a result, we grouped the MinDelay

observations by date and hour and completed the missing observations with 0 to generate

observations where no delay occurred. In this way, we will only be analyzing the delay of the

whole subway system because the original ‘station’ observations are violated by the grouping of

date and hour.

3
Independent Variables

Passenger Related Issues and Technical Related Issues

The original TTC delay data set contains 200 reasons for subway delays. After inspecting

the reasons, we found that the majority of the delay observations can be attributed to either

passenger elicited or technical elicited reasons. Thus, two dummy variables are created: “ispissue”

and “istissue”, which “1” indicates true to the reason. There are 19 reasons that are defined as

passenger related, such as injured passengers. Another 91 reasons are considered as technical

related, such as train control shut down. The rest of the uncategorized reasons are considered as

trivial to the regression analysis because of their small numbers of occurrence records.

Weekday

It is prospected that there will be more commuters during weekdays than weekends, which

suggests that TTC subways will likely experience more delays during weekdays due to a greater

number of commuters. Therefore, “isweekday” is created as a dummy variable by using the mutate

function. Weekday observations are coded as “1” while weekend observations are coded as “0”.

Peak Hours

Whether the data was obtained during peak hours is included as an independent variable

that is correlated with subway service delays. As indicated by TTC, morning peak hours are from

8am to 9am and evening peak hours are from 3pm to 5pm. Observations that occurred during these

two time periods are coded as “1” for the “ispeakhrs” bivariate variable. All other observations are

coded as “0” to indicate non-peak hours.

Holiday

According to TTC, there are normally extended service hours on holidays and free rides

on New Year’s Eve and New Year’s Day (Toronto Transit Commission, 2018). Therefore,

4
“holiday” is projected to be positively correlated with subway service delays. The holiday variable

is created pertaining to the dates of the 10 statutory and designated holiday each year for Ontario

(“Statutory Holidays”, 2018). “isholiday” is created as a dummy variable, which observations are

coded as “1” for holidays and “0” for non-holidays.

Snow on Ground

Given that TTC operation is subject to the disturbing winter weather in Toronto, the amount

of snow on ground is included as one of the independent variables. “Snowongrd” is a daily

continuous variable that is generated directly from the weather data set. Each daily observation is

applied to all hours of the day.

Temperature

“Temperature” is included in the multiple regression in order to isolate the impact on

subway delays from snow and cold temperatures. “mean.temp” is a continuous variable measured

in degree Celsius. The original daily temperature records are obtained directly from the weather

data set and divided by 24 in order to obtain an hourly-based record.

Rainfall

The amount of rain is included to determine the impact of rainfall and flooding weather

conditions on subway delays. The daily rainfall observations are obtained from the weather data

and mutated to create an hourly observation by dividing the daily amount by 24. “mean.rain” is a

continuous variable that is measured in millimeter.

REGRESSION MODEL

When it comes to the accuracy of the regression model, it is important to consider whether

the data sets can satisfy the underlying assumptions. This section will discuss four types of

assumptions for the regression model. First, the model satisfies the assumption that the error terms

5
have a conditional mean that equals to zero (Distelhorst, 2019) since we have obtained the entire

population of the TTC delay data from Open Data Catalogue. However, the regression model does

not fully satisfy the second assumption of IID sampling due to the fact that each observation from

the weather data is not entirely independent to each other and is close to each other in time.

Graph 1: Scatterplot-Original Data Graph 2: Scatterplot-New Data

The third assumption states that there should be no large outliers (Distelhorst, 2019). The

scatterplot of the independent variable ‘MinDelay’ indicates that there are several large outliers

with extremely large values that deviate from the majority of the data points (Graph 1). After

inspection, large outliers in subway delays are mostly due to random incidents that are not expected

to be relevant or influential to normal occurrence of delays. For example, a 515-minute-delay was

caused by a homicide inside the Kennedy Station (Herhalt, 2018). As a result, the outliers were

filtered out, and a new dataset was obtained with only observations that are less than 150 minutes

of delay (Graph 2). The removal of large outliers improved the estimates without significantly

reducing the sample size. The dataset still contains 102,682 observations without the presence of

outliers, which is large enough for an accurate analysis.

6
Lastly, multiple regression assumes that no regressor X is a linear function of another

(Distelhorst, 2019). By using a heatmap (Graph 3) and a correlation matrix (Graph 4), we found

that most of the independent variables in the model have no linear correlations with each other.

However, two of the independent variables, snow on ground in centimeters and daily mean

temperatures, have a relatively strong negative correlation (r=-0.498) than others. This may

introduce a weak collinearity issue in our model. However, it is not a perfect linear relationship,

and the estimates will not be greatly biased.

Graph 3: Heatmap Graph 4: Correlation Matrix

FINDINGS

We have run a multiple regression using the lm_robust function in the Rstudio with all of

our variables. Out model is indicated as the following:

Delays = β0 + β1*mean.snow + β2*mean.rain + β3*mean.temp + β4*ispeakhrs +

β5*isweekday + β6*isholiday + β7*ispissue + β8*istissue

7
An error-bar graph (Graph 5) was created as a visual aid to helps us observe the correlations

of different independent variables with the dependent variable. According to the error-bar graph,

the majority of the independent variables are positively correlated with the dependent variable,

and passenger-related issues have the largest effects on the length of TTC delays among all

independent variables. This is potentially due to the fact that compared to technical issues that are

relatively easy to solve, issues that are elicited by passengers tend to impose greater uncertainties

and require longer investigations by TTC employees. Surprisingly, the dummy variable holiday is

negatively related to the length of delay, which is counter-intuitive. The negative coefficient means

that from 2014 to 2018, the subway delays during statutory holidays are shorter compared to that

of regular days. Our interpretation is that residents in Toronto have a tendency in taking trains or

cars during holidays in order to travel to remote destinations where subways are not available.

Therefore, there may be fewer passengers who take the subway during statutory holidays.

Therefore, the reduced flow of passengers will decrease the likelihood of delay caused by

passengers during holidays.

Another interesting observation is that the effect of daily average temperatures on the

length of delay disappeared after integrating it into the multiple regression model, though a

negative correlation between the mean temperatures and the delay in a bivariate regression was

observed (i.e., confidence interval = (-0.02407408, -0.006601763)). Due to the fact that the

average snowfall per hour and the mean temperatures have a relatively strong negative correlation,

the average snowfall may attenuate the effect of the mean temperatures in the multiple regression

model. As a consequence, the change in the effects of the average temperatures implies that

snowfall and temperature can jointly explain the delay.

8
Graph 5: Error Bar Graph

A balance table (Graph 6) was also established for the numerical estimates and the

confidence interval of each independent variable, which enables us to analyze the results

quantitatively and precisely. The intercept refers to the average estimated delay of the entire TTC

subway system predicted by issues other than passenger-related and technical issues (i.e., reference

variables). The average delay is around 2 minutes when assuming that all independent variables

are equal to zero. In addition to the estimates of the intercepts, those of each independent variable

provide predictive information with respect to the effects of average snowfall per hour, daily

average temperatures, total rainfall, peak hours, weekday, holiday, passenger related issues and

technical issues on the length of delay per hour systemwide. For example, the value of the average

snowfall coefficient means that on average, each additional centimeter of snow on ground will add

3 minutes of delay to the entire TTC subway system per hour. The positive effect of snowfall on

delays is likely due to the reason that there are 12 above ground subway stations (e.g. Warden,

Victoria Park), which are subjected to the impact from snow level on the surface. Moreover, many

subway stations are connected with TTC surface transits that are influenced by snowfall. A delay

in subway service might requires passengers to take the connected streetcars in alternative, which

9
have lower capacities as compared to the subway. This can lead to an increased number of

passengers congesting in the subway stations that fill the station’s capacity. Additionally, the

coefficient of peak hours and weekday dummy indicate that when holding other variables constant,

there are more delays during peak hours and weekdays. This is commonly observed in Toronto

subways because there are generally more passengers commuting to school and work during those

times. Thus, the trains are more likely to stay longer at each station, which increases the possibility

of encountering passenger-related issues. The majority of the estimates are statistically significant

with a p-value less than 0.001 except for the weekdays and mean temperatures variables.

Graph 6: Balance Table

Furthermore, the value of R-squared, 0.16, reflects that our variables in the model explain

about 16% of the variation in subway delays. This may indicate that the reasons leading to the

delay of TTC can be very complicated and some other unobservable factors will need to be

considered in future studies. As a result, these findings still allow us to conclude that all

independent variables in this model are able to predict service delays in TTC subway system.

10
LIMITATIONS

Snow on Ground Data


First of all, the weather data from the Government of Canada website were collected by

meteorological stations located at the Toronto Pearson Int'l Airport, which are geographically

distant from where the subway line operates. However, the amount of snowfall differs across

different regions in GTA due to the large geographic area. Therefore, a limitation exists in that the

analysis between the amount of snow on ground and the length of TTC delay is not precise because

the weather data were not specified to geographic locations where TTC subway operates. Ideally,

the correlation will be more precise if we were able to find weather data that are collected from

meteorological stations that are located exactly along subway stations.

Moreover, the weather data were collected basing on daily observations. Therefore, each

snow on ground data observation is an aggregated amount of snowfall on ground in the whole day.

In order to fit our analysis, the snow on ground observations were altered to be hourly-based

observations by diving each daily amount by 24 hours. This alteration of averaging the amount of

snowfall in a day assumes that snow falls in a consistent volume during each hour. As a result, this

can lead to a less precise analysis because, in reality, the snowfall may have occurred anytime

throughout the day in a non-consistent volume. The amount of snow on ground can also depend

on the time of the day. It will be a more accurate analysis if the snow on ground observations are

originally collected on an hourly basis.

Limited Information from The Data Collection Process

The data of TTC subway delay was collected by the City of Toronto Open Data Catalogue.

There is limited information on the website regarding the data collection process and detailed

interpretations of codes in the data set. Therefore, we conducted a lot of self-interpretations while

11
analyzing and wrangling the data. There was no information as to how the delays were calculated

and defined. To illustrate, there were no delay observations that are between 1 to 3 minutes, which

we assumed that delays are defined as those that are over 3 minutes. We also classified the reasons

for delay into passenger and technical related issues by our own interpretation of the code

descriptions, which may not be entirely accurate. Although we got into contact with Open Data

and TTC, we did not get any further information from the steward of the data collection.

Omitted Variables

The findings may be biased if we leave out any variables that satisfy both of the following

assumptions: the variable is correlated with an independent variable and it has an effect on the

dependent variable (Distelhorst, 2019). In our analysis of subway service delays, an example of a

possible omitted variable is subway construction. Over the years, the city has been updating its

subway system to provide more convenient public transits. However, such upgrades to

infrastructures and other constructions near the subway lanes have led to delays in subway

operations. Besides, harsh weather conditions have been proved to slow down the progress of

construction (Beattie, 2017). Constructions in subway stations may lead to service stoppages,

which increase the volume of riders in nearby stations and cause delays. Due to insufficient data

on subway area constructions from 2014 to 2018, we did not include ‘construction’ as one of our

independent variables in the study, thus unable to analyze the correlation between it and service

delay.

FUTURE STUDIES

There are two possible studies with respect to the causal effects of extreme weather and

automatic train control system on TTC subway delays that can be conducted as extensions of our

current study in the future. The regression model developed in this research has set a foundation

12
for the first study of whether snowfall can cause TTC subway delays since a correlation between

these two variables has been found. The causal effect can be found by using a difference-in-means

method. However, selection bias is hard to avoid due to the fact that data on snow days are

normally clustered around the winter season and treatment groups are not randomized.

Second, TTC is currently undergoing a system-wide subway upgrading project called

“Automatic Train Control” (ATC). According to TTC, the project will allow train speed and gaps

between trains to be controlled automatically instead of being controlled by human operators.

There will also be a real-time centralized train control that has precise train location data. The

automatic subway trains will reduce travel time, and the gap between trains will remain consistent

for every run (TTC, n.d.). Future studies can analyze the impact of the ATC upgrade on subway

delays by conducting a difference-in-difference, or even regression discontinuity design. It is

possible to achieve a randomization design (Appendix A) if the TTC decides to upgrade trains

through random selection. It will be also valuable to control for the independent variables that are

being examined in this project to obtain a more exclusive impact on the length of delay from the

ATC upgrade.

13
APPENDIX

Appendix A: Proposed Randomization Model for Future Study of ATC System

Randomization with Parallel Trends Assumption (No Selection Bias)


The dummy variable: D=1 if the train has ATC system; otherwise D=0
Time: t=1 if after the implementation of the ATC system; t=0 if before the
implementation of ATC system

1 Difference-in-difference estimate of the effect of ATC on TTC subway delay =


Differences in delay for trains WITH ATC system -
Differences in delay for trains WITHOUT ATC system

2 𝛽3 = ɑATET = (E[Y|D=1,t=1] - E[Y|D=1,t=0]) - (E[Y|D=0,t=1] - E[Y|D=1,t=1])

3 Interaction model: Y = 𝛽0 + 𝛽1*Di +𝛽2*Tt +𝛽3*(Di*Tt) +uit

14
REFERENCES
City of Toronto Open Data Catalogue (2018). TTC Subway Delay Data 2018 [Data File].
Retrieved from: https://www.toronto.ca/city-government/data-research-maps/open-
data/open-data-catalogue/#917dd033-1fe5-4ba8-04ca-f683eec89761
Distelhorst, G. (2019). Week 4: Omitted Variables [pdf]. Retrieved from:
https://q.utoronto.ca/courses/72778/modules
Government of Canada (2018). Statutory Holidays 2018. Retrieved from: https://www.tpsgc-
pwgsc.gc.ca/remuneration-compensation/services-paye-pay-services/paye-centre-
pay/feries-holidays-eng.html
Government of Canada (2018). Daily Data Report for 2018 [Data File]. Retrieved from:
http://climate.weather.gc.ca/climate_data/daily_data_e.html?StationID=51459&timefram
e=2&StartYear=1840&EndYear=2019&Day=11&Year=2018&Month=1#
Herhalt, C. (October 20th, 2018). “Man, 25, Stabbed to Death Inside Kennedy Subway Station.”
CP24. Retrieved from: https://www.cp24.com/news/man-25-stabbed-to-death-inside-
kennedy-subway-station-1.4142871
Toronto Transit Commission (n.d.). TTC Automatic Train Control. Retrieved from
https://www.ttc.ca/About_the_TTC/Projects/Automatic_Train_Control/index.jsp
Toronto Transit Commission (2018). TTC Service During the Christmas Holiday Period.
Retrieved from:
https://www.ttc.ca/News/2018/December/19_12_18NR_holiday_service.jsp
Beattie, S. (December 16th, 2017). “After delays, cost overruns, and tragedy, a subway to
Vaughan is complete.” The Star. Retrieved from:
https://www.thestar.com/news/gta/transportation/2017/12/16/after-delays-cost-overruns-
and-tragedy-a-subway-to-vaughan-is-complete.html

15

You might also like