Professional Documents
Culture Documents
1
Table of Contents
INTRODUCTION ........................................................................................................................................................................1
INTEREST ............................................................................................................................................................................................................1
RESEARCH QUESTION ...................................................................................................................................................................................2
METHODOLOGY ........................................................................................................................................................................2
DATA SETS ..........................................................................................................................................................................................................2
DEPENDENT VARIABLE: LENGTH OF SUBWAY DELAY IN M INUTES .........................................................................................3
INDEPENDENT VARIABLES ..........................................................................................................................................................................4
Passenger Related Issues and Technical Related Issues .............................................................................................................4
Weekday..........................................................................................................................................................................................................4
Peak Hours ....................................................................................................................................................................................................4
Holiday ...........................................................................................................................................................................................................4
Snow on Ground..........................................................................................................................................................................................5
Temperature..................................................................................................................................................................................................5
Rainfall ...........................................................................................................................................................................................................5
REGRESSION MODEL..............................................................................................................................................................5
FINDINGS .......................................................................................................................................................................................7
LIMITATIONS ........................................................................................................................................................................... 11
SNOW ON GROUND DATA .......................................................................................................................................................................... 11
LIMITED INFORMATION FROM THE DATA COLLECTION PROCESS....................................................................................... 11
OMITTED VARIABLES ................................................................................................................................................................................. 12
FUTURE STUDIES ................................................................................................................................................................... 12
APPENDIX................................................................................................................................................................................... 14
APPENDIX A: PROPOSED RANDOMIZATION MODEL FOR FUTURE STUDY OF ATC SYSTEM ...................................... 14
REFERENCES............................................................................................................................................................................ 15
INTRODUCTION
Interest
Our interest for the subject in this project mainly comes from two aspects. First of all, we
have constantly experienced service delays on TTC subways because we commute by the subway
on a daily basis. Unexpected massive delays on Toronto’s subway stations can cause frustration to
transit passengers who have a designated time to arrive at a destination. Fellow students who go
to campus by the subway might be late to class, hand-in assignments late, and even miss an exam
as a result of a delay. According to our data set, there were almost 805 hours of delay in the year
of 2018 (“TTC Subway Delay Data”, 2018). On average, there are 1,477 minutes of delays each
year at St. George station alone, which poses a concern to us as U of T students. Therefore,
analyzing the impact from major variables that are hypothesized to relate to delays of TTC
subways will help passengers to better plan and estimate their commuting time while taking the
subway. Variables that are selected to be analyzed in this project derive from our personal
experiences of TTC delays. For example, we have generally experienced more frequent subway
Another motivation for this project is that analyzing the variables that are correlated to
subway delays is able to assist TTC to better prospect and manage possible delays in subway
stations. TTC can take more precautions of passengers’ safety before a prospected subway delay
to prevent injuries and accidents due to a massive volume of commuters. Moreover, preventions
can be taken by the stations to reduce the possibility of subway delays or to shorten the delay
minutes.
1
Research Question
The main purpose of our project is to analyze the variables that are correlated with the
delays of TTC subway from 2014 to 2018. The research question of this project is: What factors
METHODOLOGY
Data Sets
Information on delays of TTC subways from the year of 2014 to 2018 is retrieved from
monthly data sets that are provided by the City of Toronto Open Data Catalogue website (“TTC
Subway Delay Data”, 2018). Data frames from 2014 to 2018 were combined to form a general
data frame of the four years of subway delays, which is named “gta”. This data set contains a total
of 102,682 observations of the TTC subway service. For each observation, the following
● The date, time and day of the week of the observation (code “Date”, “Time” and
“Day.x”)
● The delayed station, the delayed subway line, and the direction of the subway
service (code “Station”, “Line” and “Bound”)
● The service delay in minutes and the length between trains (code “Min Delay” and
“Min Gap”)
● The reasons for delay (code “Code”)
The subway system in Toronto is subject to delays and stop service due to the cold winter
climate and severe weather conditions, such as heavy snowfall, wind chill and freezing rain.
Therefore, the impact from weather conditions on TTC service in this project is analyzed by the
temperature, the amount of rainfall and the depth of snow in centimetres on the ground in the GTA
region. The weather data frame was obtained from the Government of Canada website. The data
frame is read into the R program and named as “w”. It contains a total of 1,826 observations that
2
● Date of observation (code “Date”)
● Daily temperature (code “max temp” and “min temp”)
● Total snowfall (code “Total Snow (cm)”),
● Total rainfall (code “Total Rain (mm)”), and
● The depth of snow on ground in centimetres (code “Snow on Grnd”)
The “Date” variables in the weather data frame “w” and the subway delay data frame “gta”
are compatible. Therefore, the two data frames were merged basing on “Date” to generate a new
Since our research question is attempting to find out the impact from different variables on
the length of subway delays, the dependent variable in the regression model is the length of TTC
subway delay. The ‘MinDelay’ variable is obtained from the TTC subway data set, which is a
we found that there are no delays that are between 1 to 2 minutes. However, there are plenty of 0-
minute-delays. We assumed that delays that are under 3 minutes are not considered as delays by
the data collector. Therefore, we transformed all the 0-minute-delay observations into 1.5-minute-
delay observations, which is the mean of 3 minutes to reflect the actual delays.
There needs to be a complete data set with both delay and non-delay observations in order
to generate the impact from independent variables on subway delays. However, the original data
set only contains observations which delay occurred. As a result, we grouped the MinDelay
observations by date and hour and completed the missing observations with 0 to generate
observations where no delay occurred. In this way, we will only be analyzing the delay of the
whole subway system because the original ‘station’ observations are violated by the grouping of
3
Independent Variables
The original TTC delay data set contains 200 reasons for subway delays. After inspecting
the reasons, we found that the majority of the delay observations can be attributed to either
passenger elicited or technical elicited reasons. Thus, two dummy variables are created: “ispissue”
and “istissue”, which “1” indicates true to the reason. There are 19 reasons that are defined as
passenger related, such as injured passengers. Another 91 reasons are considered as technical
related, such as train control shut down. The rest of the uncategorized reasons are considered as
trivial to the regression analysis because of their small numbers of occurrence records.
Weekday
It is prospected that there will be more commuters during weekdays than weekends, which
suggests that TTC subways will likely experience more delays during weekdays due to a greater
number of commuters. Therefore, “isweekday” is created as a dummy variable by using the mutate
function. Weekday observations are coded as “1” while weekend observations are coded as “0”.
Peak Hours
Whether the data was obtained during peak hours is included as an independent variable
that is correlated with subway service delays. As indicated by TTC, morning peak hours are from
8am to 9am and evening peak hours are from 3pm to 5pm. Observations that occurred during these
two time periods are coded as “1” for the “ispeakhrs” bivariate variable. All other observations are
Holiday
According to TTC, there are normally extended service hours on holidays and free rides
on New Year’s Eve and New Year’s Day (Toronto Transit Commission, 2018). Therefore,
4
“holiday” is projected to be positively correlated with subway service delays. The holiday variable
is created pertaining to the dates of the 10 statutory and designated holiday each year for Ontario
(“Statutory Holidays”, 2018). “isholiday” is created as a dummy variable, which observations are
Snow on Ground
Given that TTC operation is subject to the disturbing winter weather in Toronto, the amount
continuous variable that is generated directly from the weather data set. Each daily observation is
Temperature
subway delays from snow and cold temperatures. “mean.temp” is a continuous variable measured
in degree Celsius. The original daily temperature records are obtained directly from the weather
Rainfall
The amount of rain is included to determine the impact of rainfall and flooding weather
conditions on subway delays. The daily rainfall observations are obtained from the weather data
and mutated to create an hourly observation by dividing the daily amount by 24. “mean.rain” is a
REGRESSION MODEL
When it comes to the accuracy of the regression model, it is important to consider whether
the data sets can satisfy the underlying assumptions. This section will discuss four types of
assumptions for the regression model. First, the model satisfies the assumption that the error terms
5
have a conditional mean that equals to zero (Distelhorst, 2019) since we have obtained the entire
population of the TTC delay data from Open Data Catalogue. However, the regression model does
not fully satisfy the second assumption of IID sampling due to the fact that each observation from
the weather data is not entirely independent to each other and is close to each other in time.
The third assumption states that there should be no large outliers (Distelhorst, 2019). The
scatterplot of the independent variable ‘MinDelay’ indicates that there are several large outliers
with extremely large values that deviate from the majority of the data points (Graph 1). After
inspection, large outliers in subway delays are mostly due to random incidents that are not expected
caused by a homicide inside the Kennedy Station (Herhalt, 2018). As a result, the outliers were
filtered out, and a new dataset was obtained with only observations that are less than 150 minutes
of delay (Graph 2). The removal of large outliers improved the estimates without significantly
reducing the sample size. The dataset still contains 102,682 observations without the presence of
6
Lastly, multiple regression assumes that no regressor X is a linear function of another
(Distelhorst, 2019). By using a heatmap (Graph 3) and a correlation matrix (Graph 4), we found
that most of the independent variables in the model have no linear correlations with each other.
However, two of the independent variables, snow on ground in centimeters and daily mean
temperatures, have a relatively strong negative correlation (r=-0.498) than others. This may
introduce a weak collinearity issue in our model. However, it is not a perfect linear relationship,
FINDINGS
We have run a multiple regression using the lm_robust function in the Rstudio with all of
7
An error-bar graph (Graph 5) was created as a visual aid to helps us observe the correlations
of different independent variables with the dependent variable. According to the error-bar graph,
the majority of the independent variables are positively correlated with the dependent variable,
and passenger-related issues have the largest effects on the length of TTC delays among all
independent variables. This is potentially due to the fact that compared to technical issues that are
relatively easy to solve, issues that are elicited by passengers tend to impose greater uncertainties
and require longer investigations by TTC employees. Surprisingly, the dummy variable holiday is
negatively related to the length of delay, which is counter-intuitive. The negative coefficient means
that from 2014 to 2018, the subway delays during statutory holidays are shorter compared to that
of regular days. Our interpretation is that residents in Toronto have a tendency in taking trains or
cars during holidays in order to travel to remote destinations where subways are not available.
Therefore, there may be fewer passengers who take the subway during statutory holidays.
Therefore, the reduced flow of passengers will decrease the likelihood of delay caused by
Another interesting observation is that the effect of daily average temperatures on the
length of delay disappeared after integrating it into the multiple regression model, though a
negative correlation between the mean temperatures and the delay in a bivariate regression was
observed (i.e., confidence interval = (-0.02407408, -0.006601763)). Due to the fact that the
average snowfall per hour and the mean temperatures have a relatively strong negative correlation,
the average snowfall may attenuate the effect of the mean temperatures in the multiple regression
model. As a consequence, the change in the effects of the average temperatures implies that
8
Graph 5: Error Bar Graph
A balance table (Graph 6) was also established for the numerical estimates and the
confidence interval of each independent variable, which enables us to analyze the results
quantitatively and precisely. The intercept refers to the average estimated delay of the entire TTC
subway system predicted by issues other than passenger-related and technical issues (i.e., reference
variables). The average delay is around 2 minutes when assuming that all independent variables
are equal to zero. In addition to the estimates of the intercepts, those of each independent variable
provide predictive information with respect to the effects of average snowfall per hour, daily
average temperatures, total rainfall, peak hours, weekday, holiday, passenger related issues and
technical issues on the length of delay per hour systemwide. For example, the value of the average
snowfall coefficient means that on average, each additional centimeter of snow on ground will add
3 minutes of delay to the entire TTC subway system per hour. The positive effect of snowfall on
delays is likely due to the reason that there are 12 above ground subway stations (e.g. Warden,
Victoria Park), which are subjected to the impact from snow level on the surface. Moreover, many
subway stations are connected with TTC surface transits that are influenced by snowfall. A delay
in subway service might requires passengers to take the connected streetcars in alternative, which
9
have lower capacities as compared to the subway. This can lead to an increased number of
passengers congesting in the subway stations that fill the station’s capacity. Additionally, the
coefficient of peak hours and weekday dummy indicate that when holding other variables constant,
there are more delays during peak hours and weekdays. This is commonly observed in Toronto
subways because there are generally more passengers commuting to school and work during those
times. Thus, the trains are more likely to stay longer at each station, which increases the possibility
of encountering passenger-related issues. The majority of the estimates are statistically significant
with a p-value less than 0.001 except for the weekdays and mean temperatures variables.
Furthermore, the value of R-squared, 0.16, reflects that our variables in the model explain
about 16% of the variation in subway delays. This may indicate that the reasons leading to the
delay of TTC can be very complicated and some other unobservable factors will need to be
considered in future studies. As a result, these findings still allow us to conclude that all
independent variables in this model are able to predict service delays in TTC subway system.
10
LIMITATIONS
meteorological stations located at the Toronto Pearson Int'l Airport, which are geographically
distant from where the subway line operates. However, the amount of snowfall differs across
different regions in GTA due to the large geographic area. Therefore, a limitation exists in that the
analysis between the amount of snow on ground and the length of TTC delay is not precise because
the weather data were not specified to geographic locations where TTC subway operates. Ideally,
the correlation will be more precise if we were able to find weather data that are collected from
Moreover, the weather data were collected basing on daily observations. Therefore, each
snow on ground data observation is an aggregated amount of snowfall on ground in the whole day.
In order to fit our analysis, the snow on ground observations were altered to be hourly-based
observations by diving each daily amount by 24 hours. This alteration of averaging the amount of
snowfall in a day assumes that snow falls in a consistent volume during each hour. As a result, this
can lead to a less precise analysis because, in reality, the snowfall may have occurred anytime
throughout the day in a non-consistent volume. The amount of snow on ground can also depend
on the time of the day. It will be a more accurate analysis if the snow on ground observations are
The data of TTC subway delay was collected by the City of Toronto Open Data Catalogue.
There is limited information on the website regarding the data collection process and detailed
interpretations of codes in the data set. Therefore, we conducted a lot of self-interpretations while
11
analyzing and wrangling the data. There was no information as to how the delays were calculated
and defined. To illustrate, there were no delay observations that are between 1 to 3 minutes, which
we assumed that delays are defined as those that are over 3 minutes. We also classified the reasons
for delay into passenger and technical related issues by our own interpretation of the code
descriptions, which may not be entirely accurate. Although we got into contact with Open Data
and TTC, we did not get any further information from the steward of the data collection.
Omitted Variables
The findings may be biased if we leave out any variables that satisfy both of the following
assumptions: the variable is correlated with an independent variable and it has an effect on the
dependent variable (Distelhorst, 2019). In our analysis of subway service delays, an example of a
possible omitted variable is subway construction. Over the years, the city has been updating its
subway system to provide more convenient public transits. However, such upgrades to
infrastructures and other constructions near the subway lanes have led to delays in subway
operations. Besides, harsh weather conditions have been proved to slow down the progress of
construction (Beattie, 2017). Constructions in subway stations may lead to service stoppages,
which increase the volume of riders in nearby stations and cause delays. Due to insufficient data
on subway area constructions from 2014 to 2018, we did not include ‘construction’ as one of our
independent variables in the study, thus unable to analyze the correlation between it and service
delay.
FUTURE STUDIES
There are two possible studies with respect to the causal effects of extreme weather and
automatic train control system on TTC subway delays that can be conducted as extensions of our
current study in the future. The regression model developed in this research has set a foundation
12
for the first study of whether snowfall can cause TTC subway delays since a correlation between
these two variables has been found. The causal effect can be found by using a difference-in-means
method. However, selection bias is hard to avoid due to the fact that data on snow days are
normally clustered around the winter season and treatment groups are not randomized.
“Automatic Train Control” (ATC). According to TTC, the project will allow train speed and gaps
There will also be a real-time centralized train control that has precise train location data. The
automatic subway trains will reduce travel time, and the gap between trains will remain consistent
for every run (TTC, n.d.). Future studies can analyze the impact of the ATC upgrade on subway
possible to achieve a randomization design (Appendix A) if the TTC decides to upgrade trains
through random selection. It will be also valuable to control for the independent variables that are
being examined in this project to obtain a more exclusive impact on the length of delay from the
ATC upgrade.
13
APPENDIX
14
REFERENCES
City of Toronto Open Data Catalogue (2018). TTC Subway Delay Data 2018 [Data File].
Retrieved from: https://www.toronto.ca/city-government/data-research-maps/open-
data/open-data-catalogue/#917dd033-1fe5-4ba8-04ca-f683eec89761
Distelhorst, G. (2019). Week 4: Omitted Variables [pdf]. Retrieved from:
https://q.utoronto.ca/courses/72778/modules
Government of Canada (2018). Statutory Holidays 2018. Retrieved from: https://www.tpsgc-
pwgsc.gc.ca/remuneration-compensation/services-paye-pay-services/paye-centre-
pay/feries-holidays-eng.html
Government of Canada (2018). Daily Data Report for 2018 [Data File]. Retrieved from:
http://climate.weather.gc.ca/climate_data/daily_data_e.html?StationID=51459&timefram
e=2&StartYear=1840&EndYear=2019&Day=11&Year=2018&Month=1#
Herhalt, C. (October 20th, 2018). “Man, 25, Stabbed to Death Inside Kennedy Subway Station.”
CP24. Retrieved from: https://www.cp24.com/news/man-25-stabbed-to-death-inside-
kennedy-subway-station-1.4142871
Toronto Transit Commission (n.d.). TTC Automatic Train Control. Retrieved from
https://www.ttc.ca/About_the_TTC/Projects/Automatic_Train_Control/index.jsp
Toronto Transit Commission (2018). TTC Service During the Christmas Holiday Period.
Retrieved from:
https://www.ttc.ca/News/2018/December/19_12_18NR_holiday_service.jsp
Beattie, S. (December 16th, 2017). “After delays, cost overruns, and tragedy, a subway to
Vaughan is complete.” The Star. Retrieved from:
https://www.thestar.com/news/gta/transportation/2017/12/16/after-delays-cost-overruns-
and-tragedy-a-subway-to-vaughan-is-complete.html
15