You are on page 1of 13

The relationship between occupancy rate of shelters in Toronto and

the number of users, program types and sectors


Assignment 2

Shuo Wang - 1005732049

10-24-2021

Introduction
Homelessness has been a considerable global issue for a long period of time. It produces multi-dimensional
influences on both individuals and the whole community. Homeless people are more likely to encounter health
problems due to the lack of focus on their personal care and the higher likelihood of exposure to unsanitary
conditions. (Effects of Homelessness, 2007) Also, the problem of homelessness brings instability to the entire
society. The availability of healthcare resources is harmed and the crime rate may increase which poses a
threat to social security. (Berk & MacDonald, 2010) More importantly, it is a long-term problem impacting
the present as well as future. (CaringWorks, 2021) Therefore, the homeless housing issue should be paid
more attention to and seeking proper measures to stabilize the public shelter system is of great significance.
From my perspective, the occupancy rate of shelters is one of the most important indexes of the current
performance of the city homeless housing system. It might easily reveal some essential information about the
shelter housing system in different aspects, such as the rationality of shelters’ location, the redundancy and
deficiency of capacity, etc.
For instance, based on the previous data collection and field investigation, the shelters with an occupancy
rate above 90% might suffer from an overcrowding crisis. (PressProgress, 2019)
Hence, I have an interest in finding out the current occupancy rate of shelters in Toronto city and investigating
some affecting factors which might be closely linked to it. The reason is that if there exist some significant
variables that influence the occupancy rate, on the one hand, the current situation of shelters could be
indicated; on the other hand, some visible measures might be formulated in terms of those related variables
to adjust and control the occupancy rate of the shelters, thus perfect the shelter service system in the future.
The data set on daily and overnight shelters occupancy inside Toronto city (Toronto Open Data, 2021) is
employed in this research. We primarily focused on the topic that how variables, such as current number
of shelter users, type of housing program, and sectors that homeless shelters are categorised
into, affect the occupancy rate of the active shelters and overnight service program in Toronto.
My assumption during the following statistical and mathematical analysis process is that the occupancy rate
of shelters has a significant relationship with the all relevant variables. Specifically, the concrete tendencies
could be observed on our predictor variables: an increase in the number of shelter users should result in
raising the occupancy rate correspondingly. Also, shelters of certain types and sectors might be more likely
to be crowded while shelters under the other categories affect the occupancy rate in the opposite direction.

1
Data
Data Collection Process
This data set is from the Shelter Support and Housing Administration division’s Shelter Management
Information System (SMIS) database (City of Toronto, 2021), providing the information of a daily list of
active overnight shelters. Since this data shows the real-time situation of the local public shelter system to
the relative authorities, it remains unaudited and compiled directly from an administrative database.
However, the limitation is obviously attributed to the data collection process described above. Rather than
accurately reflecting the actual condition in each program, the dataset simply preserves each program’s
records in the database. Therefore, the integrity and the promptness of each public shelter program, who
entitled to update the daily data, is essential for the credibility of this data set. Besides, it is inevitable that
the data could be biased since those shelters with terrible performances might not be willing to report the
data onto the administrative database, which leads to a better research outcome than it actually was. If this
trend exists commonly in the shelter system, objects of our research can be impacted adversely and lead to a
skewed result.

Important Variables
The rounded comprehension of the variables of interest is a critical and foundational part of data manipulation
and further analysis. It provides guidance to identify the abnormal data that were produced ineluctably
during the collection procedure. Removing such data at the very beginning of our research facilitates the
following data cleaning process. In addition, it is consistently associated with the statistical and mathematical
tools utilized to analyze those variables. For example, a proper regression model is chosen directly based on
the type and property of variables included.

Table 1: Introduction to Important Variables


Variable Type Feature
OCCUPANCY_RATE_BEDS num The proportion of actual bed capacity that is occupied for
the reporting date.
SERVICE_USER_COUNT int Count of the number of service users staying in an
overnight program as of the occupancy time and date.
Programs with no service user occupancy will not be
included in reporting for that day.
PROGRAM_MODEL chr A classification of shelter programs as either Emergency
or Transitional.
SECTOR chr A means of categorizing homeless shelters based on the
gender, age and household size of the service user group(s)
served at the shelter location. There are currently five
shelter sectors in Toronto: adult men, adult women,
mixed adult (co-ed or all gender), youth and family.

• The outcome variable that represents the occupancy rate by bed of the shelters is a numeric variable
storing non-negative values, ranging from 0 to 100 in most cases. The interpretation of this value is
adding the percent sign % behind. For example, the occupancy rate labeled as 100 means that 100%
of bed in that shelter has been occupied. Note that the value of the occupancy rate is not strictly a
probability measure, it could actually be beyond 100 to indicate the overloading of a shelter, which
needed the most urgent remedy.
• SERVICE_USER_COUNT is the only numerical independent variables with discrete non-negative integer
values, which are reasonable since this variable records the number of the people who received service
in a particular shelter.
• The rest two variables PROGRAM_MODEL and SECTOR are both categorical, indicating the type of program

2
and component respectively of the shelters. The detailed description of the important variables (Toronto
Open Data, 2021) in this research is demonstrated above in Table.1.

Data Cleaning
After grabbing the data set from the Toronto Open Data Portal and obtaining an understanding of our
targeted data, several cleaning steps are applied to make the dataset more decent to manipulate:
• Examing through the raw data, we noticed that the occupancy rate is calculated by either beds or
rooms. Bed based capacity is for programs with common sleeping areas, while room based capacity
is typically applicable for family programs and hotel programs where sleeping rooms are not shared
by people from different households. (Toronto Open Data, 2021) We found that more than half of
the shelter programs report the occupancy rate by bed. Indeed, recording the occupancy rate by bed
seems to be more flexible and it covers a variety of shelters that satisfy demand from different groups of
people.
In addition, the occupancy rate by bed tends to have a direct connection with the number of users
that live in a certain shelter. On the contrary, the occupancy rate by room might be affected by other
factors such as the size of the room, thus it is less likely to have a certain correlation with the number
of users and other variables of interest. Hence, choosing the variable recording the occupancy rate by
bed as our outcome variable converges to the purpose of our research.
We should also filter out the shelters that categorized as “Families” sector since the shelters under the
way of calculation by bed do not take this sector into account
• Next, all the programs that do not report the occupancy rate should be cleared out and we need to
ensure that all numerical values are reasonable and the categorical variables do not have missing values,
since the non-informative responses fail to have research value in this case.
• Values of occupancy rate reported by shelters are strictly between the range [0, 100], and all values
of another numerical variable that represents the number of users in each shelter are all non-negative
integers. Thus no operation is needed on these variables.
• Finally, irrelevant variables of our research should be filtered out from the sample data. We selected
the occupancy rate calculated by bed as our dependent variable and kept three predictors of interest,
which are the number of users, types of programs, and the sectors, to get a decent and clean data set
for the following analysis.

Data Summary
To acquire a better overview of the shelter data set and develop some motivations for further modeling
process, the following are some explanations of numerical summaries and figures.

1. Numerical Summaries

Table 2: Numerical Summaries

Variable N Mean Std. Dev. Min Pctl. 25 Pctl. 75 Max


OCCUPANCY_RATE_BEDS 26109 92.474 12.396 2.27 89.47 100 100
SERVICE_USER_COUNT 26109 29.221 24.848 1 14 40 234

The summary statistics table of two numerical variables in our variables of interest above epitomizes the
overall information of the current occupancy rate by bed and the number of users that receive the housing
services in the city.

3
For the variable representing the occupancy rate, the sample mean is 92.47, which indicates that the average
occupancy rate among all of the shelter programs is above 90%. I would like to assume that some or all types
of shelters are experiencing the crowding issue recently. (as mentioned in Introduction section).
We found that the occupancy rate widely spreads with minimum 2.27 and maximum 100, but the 50%
programs’ occupancy rate is between 89.47% and 100%. It is worth noting that this range has a high level of
concentration on its mean value, which reflects that shelters with a relatively high rate of occupancy are the
majority group in the current shelter system.
Additionally, looking at another numerical variable recording the number of users in a certain shelter, we
noticed that it varies from 1 to 234 with sample mean 29.22, which indicates that the shelter’s service capacity
varies greatly. This numerical summary provides the motivation that how the service users are alloted to
different sectors and different types of the shelter program. If the allocation is uneven to some extent, does it
finally contribute to the different performance of the occupancy rate of the shelters? Let us explore it more
in the following graphical summaries.

2. Graphical Summaries
Figure.1: The Distribution of Users by Type of Program
Emergency Transitional

600
count

400

200

0
0 50 100 150 200 0 50 100 150 200
Number of users

The mean value of users of emergency programs is 31.57and mean users of transitional shelters is 22.97 And
the sample variance for these two types of programs are 713.54 and 307.29 respectively. The variance of
distribution of users in emergency shelters is much larger than it in transitional type programs, which implies
that the number of users in emergency shelters are far from the mean and also far from each other. In
contrast, the variability of the number of users in transitional programs is mild relative to emergency shelters.
From the output above (Fig.1), notice that the majority range of users in both emergency and transitional
programs is both [0, 50], though there is a massive discrepancy in the total number of users between these two
programs. The transitional programs contain a small number of users, and the range of users they received is
narrower than that in emergency programs. We could hardly find any transitional programs that contain
users more than. However, some emergency programs could serve a larger number of users up to 200 and
even higher. The ability of transitional programs to accept a different number of users is less flexible than
the emergency programs. The possible reasons might be the size of transitional shelters are commonly small
or there exists some vacancy in some program of transitional type.
The allocation of users on two program types is not even, which leads to the concerns that whether the
different distribution of the number of users between two program types result in their different occupancy
rate or not.

4
Figure.2: Number of Users by Sector
0e+00

SECTOR
6e+05 Men
Mixed Adult
x

2e+05
Women
Youth

4e+05

N
Next, looking at the other categorical variables sectors of the shlters (Fig.2), the pie plot shows the user
distribution of five shelter categories. Two major groups of shelter sectors is “Men” and “Mixed Adult”, while
sectors “Women” and “Youth” take up a smaller proportion.
Another motivation for the following analysis process is that how the different categories of sectors perform
in occupancy rate. Intuitively, if a sector that services small amounts of users still tends to have higher
occupancy rates than the others, this sector might need further adjustment, such as expanding the capacity
of each shelter program or setting up more shelters in this sector.
Figure.3: Occupancy Rate VS Number of Service Users
Emergency Transitional
100
Occupancy rate by bed

75

50

25

0
0 50 100 150 200 0 50 100 150 200
Number of users

As shown in Figure.3 above, there exists an obvious positive relationship between the number of service
users of shelters and the occupancy rate in both program categories. However, we cannot ensure that the
occupancy rate of both program types has a significant linear relationship yet.
It is important to notice that the relationship is more likely to be logarithmic, which indicates that the
log-transformation might be applied when building the linear regression model in order to better satisfy the
linearity assumptions. (discussed in Method section)
Even though the tendency shown in the scatter plot in both program categories is similar when they serve
the number of users less than 50, the occupancy rate of emergency shelters that contain users more than
100 is stably high, which might mean that the emergency programs of large size might have a more serious

5
problem of overcrowding in the current public service system.
To sum up, we have the following motivations after examing the variables of interest in our sample data:
• Is any obvious pattern exist between the occupancy rate and the number of users it services for a single
shelter?
• How does the occupancy rate perform in “Emergency” and “Transitional” programs with great differences
in the number of users?
• How do shelters in four different sectors, which have uneven proportions in the shelter system, vary
from occupancy rates separately?

Methods
Linear regression model could not only make it easier to interpret our sample data but generate predictions
based on the predictors. It is a scientific and reliable method to manipulate large amounts of raw data and
transform it into pragmatic information that might not have been uncovered. (IMB, 2021)
In this section, a multiple linear regression model is created to explore the correlation between the independent
variables of interest and several dependent variables as described in the Data section, trying to analyze the
current performance of shelters in terms of the occupancy rate and to predict the potential tendency of it in
the near future.

Model Assumptions
Several key assumptions should be made before building the multiple linear regression model:
• Independent variables are independently distributed. (Multicollinearity)
If there is a strong correlation among predictors, we might get a misleading inference interpreting the
model with the parameter estimates from our model. (Sheather, 2010) To be more specific, violating
the multicollinearity of independent values might weaken the actual correlation between the predictor
and outcome variable, therefore reducing the significance of our parameter estimates. It directly harms
our modeling process and conclusion drawing.
Multicollinearity is often checked by a variance inflation factor (VIF). It measures the severity of
multicollinearity and quantifies it as an index which is calculated by the ratio of total model variance
to the variance of the model with only a single independent variable included. (Snee, 1981) A common
cutoff for a large VIF is 5. (Sheather, 2010) If a VIF value for a predictor is greater than 5, it could
negatively impact the regression model.
• Relationship between the dependent variable and the independent variables should be linear.
This assumption is fulfilled by examing the scatter plots between a dependent variable and each
independent variable. For example, considering the relationship between our outcome variable occupancy
rate and the numerical predictor the number of users, the scatter plot shows a logarithmic between them
before the log-transformation is applied to our dependent variable. Therefore, the log-transformation is
necessary on this predictor to satisfy the linearity assumption of the regression model.
• Homoscedasticity: the error terms have the same variance across the values of the independent variables.
Due to the problem that biased variances of independent variables might associate with less accurate
significant tests and confidence intervals, the homoscedasticity assumption ensures similar variances
among independent variables in order to make the parameter outcomes in our linear regression model
more powerful.
• Normality of the errors: the distribution of residuals of the regression should be normal.

6
Note that this assumption is automatically satisfied when the sample size is large (>30) (Evans &
Rosenthal, 2010) because the Central Limit Theorem states that the distribution of residuals will
approximate normal.

Mathematical Model
The multiple linear regrssion model is

y = β0 + β1 log(x1 ) + β2 x2 + β3 x3 + β4 x4 + β5 x5 + 

β0 represents the intercept of the regression line. β1 is the coefficient of logarithmic transformation of our
numeric independent variable, which represents the logarithmic unit we expect the occupancy rate by bed,
i.e. y, to change as every unit of increase in service user. And for the categorical variable model of the
program, x2 is the variable of the shelters whose program type is “Transitional”, and β2 means, if a program
is transitional, the expected change of y compared to the emergency program. In other words, if β2 < 0, then
the transitional programs are more likely to drop the occupancy rate than the emergency program. The
interpretation for the last categorical variable, which represent by the x3 , x4 , x5 , is similar. The parameter
estimates ahead of them show the relative relationship to the particular group selected as a reference in terms
of the dummy variable coding.
Before building the multiple linear regression model, the hypothesis is made as followed. The null hypothesis
is that there is no significant relationship among our independent variable y and each of the dependent
variables. That is to say the coefficients βi , where i = 1, 2, 3, 4, 5, are all equal to zero. And the alternative
hypothesis is there exists a significant correlation among the independent variable and variables of interest,
i.e., βi 6= 0, for i = 1, 2, 3, 4, 5.

Model Selection
Furthermore, model selection techniques such as P-values, coefficient of determination, etc., are applied to
the data analysis process.
The p-value is the most common measure used to check the significance of the parameters. A small p-value
indicates the extreme outcome would be very unlikely under the null hypothesis. Therefore, when the
p-value of a parameter is small enough (p < 0.05), we reject the null hypothesis and conclude that there
exists a significant correlation between the independent variable and the dependent variable, i.e., that
specific parameter is not equal to zero, βj 6= 0. Therefore, examing through each parameter, i.e. βi , where
i = 1, 2, 3, 4, 5, if the p-value is less than 0.05, then this predictor is significant and cannot be eliminated.
Then, the coefficient of determination (Lewis-Beck, 2015), which is known as the goodness of fit, indicates
how much variation of the variables of interest is explained by the independent variable in the regression
model. In other words, R2 = ExplainedV ariation 2
T otalV ariation . The higher the R value is, the better our regression model
fits the sample data.
However, a low R2 value does not mean the regression model is invalid. A model with a low R2 value but the
significant parameters could still have the power to draw the powerful conclusion. (Sheather, 2010) In our
scenario, three predictors are selected to model the occupancy rate of the shelters. However, there might
exist large amounts of predictors that are relevant to model our outcome of interest. That is to say when
it comes to outcomes that are difficult to predict, the R2 might be inherently low due to the nature of the
sample data collected.
In addition, with the consideration of practical rationale, we tend to include all three predictors chosen from
the raw sample dataset on the condition that they are all statistically significant predictors based on the
P-values cutoff examination. (mentioned in Method section) The reason is that based on the analysis of
variables summaries (in the Data section), the three main problems we are interested in are closely related to
each other.

7
Results

Table 3: Table of Model summary


Estimate Value P-value
(Intercept) 81.47 0.000000e+00
log(SERVICE_USER_COUNT) 4.76 0.000000e+00
PROGRAM_MODELTransitional -4.75 0.000000e+00
SECTORMixed Adult -3.94 0.000000e+00
SECTORWomen -1.14 3.394844e-09
SECTORYouth -4.54 0.000000e+00

The table of linear regression results is shown above. The first column represents xi , where i = 1, 2, 3, 4, 5.
And the values stored in second column are our regression coefficients βj , where j = 1, 2, 3, 4, 5. Looking at
P-values in the third column, we found that they are all less than the significance cut-off 0.05. Therefore, we
could conclude that each βj is not equal to zero, and all of our predictors have a significant linear relationship
with our outcome variables.

R2 0.2083
2
Radjust 0.2081

The R2 and Radjust


2
values in our model is showcased above. It is to be observed that these two values are
very close, thus it does not matter which one we choose to interpret our model. As a matter of fact, the R2 is
more proper to show the total sample variability in the response that has been explained by the regression
2
model. If the number of the predictors is greater than seven, then we might switch to Radjust value to avoid
the mislead of overfitting. (Sheather, 2010)
Note that the coefficient of determination of our model is not close to 1, which indicates the poor performance
of goodness of fit in the model. In fact, the R2 value might not be a good measurement to our model since
our dependent variable of interest is the occupancy rate that is associated with a variety of factors, such as
the social insurance policy, etc. Therefore, we could still get valuable insights with the limited information
provided in sample data, because of the high level of significance in our model.

Final Model
y = 81.42 + 4.75log(x1 ) − 4.75x2 − 3.94x3 − 1.14x4 − 4.54x5 + 

Note that the coefficient β1 value is based on logarithmic transformation in x1 , we should restore it when
doing the further interpretation. According to the mathematical relationship behind the log-transformation,
every unit increase in x1 inside the log function will result in β1 log2 unit increase in our dependent variable
y. Thus, based on our model, every additional user entering a particular shelter tends to raise the occupancy
rate in that shelter by 1.43.
Next, we interpret the categorical variables by the property of the dummy variable (Wu & Thompson, 2020)
in the linear regression model. Our reference groups on two categorical variables program types and sectors
are “Emergency” and “Men”.
If a shelter program is “Transitional” type, it tends to have the occupancy rate 4.75% less than an “Emergency”
shelter. As mentioned in the previous section, the number of total users of emergency shelters is far greater
than the transitional program. Thus, we might have the inference that the emergency shelters that are in
huge demand might have a more serious shortage of sleeping spaces than the transitional program.
And same propensity happened for other categories whose parameter estimates are negative. For example, if
a shelter is in the “Mix Adult” sector, an drop in occupancy rate by 3.94% is expected (relative to “Men”

8
sector). We found that shelters in all sectors are associated with a decrease in occupancy rate compared
to the “Men” sector, thus it is equivalent to conclude that being a shelter in the “Men” sector is of the
greatest likelihood to increase the occupancy rate. Recall from the previous part, shelters categorized as
“Men” sectors is one of the major groups in the shelter service system, but it is the sector that tends to
increase the occupancy rate the most. Hence, the capacity of “Men” shelters is the most vulnerable to the
heavy demand. Also, the “Women” sector has the lightest level of dropping the outcome variable, which
leads to the lowest possible to encounter the problems like overcrowing. In other words, the current shelters
designed for women only are the most receptive kind among all shelters.
Combined the overview attained from the numerical and graphical summaries with the results attained from
our multiple linear regression model, we have some insights on three motivational problems (pointed in Data
section) that converges to the main goal of our research:
• A significant pattern is discovered between the occupancy rate by bed and the logarithmic value of users
quantity for a particular shelter. Its major contribution to the practical purpose is that if an occupancy
rate should be controlled under a certain threshold in a shelter, we could have a rough estimation of the
maximum number of users that could be assigned to that shelter, regardless of any other characteristic
of the shelter.
• When it comes to the performance of occupancy rate on two types of shelter programs, we discussed
in the previous part that if a shelter is Transitional, it approximately reduce the occupancy rate by
4.75% compared to the “Emergency” program. The variable summaries could support the rationality of
tendency shown in our model. Figure.3 shows that the number of emergency shelters whose occupancy
rate is approaching 100% is obviously greater than transitional shelters. Therefore, the transitional
shelters have more association with a reduction in occupancy rate than the shelters in the emergency
program.
In reality, estimating the link between the program type of shelters and the occupancy rate provides
the theoretical reference to adjust the proportion of two types of shelters in the public service system.
Based on the previous analysis, transitional shelters are more likely to drop the occupancy rate, which
means they tend to have more vacancies than emergency shelters. It also demonstrates the current
situation that emergency shelters, the major program type in the system based on our sample data, are
highly possible suffering from the overcrowding crisis, which should receive critical attention on.
• Additionally, among four sectors of the shelters, all sectors are linked to a decrease of occupancy rate
referred to the “Men” sector. It is worth mentioning that although the men shelters have already the
largest component among the all shelter sector, they seem to increase the occupancy rate the most and
might be under heavy stress currently.
It is worth mentioning that more attention should be drawn to the men shelters, however, it does not
imply that shelters in other sectors with a tendency of decreasing occupancy rate relative to men’s
shelters do not need any modification. Recall that the average occupancy rate is above 90%, the
overcrowding may be a universal issue throughout the shelter system.
Here comes a specific example, if an emergency shelter is planning to be opened to alleviate the pressure of
shelters in the man sector, and we estimate the probable number of users needed in this particular new shelter
spot. The approximate value of the occupancy rate could be produced based on those pieces of information.
If the occupancy rate is higher than expected, we might consider reducing the size of this shelter and building
another shelter of the same type to accommodate the extra people in need.

9
Model Diagnostics

Table 5: VIF Values

VIF Value Degree of Freedom


log(SERVICE_USER_COUNT) 1.10 1
PROGRAM_MODEL 1.03 1
SECTOR 1.08 3

Table.5 above indicates the variance inflation factor values of our independent variables. The multicollinearity
assumption is fulfilled since all VIF values are at the low level, which is a good situation that predictors in
multiple linear regression do not have a strong correlation mutually. (mentioned in Model Assumtion)

Normal Q−Q Plot


Standardized Residual
Sample Quantiles

2
−2

−2
−6

−6

−4 0 2 4 75 85 95

Theoretical Quantiles Fitted Values


However, the “Normality” and “Homoscedasticity” assumptions are violated as shown in the Normal Q-Q
plot and “residual vs fits scatter plot”. The Q-Q plot greatly deviates on the left tail and the residual plot
seems to show some patterns. Recall that we visualize a logarithmic relationship in the scatter plot between
a dependent variable and numeric predictor and conduct the log-transformation on the numeric independent
variable to make the normality assumption more satisfied, nevertheless, it still fails to fulfill our expectation.
A similar situation happened for the homoscedasticity assumption. As the consequence, it might adversely
influence the significance of the regression model.
Despite the fact that these two assumptions cannot be perfectly satisfied, it could not be the contributing
factor that considerably challenges the credibility of the model. The reason is that the P-value of our numeric
predictor that is the number of users count of the shelters is extremely small, even far less than the significance
cutoff 0.5. Thus, in the premise that other indispensable assumptions such as the independence of predictors
which has been met nicely, these violations could not attribute to conspicuous effect on our model.
All analysis for this report was programmed using R version 4.1.0 (2021-05-18). Tables for analysis in this
research are made using the knitr 1.33 package (Xie et al., 2021), kableExtra 1.3.4 package (Zhu, 2021)
and vtable 1.3.3 package (Huntington-Klein, 2021).

Conclusions
This research investigates the influence of some relevant variables exerted on the occupancy rate by bed.
Three predictors, which are the current number of shelter users, type of housing program, and different
sectors of shelters, are selected to be the variables of interest intending to obtain inference about the current
operation of shelters system in Toronto city and also predict the occupancy rate based on the predictors.

10
We employ a data set that records active overnight shelter and allied services and is compiled directly from
Shelter Management Information System (SMIS) database. (Toronto Open Data, 2021) After cleaning the
raw data, filtering out the abnormal values, and selecting the variables of interest, numerical and graphical
summaries obtained from the cleaned dataset bring the motivations to the analysis steps.
With the aim of building a model to specify the relationship between the occupancy rate and several relevant
predictors, a multiple linear regression model is chosen. We conduct logarithmic transformation on one
of the numeric predictors, i.e. number of users, in an attempt to make our “Normality” assumption more
fulfilled. After making certain key assumptions of the multiple linear models, we get the initial model
including all three predictors as our independent variables. The model selection process was conducted mainly
centered on P-values, coefficient of determination, and meanwhile, taking practical rationale into consideration.
Furthermore, the variance inflation factor (VIF) is taken into account to identify if any predictors might be
removed due to the violation of multicollinearity. Combining the method above, we kept our initial multiple
linear regression model as the final choice.
The statistical significance of our parameter estimates from the final model allows the rational results to
be drawn on our outcome variable that represents the rate of occupancy of the shelters. In general, the
occupancy rate has a strong relationship with all three predictor variables: current number of
shelter users, type of housing program, and sectors that homeless shelters are categorized into.
Specifically, every unit increase in logarithmic value of the number of users in a particular shelter result
in 1.43 units increase in its occupancy rate. And linear relationship uncovered by our model between the
outcome and the categorical variables is a pragmatic tool to quantify the influence that shelters with various
division standards on the occupancy rate that we are concerned mainly with. For example, shelters in the
“Women” sector tend to encounter the slightest degree of overcrowding, while the excessive overload on the
room space might occur in shelters in the “Men” sector.
To sum up, the multiple linear models shed light on the existing problems of the shelter service system in
Toronto city, meanwhile, it could offer the statistically significant prediction on the occupancy rate according
to some accessible information such as program types and sectors categorized of the shelters.

Weaknesses
Ideally, the model significance should be high (P-values < 0.05) and the variability could be greatly explained
by our model, which means the coefficient of determination is close to 1. All variables that we are interested in
ought to be included in the model to facilitate the results drawing process. In the meantime, every assumption
should be satisfied or be perfected by some mathematical method on the variables of interest.
Nonetheless, it is known that the occupancy rate is a numeric value recording human activities which is
vulnerable to external and inherent conditions. For example, the implementation of new social insurance
police or an unpredictable snowstorm will fluctuate the demand for shelters, so that influences the occupancy
rate of the shelters. Even though our regression model has certain statistical significance theoretically, there
does not exist a perfect model that simulated the actual pattern and provided accurate prediction on this
variable of interest. And the deep reasons behind the limitations of our research attribute to the inherent
characteristic of the outcome variable.
In order to create a cleaned sample dataset, we removed the shelters whose occupancy rates are calculated
in room base, since they are mainly designed to fulfill shelter requirements for families. However, separate
research is needed to analyze the performance in those family shelters as well to get a more comprehensive
conclusion on the current situation of shelters.
In addition, the occupancy rate is a fluctuating and real-time quantity, but it needs time for data to be
uploaded into the database, and our model could only capture the relationship between the occupancy rate
and several relevant predictors at some point. It is more scientific to monitor the occupancy rate in a period
of time for some patterns to show up.

11
Next Steps
Equipped with more sophisticated analysis methods, the regression model with a higher level of accuracy is
expected to investigate and predict the outcome variable. The model might not be linear, but promote its
degree of fitness to the sample data involved. Besides, despite that the occupancy rate reveals the key aspect
of the operation status of shelters, which is whether they face the housing shortage or overcrowding or not,
occupancy rate could not be the overall indicator that tells us the whole story of the current performance
of the shelters. More attention should be made to the quality of the shelters, not merely focusing on the
quantitative point of view. For example, the regular repair of the facilities and the balanced nutrition in the
meals provided in the shelters could also showcase the present condition of the shelters.
Another improvement could be made to visualize the distribution of the shelter on the map. The locations of
shelters might affect their demand and requirement. For example, if the overcrowding issue of shelters is
within a certain district in the city, then the solutions could pinpoint that area, and it could save time and
money as well.
Furthermore, the field visit to the shelters investigated is necessary when the regression model is utilized to
improve the shelter service. The first reason is that potential unwillingness of reporting bad performances of
some shelters (discussed within the Data section) may harm the practical function of the model, even if a
nearly perfect model is discovered to simulate the outcome variable. So the field survey on those shelters
whose information is missing and whose records of the current situation are incomplete is important. Secondly,
the regression model interprets the actual situation from a theoretical angle. A field trip to the target shelter
that shows unusual performance is needed before some steps to be taken. It is admitted that the workload of
field visits is huge and it is money-consuming, so the trade-off decision should be made when it comes to the
practical application of the statistical model.

Discussion
Regular tracing of the operation status of shelters in a city is essential for city management. With a
well-rounded shelter housing system, the well-being of the citizens will be improved and a more stable society
could be expected in a near future.

12
Bibliography
• Grolemund, G. (2014, July 16) Introduction to R Markdown. RStudio. https://rmarkdown.rstudio.
com/articles_intro.html. (Last Accessed: October 12, 2021)
• Dekking, F. M., et al. (2005) A Modern Introduction to Probability and Statistics: Understanding why
and how. Springer Science & Business Media.
• Allaire, J.J., et. el. References: Introduction to R Markdown. RStudio. https://rmarkdown.rstudio.
com/docs/. (Last Accessed: October 12, 2021)
• City of Toronto. (2021, September 9). Shelter Management Information System (SMIS). Retrieved
October 15, 2021, from https://www.toronto.ca/community-people/community-partners/emergency-
shelter-operators/shelter-management-information-system/.
• Daily Shelter & Overnight Service Occupancy & Capacity. City of Toronto Open Data Portal. (n.d.).
Retrieved October 15, 2021, from https://open.toronto.ca/dataset/daily-shelter-overnight-service-
occupancy-capacity/.
• Peter Dalgaard. (2008) Introductory Statistics with R, 2nd edition.
• Effects of Homelessness. Homelessness. (n.d.). Retrieved October 15, 2021, from https://depts.
washington.edu/triolive/quest/2007/TTQ07033/effects.html.
• Community impact: The community impact on homelessness. CaringWorks, Inc. (n.d.). Retrieved
October 15, 2021, from https://www.caringworksinc.org/our-impact/community-impact/.
• Sheather, S. J. (2010). A modern approach to regression with R. Springer.
• Pressprogress. (2019, January 28). Toronto’s shelters are now consistently above 90% capacity – and
that is extremely dangerous. PressProgress. Retrieved October 19, 2021, from https://pressprogress.ca/
torontos-shelters-are-now-consistently-above-90-capacity-and-that-is-extremely-dangerous/.
• Linear regression. IBM. (2021). Retrieved October 20, 2021, from https://www.ibm.com/topics/linear-
regression.
• Evans, M., & Rosenthal, J. S. (2010). Probability and statistics: The Science of Uncertainty. W.H.
Freeman and Co. .
• Lewis-Beck, C., & Lewis-Beck, M. (2015). Applied regression: An introduction (Vol. 22). Sage
publications.
• Wu, C., & Thompson, M. E. (2020). Sampling theory and practice. Springer International Publishing.
• Snee, Ron (1981). Origins of the Variance Inflation Factor as Recalled by Cuthbert Daniel (Technical
report). Snee Associates.
• Zhu, H. (2021, February 19). Create awesome HTML table with knitr::kable and kableextra. Retrieved
October 24, 2021, from https://haozhu233.github.io/kableExtra/awesome_table_in_html.html.
• Xie Y (2021). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package
version 1.36, https://yihui.org/knitr/.
• Huntington-Klein, N. (2021, August 5). Sumtable: Summary statistics. Retrieved October 24, 2021,
from https://cran.r-project.org/web/packages/vtable/vignettes/sumtable.html.
• Berk, R., & MacDonald, J. (2010). Policing the homeless: An evaluation of efforts to reduce homeless-
related crime. Criminology & Public Policy, 9(4), 813-840.

13

You might also like