Forecasting Ambulance Demand Using Machine Learning A Case Study From Oslo Norway

Forecasting Ambulance Demand using Machine
Learning: A Case Study from Oslo, Norway

Anna Haugsbø Hermansen Ole Jakob Mengshoel
Department of Computer Science Department of Computer Science
NTNU, Trondheim, Norway NTNU, Trondheim, Norway
Email: annahhermansen@gmail.com Email: ole.j.mengshoel@ntnu.no
Abstract—In Emergency Medical Services (EMS), time is of the spatial and temporal granularity. Daily forecasts are considered
essence. It is crucial for one or more ambulances to reach the a low temporal resolution, while a time interval of one or a few
scene of a medical incident quickly to ensure timely, sometimes hours is considered high resolution. For the spatial resolution,
2021 IEEE Symposium Series on Computational Intelligence (SSCI) | 978-1-7281-9048-8/21/$31.00 ©2021 IEEE | DOI: 10.1109/SSCI50451.2021.9659837
life-saving, assistance to people in need. To do so, one needs

good estimates of when and where incidents are likely to occur. studies using regions of up to a few square kilometers are
This work investigates how to best forecast EMS demand in and considered high resolution, while coarser resolution (including
around Oslo, Norway’s capital. Demand forecasts are computed those studying entire cities) are low resolution.
using machine learning models created from historical EMS Previous works using low spatial and temporal resolution
and weather data. We use a fine spatio-temporal resolution of [15], [16], [13], [7], [14] have considered Hong Kong (Hong
1x1km spatial regions and 1-hr time intervals. Two different
approaches are implemented and compared against each other: Kong), New Taipei City (Taiwan), Birmingham (UK), and
a split approach that forecasts the volume and distribution of Ningbo (China). These studies have established (i) the fea-
the EMS demand separately, and a complete approach that sibility of ambulance demand forecasting at this spatial and
directly models the EMS demand over our spatial domain. We temporal scale and (ii) the impact of extreme weather on EMS
use multi-layer perceptron (MLP) and long short-term memory demand [15], [16].
(LSTM) models to forecast the EMS demand, and compare them
to simple aggregation methods and baselines. We also study Research using a low spatial but high temporal resolution
different feature sets and online learning to improve the models’ typically forecast the hourly demand for large cities such as
performance. Experiments suggest that a split online model Toronto (Canada) and Calgary (Canada) [9], [1]. This research
consisting of a simple aggregation distribution model and an demonstrates, for EMS forecasting, the feasibility and relative
MLP volume model with simple temporal features produces the performance of different forecasting methods. Further, it shows
best forecasts. In particular, the split model produces competitive
volume, distribution, and complete forecasts relative to a state- how to improve hourly forecasts via online learning, in which
of-the-practice method, MEDIC, and a well-established complete the hourly forecast for the later hours of a day is updated in
MLP model proposed by Setzler et al. light of correct data from the early parts of the day.
Using a high spatial but low temporal resolution, researchers
I. I NTRODUCTION have successfully studied forecasting of daily demand in
Emergency medical services (EMS) are a crucial but com- small spatial regions of a few square kilometers or on a
plex part of modern health care systems. For EMS there is continuous scale in Singapore [8] and Athens (Greece) [5]. An
substantial uncertainty in the volume, severity, and location interesting finding from the Athens study is the reduction of
of medical incidents, which impacts ambulance demand and distance between ambulances and incidents by over 1km when
ambulance travel times. Further, the trade-off between cost, distributing the units according to a novel model using their
effect, and equity is a substantial issue. Allocating many forecast demand as input, compared to the current location of
resources, including ambulances, is necessary to handle sudden ambulances in Athens [5].
high workloads. But resources that are never or seldom used Using both high spatial and temporal resolution typically
pose a high and unnecessary cost. It is more cost-effective to means performing hourly forecasts in either small spatial
focus resources in areas with high demand, but that means regions of a few square kilometers, often inside large cities,
people living in low-demand areas will have less access to or on a continuous spatial domain [11], [2], [3], [17], [18],
those critical resources. On the other hand, providing equal [20], [19]. Such studies have been performed with success
access to resources when the cost of doing so is higher in in cities like Charlotte-Mecklenburg (USA), New Taipei City
some areas may imply that people are valued higher in those (Taiwan), Toronto (Canada), and Melbourne (Australia). Using
areas [4]. such fine-grained spatio-temporal scales is one of the main
In light of these complexities, it is not surprising that studies considerations of this work, thus we provide a more detailed
of EMS demand has received and is receiving much attention analysis in Section III. However, we wish to emphasize Zhou
from the general public, policymakers, medical professionals, et al.’s study on decomposing or splitting ambulance demand
and researchers. To structure the discussion in this paper, we into components [17], [18], [20], [19] as well as Setzler et
group previous studies of EMS demand according to their al.’s [10] use of neural networks.
978-1-7281-9048-8/21/$31.00 ©2021 IEEE

Authorized licensed use limited to: ESCUELA POLITECNICA DEL LITORAL (ESPOL). Downloaded on February 09,2023 at 05:48:03 UTC from IEEE Xplore. Restrictions apply.
A. Contributions and Perspectives
In this work, we consider the above complexities, uncer-
tainties, and trade-offs in the context of a real-world study in
and around Oslo, Norway. Building on previous research, we
develop machine learning models for forecasting EMS demand
in Oslo and Akershus, Norway. Doing so involves several
challenges. In addition to the challenges mentioned above,
demand forecasts are only useful for staffing and positioning
resources if the forecasts have high resolutions on a spatio-
temporal scale. However, high resolutions result in very sparse
data which is difficult to model accurately.
The main contributions of our work1 are as follows:
• Using machine learning methods, we study ambulance
demand forecasting using datasets from the Oslo region
in Norway. To our knowledge, such a study has not been
performed before.
• We emphasize the use of a split approach, inspired
by Zhou [17] et al., which splits or decomposes the
forecasting problem into a volume forecasting problem
and a spatial distribution forecasting problem.
• We use online learning to enhance our proposed models.
• Based on experiments with the Oslo datasets, where we
compare several split approaches to baselines and bench-
marks from the literature, we recommend a split online
model consisting of a simple aggregation distribution
model and an MLP volume model using simple temporal Figure 1: Illustration of the total number of incidents per grid
features. cell in the filtered incident dataset from Oslo. The demand
Inspired by the work of Zhou [17], we study how her exhibits extreme variation and locality, with many cells having
split approach performs compared to the more established zero incidents and some cells having between 10,000 and
complete approaches. We also build on Setzler et al.’s [10] 100,000 (10k and 100k) incidents.
use of neural networks to model EMS demand and search
for the best possible network architecture. In addition, we More specifically, this is the overall goal of the health-care
go beyond Setzler et al. by testing LSTM models, which to effort of which this work is a part:
our knowledge have not been tested in this domain before.
Goal To minimize the EMS response time in Oslo2 through
We use MEDIC and existing neural network models [10]
ambulance demand forecasting and strategic place-
as benchmarks. Channouf et al. [1] show that the use of
ment of ambulances.
special day effect flags for New Year’s Eve and a local festival
As a step towards achieving this goal, one needs to forecast
improves demand volume forecasts. We try to capture such
when and where incidents are likely to occur. Consequently,
special days implicitly by adding day of the month and month
we study data-driven and machine learning methods for fore-
of the year to our feature set. We also use weather features as
casting of ambulance demand. The datasets used, research
suggested by Setzler [10] and Zhou [17]. Our study extends
questions, and metrics are discussed below in this section.
the work by Wong and Lai [15], as we use a higher spatio-
temporal resolution. Unlike much previous work on EMS A. Dataset
forecasting, we use online learning (as suggested by Chen et The EMS incident dataset used in this work was provided
al. [2], [3]) and compare it to traditional batch learning. by the EMCC department of the Oslo University Hospital
II. BACKGROUND (OUH) and the Norwegian National Advisory Unit for Prehos-
pital Emergency Medicine (NAKOS). It includes the location
In a medical emergency, a minimal ambulance response time
and timestamps of missions completed by the ambulance
will benefit the medical outcome for the patient or patients. At
department between January 1st, 2015, and February 11th,
the same time, we want to optimize the utilization of available
2019. Due to privacy concerns, we use anonymized data in
resources so that we can reduce the response time without
which the exact incident locations have been mapped to a
incurring large expenses.
standard 1x1km grid, see Figure 1.3
Ultimately, our goal is to minimize ambulance travel time by
positioning ambulances strategically and according to demand. 2 For
simplicity, we often say “Oslo” instead of “Oslo and Akershus.”
3 More information about the grid is available [12]. The grid can be
1 This paper summarizes results of Hermansen’s master thesis [6] downloaded from Statistics Norway at www.ssb.no/natur-og-miljo/geodata.
Topography, climate, population, and traffic are factors that Here, the definition of “better” is not straightforward be-
may impact ambulance demand, and these vary from city to cause we consider several dimensions. Which is better, a
city. Oslo—being “wrapped around” a fjord, with a humid model with good complete forecasts or a model with good
continental climate and relatively low population density— volume or distribution forecasts? We believe that a model that
clearly differs from many previously studied cities. Zhou [17], makes good distribution and volume forecasts is better because
for example, studies ambulance demands in Toronto (Canada) such separate forecasts have much practical value. First, a good
and Melbourne (Australia). Toronto, Melbourne, and Oslo volume forecast is useful for staffing decisions—it can be used
have population densities of approximately 5 000 persons/km2 , to determine how many employees should staff a shift. Second,
500 persons/km2 , and 200 persons/km2 , respectively. Even a good distribution forecast helps determine where resources
within the same city, for example New Taipei City, it has should be positioned to ensure low response times. These two
been observed that different ML models perform best within points speak to the practical value of a split approach in which
different parts of the city [2], perhaps due to differences in the focus is on both volume and distribution forecasts. Having
population density. a good complete forecast, on the other hand, is not as easily
interpreted and therefore may have less practical value.
B. Research Questions (RQs)
We have defined specific research questions (RQs) that we C. Metrics
study in this work, as discussed below. Error metrics are used to quantitatively evaluate the perfor-
1) Research Question 1 (RQ 1): We want to forecast the mance of forecasting models.
demand λ(t) = ut ∈ RM t
+ such that uk is the forecast number 1) Mean Absolute Error: The mean absolute error (MAE)
of incidents at time step t in region k for k ∈ {1, 2, ..., M }, avoids the cancellation of negative and positive errors by
where M is the number of spatial regions. The higher the taking the absolute value of each error. For forecasts ŷ ∈ Rm
spatio-temporal resolution of the EMS demand forecasts, i.e., and targets y ∈ Rm , the mean absolute error is defined as:
the larger M is for a given geographical area and the shorter
each time step is, the more information we have for positioning m
1 X
ambulances strategically. The disadvantage of using a high MAE (ŷ, y) = |ŷi − yi |. (1)
m i=1
spatio-temporal resolution is that the data becomes more
sparse and stochastic, making forecasting more difficult. This 2) Mean Squared Error: The mean squared error (MSE) is
leads us to our first research question: a popular metric for regression problems. Similar to MAE, it
RQ 1 How can EMS demand in Oslo be forecast accurately avoids cancellation of negative and positive errors, but MSE
at a high spatio-temporal resolution? emphasizes large errors. For model predictions ŷ ∈ Rm and
In this work, we use a high spatial resolution of 1x1km targets y ∈ Rm , the MSE is defined as:
regions, which is the highest possible granularity in our case m
1 X
due to the privacy restrictions of our dataset. Our temporal MSE (ŷ, y) = (ŷi − yi )2 . (2)
resolution of 1 hour is the highest temporal granularity used m i=1
in the literature, making state-of-the-art comparisons easier. MSE is often used as a loss function for neural networks
We test various methods, models, and features to investigate because of the efficient calculation of its gradient.
how to model the EMS demand as accurately as possible at
3) Categorical Cross-Entropy: Categorical cross-entropy
these spatio-temporal resolutions.
(CCE) is a measure of the distance between two probability
2) Research Question 2 (RQ 2): Existing studies on EMS
distributions. It is based on cross entropy Pfrom information
demand forecasting have, with a few exceptions, tried to m m
theory. For a model prediction ŷ ∈ R , i=1 ŷi = 1 and a
forecast the complete demand λ(t) directly. However, it is Pm
target distribution y ∈ Rm , i=1 yi = 1, the CCE is:
also possible to use a split approach that forecasts the volume
m
and the distribution of the demand separately. Let δ(t) ∈ R+ X
be the aggregated CCE(ŷ, y) = − yi log ŷi . (3)
PM volume
t
of incidents at time step t such
that δ(t) = u
k=1 k . Let f (t) = vt ∈ RM + be the spatial
i=1
distribution of the incidents at time step t such that vkt is the For many predictions and targets, we take the average of the
fraction
PMof the events in time step t that occurs in region k cross-entropies between each prediction-target pair.
t
and k=1 kv = 1. A complete forecast at time step t can
be obtained by combining volume and distribution forecasts: III. R ELATED R ESEARCH
λ(t) = δ(t)f (t). Existing research has used datasets of highly varying spatial
Almost all previous research has been on complete models, and temporal resolutions. Given our goal and research ques-
with the notable exception of Zhou and co-authors [18]; tions, as well as limited space, we focus mainly on previous
see Section III. We are interested in how different methods research using hourly forecasts in either small spatial regions
influence the forecasting of EMS demand: of a few square kilometers or on a continuous spatial domain.
RQ 2 Is a split model or a complete model better at Using such a high spatio-temporal resolution is one of the
forecasting EMS demand in Oslo? main challenges of our work.
Steins et al. [11] use Zero-Inflated Poisson (ZIP) regression In a follow-up study, Chen et al. [3] adopt a two-phase
to model the hourly EMS call volume in 6x6km, 4x4km, and model-selection approach to make 3hr forecasts in 3x3km
3x3km spatial regions of three different Swedish counties. The spatial regions of three districts in New Taipei City. First, they
ZIP regression combines two models: one that generates zeroes use cross-validation to select the best features and potential
and a Poisson process that generates counts. This makes it hyper-parameters for the four different models. Second, the
suitable for modeling point processes with excess zero-count best of the four proposed models is selected for each region.
data, characteristic of EMS demand. Steins et al. [11] create The authors test seven different feature sets for the SVM and
one model for each county, using socioeconomic, geographic, ANN models. Several different models with different feature
and temporal features. The ZIP model performs slightly better sets are chosen for the regions. For regions with the highest
than the existing model, which calculates the total average EMS demand, the ANN model is chosen in 4 out of 7 regions.
demand in the county per hour of the week using data from In low-demand regions, linear regression is chosen in 16 out
the previous year and then assigning a fraction of this demand of 27 regions. The different feature sets are chosen one to
to each grid based on their day and night population. three times each. Overall, Chen et al. [3] conclude in 2016
Setzler et al. [10] propose a multi-layer perceptron (MLP) that this two-fold model selection approach produces better
to forecast EMS call volume in Charlotte-Mecklenburg, North results than their 2014 single-model approach [2].
Carolina (USA) for a variety of grid cell sizes and time in- Zhou, in her PhD thesis [17] and related publications
tervals. Their 2002–2004 EMS data were mapped to different [20], [18], [19], [17], proposes several novel approaches
combinations of grid cell sizes (2x2mile and 4x4mile) and time for forecasting ambulance demand on a discrete time and
intervals (1hr and 3hrs); we focus on the 1hr case below. The continuous space domain. She and her co-authors assume
MLPs use four temporal features (hour, day of week, month, that the ambulance demand in each 1- or 2-hr time period
and season), a hidden layer with four nodes, and one output independently follows a non-homogeneous Poisson process
value for each grid cell. Setzler et al. [10] compare their MLP whose expected value can be represented as a positive intensity
model to the MEDIC forecasting method used by the EMS function λt (s), where s ∈ S ⊆ R2 is a spatial location in the
agency in Charlotte-Mecklenburg. relevant geographic area S. The intensity functionRis decom-
MEDIC makes, in the 1hr case, forecasts by averaging the posed into an aggregate demand intensity R δt = s λt (s) ds
call volumes at the same hour for the previous four weeks over and spatial density ft (s) such that S ft (s) ds = 1. Then,
the past five years. Let Ah,d,w,y be the actual call number for λt (s) = δt ft (s). The aggregate demand intensity δt is simply
a given hour h, day d, week w, and year y. Let Y = 5 be the total demand of the entire geographic area of interest in
the number of years and W = 4 be the number of weeks to time interval t, often studied in the literature. In contrast, Zhou
average over. Then the MEDIC forecast Fh,d,w,y is: [17] focuses on forecasting the spatial density ft (s). We call
this approach of forecasting the aggregated demand and its
PY −1 PW
i=0 j=1 Ah,d,w−j,y−i spatial density separately a “split” approach.
Fh,d,w,y = . (4) Zhou et al. [17] discuss case studies from Toronto, Canada
YW
(see [18] and [20]), and Melbourne, Australia (see [19]), with
In practice, one MEDIC model is created for each spatial re- data from 2007-2008 and 2011-2012, respectively. Three main
gion. Setzler et al. find that their new MLP model outperforms, models for spatial density ft (s) are studied:
according to MSE, MEDIC for 4x4mile grids (both 1hr and 1) A Gaussian Mixture Model (GMM) with weekly season-
3hr time intervals). For the 2x2mile grid with 3hr intervals, ality constraints and conditional autoregressive (CAR)
there is no statistical difference between the two methods. priors to capture daily seasonality.
Interestingly, MEDIC was better for the 2x2mile grid with 1hr 2) A spatio-temporal kernel density estimation (stKDE)
time intervals. Upon further inspection, the authors found that approach.
forecasting only zeros outperformed the two other approaches 3) An extended stKDE with kernel warping.
for this configuration.
These methods are compared against each other and against
Chen et al. [2] study daily EMS demand in 3x3km spatial
MEDIC [10] using the Average Logarithmic Score (ALS).
regions in New Taipei City, Taiwan, using 2010–2012 data.
The stKDEs models produce equally accurate forecasts as
Their forecasts use moving average, artificial neural network
the GMM models while having considerably lower time
(ANN), linear regression, and support vector machine (SVM)
complexity. The warped stKDE model outperforms the other
models [2]. They train one model per spatial region for each
models for Melbourne, as the warping adapts the stKDE to
model type. The SVM and ANN models are trained with
Melbourne’s complex geographical boundaries.
month, day of the week, season, hour, and year features.
Models are retrained with more data after each forecast is IV. D EMAND F ORECASTING M ETHODS AND M ODELS
made on the test set. Chen et al. find that the ANN model is
best overall but that the SVM is better for regions with high A. Categories of Forecasting Models
EMS demand. Thus, they propose a model selection phase in Our demand forecasting models for Oslo are split into three
which the best of the four proposed models is selected for main categories: distribution, volume, and complete models.
each region. The volume models forecast aggregate hourly demand, while
Feature Set Id Features Nodes
the distribution models forecast how this hourly demand is Basic b Hour, day of week, month 43
distributed spatially across regions. Volume and distribution Hour, day of week, month,
Weather w 59
forecasts can be combined into a complete model that forecasts precipitation, temperature
Date d Hour, day of week, month, day 74
the number of incidents per region. The complete models make Hour, day of week, month, day,
such complete forecasts directly. We evaluate the volume and Weather and date w&d 90
precipitation, temperature
complete models using MSE and MAE, while the distribution
models are evaluated using CCE. Table I: Overview of the four different feature sets. The
identifier of the feature set is used in the result tables. The
The complete models have M = 5569 non-negative output
nodes column states the number of feature nodes each feature
values—one per grid cell. The target values for the complete
set has.
models are simply the number of incidents in each of the M
regions at each time step t; yt ∈ RM .
The volume models have a single non-negative output. The
type. Section V-E2 details how the weather data is collected.
target values for the volume models are the sum of the number
The date feature set includes the day of the month. The day of
of events across all grid cells in a time step t; yt vol = δt =
P M vol the month is one-hot encoded with 31 classes. We hope that
i=1 (yt )i , with yt ∈ R.
the inclusion of the day of the month will allow the neural
The distribution models have M = 5569 output values—
network to leverage special day effects. Finally, we consider
one per region. The output values must be positive and sum to
a combined weather and date feature set.
one. The distribution target values are calculated by normaliz-
For each of our three categories, we propose two different
ing the number of incidents across a time step; yt dist = δ1t yt .
model types (MLP and LSTM) and four different feature sets.
If there are no incidents (δt = 0) in a time step, an even
1 1 1 For each of these 3 · 2 · 4 = 24 models we cross-validate
distribution is used as the target; yt dist = [ M ,M , ..., M ].
2 · 7 = 14 configurations in the architecture validation phase:
B. A Small Numerical Example 2 configurations for the number of hidden layers (two or three
layers) and 7 configurations for the number of nodes in each
As a simple example, consider M = 5 regions with
hidden layer (2, 4, 8, 16, 32, 64, or 128 nodes). This makes
the following number of incidents at time step t: yt =
a total of 24 · 14 = 336 different architectures.
[1, 2, 0, 0, 1]. Then, the target distribution will be: yt dist =
The distribution models are optimized using CCE loss and
[0.25, 0.5, 0, 0, 0.25], the target volume will be yt vol = 4, and
use the softmax activation function in the output layer to
the complete target will be yt comp = [1, 2, 0, 0, 1].
ensure that the sum of the outputs is one. Meanwhile, the
Next, suppose that for a different time step t + k there are
volume and complete models are optimized using the MSE
no incidents: yt+k = [0, 0, 0, 0, 0]. Then the target distribution
loss and have no activation function in the output layer.
is yt+k dist = [0.2, 0.2, 0.2, 0.2, 0.2], the target volume is
2) Baselines: Our baselines are simple but specialized
yt+k vol = 0, and the complete target will be yt+1 comp =
volume and distribution models that capture basic patterns in
[0, 0, 0, 0, 0]. Note that ∀t (yt vol · yt dist = yt comp = yt ).
the data. They compute average values for disjunct groups
C. Offline Forecasting Methods induced by time or space. The baselines make forecasts by
We study several offline models for forecasting the hourly computing the average value of the relevant temporal (for
EMS demand in Oslo and Akershus. These include several volume forecasts) or spatial (for distribution forecasts) group,
neural network models and some simple baseline models based as detailed in Section V-C
on averages of the historical EMS demand; see Section IV-C1 D. Online Forecasting Methods
and Section IV-C2. We also study benchmark models from the
literature [10]; see Section V-D. We create online versions of our proposed offline forecast-
1) Neural Networks: We consider neural networks because ing methods to make the models dynamic and get the most
of their ability to model complex and non-linear patterns in out of the available data. We use a hybrid approach for the
data. We propose both MLP and LSTM neural networks to online models by first training them offline on the training set
forecast EMS demand. The LSTM model can leverage the and then continue with online learning on the validation/test4
sequential nature of the problem and may be able to pick up set. Our online models extend their offline counterparts: we
patterns over time. perform online learning after initial offline learning.
The neural networks are trained and tested with four differ- 1) Neural Networks: The pseudo-code in Figure 2 details
ent sets of features (inputs), see Table I. The basic feature set how we perform online learning for our neural network
includes the hour of the day, day of the week, and month. models. We start off with a trained offline version of a model.
These features are all one-hot encoded. The other feature Then, we lower the learning rate of the optimizer (from 0.001
sets are extensions of this basic set. The weather feature to 0.0001) to mitigate the catastrophic forgetting problem by
set includes rainfall and temperature measurements. For each avoiding putting too much weight on new samples. We make
hourly forecast, the entire day’s worth of weather data is 4 We use the terminology “validation/test” since the validation dataset is
included. The weather data is collected at 3-hour intervals, used for online learning in the validation phase while the test dataset is used
meaning each day has eight measurements of each weather in the test phase, as detailed in Section V-B
1: procedure F ORECAST O NLINE(model, inputs, targets)
2: all forecasts ← () . Empty sequence, initially
3: ` ← length(inputs)
4: for i = 0 to ` step 24 do . Use 24 inputs at a time Figure 3: Illustration of how the full dataset is split into
5: x ← inputs[i : i + 24] training, validation and test datasets.
6: y ← targets[i : i + 24]
7: z ← model.predict(x)
8: all forecasts.append all(z) First, in the architecture selection phase, we use cross-
9: model.train(x, y) . Train on new samples validation on the training set to select the best architecture for
10: end for each of our proposed neural networks. We use 5-fold cross-
11: return all forecasts validation for the MLP models and hold-out cross-validation
12: end procedure for the LSTM models, as they are sensitive to the sequential
Figure 2: Online forecasting algorithm. order of the time series. We test 14 different architectures
for each of 24 models (see Section IV-C) and pick the best
architecture for each model.
forecasts for each hour of the first day of the validation or test Next, in the validation phase, we train all of our proposed
set and store those forecasts in a vector as seen in lines 7 and models (8 neural networks in each of 3 categories + 3
8 in Figure 2. Next, in line 9, we train the model one epoch distribution baseline + 1 volume baseline, making a total of 28
on those 24 samples. We continue forecasting and training models) on the training set and evaluate them on the validation
one day at a time until we have made forecasts for the entire set, with and without online training. Because neural networks
validation set. These forecasts are then returned and are used have random initialization, we create 5 instances of each neural
to compute the error of the model. network model and train them using hold-out cross-validation.
Because the LSTM volume and distribution models are The instance with the lowest cross-validation error is selected,
trained on normalized outputs, we have to normalize the and only this instance is evaluated on the validation set. We
validation/test targets before feeding them into the online select the two best models within each of the 3 categories to
forecasting function. We also have to inverse the normalization proceed to the testing phase based on their performance on
on the returned forecasts to transform the raw forecasts back the validation set, giving 6 models to test (the bottom 6 rows
to the original scale. in Table II).
Finally, in the testing phase, the 6 models selected in the
V. E XPERIMENTAL R ESULTS validation phase and the 3 benchmark models (MEDIC, MLP,
A. Computing Platforms and All 0s from [10]) are trained anew on the combined
training and validation set and tested on the test set, producing
The models were implemented in Python 3.6.7 with the our final results. Also here, we create 5 instances of each
Keras 2.4.3, Pandas 1.1.2, GeoPandas 0.8.1, and Numpy 1.19.2 neural network model and select the best according to hold-
libraries. Default values have been used unless otherwise out cross-validation, which is then evaluated on the test set.
specified. We use Keras to implement ANN architectures to
forecast the EMS demand. We use the RMSProp optimizer for
C. Baselines
all of the ANNs and the ReLU activation function in the hidden
layers of the MLPs. For the LSTM volume and complete 1) Volume Baseline: The volume baseline, α1hr , makes
models, we normalize the output values of the training set forecasts based on the average number of incidents in each
using Keras’ MinMaxScaler to ensure efficient training. hour of the week. This captures the weekly seasonality of the
EMS demand. By grouping the incidents in the training data
B. Experimental Protocol: Data, Models, and Phases by the day of week and hour of the day, α1hr creates a total of
The forecasting models are trained, validated, and tested in 7 · 24 = 168 groups and computes the average for each, which
different phases with different datasets. Figure 3 shows how is later used to make forecasts. For example, α1hr forecasts
the EMS incident time series is split into disjunct training, 8.52 for Mondays (1am, 2am] because there was an average
validation, and test dataset. The training dataset is the largest, of 8.52 incidents occurring for that Monday time interval in
with just over 2.5 years of data (January 1st, 2015 – July 1st, the training data.
2017). The validation dataset is the smallest, with just over 2) Distribution Baseline 1: The first distribution baseline,
0.5 years of data (July 2nd, 2017 – February 10th, 2018). The βtotal , makes forecasts based on the spatial distribution of the
test dataset consists of 1 year’s worth of data (February 11th, incidents in the training set across all time steps. It calculates
2018 – February 11th, 2019). this distribution by summing the number of events per grid
Common to all three phases is the use of early stopping cell and then dividing by the total number of events:
with patience 5 in the training of the neural network models. tc
The maximum number of iterations is set to 100, but this limit 1 X
βtotal (tc , th ) = Ptc PM yt . (5)
is never reached. (yt )i t=1
t=1 i=1
Method MSE MAE
Here, tc is the current time step, th > 0 is the forecast time All 0s 0.0043334 0.0027283
horizon, and yt ∈ RM is a vector of the number of incidents MEDIC 0.0040064 0.0046825
occurring in the M regions at time t. Note that the forecast Setzler 0.0038105 0.0049510
dist βtotal + vol MLPb 0.0038172 0.0048571
βtotal (tc , th ) does not depend on the time horizon th ; βtotal dist βtotal + vol LSTMw 0.0038179 0.0048555
forecasts the same distribution vector for all time steps. dist LSTMw + vol LSTMw 0.0037996 0.0048874
3) Distribution Baseline 2: The second distribution base- dist LSTMw + vol MLPb 0.0037995 0.0048889
complete MLPb 0.0038011 0.0044522
line, β8hr , creates 21 different distribution predictions: one for complete MLPw 0.0037983 0.0044777
each 8hr bucket of the week. These are created much in the
same way as βtotal , but we group the training set by day of Table II: Test errors of the complete forecasts of the proposed
the week and 8hr bucket before averaging within each group. and benchmark methods. Note that all models except All 0s,
Similar to α1hr , it makes forecasts by outputting the average MEDIC and Setzler use online learning.
distribution of the group of the time step to forecast.
4) Distribution Baseline 3: The third distribution baseline,
β1hr , creates a distribution forecast for each hour of the week, 2) Weather Data Collection: We collect weather data to
similarly to β8hr but with 168 different values. be used as features in some of our models, to investigate
the impact of weather [15], [16], [10], [17]. Weather data
D. Benchmarks has been collected through the Grid Time Series Data API
provided by The Norwegian Water Resources and Energy
We use the MEDIC, All 0s, and MLP models [10] as Directorate (NVE).5 The API provides access to time series of
forecasting benchmarks. The models are in the complete cate- weather data for grid cells. Several time series are available,
gory, but we can derive their implicit volume and distribution such as daily temperature, precipitation, snow depth, and wind
forecasts as shown in the example in Section IV-B. speed. We collected temperature and precipitation data at 3-
1) MEDIC: We implement the MEDIC method; see (4). hour intervals from 01.01.2015 00:00 up to and including
We have around four years’ worth of data, while the original 11.02.2019 21:00. For simplicity, and since weather does not
MEDIC method uses five years of data. Thus, we adapt the vary much across Oslo, we collect weather data from a single
method by reducing the number of points included in the grid cell. This is the cell in the city centre of Oslo, with the
model to what is available. For the test set, forecasts before most incidents (SSB id 22620006649000). We normalize the
07.01.2019 are made with 16 data points, while the forecasts temperature and precipitation data using Kera’s MinMaxScaler
for the next three weeks are made with 17, 18, and 19 points, fitted on the data belonging to the training set.
respectively. The remaining forecasts use the original MEDIC
method with 20 data points. For simplicity, we subtract 52 F. Experimental Results
weeks to go one year back.
We emphasize two key results from the validation phase.6
2) All Zeroes: The All 0s model forecasts zeroes for all grid First, every single model among the 28 models validated
cells at all times. While being an extremely simple method, it improved with online learning. Second, varying the feature
is often accurate for grid cells with low population density. sets did not influence the performance of the distribution and
3) Setzler: We implement a complete MLP model [10], complete models much.
hereby denoted as the Setzler model. It has a single hidden We select the two best distribution, volume, and complete
layer with four neurons that use the sigmoid activation func- models from the validation phase to retrain and run on the
tion. The features consist of the one-hot encoded hour (0– test set, using the protocol presented in Section V-B. Next,
23), day of the week (1–7), month (1–12), and season (0–3), the two distribution models are combined with each of the
making a total of 47 binary feature nodes. The output layer has two volume models, making a total of four split models being
one node for each spatial region and no activation function. tested. As only online models were selected, we omit making
this explicit in the rest of this section for brevity.
E. Data Collection and Preprocessing
1) Complete Model Results: The final complete test results
1) Preprocessing: Before model training, we preprocess the can be found in Table II. We see that the complete MLPw
raw EMS data. This includes removing incorrect, redundant, model achieves the lowest MSE, closely followed by the two
and irrelevant data and structuring it as a time series, which split models with the internal LSTMw distribution model.
we denote as the filtered incident dataset. The filtered incident The All 0s model has the lowest MAE. The Setzler method
dataset used for learning is sorted by grid cell id and hourly achieves a better MSE than the MEDIC method in our case,
intervals, starting at 00:00 on January 1st, 2015 up to and in contrast to the original study [10]. In the original study, the
including 23:00 on February 11th, 2019. The result is a MEDIC method outperformed their proposed ANN method at
multivariate time series of hourly resolution with M = 5569
5 The API is available at http://api.nve.no/doc/gridtimeseries-data-gts There
variables—one for each spatial region. We denote this time
series as y, such that yt ∈ RM is a vector with the incidents is also an interactive application built on top of the API, which can be used
to explore the data more intuitively at http://www.xgeo.no/.
in the M = 5569 regions at time step t, and (yt )k ∈ R is the 6 Hermansen’s thesis [6] has futher details and results from the validation
number of incidents in region k at time step t. and architecture selection phases.
Method MSE MAE
All 0s 280.6049 15.1939
1) Model Type: The βtotal model made the best distribution
MEDIC 25.2394 3.8183 forecasts, as evident in Table IV. The table also suggests that
Setzler 33.8975 4.3933 the neural network models make fairly good spatial forecasts.
complete MLPb 34.5490 4.4373
complete MLPw 35.2900 4.5135
MEDIC, on the other hand, makes poor distribution forecasts.
volume LSTMw 22.9890 3.7027 This is caused by the fact that MEDIC makes forecasts as if
volume MLPb 22.9108 3.6988 an incident occurring increases the likelihood that an incident
Table III: Test errors of the volume forecasts of the proposed will occur in the same area again at the same time point in later
and benchmark methods. Note that all models except All 0s, weeks. However, the EMS incidents are mostly independent
MEDIC and Setzler use online learning. and stochastic, which means MEDIC often forecasts incidents
that are unlikely to occur.
Method CCE Surprisingly, βtotal , being the simplest distribution model
All 0s (even dist) 8.6249930 tested, makes the most accurate distribution forecasts. Intu-
MEDIC 10.487533
Setzler 6.2471037
itively, we expected the distribution of demand to show some
complete MLPb 6.0354133 pattern in time. For example, there might have been a more
complete MLPw 6.0745206 centralized distribution during working hours, while people
dist βtotal 5.8713098
dist LSTMw 5.8967190
might be drawn toward the fjord on a warm and sunny Sunday.
Seeing how our best distribution model has captured no such
Table IV: Categorical cross-entropy errors of the distribution variations, there must either be too few variations in our data
forecasts on the test set. A lower error is favorable. All models or our models are unable to learn them.
except All 0s, MEDIC and Setzler use online learning. Zhou [17] finds, in contrast to us, significant variations in
the density of ambulance demand with patterns in time and
space for both Toronto and Melbourne, despite having just two
their smallest spatial and temporal levels of granularity (grid- years of data. This might be due to Melbourne and Toronto
size of 2x2 miles ≈ 3.2x3.2 km and 1-hr time buckets). having higher population densities than Oslo and Akershus
2) Volume Model Results: The errors of the volume fore- (see Section II-A). Indeed, Zhou [17] notes that patterns are
casts7 can be found in Table III. The dedicated volume models most prominent in the most densely populated areas of the
produce the best volume forecasts, followed by MEDIC, the cities. Another possible explanation is that Zhou [17] uses a
complete neural networks, and finally, All 0s. Volume forecasts continuous spatial domain, while we use grid-mapped incident
made on the first week of the test set are illustrated in Figure locations. This might make it harder for us to detect subtle
4. We see that the dedicated volume models and the MEDIC differences in the underlying demand density function. Either
models produce better volume forecasts than the complete way, it might be interesting to focus on the more densely
neural network models, which tend to be too conservative. populated areas of Oslo to see if we can detect some patterns.
The complete neural networks do not capture the daily and It might be worthwhile to use different models for different
weekly seasonality well; their volume predictions look very grid cells as Chen et al. [3] do. An LSTM model should be
similar every day and do not capture the increase in incidents tested in such a case, as it seemed like the LSTM models
during working hours or Friday and Saturday nights. were slightly better than the MLP models at forecasting EMS
3) Distribution Model Results: The errors of the distribu- distributions.
tion forecasts are shown in Table IV. We see that the dedicated The online volume MLPb model makes the best volume
distribution models produce the best forecasts, followed by forecasts, see Table III. The volume MLP and LSTM models
the complete neural network models, the All 0s method, and have similar architecture complexity and validation errors.
finally, the MEDIC method. However, since the LSTM model is significantly more complex
than the MLP models without producing better results, we
G. RQ1: How can EMS Demand in Oslo be Forecast Accu-
prefer the MLP model. All of the volume neural network
rately at a High Spatio-Temporal Resolution?
models outperformed the simple α1hr model, which indicates
We have tested a variety of feature sets, model types, and that there are patterns in time other than the weekly seasonality
training regimes for predicting the EMS demand in Oslo at a that the neural networks manage to leverage.
fine spatial and temporal scale of 1x1km spatial regions and 2) Input Features: Figure 4 suggests that the volume MLPb
1-hr time intervals. In this section, we discuss what we have model is able to capture the weekly seasonality. This shows
found to be the best approaches for making accurate forecasts that the basic features are enough to capture the large-scale
of the EMS demand. patterns of the EMS demand. The neural network models also
7 The forecasting approach used here is as follows. Each 24-hour period,
seem to have leveraged some of the annual seasonality of the
at midnight, a new forecast is created. The forecast is for each hour of the EMS demand since they outperformed the α1hr , which naively
next 24-hour period. Further, we train the model on new data after every models the weekly seasonality.
day. In other words, at the end of the day we train on today’s 24 data points The models with the weather feature set perform reasonably
and predict the demand for the next 24 hours. For a time period that spans
multiple days where forecasts are needed, as is the case here, the above 24- well as shown in Tables II, III, and IV. The complete MLPw
hour forecasting method is repeated for each day. model is the best of the complete models while LSTMw is the
Figure 4: Volume forecasts on the first week of the test set, starting on Sunday the 11th of February 2018. The complete neural
network methods tend to underestimate demand volume, especially during working hours and Friday and Saturday nights.
second-best volume and distribution model. This indicates that specialized volume and distribution models produce better
weather features have the potential of improving forecasts. volume and distribution forecasts as seen in Tables IV and
The Setzler model adds to our basic feature set the season III. Overall, the results are perhaps not very surprising, as
of the year as a feature. We did not include season as a feature the models have been optimized for making either volume,
in any of our feature sets because it is implicitly covered by distribution, or complete forecasts.
the month feature. The Setzler model makes slightly better However, it raises the question of which approach has the
volume forecasts but worse distribution and complete forecasts most practical value, as discussed in Section II-B2.
compared to our proposed complete MLP models. The fact
that the Setzler model makes better volume forecasts than our The complete models are essentially trying to solve a more
complete models indicates that the inclusion of the season difficult problem than the split models by forecasting the
variable might be helpful. number and location of incidents simultaneously. Intuitively, it
3) Metrics: Clearly, the choice of performance metric in- is easier to forecast the occurrence of ten incidents than to fore-
fluences the ranking of different methods in Table II, where cast the occurrence of ten incidents at certain exact locations.
the All 0s method achieves the lowest MAE although it has Because of the stochastic nature of the EMS demand, it is very
limited practical value. This happens because our data is very hard to make accurate forecasts in both volume and location,
skewed towards zero, so most of the zero “forecasts” by All causing the complete models to only forecast incidents that are
0s are correct. When All 0s makes mistakes, they can be very likely to happen. This makes them often underestimate
quite large, however these are not punished harshly by MAE. the number of incidents, as suggested in Figure 4. This could
The MSE punishes deviations more than MAE, which makes perhaps partly be remedied by using a different approach as
MSE a better metric in our case as we are most interested discussed in Section V-G3, but the nature of the complete
in the deviations, with zero incidents being dominant in most problem may still cause the complete models to produce worse
regions. However, the complete and (to a lesser degree) volume volume forecasts than split methods.
models tend to underestimate the EMS demand volume in The distribution forecasts of the specialized distribution
Figure 4 even though they are optimized for MSE. A different models are also significantly better than those of the other
approach may work better for our purposes. We could, for models, as demonstrated in Table IV.
example, consider metrics that punish demand underestimation
The MEDIC model, although a complete model, makes
errors more than demand overestimation errors or probabilistic
pretty good volume forecasts. It does not suffer from the
forecasts. Such approaches could be argued to be appropriate
underestimation problem of the neural networks because it
for EMS since it is essential to have enough resources to
simply averages previous incidents instead of trying to mini-
respond to the demand, if at all possible.
mize its MSE. While this method works well for the demand
H. RQ2: Is a Split Model or a Complete Model Better at volume, it makes poor distribution forecasts.
Forecasting EMS Demand in Oslo? We conclude that a split model may in our case produce
The complete models achieve lower MSE on their complete more useful information for EMS providers than complete
forecasts than the split models, as seen in Table II. However, models, because they make better volume and distribution
the differences are extremely small. On the other hand, the forecasts.
VI. C ONCLUSION AND F UTURE W ORK R EFERENCES
[1] N. Channouf, P. L’Ecuyer, A. Ingolfsson, and A. Avramidis. The
A. Conclusion application of forecasting techniques to modeling emergency medical
system calls in Calgary, Alberta. Health Care Management Science,
In this work, we propose and test a variety of models 10:25–45, February 2007. pages 1, 2
[2] A. Chen and T.-Y. Lu. A GIS-based demand forecast using machine
for forecasting the hourly EMS demand in 1x1km spatial learning for emergency medical services. In Proc. International Con-
regions in Oslo and Akershus of Norway. This problem is ference on Computing in Civil and Building Engineering, pages 1634–
challenging because of the sparsity of EMS data at such a fine 1641, June 2014. pages 1, 2, 3, 4
[3] A. Y. Chen, T. Lu, M. H. Ma, and W. Sun. Demand forecast using
spatio-temporal resolution. We study two different approaches data analytics for the preallocation of ambulances. IEEE Journal of
for EMS demand forecasting: (i) a complete approach that Biomedical and Health Informatics, 20(4):1178–1187, 2016. pages 1,
directly forecasts the incidents in each of the M spatial region 2, 4, 8
M [4] E. Erkut, A. Ingolfsson, and G. Erdoğan. Ambulance location for
y∈R PMand (ii) a split approach that1first forecasts the volume maximum survival. Naval Research Logistics (NRL), 55(1):42–58, 2008.
δ = i=1 yi and distribution f = δ y of the demand sepa- pages 1
rately. Then, the two forecasts are integrated to get a complete [5] G. Grekousis and Y. Liu. Where will the next emergency event occur?
Predicting ambulance demand in emergency medical services using
forecast. In this setting, we consider two different types of artificial intelligence. Computers, Environment and Urban Systems,
neural networks (MLP and LSTM) with four different feature 76:110 – 122, 2019. pages 1
sets for forecasting the complete, volume and distribution of [6] A. Haugsbø Hermansen. Machine learning for spatio-temporal fore-
casting of ambulance demand. Master’s thesis, Norwegian University
EMS demand. In addition, we study several simple aggregation of Science and Technology, 2021. pages 2, 7
methods for forecasting EMS demand volume and distribution. [7] H. Huang, M. Jiang, Z. Ding, and M. Zhou. Forecasting emergency calls
We conclude that split models, despite being discussed with a Poisson neural network-based assemble model. IEEE Access,
7:18061–18069, 2019. pages 1
little in the literature, are better suited for modeling EMS [8] A. X. Lin, A. F. W. Ho, K. H. Cheong, Z. Li, W. Cai, M. L. Chee, Y. Y.
demand than complete models. Among the different split Ng, X. Xiao, and M. E. H. Ong. Leveraging machine learning techniques
models tested, we find that a simple aggregation model is the and engineering of multi-nature features for national daily regional
ambulance demand prediction. International Journal of Environmental
best at modeling the spatial distribution of the EMS demand Research and Public Health, 17(11), 2020. pages 1
in Oslo and Akershus. The demand volume is best captured [9] D. S. Matteson, M. W. McLean, D. B. Woodard, and S. G. Henderson.
by an MLP with two hidden layers of 32 nodes each, using Forecasting emergency medical service call arrival rates. The Annuals
of Applied Statistics, 5, 2011. pages 1
hour, day of the week, and month as features. [10] H. Setzler, C. Saydam, and S. Park. EMS call volume predictions: A
We find that all models studied improved with online comparative study. Computers & Operations Research, 36(6):1843 –
1851, 2009. pages 1, 2, 4, 5, 6, 7, 10
learning. We also find indications that weather may influence [11] K. Steins, N. Matinrad, and T. Granberg. Forecasting the demand for
the spatial distribution of EMS demand somewhat. In contrast emergency medical services. In Proc. of the 52nd Hawaii International
to previous research [10], [17], [15], we find that the impact Conference on System Science, pages 1855–1864, January 2019. pages
1, 4
seems relatively small. [12] G.-H. Strand and V. V. Holst Bloch. Statistical grids of Norway.
Technical report, Statistics Norway, Department of Economic Statistics,
2009. pages 2
B. Future Work [13] J. E. Thornes, P. A. Fisher, T. Rayment-Bishop, and C. Smith. Am-
bulance call-outs and response times in Birmingham and the impact
We now outline three areas of future research. First, in the of extreme weather and climate change. Emergency Medicine Journal,
EMS field, the goal is to always have enough resources to 31(3):220–228, 2014. pages 1
[14] H. Wong and J.-J. Lin. The effects of weather on daily emergency
respond to the demands of the public. While this is hard to ambulance service demand in Taipei: a comparison with Hong Kong.
achieve in practice, it is clear that the volume models often Theoretical and Applied Climatology, 141:321––330, July 2020. pages
underestimate demand too much, as evident in Figure 4. One 1
[15] H.-T. Wong and P.-C. Lai. Weather inference and daily demand for
should perhaps consider metrics that penalize underestimates emergency ambulance services. Emergency Medicine Journal (EMJ),
of demand more than overestimates, see Section V-G3. Al- 29:60–64, October 2010. pages 1, 2, 7, 10
ternatively, probabilistic forecasts or forecasts with prediction [16] H.-T. Wong and P.-C. Lai. Weather factors in the short-term forecasting
of daily ambulance calls. International journal of biometeorology, 58,
intervals instead of point forecasts could be more useful 03 2013. pages 1, 7
to EMS providers. Second, the spatial forecasts might be [17] Z. Zhou. Predicting Ambulance Demand. PhD thesis, Cornell University,
improved by looking at the most populous regions separately. 2015. pages 1, 2, 3, 4, 7, 8, 10
[18] Z. Zhou and D. S. Matteson. Predicting ambulance demand: A spatio-
There may also be patterns in demand distribution linked to temporal kernel approach. In Proc. of the 21th International Conference
population movement, for which data could be obtained from on Knowledge Discovery and Data Mining, page 2297–2303, August
cell phone providers or road sensors. However, the practical 2015. pages 1, 3, 4
[19] Z. Zhou and D. S. Matteson. Predicting Melbourne ambulance demand
value of leveraging such patterns is uncertain as most of the using kernel warping. Annals of Applied Statistics, 10(4):1977–1996,
regions with high population density are close to each other. 12 2016. pages 1, 4
Third, we believe that the forecasting of the EMS demand [20] Z. Zhou, D. S. Matteson, D. B. Woodard, S. G. Henderson, and A. C.
Micheas. A spatio-temporal point process model for ambulance demand.
volume might be improved by including flags for special days Journal of the American Statistical Association, 110(509):6–15, 2015.
such as New Year’s Eve. There might be similar spikes in pages 1, 4
incidents related to concerts or other special events, which
could impact the spatial distribution of incidents.

Forecasting Ambulance Demand Using Machine Learning A Case Study From Oslo Norway

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Forecasting Ambulance Demand Using Machine Learning A Case Study From Oslo Norway

Uploaded by

Copyright:

Available Formats

Forecasting Ambulance Demand using Machine

Learning: A Case Study from Oslo, Norway

life-saving, assistance to people in need. To do so, one needs

978-1-7281-9048-8/21/$31.00 ©2021 IEEE

You might also like