Professional Documents
Culture Documents
Nhom4 Report
Nhom4 Report
1 Introduction
Transportation is one of the essential needs of people all over the world. According
to data from the Union Internationale des Transports Publics, an average of 168
million passengers use metros every day to travel in 178 cities in 56 countries.
Thanks to transportation, the supply chain of goods and people’s travel needs
are continuously met and circulated everywhere and at any time. However, the
development of any field has advantages and disadvantages. The explosion in the
number of personal vehicles, the increase in greenhouse gas emissions, the traffic
jams, and the number of deaths due to traffic accidents are among the issues
that deserve great attention in recent years. Public transport is the solution
proposed throughout the years to solve the above problems. Nevertheless, public
transport only solves a small part of the problems that occur, such as reducing
environmental pollution, improving community health, reducing traffic congestion,
and increasing the number of private vehicles. In large cities, high population
2
densities and large populations lead to overloaded transport systems that are
unable to respond to rapid population growth. Policymakers and experts from
different fields have come together to discuss and propose outstanding solutions
among them the implementation of intelligent transportation systems [1] with
the core technology of artificial intelligence.
With the rapid development of science and technology, many fields, including
transportation, take benefit of the potential and available power of information
technology. Artificial intelligence, a small branch of computer science, has achieved
many breakthroughs in recent years, notably the introduction of the deep learning
model Alexnet [2]. Currently, artificial intelligence crept into every corner of life
and is an indispensable thing in modern life. In addition, the explosion of large
amounts of data from social media to data in specialized industries such as data
devices, data traffic is a big challenge for the research community dealing with
big data. Towards the goal of developing smart cities [3–6] in many countries,
they have digitized large amounts of data from many services and industries to
save time, reduce costs and bring the best experience for the people. Moreover,
traffic in smart cities is one of the enormous challenges to solve. In a smart city,
there are many sensors placed in various locations to detect traffic vehicles in
order to serve data analysis such as predicting the average speed of vehicles,
analyzing vehicle traffic at multiple points in time, forecasting demand in regions
from past data [7, 8].
Traffic prediction is a complex problem and a research topic that receives
a lot of attention from researchers. Currently, topics in traffic prediction tasks
such as flow, speed, demand, travel time, occupancy [7, 8] are the main problems
discussed and answered. In this paper, we present a state-of-the-art algorithm
used to predict the average speed of vehicles at different times based on historical
data. The performance of the model evaluation based on the measures for the
time-series regression problem achieved quite high results with the characteristic
properties of time-series data such as trend, seasonal, and stationary. In addition,
the data is distributed and organized before being put into a real-time processing
model based on the Apache Spark technology platform. In the following of this
paper, Section II reviews the work based on deep learning methods for traffic
prediction and Apache Spark. The technical background are presented in Section
III. The experiments and results are shown in Section 4. In Section 5, we give
the conclusion about this paper and some future works.
2 Related work
2.1 Deep Learning Approaches
A recurrent Neural Network with fully connected LSTM hidden units (FC-LSTM)
was designed by Sutskever et al in 2014 [9]. In 2017, Diffusion convolutional
recurrent neural network, a model for spatiotemporal forecasting tasks, achieved
high results on the traffic prediction benchmark dataset [10]. In the same year, a
decentralized deep learning-based method based on the congestion state of the
neighboring stations was presented by Fouladgar et al [11]. In 2018, Chen et al
3
3 Methodology
In today’s 4.0 technology era, IoT platforms, automation technology, and artificial
intelligence all need big data to create quality products with high performance.
Big data can be collected from many sources through the Internet, such as search
engines, social networking sites, e-commerce sites, IoT devices, sensors, .. .. Data
is always updated continuously, making it huge that traditional tools are difficult
to process because it does not meet the five properties of Big Data that are
volume, variety, value, velocity, and veracity. To be able to utilize it, more and
more tools are developed specifically for handling big data sources and among
them, Apache Spark is a framework that this article implements to predict traffic.
3.2 Prophet
In order to build a great framework for traffic congestion forecasting and analyzing,
there must be a core method/a basic building block dedicated to time series
modeling, thus we have to find a suitable method that can generate high quality
time series forecasts. There are indeed countless tools/methods available that
can be applied to our framework, however, our main purpose in building this
framework is to be able to forecast at scale, that is why Prophet is chosen among
all existing methods, not to mention its reliability and the capability to generate
decent quality forecasts.
Prophet is an open source software released by Facebook’s Core Data Science
team. As stated in Taylor and Letham’s research [32], Prophet is used in many
applications across Facebook for producing reliable forecasts and also performs
better than any other approach in the majority of cases. Prophet is a full-fledged
procedure providing an easy-to-tune and fast fitting model, it is based on an
additive model where non-linear trends are fit with yearly, weekly, and daily
seasonality, plus holiday effects. It works best with time series that have strong
seasonal effects and several seasons of historical data. Prophet is robust to missing
data and shifts in the trend, and typically handles outliers well.
Although Prophet used to be described as a good model for forecasting sales,
or a forecasting model designed to handle common features of business time series,
we want to implement it as a general forecasting model that can perform well on
all aspects of time series forecasting, especially in our case - velocity forecasting.
In this paper, Prophet is used for modeling velocity captured by sensors at Bay
Area through the first three months data, then producing forecasts for the next
week (as an ideal forecasting framework mentioned) or for the whole last three
months (for analyzing model errors).
4 Experiments
4.1 Dataset
Talking about Prophet training parameters, since our training data spreads
over 3 months, which does not cover a whole year, we set year seasonality =
False, weekly seasonality = True and daily seasonality = True.
4.3 Metrics
Suppose y = y1 , y2 , y3 ,... represents the ground truth, ŷ = ŷ1 , ŷ2 , ŷ3 ,... represents
the predicted values, and n denotes the indices of observed samples, the metrics
are defined as follows.
Mean Absolute Error (MAE):
n
1 X
M AE = ( ) |ŷi − yi |
n i=1
8
4.4 Result
Weekly and daily seasonality are first extracted from the model to analyze
congestion time in a week and in a day, as shown in Figure 4 and Figure
5. Throughout a week, heavy traffic happens on Tuesdays, Wednesdays and
Thursdays, and on a daily basis, from 7AM to 10AM.
Next, we evaluate MAE and RMSE when testing on the first week after
training data, specifically, training data is taken from January 1st to March 31st,
so the test week data is the first week of April, we obtained an MAE of 3.64 (see
Table 1), which is better compared to 5.02 when testing on the whole 3 months.
9
The same performance difference applies for RMSE. The main reason for this
MAE increment must be the unreliability usage of the old model.
Look closely at Figure 6, since the model uses data from the first 3 months
and is not updated or retrained on new data, it becomes inaccurate on predicting
data in a far future.
Fig. 6: Absolute error tends to increase over time if the model is keep outdated
Following on this catch, we actually retrained data every week on its previous
3 months data (for example, if the current time is on week n, its model would be
trained on data from week n-13 to week n-1). Then, we calculate weekly MAE on
both methods which are using old dataset and retraining every week to compare
on Figure 7. It turns out that MAE is kept below 4.5 when updating our model
every week, unlike using an old model, which its MAE could go up high forever
since the model becomes less and less accurate.
10
Fig. 7: Weekly MAE comparison between a model trained on old dataset and a
model updated to include new dataset
4.5 Discussion
As seen in the result section, or at Figure 8, high congestion happen from 7AM
to 10AM. Surprisingly, high errors happen at the same time with congestion,
which is shown in Figure 9. This leads to two explanations:
– During congestion time, velocity values may vary from even 0 (stopped vehicle
when waiting for the vehicle in front of it to move) to high velocity (the
vehicles have some space to move forward), thus sensor may as well captured
the velocity that is way much different from the average velocity around
that specific time. Therefore during congestion time, after every time window
11
of 5 minutes, the current captured velocity can fluctuate up and down far
from the previous velocity value making the whole ground truth velocity
flow unexpectedly inconsistent. And surely the predicted velocity flow is
much more smoother making the errors at congestion time to be higher than
normal.
The ”high errors happen at the same time with congestion” statement once
again appear to be true on a week long perspective. Figure 10 clearly shows how
congestion affects errors by raising the errors high on mid-week and lowering the
errors on weekend. Interesting how the left plot of Figure 10 is the condensed
view of Figure 9.
12
5 Conclusion
5.1 How our framework works in action
This is where Spark Streaming comes in to play. Since our work focuses solely
on analyzing data, we decided to stick with old PEMS-BAY 2017 dataset to
demonstrate how our framework helps users in practice, which also means we are
using simulated streaming and not actual streaming. But do not get us wrong,
our streaming framework just needs a little tweak to work with real-time data
receive from California Department of Transportation. Every 5 minutes, the
framework receive new velocities across all 325 sensors in Bay Area, it will write
velocity data and produces congestion predicts on the whole map for the next 5,
30 and 90 minutes (see Figure 11).
– Data written will be stored to be ready for retraining model every Monday.
– Congestion are presented in 5 difference levels corresponding to lowest to
highest congestion, smallest red points as lowest congestion and largest red
points as highest congestion. Note that these 5 levels are divided by 4 quantiles
which are 0.2, 0.4, 0.6, 0.8 and not based on a specific ”congestion threshold”
to determine which places have congestion. This way users will be able to
choose best routes to avoid congestion, or simply the routes that has the
”least congestion” even if the current time is not congestion time.
13
Fig. 11: Predicted future congestion across all sensors in Bay Area at a specific
time
References
1. H. Makino, K. Tamada, K. Sakai, and S. Kamijo, “Solutions for urban
traffic issues by its technologies”, IATSS research 42, 49–60 (2018).
2. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
with deep convolutional neural networks”, Advances in neural information
processing systems 25, 1097–1105 (2012).
14