Nhom4 Report

Real-Time Traffic Congestion through Multiple
Time-series Forecasting using Prophet and Spark

Streaming
Ngan-Linh Nguyen1,2 , Hoang-Thong Vo1,2 , and Trong-Hop Do1,2

1
University of Information Technology, Ho Chi Minh City, Vietnam
2
Vietnam National University, Ho Chi Minh City, Vietnam
{18520989, 18521462}@gm.uit.edu.vn, hopdt@uit.edu.vn
Abstract. Traffic prediction is one of the principal tasks to develop

components of an intelligent traffic system. Many approaches exist to
build this system including research directions such as vehicle-to-vehicle
communication, probe vehicles, speed estimation, and vehicle counting
based on vehicle detection and tracking. In addition to solving the above
research topics, a challenging problem that needs attention is traffic
congestion. RNNs and CNNs are recent models developed to solve the
problem of temporal modeling, part of the problem of traffic prediction.
However, these models all have certain limitations such as cannot cap-
ture long-range temporal sequences. In this study, the Prophet model
is constructed to predict the average speed of future vehicles; besides,
the Spark Streaming technique is applied to process data in real-time for
forecasting and integration into the ITS system. Experimental results
on the public transport dataset, PEMS-BAY, the model performance is
quite good.
Keywords: Traffic Prediction · Big Data · Traffic Analysis · Neural

Networks.
1 Introduction
Transportation is one of the essential needs of people all over the world. According
to data from the Union Internationale des Transports Publics, an average of 168
million passengers use metros every day to travel in 178 cities in 56 countries.
Thanks to transportation, the supply chain of goods and people’s travel needs
are continuously met and circulated everywhere and at any time. However, the
development of any field has advantages and disadvantages. The explosion in the
number of personal vehicles, the increase in greenhouse gas emissions, the traffic
jams, and the number of deaths due to traffic accidents are among the issues
that deserve great attention in recent years. Public transport is the solution
proposed throughout the years to solve the above problems. Nevertheless, public
transport only solves a small part of the problems that occur, such as reducing
environmental pollution, improving community health, reducing traffic congestion,
and increasing the number of private vehicles. In large cities, high population
2
densities and large populations lead to overloaded transport systems that are
unable to respond to rapid population growth. Policymakers and experts from
different fields have come together to discuss and propose outstanding solutions
among them the implementation of intelligent transportation systems [1] with
the core technology of artificial intelligence.
With the rapid development of science and technology, many fields, including
transportation, take benefit of the potential and available power of information
technology. Artificial intelligence, a small branch of computer science, has achieved
many breakthroughs in recent years, notably the introduction of the deep learning
model Alexnet [2]. Currently, artificial intelligence crept into every corner of life
and is an indispensable thing in modern life. In addition, the explosion of large
amounts of data from social media to data in specialized industries such as data
devices, data traffic is a big challenge for the research community dealing with
big data. Towards the goal of developing smart cities [3–6] in many countries,
they have digitized large amounts of data from many services and industries to
save time, reduce costs and bring the best experience for the people. Moreover,
traffic in smart cities is one of the enormous challenges to solve. In a smart city,
there are many sensors placed in various locations to detect traffic vehicles in
order to serve data analysis such as predicting the average speed of vehicles,
analyzing vehicle traffic at multiple points in time, forecasting demand in regions
from past data [7, 8].
Traffic prediction is a complex problem and a research topic that receives
a lot of attention from researchers. Currently, topics in traffic prediction tasks
such as flow, speed, demand, travel time, occupancy [7, 8] are the main problems
discussed and answered. In this paper, we present a state-of-the-art algorithm
used to predict the average speed of vehicles at different times based on historical
data. The performance of the model evaluation based on the measures for the
time-series regression problem achieved quite high results with the characteristic
properties of time-series data such as trend, seasonal, and stationary. In addition,
the data is distributed and organized before being put into a real-time processing
model based on the Apache Spark technology platform. In the following of this
paper, Section II reviews the work based on deep learning methods for traffic
prediction and Apache Spark. The technical background are presented in Section
III. The experiments and results are shown in Section 4. In Section 5, we give
the conclusion about this paper and some future works.
2 Related work
2.1 Deep Learning Approaches
A recurrent Neural Network with fully connected LSTM hidden units (FC-LSTM)
was designed by Sutskever et al in 2014 [9]. In 2017, Diffusion convolutional
recurrent neural network, a model for spatiotemporal forecasting tasks, achieved
high results on the traffic prediction benchmark dataset [10]. In the same year, a
decentralized deep learning-based method based on the congestion state of the
neighboring stations was presented by Fouladgar et al [11]. In 2018, Chen et al
3
proposed a novel method named PCNN, which is based on a deep convolutional

neural network, modeling periodic traffic data for short-term traffic congestion
prediction [12]. In 2019, Wu et al presented the Graph WaveNet for Deep Spatial-
Temporal Graph Modeling model [13]. The following year, Chen et al presented
dynamic Spatio-temporal graph-based CNNs (DST-GCNNs) by learning expres-
sive features to represent spatiotemporal structures and predict future traffic
flows from surveillance video data [14]. A spatial-temporal deep learning network,
termed ST-TrafficNet, was presented based on attentive diffusion convolution
and cascaded LSTM block [15]. Zhang proposed a novel framework named Struc-
ture Learning Convolution (SLC) that enables the extension of the traditional
convolutional neural network (CNN) to graph domains and learn the graph
structure for traffic forecasting [16]. Zheng proposed a graph multi-attention
network (GMAN), which adapts an encoder-decoder architecture, where both
the encoder and the decoder consist of multiple Spatio-temporal attention blocks
[17]. The Multi-Range Attentive Bicomponent GCN (MRA-BGCN) was built
the node-wise graph according to the road network distance and the edge-wise
graph according to various edge interaction patterns [18]. To enhance the param-
eter learning efficiency confronting traffic big data, a parallel training approach
was advanced for the DCNN prediction model by Zhang et al [19]. In 2021, an
attention temporal graph convolutional network (A3T-GCN) was proposed to
simultaneously capture global temporal dynamics and spatial correlations in
traffic flows by Bai et al [20]. An attribute-augmented spatiotemporal graph
convolutional network (AST-GCN) was modeled the external factors as dynamic
attributes and static attributes and design an attribute-augmented unit to en-
code by Zhu et al [21]. A multi-step prediction model named Spatial-Temporal
Attention Wavenet (STAWnet) was proposed to better capture the complex
spatial-temporal dependencies and forecast traffic conditions on road networks
[22].
2.2 Apache Spark
A comprehensive and flexible architecture based on distributed computing plat-

form for real-time traffic control was proposed by Amini et al., in which the
architecture was based on a systematic analysis of the requirements of the existing
traffic control system [23]. Saraswathi et al. presented a system used to predict
the number of vehicles on different roads based on the processing technique of
streaming data to reduce traffic congestion and visualize the traffic conditions
analysis in real-time [24]. From the analysis and prediction of the traffic patterns
in Los Angeles County, Dauletbak et al. contributed clear arguments on how
to store and process large amounts of traffic data using big data systems [25].
Anveshrithaa et al. introduced the proposed model is aspire to predict traffic flow
information by integrating Spark and Kafka along with deep neural networks [26].
In the study of data surveys and big data tools, author Jiang et al. have detailed
the use of the above tools for traffic estimation and prediction problems [27].
4
3 Methodology
3.1 Big Data Platform Architecture
In today’s 4.0 technology era, IoT platforms, automation technology, and artificial
intelligence all need big data to create quality products with high performance.
Big data can be collected from many sources through the Internet, such as search
engines, social networking sites, e-commerce sites, IoT devices, sensors, .. .. Data
is always updated continuously, making it huge that traditional tools are difficult
to process because it does not meet the five properties of Big Data that are
volume, variety, value, velocity, and veracity. To be able to utilize it, more and
more tools are developed specifically for handling big data sources and among
them, Apache Spark is a framework that this article implements to predict traffic.
Apache Spark[28] is an open-source cluster computing framework developed

in 2009 by AMPLab at the University of California. Later on, Spark was given to
Apache Software Foundation in 2013 and developed to date. With the ability to
process data in parallel on many different clusters as well as compute performed
in internal or all-ram memory, Spark offers fast processing speed, as well as
high fault tolerance[29].Spark Core is the base engine for large-scale parallel
and distributed data processing. In addition, the most necessary part is that
Spark supports users with libraries to manipulate data such as Spark SQL[30],
Spark MLib[31], Spark Streaming,... Spark SQL supports users with commands
to explore data analysis, handling data by SQL queries. Spark MLib helps to
train data that has been processed by traditional machine learning methods
as well as support for evaluation metrics to easily analyze the results achieved.
Besides, Spark Streaming is an extension of Spark that provides real-time data
analysis. Spark is one of the great frameworks for big data processing as well as
real-time execution like the one used in this article.
Fig. 1: Spark Components

5
3.2 Prophet
In order to build a great framework for traffic congestion forecasting and analyzing,
there must be a core method/a basic building block dedicated to time series
modeling, thus we have to find a suitable method that can generate high quality
time series forecasts. There are indeed countless tools/methods available that
can be applied to our framework, however, our main purpose in building this
framework is to be able to forecast at scale, that is why Prophet is chosen among
all existing methods, not to mention its reliability and the capability to generate
decent quality forecasts.
Prophet is an open source software released by Facebook’s Core Data Science
team. As stated in Taylor and Letham’s research [32], Prophet is used in many
applications across Facebook for producing reliable forecasts and also performs
better than any other approach in the majority of cases. Prophet is a full-fledged
procedure providing an easy-to-tune and fast fitting model, it is based on an
additive model where non-linear trends are fit with yearly, weekly, and daily
seasonality, plus holiday effects. It works best with time series that have strong
seasonal effects and several seasons of historical data. Prophet is robust to missing
data and shifts in the trend, and typically handles outliers well.
Although Prophet used to be described as a good model for forecasting sales,
or a forecasting model designed to handle common features of business time series,
we want to implement it as a general forecasting model that can perform well on
all aspects of time series forecasting, especially in our case - velocity forecasting.
In this paper, Prophet is used for modeling velocity captured by sensors at Bay
Area through the first three months data, then producing forecasts for the next
week (as an ideal forecasting framework mentioned) or for the whole last three
months (for analyzing model errors).
3.3 Spark Streaming
Streaming is a continuous data transmission technology that is very popular

these days because of its ability to transfer data quickly by dividing them into
parts. Usually, this technology is used in video streaming, but besides that, it
is also applied to real-time big data today and Spark is applying Streaming
to meet that need. Spark Streaming is an extension of spark core API that
allows developers to process real-time big data and ensure fault tolerance [29].
As a feature of streaming, it helps to scale data from one node to thousands of
nodes for execution, each data node includes many tasks to execute queries so
the processing is quite fast. In addition, it also integrates with DStream, which
represents the series of RDDs in Spark so any function in Apache Spark can be
used to process data. A more convenient way for real-time data analysis, the
streaming feature is also integrated on Spark SQL, MLib, GraphX to make it
easy to execute queries as well as apply predictive models. A Spark Streaming
System will consist of three main stages. First, receive continuous data from
various input sources such as streaming data sources including Kafka, Flume,
Amazone Kinesis, or static data sources including MongoDB, MySQL, HBase,
6
Cassandra, PostgreSQL, Parquet, Elasticsearch. Secondly, perform processing,

modeling, and aggregation steps on streaming data by commands in libraries
such as Spark SQL [30], Spark Mlib[31],... Finally, the processed data is put into
the database, dashboard, file system to store for continuous data display.
4 Experiments
4.1 Dataset
PEMS-BAY [10] traffic dataset is collected by California Transportation Agencies

(CalTrans) Performance Measurement System (PeMS). 325 sensors were selected
in the Bay Area and 6 months of data collected ranging from Jan 1st 2017 to
May 31st 2017 for the experiment. With the time window of 5 minutes, the total
number of observed traffic data points is 16,937,179. Details about the dataset
split will be better demonstrated based on circumstances in the experiment
settings section. Below is the distribution of sensors spreading across Bay Area,
San Francisco, California, USA.
Fig. 2: Distribution of 325 sensors in Bay Area
4.2 Experiment procedure and settings
We divided our approaches into forecasts on a single sensor and on multiple

sensors. We use PEMS-BAY first three months data (from Jan 1st to March 31st)
as training data for both approaches. Our approaches are described in Figure 3
7
Talking about Prophet training parameters, since our training data spreads
over 3 months, which does not cover a whole year, we set year seasonality =
False, weekly seasonality = True and daily seasonality = True.
Fig. 3: Experiment Procedure
In single sensor forecasting, after training on first 3 months data, we obtain

weekly, daily seasonality, and also absolute error throughout the last three months
test data, next we investigate the reasons for error appearance.
In multiple sensors forecasting, we aim to compare weekly MAEs, and see
how strong MAE increases if we use a fixed model trained only on the first three
months versus retraining the model every week using previous 3 month data for
training.
Finally we propose a correct usage of the framework and present a real time
forecasting framework using Spark streaming to eagerly take current time as
input and produce predictions for the next five minutes, thirty minutes and
ninety minutes as well as show the prediction on an interactive world map.
4.3 Metrics
Suppose y = y1 , y2 , y3 ,... represents the ground truth, ŷ = ŷ1 , ŷ2 , ŷ3 ,... represents
the predicted values, and n denotes the indices of observed samples, the metrics
are defined as follows.
Mean Absolute Error (MAE):
n
1 X
M AE = ( ) |ŷi − yi |
n i=1
8
Root Mean Squared Error (RMSE):

v
u n
u 1 X
RM SE = t( ) (ŷi − yi )2
n i=1
4.4 Result
Weekly and daily seasonality are first extracted from the model to analyze
congestion time in a week and in a day, as shown in Figure 4 and Figure
5. Throughout a week, heavy traffic happens on Tuesdays, Wednesdays and
Thursdays, and on a daily basis, from 7AM to 10AM.
Fig. 4: Weekly seasonality
Fig. 5: Daily seasonality
Next, we evaluate MAE and RMSE when testing on the first week after
training data, specifically, training data is taken from January 1st to March 31st,
so the test week data is the first week of April, we obtained an MAE of 3.64 (see
Table 1), which is better compared to 5.02 when testing on the whole 3 months.
9
The same performance difference applies for RMSE. The main reason for this
MAE increment must be the unreliability usage of the old model.
Table 1: Evaluation comparison between Prophet and previous works on PEMS-

BAY dataset
Model MAE RMSE Year
Traffic Transformer 1.77 - 2020
STGNN 1.83 - 2021
STAWnet 1.89 - 2021
Graph Wave-Net 1.95 4.52 2019
SLCNN 2.03 - 2020
DCRNN 2.07 4.74 2017
STBayesian - 4.44 2021
Prophet(test on 1 week) 3.64 5.31 2021
Prophet(test on 3 months) 5.02 7.09 2021
Look closely at Figure 6, since the model uses data from the first 3 months
and is not updated or retrained on new data, it becomes inaccurate on predicting
data in a far future.
Fig. 6: Absolute error tends to increase over time if the model is keep outdated
Following on this catch, we actually retrained data every week on its previous
3 months data (for example, if the current time is on week n, its model would be
trained on data from week n-13 to week n-1). Then, we calculate weekly MAE on
both methods which are using old dataset and retraining every week to compare
on Figure 7. It turns out that MAE is kept below 4.5 when updating our model
every week, unlike using an old model, which its MAE could go up high forever
since the model becomes less and less accurate.
10
Fig. 7: Weekly MAE comparison between a model trained on old dataset and a
model updated to include new dataset
4.5 Discussion
Fig. 8: Ground truth velocity and predicted velocity within 5 days
As seen in the result section, or at Figure 8, high congestion happen from 7AM
to 10AM. Surprisingly, high errors happen at the same time with congestion,
which is shown in Figure 9. This leads to two explanations:
– During congestion time, velocity values may vary from even 0 (stopped vehicle
when waiting for the vehicle in front of it to move) to high velocity (the
vehicles have some space to move forward), thus sensor may as well captured
the velocity that is way much different from the average velocity around
that specific time. Therefore during congestion time, after every time window
11
of 5 minutes, the current captured velocity can fluctuate up and down far
from the previous velocity value making the whole ground truth velocity
flow unexpectedly inconsistent. And surely the predicted velocity flow is
much more smoother making the errors at congestion time to be higher than
normal.
– Another explanation is that both approaches mentioned are using models

that are based on at least 1 week old data, and basically does not consider
new data in a range of previous hours or previous minutes, for which would
causes high errors because the model does not adapt to current congestion
situation. We can as well consider this as well-fitting overall but under-fitting
at congestion time, this is understandable since Prophet was initially made
for sales forecasting and was not specifically made to predict velocity.
Fig. 9: High errors happens at a specific time in a day
The ”high errors happen at the same time with congestion” statement once
again appear to be true on a week long perspective. Figure 10 clearly shows how
congestion affects errors by raising the errors high on mid-week and lowering the
errors on weekend. Interesting how the left plot of Figure 10 is the condensed
view of Figure 9.
12
Fig. 10: High errors also happens at a specific time in a week
5 Conclusion
5.1 How our framework works in action
This is where Spark Streaming comes in to play. Since our work focuses solely
on analyzing data, we decided to stick with old PEMS-BAY 2017 dataset to
demonstrate how our framework helps users in practice, which also means we are
using simulated streaming and not actual streaming. But do not get us wrong,
our streaming framework just needs a little tweak to work with real-time data
receive from California Department of Transportation. Every 5 minutes, the
framework receive new velocities across all 325 sensors in Bay Area, it will write
velocity data and produces congestion predicts on the whole map for the next 5,
30 and 90 minutes (see Figure 11).
– Data written will be stored to be ready for retraining model every Monday.
– Congestion are presented in 5 difference levels corresponding to lowest to
highest congestion, smallest red points as lowest congestion and largest red
points as highest congestion. Note that these 5 levels are divided by 4 quantiles
which are 0.2, 0.4, 0.6, 0.8 and not based on a specific ”congestion threshold”
to determine which places have congestion. This way users will be able to
choose best routes to avoid congestion, or simply the routes that has the
”least congestion” even if the current time is not congestion time.
13
Fig. 11: Predicted future congestion across all sensors in Bay Area at a specific
time
5.2 Conclusion and future work

This paper illustrates a framework for real time congestion detection using
Prophet and Spark Streaming on PEMS-BAY dataset. We analyzed weekly and
daily seasonalities, also compared the use of an old model and updated model
to finally propose a correct usage of the framework. Next, we investigated on
how errors appear at the same time with congestion on both day and week
perspective.
This study demonstrates the effectiveness of time on congestion level daily
and weekly and also bring a clear view at the reasons why the errors appear,
thus we have to touch on how to reduce the errors at congestion time by letting
the model to adjust itself by absorbing past minutes/hours actual data, this will
sure be the way to improve the completeness of our framework.
References
1. H. Makino, K. Tamada, K. Sakai, and S. Kamijo, “Solutions for urban
traffic issues by its technologies”, IATSS research 42, 49–60 (2018).
2. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
with deep convolutional neural networks”, Advances in neural information
processing systems 25, 1097–1105 (2012).
14
3. A. Camero and E. Alba, “Smart city and information technology: a review”,

cities 93, 84–94 (2019).
4. T.-h. Kim, C. Ramos, and S. Mohammed, Smart city and iot, 2017.
5. A. Kirimtat, O. Krejcar, A. Kertesz, and M. F. Tasgetiren, “Future trends
and current state of smart city concepts: a survey”, IEEE Access 8, 86448–
86467 (2020).
6. E. Okai, X. Feng, and P. Sant, “Smart cities survey”, in 2018 ieee 20th
international conference on high performance computing and communica-
tions; ieee 16th international conference on smart city; ieee 4th international
conference on data science and systems (hpcc/smartcity/dss) (IEEE, 2018),
pp. 1726–1730.
7. X. Yin, G. Wu, J. Wei, Y. Shen, H. Qi, and B. Yin, “Deep learning on traffic
prediction: methods, analysis and future directions”, IEEE Transactions on
Intelligent Transportation Systems (2021).
8. H. Yuan and G. Li, “A survey of traffic prediction: from spatio-temporal
data to intelligent transportation”, Data Science and Engineering 6, 63–85
(2021).
9. I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with
neural networks”, in Advances in neural information processing systems
(2014), pp. 3104–3112.
10. Y. Li, R. Yu, C. Shahabi, and Y. Liu, “Diffusion convolutional recurrent neu-
ral network: data-driven traffic forecasting”, arXiv preprint arXiv:1707.01926
(2017).
11. M. Fouladgar, M. Parchami, R. Elmasri, and A. Ghaderi, “Scalable deep
traffic flow neural networks for urban traffic congestion prediction”, in 2017
international joint conference on neural networks (ijcnn) (IEEE, 2017),
pp. 2251–2258.
12. M. Chen, X. Yu, and Y. Liu, “Pcnn: deep convolutional networks for
short-term traffic congestion prediction”, IEEE Transactions on Intelligent
Transportation Systems 19, 3550–3559 (2018).
13. Z. Wu, S. Pan, G. Long, J. Jiang, and C. Zhang, “Graph wavenet for deep
spatial-temporal graph modeling”, arXiv preprint arXiv:1906.00121 (2019).
14. K. Chen, F. Chen, B. Lai, Z. Jin, Y. Liu, K. Li, L. Wei, P. Wang, Y. Tang,
J. Huang, et al., “Dynamic spatio-temporal graph-based cnns for traffic flow
prediction”, IEEE Access 8, 185136–185145 (2020).
15. H. Lu, D. Huang, Y. Song, D. Jiang, T. Zhou, and J. Qin, “St-trafficnet: a
spatial-temporal deep learning network for traffic forecasting”, Electronics
9, 1474 (2020).
16. Q. Zhang, J. Chang, G. Meng, S. Xiang, and C. Pan, “Spatio-temporal
graph structure learning for traffic forecasting”, in Proceedings of the aaai
conference on artificial intelligence, Vol. 34, 01 (2020), pp. 1177–1185.
17. C. Zheng, X. Fan, C. Wang, and J. Qi, “Gman: a graph multi-attention
network for traffic prediction”, in Proceedings of the aaai conference on
artificial intelligence, Vol. 34, 01 (2020), pp. 1234–1241.
15
18. W. Chen, L. Chen, Y. Xie, W. Cao, Y. Gao, and X. Feng, “Multi-range

attentive bicomponent graph convolutional network for traffic forecasting”,
in Proceedings of the aaai conference on artificial intelligence, Vol. 34, 04
(2020), pp. 3529–3536.
19. Y. Zhang, Y. Zhou, H. Lu, and H. Fujita, “Traffic network flow prediction
using parallel training for deep convolutional neural networks on spark
cloud”, IEEE Transactions on Industrial Informatics 16, 7369–7380 (2020).
20. J. Bai, J. Zhu, Y. Song, L. Zhao, Z. Hou, R. Du, and H. Li, “A3t-gcn:
attention temporal graph convolutional network for traffic forecasting”,
ISPRS International Journal of Geo-Information 10, 485 (2021).
21. J. Zhu, Q. Wang, C. Tao, H. Deng, L. Zhao, and H. Li, “Ast-gcn: attribute-
augmented spatiotemporal graph convolutional network for traffic forecast-
ing”, IEEE Access 9, 35973–35983 (2021).
22. C. Tian and W. K. Chan, “Spatial-temporal attention wavenet: a deep
learning framework for traffic prediction considering spatial-temporal de-
pendencies”, IET Intelligent Transport Systems 15, 549–561 (2021).
23. S. Amini, I. Gerostathopoulos, and C. Prehofer, “Big data analytics architec-
ture for real-time traffic control”, in 2017 5th ieee international conference
on models and technologies for intelligent transportation systems (mt-its)
(IEEE, 2017), pp. 710–715.
24. A. Saraswathi, A. Mummoorthy, A. R. GR, and K. Porkodi, “Real-time
traffic monitoring system using spark”, in 2019 international conference on
emerging trends in science and engineering (icese), Vol. 1 (IEEE, 2019),
pp. 1–6.
25. D. Dauletbak and J. Woo, “Traffic data analysis and prediction using big
data”, in Proc. of ksii the 14th asia pacific international conference on
information science and technology (apic-ist) 2019 (2019).
26. S. Anveshrithaa and K. Lavanya, “Real-time vehicle traffic analysis using
long short term memory networks in apache spark”, in 2020 international
conference on emerging trends in information technology and engineering
(ic-etite) (IEEE, 2020), pp. 1–5.
27. W. Jiang and J. Luo, “Big data for traffic estimation and prediction: a
survey of data and tools”, arXiv preprint arXiv:2103.11824 (2021).
28. M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica,
“Spark: cluster computing with working sets”, in Proceedings of the 2nd
usenix conference on hot topics in cloud computing, HotCloud’10 (2010),
p. 10.
29. M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J.
Franklin, S. Shenker, and I. Stoica, “Resilient distributed datasets: a fault-
tolerant abstraction for in-memory cluster computing”, in Proceedings of
the 9th usenix conference on networked systems design and implementation,
NSDI’12 (2012), p. 2.
30. M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X.
Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, and M. Zaharia, “Spark sql:
relational data processing in spark”, in Proceedings of the 2015 acm sigmod
16
international conference on management of data, SIGMOD ’15 (2015),

pp. 1383–1394.
31. X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman, D. Liu, J.
Freeman, D. Tsai, M. Amde, S. Owen, D. Xin, R. Xin, M. J. Franklin,
R. Zadeh, M. Zaharia, and A. Talwalkar, Mllib: machine learning in apache
spark, 2015.
32. S. J. Taylor and B. Letham, “Forecasting at scale”, The American Statisti-
cian 72, 37–45 (2018).

Nhom4 Report

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Nhom4 Report

Uploaded by

Copyright:

Available Formats

Real-Time Traffic Congestion through Multiple

Time-series Forecasting using Prophet and Spark

Ngan-Linh Nguyen1,2 , Hoang-Thong Vo1,2 , and Trong-Hop Do1,2

Abstract. Traffic prediction is one of the principal tasks to develop

Keywords: Traffic Prediction · Big Data · Traffic Analysis · Neural

proposed a novel method named PCNN, which is based on a deep convolutional

2.2 Apache Spark

A comprehensive and flexible architecture based on distributed computing plat-

3.1 Big Data Platform Architecture

Apache Spark[28] is an open-source cluster computing framework developed

Fig. 1: Spark Components

3.3 Spark Streaming

Streaming is a continuous data transmission technology that is very popular

Cassandra, PostgreSQL, Parquet, Elasticsearch. Secondly, perform processing,

PEMS-BAY [10] traffic dataset is collected by California Transportation Agencies

Fig. 2: Distribution of 325 sensors in Bay Area

4.2 Experiment procedure and settings

We divided our approaches into forecasts on a single sensor and on multiple

Fig. 3: Experiment Procedure

In single sensor forecasting, after training on first 3 months data, we obtain

Root Mean Squared Error (RMSE):

Fig. 4: Weekly seasonality

Fig. 5: Daily seasonality

Table 1: Evaluation comparison between Prophet and previous works on PEMS-

Fig. 8: Ground truth velocity and predicted velocity within 5 days

– Another explanation is that both approaches mentioned are using models

Fig. 9: High errors happens at a specific time in a day

Fig. 10: High errors also happens at a specific time in a week

5.2 Conclusion and future work

3. A. Camero and E. Alba, “Smart city and information technology: a review”,

18. W. Chen, L. Chen, Y. Xie, W. Cao, Y. Gao, and X. Feng, “Multi-range

international conference on management of data, SIGMOD ’15 (2015),

You might also like