You are on page 1of 11

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/288872405

Computer network traffic prediction: A comparison between traditional and


deep learning neural networks

Article · January 2016


DOI: 10.1504/IJBDI.2016.073903

CITATIONS READS

35 3,320

3 authors:

Tiago prado Oliveira Jamil Salem Barbar


Universidade Federal de Uberlândia (UFU) Universidade Federal de Uberlândia (UFU)
3 PUBLICATIONS   49 CITATIONS    19 PUBLICATIONS   63 CITATIONS   

SEE PROFILE SEE PROFILE

Alexsandro Santos Soares


Universidade Federal de Uberlândia (UFU)
8 PUBLICATIONS   150 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Tuning América Latina: Innovación Educativa y Social View project

Predição do tráfego de rede de computadores usando redes neurais View project

All content following this page was uploaded by Tiago prado Oliveira on 22 February 2018.

The user has requested enhancement of the downloaded file.


28 Int. J. Big Data Intelligence, Vol. 3, No. 1, 2016

Computer network traffic prediction: a comparison


between traditional and deep learning neural
networks

Tiago Prado Oliveira*, Jamil Salem Barbar and


Alexsandro Santos Soares
Federal University of Uberlândia (UFU),
Faculty of Computer Science (FACOM),
Uberlândia, MG, Brazil
Email: tiago prado@comp.ufu.br
Email: jamil@facom.ufu.br
Email: alex@facom.ufu.br
*Corresponding author

Abstract: This paper compares four different artificial neural network approaches for computer
network traffic forecast, such as: 1) multilayer perceptron (MLP) using the backpropagation as
training algorithm; 2) MLP with resilient backpropagation (Rprop); (3) recurrent neural network
(RNN); 4) deep learning stacked autoencoder (SAE). The computer network traffic is sampled
from the traffic of the network devices that are connected to the internet. It is shown herein
how a simpler neural network model, such as the RNN and MLP, can work even better than
a more complex model, such as the SAE. Internet traffic prediction is an important task for
many applications, such as adaptive applications, congestion control, admission control, anomaly
detection and bandwidth allocation. In addition, efficient methods of resource management, such
as the bandwidth, can be used to gain performance and reduce costs, improving the quality of
service (QoS). The popularity of the newest deep learning methods have been increasing in
several areas, but there is a lack of studies concerning time series prediction, such as internet
traffic.

Keywords: deep learning; internet traffic; neural network; prediction; stacked autoencoder; SAE;
time series.

Reference to this paper should be made as follows: Oliveira, T.P., Barbar, J.S. and Soares, A.S.
(2016) ‘Computer network traffic prediction: a comparison between traditional and deep learning
neural networks’, Int. J. Big Data Intelligence, Vol. 3, No. 1, pp.28–37.

Biographical notes: Tiago Prado Oliveira graduated in Computer Science from Federal
University of Uberlândia (UFU), Uberlândia City, Brazil in 2012. He received his Masters in
Computer Science from UFU in 2014. He has experience in computer science in the following
areas: computer systems, computer network management, artificial intelligence, artificial neural
networks, deep learning and data prediction.

Jamil Salem Barbar graduated in Electrical Engineering from the Federal University of
Uberlândia in 1985. He received his MSc in Electronics and Computer Engineering from
Instituto Tecnológico de Aeronáutica (ITA) in 1990 and PhD in Electrical and Computer
Engineering from ITA (1998). Currently, he is a Professor in UFU. He has experience
in computer science, acting on the following subjects: computer networks, QoS, wavelets,
multimedia systems, computer forensics and peer-to-peer. He is a member of the group of the
subject area informatics project tuning Latin America, which consists of 12 countries of Latin
America.

Alexsandro Santos Soares graduated in Electrical Engineering from the Federal University
of Uberlândia in 1996. He received his Masters in Computer Science and Computational
Mathematics from the University of São Paulo (USP) in 2001 and Doctorate in Electrical
Engineering from UFU (2007). Currently, he is a Professor in UFU. Has experience in artificial
intelligence, mainly in the following areas: natural computing, bioinformatics, artificial neural
networks and evolutionary computation.

Copyright © 2016 Inderscience Enterprises Ltd.


Computer network traffic prediction 29

This paper is a revised and expanded version of a paper entitled ‘Multilayer


perceptron and stacked autoencoder for internet traffic prediction’ presented at 11th IFIP
International Conference on Network and Parallel Computing (NPC 2014), Ilan, Taiwan,
September 18–20, 2014.

1 Introduction 2 Artificial neural networks

Using past observations to predict future network traffic ANNs are simple processing structures, which are separated
is an important step to understand and control a computer into strongly connected units called artificial neurons
network. Computer network traffic prediction can be (nodes). Neurons are organised into layers, one layer has
crucial to the network providers and computer network multiple neurons and any one neural network can have one
management in general. It is of significant interest in or more layers, which are defined by the network topology
several domains, such as adaptive applications, congestion and vary among different network models (Haykin, 1998).
control, admission control and bandwidth allocation. Neurons are capable of working in parallel to process
There are many studies that focus on adaptive and data, store experimental knowledge and use this knowledge
dynamic applications. They usually present some algorithms to infer new data. Each neuron has a synaptic weight,
that use the traffic load to dynamically adapt the bandwidth which is responsible for storing the acquired knowledge.
of a certain network component (Han, 2014; Zhao et al., Network knowledge is acquired through learning processes
2012; Liang and Han, 2007) and improve the quality of (learning algorithm or network training) (Haykin, 1998).
service (QoS) (Nguyen et al., 2009). Several works have In the learning process, the neural network will be trained
been developed using artificial neural networks (ANNs) to recognise and differentiate the data from a finite set.
and they have shown that ANN are a competitive model, After learning, the ANN is ready to recognise the patterns
overcoming classical regression methods such as the in a time series, for example. During the learning process
autoregressive integrated moving average (ARIMA) (Cortez the synaptic weights are modified in an ordered manner
et al., 2012; Hallas and Dorffner, 1998; Ding et al., 1995; until they reach the desired learning. A neural network
Feng and Shu, 2005). Thus, there are works that combine offers the same functionality as neurons in a human
these two factors, therefore producing a predictive neural brain for resolving complex problems, such as nonlinearity,
network to dynamically allocate the bandwidth in real-time high parallelism, robustness, fault and noise tolerance,
video streams (Liang and Han, 2007). adaptability, learning and generalisation (Cortez et al.,
2012; Haykin, 1998).
Network traffic is a time series, which is a sequence
Historically, the use of neural networks was limited in
of data regularly measured at uniform time intervals. For
relation to the number of hidden layers. Neural networks
network traffic, these sequential data are the bits transmitted
made up of various layers were not used due to the
in some network device at a certain period on time. A time
difficulty in training them (Bengio, 2009). However, in
series can be a stochastic process or a deterministic one.
2006, Hinton presented the deep belief networks (DBN),
To predict a time series is necessary to use mathematical
with an efficient training method based on a greedy
models that truly represent the statistical characteristic
learning algorithm, which trains one layer at a time (Hinton
of the sampled traffic. For adaptive applications that
et al., 2006). Since then, studies have encountered several
require real-time processing, the choice of the prediction
sets of good results regarding the use of deep learning
method must take into account the prediction horizon,
neural networks. Through these findings this study has as
computational cost, prediction error and the response time.
its objective to use the deep learning concept in traffic
This paper analyses four prediction methods that
prediction.
are based on ANN. Evaluations were made comparing
Deep learning refers to a machine learning method that
multilayer perceptron (MLP), recurrent neural network
is based on a neural network model with multiple levels
(RNN) and stacked autoencoder (SAE). MLP is a
of data representation. Hierarchical levels of representation
feed-forward neural network with multiple layers that uses a
are organised by abstractions, features or concepts. The
supervised training. SAE is a deep learning neural network
higher levels are defined by the lower levels, where
that uses a greedy algorithm for unsupervised training.
the representation of the low-levels may define several
For the MLP, it were compared two different training
different features of the high-levels, this makes the
algorithms, the standard backpropagation and the resilient
data representation more abstract and nonlinear for the
backpropagation (Rprop).
higher levels (Bengio, 2009; Hinton et al., 2006). These
These models were selected with the objective to
hierarchical levels are represented by the layers of the
confirm how competitive simpler approaches are (RNN
ANN.
and MLP) comparing to the more complex ones (SAE
Deep learning ANN allows for the adding of a
and deeper MLP). The analysis focuses on short-term and
significant complexity to the prediction model. This
real-time prediction and the tests were made using samples
complexity is proportional to the number of layers that
of internet traffic time series, which were obtained on
the neural network has, the abstraction of the features are
DataMarket database (http://datamarket.com).
30 T.P. Oliveira et al.

more complex for deeper layers. That way, the neural small. This is a problem because the learning process can
network depth concerns the number of composition levels go through many fluctuations, slowing the convergence
of nonlinear operations learned from trained data, i.e., more time or making the network become stuck in a local
layers; more nonlinear and deeper the ANN. minima. To help avoid this problem, was created the Rprop,
The main difficulty in using deep neural networks which has a dynamic learning rate, i.e., it updates the
relates to the training phase. Conventional algorithms, learning rate for every neuron connection, reducing the
like backpropagation, do not perform well when the error for each neuron separately.
neural network has more than three hidden layers (Erhan
et al., 2009). Furthermore, these conventional algorithms
do not optimise the use of more layers and they do not 3.2 Recurrent neural network
distinguish the data characteristics hierarchically, i.e., the
RNNs are neural networks that has one or more connections
neural network with many layers does not have a better
between neurons that forms cycles. These cycles are
result to that of a neural network with few layers, e.g.,
responsible for storing and passing the feedback of one
shallow neural network with two or three layers (de Villiers
neuron to another one, creating an internal memory that
and Barnard, 1993; Hornik et al., 1989).
facilitate learning of sequential data (Hallas and Dorffner,
1998; Haykin, 1998). The cycles can be used anywhere in
3 Review of literature the neural network and in any direction, e.g., it can have
a delayed feedback from the output to the input layer; a
Several types of ANN have been studied for network traffic feedback loop from one hidden layer to another layer or to
prediction. There are several studies in feed-forward neural the same layer and any combination of it (Haykin, 1998).
networks, such as MLP (Oliveira et al., 2014; Cortez et al., One of the objectives of this work is to compare the
2012; Ding et al., 1995), but many studies aim RNN (Hallas simple neural network models with the more complex ones.
and Dorffner, 1998) because of the internal memory cycles For that purpose it was chosen the Jordan neural network
that it has, facilitating learning temporal and sequential (JNN), since it is a simple recurrent network (SRN) and
dynamical behaviour and making it a good model for time usually is used to prediction. The JNN has a context layer
series learning. that holds the previous output from the output layer, this
An advantage of ANN is the response time, i.e., how context layer is responsible for receiving the feedback from
fast the prediction of future values is made. After the the previous iteration and transmitting it to the hidden layer,
learning process, which is the slowest step in the use of allowing a simple short term memory (Hallas and Dorffner,
an ANN, the neural network is ready for use, obtaining 1998).
results very quickly compared to other more complex
prediction models as ARIMA (Cortez et al., 2012; Feng
3.3 SAE for deep learning
and Shu, 2005). Therefore, ANNs are very good for
on-line prediction, obtaining a satisfactory result regarding SAE is a deep learning neural network built with multiple
prediction accuracy and response time (Oliveira et al., 2014; layers of sparse autoencoders, in which the output of each
Cortez et al., 2012). layer is connected to the input of the next layer. SAE
learning is based on a greedy layer-wise unsupervised
3.1 MLP and backpropagation training, which trains each layer independently (Vincent
et al., 2008; Ufldl, 2013; Bengio et al., 2007).
One of commonest architectures for neural networks is the The SAE uses all benefits of a deep neural network
MLP. This kind of ANN has one input layer, one or more and has a high classification power. Therefore, a SAE
hidden layers, and an output layer. Best practice suggests can learn useful data concerning hierarchical grouping and
one or two hidden layers (de Villiers and Barnard, 1993). part-whole decomposition of the input (Ufldl, 2013). The
This is due to the fact that the same result can be obtained key idea behind deep neural networks is that multiple
by raising the number of neurons in the hidden layer, rather layers represent various abstraction levels of the data.
than increase the number of hidden layers (Hornik et al., Consequently, deeper networks have a superior learning
1989). compared to a shallow neural network (Bengio, 2009).
MLPs are feed-forward networks, where all neurons The strength of deep learning is based on the
in the same layer are connected to all neurons of the representations learned by the greedy layer-wise
next layer, yet the neurons in the same layer are not unsupervised training algorithm. Furthermore, after a good
connected to each other. It is called feed-forward because data representation in each layer is found, the acquired
the flow of information goes from the input layer to the neural network can be used to initialise some new ANN
output layer. The training algorithm used for MLP is the with new synaptic weights. This new initialised neural
backpropagation, which is a supervised learning algorithm, network can be an MLP, e.g., to start a supervised training
where the MLP learns a desired output from various entry if necessary (Bengio, 2009).
data. A lot of papers emphasise the benefits of the
Usually the backpropagation suffers a problem of the greedy layer-wise unsupervised training for deep network
magnitude of partial derivative, making it too large or too initialisation (Bengio, 2009; Hinton et al., 2006; Bengio
Computer network traffic prediction 31

et al., 2007; Larochelle et al., 2009; Ranzato et al., 2007). 4.1 Data normalisation
Therefore, one of the goals of this paper is to verify whether
the unsupervised training of deep learning does actually Before training the neural network, it is important to
bring advantages over the simpler ANN models. normalise the data (Feng and Shu, 2005), in this case, the
time series. Hence, to decrease the time series scale the
min-max normalisation was used to limit the data between
the interval [0.1, 1]. This interval was set so that the
4 Experiments and results prediction output values stay inside the range of the sigmoid
activation function used by the neural network. Besides, for
The utilised time series was created by R.J. Hyndman and the minimum value it was selected 0.1 instead of 0, for
it is available under the ‘Time Series Data Library’ term avoiding a division by zero when calculating the normalised
at the DataMarket Web portal (http://data.is/TSDLdemo). errors.
The experiments were performed from data collected daily, The original time series is normalised generating a
hourly and at five minute intervals. Altogether, six time new normalised time series, which will be used for the
series were used, with them being ‘A-1d’, ‘A-1h’, ‘A-5m’, training. The range of the 6 time series used varies from
‘B-1d’, ‘B-1h’ and ‘B-5m’. 51 values (for the smallest time series, with daily data)
The time series used are composed of internet traffic to 19,888 values (for the largest time series, with data
(in bits) from a private internet service provider (ISP) with collected at five minute intervals). During the experiments
centres in 11 European cities. The data corresponds to a the data range for the training set varied greatly, from 25
transatlantic link and was collected from 06:57 hours on values (for the smallest time series) to 9,944 values (for the
7 June 2005 to 11:17 hours on 31 July 2005. This series was largest time series). The size for the training set was chosen
collected at different intervals, resulting in three different as that of the first half of the time series, the other half of
time series: ‘A-1d’ is a time series with daily data; ‘A-1h’ is the time series is the test set for evaluating the prediction
hourly data; and ‘A-5m’ contains data collected every five accuracy. The size of each dataset can be seen in Table 1.
minute.
The remaining time series are composed of internet Table 1 The time interval and size of each time series
traffic from an ISP, collected in an academic network
backbone in the UK. They were collected from Dataset Time interval Time series total size Training set size
19 November 2004, at 09:30 hours to 27 January 2005, at A-1d 1 day 51 25
11:11 hours. In the same way, this series was divided into A-1h 1h 1,231 615
three different time series: ‘B-1d’ is daily data; ‘B-1h’ is A-5m 5 min 14,772 7,386
hourly data; and ‘B-5m’, with data collected at five minute B-1d 1 day 69 34
intervals. B-1h 1h 1,657 828
The conducted experiments used DeepLearn Toolbox B-5m 5 min 19,888 9,944
(Palm, 2014) and Encog machine learning framework
(Encog, 2014). Both are open source code of different
libraries that covers several machine learning and artificial 4.2 Neural network architecture and topology
intelligence techniques. They were chosen because they are
widespread and used in research, they supply our needs, For the MLP-BP was used a sigmoid activation function,
such as MLP, RNN and SAE, is open source and has easy a learning rate of 0.01 and backpropagation as the training
access and usability. algorithm. Higher learning rate accelerates the training, but
DeepLearn Toolbox is a MATLAB library set that may generate many oscillations in it, making it harder to
covers a variety of deep learning techniques such as ANN, reach a low error. On the other hand, a lower learning
DBN, convolutional neural networks (CNN), convolutional rate leads to steadier training, however is much slower.
autoencoders (CAE) and SAE. Encog is a machine learning Different values for the learning rate were also tested, such
framework that supports several algorithms, like genetic as 0.5, 0.25 and 0.1, yet as expected, the lowest errors were
algorithms, Bayesian networks, hidden Markov models and obtained using 0.01 for the learning rate.
neural networks. For the Encog, the Java 3.2.0 release was For the deep learning neural network the SAE was used,
used, but it is also available for .Net, C and C++. also with sigmoid activation function and a learning rate of
For prediction and training of the multilayer perceptron 0.01. For the MLP-RP and the RNN, it was also used a
with backpropagation (MLP-BP), the MLP with resilient sigmoid activation function. The training algorithm of the
backpropagation (MLP-RP) and the RNN was used the RNN was the Rprop and the topology was the one of a
Encog framework. The SAE was developed using the JNN, with one context layer responsible for storing previous
MATLAB’s DeepLearn Toolbox. The experiments were iterations allowing a short term memory.
carried out on a Dell Vostro 3550; Intel Core i5-2430M The training algorithm of SAE is a greedy algorithm that
processor, clock rate of 2.40 GHz and 3 Mb Cache; 6 GB gives similar weights to similar inputs. Each autoencoder
RAM DDR3; Windows 7 Home Basic 64-Bit operating is trained separately, in a greedy fashion, then it is stacked
system. The instaled Java Platform is the Enterprise Edition onto those already trained; thereby, producing a SAE with
with JDK 7 and the MATLAB version is the R2013b. multiple layers. The SAE has an unsupervised training
32 T.P. Oliveira et al.

algorithm, i.e., it does not train the data considering an results began to worsen from 40 neurons in the input layer
expected output. Thus, the SAE is used as a pre-training and 120 neurons in the entire neural network.
stage to initialise the neural network weights. After
that, the fine-tuning step is initiated and was used the Figure 2 JNN architecture showing the layers, number of
backpropagation algorithm for supervised training (Bengio, neurons of each layer, the information flow and
the context layer with the feedback from the
2009; Erhan et al., 2009; Ufldl, 2013; Bengio et al., 2007).
output layer to the hidden layer
Several tests were carried out varying the ANN
topology, both in number of neurons per layer as in the
number of layers. The tests consist int create several neural
networks with different topology and train each one of
them. The variations of number of neurons were every
five neurons for the input layer, every five neurons for the
hidden layer and one output neuron for the prediction data.
Additionally, the number of hidden layers ranged between
[2, 5] for the MLP-BP and MLP-RP, [3, 7] for the SAE and
just one hidden layer for the RNN plus one context layer.
The tests ware stopped when the validation error began to
decrease.
All result comparisons were made accordingly with the
validation errors for each analysed ANN type (MLP-BP,
MLP-RP, RNN and SAE). For the MLP-BP and MLP-RP,
the best performances were obtained with four layers,
around 15 input neurons, 1 output neuron, 45 and
35 neurons in the hidden layers, respectively, as shown
For the SAE, the best results were found with six layers,
in Figure 1. It was found that increasing the number of
with 20 input neurons, one output neuron, 80, 60, 60
neurons or the number of layers did not result in better
and 40 neurons in each of the hidden layers, respectively.
performance, to some extent the average of normalised root
Increasing the number of neurons in the layers of the SAE
mean square error (NRMSE) was found similar for the same
did not produce better results; on average the NRMSE
time series.
was very similar. Similar results were found also with
Figure 1 MLP architecture showing the layers, number four layers, like the MLP, whereas deeper SAE achieved
of neurons of each layer and the information slightly better results. A comparison of the NRMSE of each
feed-forward flow prediction model will be shown in Table 3.
The influence of the number of neurons in the error is
better seen in Figure 3. Is noticed that the mean squared
error (MSE) increased significantly after 120 neurons in the
MLP-RP and RNN. For a smaller number of neurons, less
than 120, the error does not changed very much. Besides,
the more neurons in the network harder and longer the
training will be. For the SAE, the MSE was steadier even
for deeper architectures.

4.3 Neural network training

The neural network training was carried out in a single


batch. This way all input data of the training set are trained
in a single training epoch, adjusting the weights of the
neural network for the entire batch. Tests with more batches
(less input data for each training epoch) were also realised
and similar error rates were found. Nevertheless, for smaller
The MLP-BP, MLP-RP and RNN had a similar average batches, the training took more time to converge, because a
of NRMSE, but the RNN had a lower average, even smaller amount of data is trained at each epoch.
with fewer neurons. The best performances of RNN were The MLP-BP, MLP-RP and RNN training lasted
obtained with three layers (excluding the JNN’s context 1,000 epochs. The SAE training is separated into two steps.
layer), around ten input neurons, one output neuron and The first one is the unsupervised pre-training, which lasted
45 neurons in the hidden layer, as shown in Figure 2. In 900 epochs. The second step is the fine-tuning that uses a
the same way, increasing the number of neurons did not supervised training, which lasted 100 epochs.
result in better performance, in fact, overly increasing the
number of neurons was detrimental to performance. The
Computer network traffic prediction 33

Figure 3 Proportion between the complexity of the neural network measured in total number of neurons and
the respective MSE for the prediction of the B-5m time series

The training time is mainly affected by the size of the Due to this, the error keeps varying until it finds a better
training set and by the number of neurons of the neural update value for the training.
network. The higher the size of the training set and higher Figures 6 and 7 show the time series prediction results
the number of neurons, more time is necessary to train the for the MLP-RP and SAE, respectively. The fitted values
neural network. Table 2 shows the average training time graphic for MLP-BP and RNN were very similar to the
obtained for each time series and prediction model. MLP-RP. Due to low-scale image it would be difficult to
see a difference between them, so it was only shown the
Table 2 Comparison of the average training time MLP-RP. It is noted that MLP-RP, MLP-BP and RNN (see
Figure 6) best fit the actual data, nevertheless, the SAE
Training time (milliseconds) fared well in data generalisation.
Dataset
MLP-BP MLP-RP RNN SAE
All the prediction models, the MLP-BP, MLP-RP, RNN
and SAE, learned the time series features and used these
A-1d 268 79 67 8,874 features to predict data that are not known a priori. An
A-1h 2,373 1,482 1,473 585,761 important detail is that the RNN used fewer neurons and
A-5m 86,296 56,078 33,981 6,724,641 just one hidden layer. Therefore, the training phase of the
B-1d 558 326 108 17,280 RNN is faster than the MLP-BP, MLP-RP and SAE.
B-1h 7,322 7,181 2,838 837,567
B-5m 117,652 78,078 45,968 8,691,876
4.4 Main results
The only difference between the MLP-BP and the MLP-RP The key idea of deep learning is that the depth of the
is the training algorithm. The dynamic learning rate of neural network allows learning complexes and nonlinear
Rprop makes it one of the fastest training algorithms for the data (Bengio, 2009). However, the use of SAE for time
ANN and that is clear in our results, the MLP-RP was faster series prediction was not beneficial, i.e., the pre-training
than the MLP-BP. The RNN with Rprop used fewer neurons did not bring significant benefits to prediction and the SAE
and fewer layers, three layers in total, instead of four layers was outperformed by the other methods. The results with
like the other methods. Thereby, the fastest training of all respective NRMSE are shown in Table 3. The NRMSE is
neural network models was the RNN. the normalised version of the root of the MSE that is:
The initial 50 training epochs and errors are shown in
Figure 4, it is compared the fine-tuning training of the et = yt − ŷt , (1a)
∑n 2
SAE with the MLP-BP training. It is possible to observe e
that, because of the SAE pre-training, the SAE training M SE = i=1 i , (1b)
n √
converges faster than the MLP-BP training. However, more M SE
training epochs are enough for them to obtain very similar N RM SE = , (1c)
y + max − ymin
error rates.
Figure 5 shows the first 50 training epochs and errors where et is the prediction error at time t; yt is the actual
for the MLP-RP and RNN, as both uses the Rprop as value observed in time t; ŷt is the predicted value at time
training algorithm, the errors change drastically at the t; n is the number of predictions, ymax is the maximum
beginning of training, since there is no fixed learning rate. observed value and ymin is the minimum observed value.
34 T.P. Oliveira et al.

Figure 4 A MSE comparison of SAE and MLP-BP at each training epoch, for B-5m time series

Figure 5 A MSE comparison of the MLP-RP and the RNN at each training epoch, for B-5m time series

Figure 6 A comparison of the actual (the original) time series (represented in grey) and the predicted network traffic
represented in black) using a MLP-RP with two hidden layers, for B-5m time series

Notes: Since the actual and the predicted plot lines are very similar, it is difficult to see the difference with a low scale image.
Yet, it is possible to see that the predicted values fit very well to the actual values. Another observation is that the actual
and predicted graphic plot of network traffic for the MLP-BP and RNN were almost identical of the MLP-RP.
Computer network traffic prediction 35

Figure 7 A prediction comparison of the SAE with four hidden layers, at each training epoch, for B-5m time series

Notes: It shows the actual (the original) time series (represented in grey) and the predicted network traffic (represented in black)
using the SAE. It is noted that the predicted values did not fit well from 1 × 104 to 1.4 × 104 period in time, but for the rest
of the series the predicted values fit well to the actual values.

Table 3 Comparison of NRMSE results MLP-RP, which obtained the second fastest training time
(see Table 2). For the NRMSE, the MLP-RP method
Dataset Normalised root mean squared error (NRMSE) obtained the best (smallest) for the ‘A-1h’ series, the RNN
MLP-BP MLP-RP RNN SAE was the second best, so that is why the percentage is
negative. Still in the NRMSE comparison, the RNN was the
A-1d 0.19985 0.20227 0.19724 0.36600
best for the remaining series, the MLP-BP was the second
A-1h 0.05524 0.04145 0.04197 0.09399
best for the ‘A-1d’ and ‘B-1d’ series and the MLP-RP
A-5m 0.01939 0.01657 0.01649 0.02226
was the second best for the ‘A-5m’, ‘B-1h’ and ‘B-5m’
B-1d 0.12668 0.14606 0.11604 0.21552
(see Table 3).
B-1h 0.04793 0.02927 0.02704 0.06967
B-5m 0.01306 0.01008 0.00994 0.01949
Table 4 Percentage of how fast and accurate the RNN is,
compared to the second best prediction method
In the time series prediction, the SAE method has more
complexity than the other ones, since it has the extra Dataset Training time NRMSE
unsupervised training phase, which initialises the neural A-1d 15.19% 1.3%
network weights for the fine-tuning stage. Even with the A-1h 0.4% –1.25%
additional complexity, the SAE was inferior. Due to this A-5m 39.4% 0.48%
fact, this approach is not recommended for time series B-1d 66.87% 8.4%
prediction. B-1h 60.47% 7.62%
There were no major differences concerning the B-5m 41.12% 1.39%
MLP-BP and MLP-RP, but in general the MLP-RP
had better results, both in accuracy and training time. There are works, in the pattern recognition field, where
Ultimately, comparing all the results in Table 3, the the use of autoencoders are advantageous (Ranzato et al.,
RNN approach produced the best results with the smallest 2007), as they are based in unlabelled data. On the other
NRMSE. Moreover, the RNN used fewer neurons and hand, there are works in energy load forecast, showing that
performed better than the other methods, making the the autoencoders approach is worse than classical neural
training much faster as shown in Table 2. In addition, networks (Busseti et al., 2012), like MLP and recurrent
Table 4 shows that the RNN can be up to 66.87% faster, ANN. Each of these problems has a better method for
and 8.4% more accurate, compared to the second best solving it, so it is important to analyse the entry data type
method. before choosing the most appropriate method to be used.
The comparison made in Table 4 is showing the
percentage of how fast the RNN is, as well as the
percentage of how accurate the RNN is, when comparing
it to the second best prediction method in each category.
For the training time it is compared the RNN with the
36 T.P. Oliveira et al.

5 Conclusions Bengio, Y. (2009) ‘Learning deep architectures for AI’, Found.


Trends Mach. Learn., Vol. 2, No. 1, pp.1–127.
All types of studied ANN have proven that they are capable Busseti, E., Osband, I. and Wong, S. (2012) Deep Learning for
of to adjusting and predicting network traffic accurately. Time Series Modeling, Cs 229: Machine Learning, Stanford.
However, the initialisation of the neural network weights Chao, J., Shen, F. and Zhao, J. (2011) ‘Forecasting exchange rate
through the unsupervised pre-training did not brought an with deep belief networks’, in Neural Networks (IJCNN),
improvement for time series prediction. The result shows The International Joint Conference on, pp.1259–1266.
that MLP and RNN are better than SAE for internet Cortez, P., Rio, M., Rocha, M. and Sousa, P. (2012) ‘Multi-scale
traffic prediction. In addition, the SAE deep neural network internet traffic forecasting using neural networks and time
approach reflects in more computational complexity during series methods’, Expert Systems, Vol. 29, No. 2, pp.143–155.
the training, so the choice of MLP or RNN is more de Villiers, J. and Barnard, E. (1993) ‘Backpropagation neural
advantageous. nets with one and two hidden layers’, Neural Networks,
IEEE Transactions on, Vol. 4, No. 1, pp.136–141.
In theory, of all ANN studied in this work, the best
prediction method would be the RNN, since it is the Ding, X., Canu, S., Denoeux, T., Rue, T. and Pernant, F.
(1995) ‘Neural network based models for forecasting’,
one that uses previous observations as feedback in the
in Proceedings of ADT ‘95, pp.243–252, Wiley and Sons.
learning of newer observations, facilitating the learning of
Encog (2014) Encog (2014) ‘Encog articial
temporal and sequential data. Accordingly, the carried out
intelligence framework for Java and DotNet’
experiments shown that the best results in both, accuracy
[online] http://www.heatonresearch.com/encog
and computation time, were obtained with the JNN, an (accessed 3 November 2014).
SRN. Therefore, of all used methods, the best prediction
Erhan, D., Manzagol, P-A., Bengio, Y., Bengio, S. and
method for short-term or real-time prediction is the RNN Vincent, P. (2009) ‘The difficulty of training deep
with Rprop as training algorithm, whereas obtained the architectures and the effect of unsupervised pre-training’,
smallest errors in a significantly less time, as shows the in 12th International Conference on Artificial Intelligence
Table 4. and Statistics (AISTATS), pp.153–160.
The use and importance of deep neural networks is Feng, H. and Shu, Y. (2005) ‘Study on network traffic prediction
increasing and very good results are achieved in images, techniques’, in Wireless Communications, Networking
audio and video pattern recognition (Larochelle et al., and Mobile Computing, Proceedings, International
2009; Ranzato et al., 2007; Arel et al., 2010; Chao et al., Conference on, Vol. 2, pp.1041–1044.
2011). However, the main learning algorithms for this kind Hallas, M. and Dorffner, G. (1998) ‘A comparative study on
of neural network are unsupervised training algorithms, feedforward and recurrent neural networks in time series
which use unlabelled data for their training. In contrast, prediction using gradient descent learning’, in Proc. 14th
network traffic and time series, in general, are labelled data, European Meet. Cybernetics Systems Research, Vol. 2,
requiring an unsupervised pre-training before the actual pp.644–647.
supervised training as a fine-tuning. Yet, as shown in Chao Han, M-S. (2014) ‘Dynamic bandwidth allocation with high
et al. (2011), the DBN and restricted Boltzmann machine utilization for xg-pon’, in Advanced Communication
Technology (ICACT), 16th International Conference on,
(RBM), which are deep learning methods, can be modified
pp.994–997.
to work better with labelled data, i.e., time series datasets.
Haykin, S. (1998) Neural Networks: A Comprehensive
Future works will focus in other deep learning
Foundation, 2nd ed., Prentice Hall PTR, Upper Saddle
techniques, like DBN and continuous restricted Boltzmann
River, NJ, USA.
machine (CRBM). Other future works will use the
Hinton, G.E., Osindero, S. and Teh, Y-W. (2006) ‘A fast learning
network traffic prediction to create an adaptive bandwidth
algorithm for deep belief nets’, Neural Comput., Vol. 18,
management tool. This adaptive management tool will first No. 7, pp.1527–1554.
focus on congestion control through bandwidth dynamic
Hornik, K., Stinchcombe, M. and White, H. (1989) ‘Multilayer
allocation, based on the traffic predicted. The goal is feedforward networks are universal approximators’, Neural
to ensure a better QoS and a fair share of bandwidth Netw., Vol. 2, No. 5, pp.359–366.
allocation for the network devices in a dynamic and Larochelle, H., Erhan, D. and Vincent, P. (2009) ‘Deep
adaptive management application. learning using robust interdependent codes’, in D.V.
Dyk and M. Welling (Eds.): Proceedings of the Twelfth
International Conference on Artificial Intelligence and
References Statistics (AISTATS), Vol. 5, pp.312–319, Journal of Machine
Arel, I., Rose, D. and Karnowski, T. (2010) ‘Deep machine Learning Research – Proceedings Track.
learning – a new frontier in artificial intelligence research Liang, Y. and Han, M. (2007) ‘Dynamic bandwidth allocation
[research frontier]’, Computational Intelligence Magazine, based on online traffic prediction for real-time mpeg-4
IEEE, Vol. 5, No. 4, pp.13–18. video streams’, EURASIP J. Appl. Signal Process., No. 1,
Bengio, Y., Lamblin, P., Popovici, D. and Larochelle, H. (2007) pp.51–51.
‘Greedy layerwise training of deep networks’, in Advances Nguyen, T., Eido, T. and Atmaca, T. (2009) ‘An enhanced
in Neural Information Processing Systems 19 (NIPS ‘06), QoS-enabled dynamic bandwidth allocation mechanism for
pp.153–160. ethernet pon’, in Emerging Network Intelligence, First
International Conference on, pp.135–140.
Computer network traffic prediction 37

Oliveira, T.P., Barbar, J.S. and Soares, A.S. (2014) ‘Multilayer Ufldl (2013) Unsupervised Feature Learning and Deep Learning,
perceptron and stacked autoencoder for internet traffic Stanford’s Online Wiki, Stacked Autoencoders.
prediction’, in C-H. Hsu, X. Shi and V. Salapura (Eds.): Vincent, P., Larochelle, H., Bengio, Y. and Manzagol, P-A.
Network and Parallel Computing – 11th IFIP WG 10.3 (2008) ‘Extracting and composing robust features with
International Conference, NPC, Lecture Notes in Computer denoising autoencoders’, in Proceedings of the 25th
Science, Ilan, Taiwan, September 18–20, Vol. 8707, International Conference on Machine Learning, ICML ‘08,
pp.61–71, Springer. pp.1096–1103, ACM, New York, NY, USA.
Palm, R.B. (2014) ‘Deeplearntoolbox, a Zhao, H., Niu, W., Qin, Y., Ci, S., Tang, H. and Lin, T.
MATLAB toolbox for deep learning’ [online] (2012) ‘Traffic load-based dynamic bandwidth allocation
https://github.com/rasmusbergpalm/DeepLearnToolbox for balancing the packet loss in diffserv network’,
(accessed 3 November 2014). in Computer and Information Science (ICIS), IEEE/ACIS
Ranzato, M., Boureau, Y. and Cun, Y.L. (2007) ‘Sparse feature 11th International Conference on, pp.99–104.
learning for deep belief networks’, in J. Platt, D. Koller,
Y. Singer and S. Roweis (Eds.): Advances in Neural
Information Processing Systems, Vol. 20, pp.1185–1192, MIT
Press, Cambridge, MA.

View publication stats

You might also like