You are on page 1of 6

International Journal of Computer Trends and Technology (IJCTT) – volume 4 Issue 8– August 2013

ISSN: 2231-2803 Page 2452
A Short-Term Traffic Prediction On A Distributed Network Using Multiple Regression Equation
Ms.Sharmi .S
Research Scholar, Director,
MS University,Thirunelvelli SREC,Coimbatore.

Abstract: A new approach is proposed to predict the fractal behavior of a distributed network traffic, in which a
random scaling fractal model is used to simulate the self-affine characteristics ofa network traffic.A study of the
network traffic is done by sniffing a portion of it using Wireshark. The sniffed traffic is inspected and dissected
using filter option, for each differentprotocols. The fractal behavior of the traffic are sniffed and examined by an
open-source network analyzer. Later, the packet records that were sniffed are exported to NeuroSolutions
builder,SPSS andthen examined. Further, the exported and dissected traffic data is fed as input to train the neural
network to let it predict the resultant fractal behavior of the distributed network traffic and an equation is proposed
to derive the ultimate close network traffic prediction in SPSS.
Keywords: fractal behavior, sniffing, predict, SPSS, NeuroSolution builder, NeuroXL predictor.
For the examination of local problems in a
small network, monitoring at a single
observation point is sufficient to train the
network builder. For such cases, a network
analyzer may be used which can be a
machine running Wireshark and is directly
connected to a network segment or the
monitoring port of a switch or a router. In
larger networks, it is often necessary to
perform simultaneous monitoring at multiple
observation points to train the constructed
neural network in a more efficient manner.
In this research a Neural Network(Multi-
layer Perceptron)is proposed to be used to
predict the dependent variable values over
different independent variable value
distributions using two specific modeling
tools, viz., SPSS and NeuroSolutions. One
objective of this is to find the effect of the
dependent variable values distributions in
the dataset using different modeling tools on
the Neural Network prediction performance.
A second objective is to compare the
performance of the two modeling tools in
the predictionof the dependent variable
Analyzing packet records with wireshark
Wireshark [1], formerly known as Ethereal
is probably the most popular open-source
network analyzer tool. For the experiments,
we configured Wireshark on our machine to
capture network packets. The data collected
is exported in Comma Separated Value
(.csv) format.
Wireshark can be divided into four main
modules: Capture Core, WireTap, Protocol
Interpreter and Dissector. Capture Core uses
the common library WinPcap to capture data
from different network (Ethernet, Ring,
etc.); once the data is obtained, WireTap is
used to save it as a binary file; since the data
is in binary, without the Protocol Interpreter
and Dissector, user cannot understand the
data. Dissector can be available in a built-in
or a plug-in mode. The proposed approach
allows profiting from Wireshark's extensive
packet inspection facility and protocol
dissection capabilities for distributed
network analysis.
Neuro solutions
The NeuralBuilder helps to construct the
neural network by selecting parameters. The
four currently available problem types in the
NeuralExpert are Classification, Prediction,
Function Approximation, and Clustering.
Later, a parameter list is selected to train the
neural network and the desired traffic is
output to train the network.
International Journal of Computer Trends and Technology (IJCTT) – volume 4 Issue 8– August 2013

ISSN: 2231-2803 Page 2453
Figure 1. Flow diagram to deploy traffic prediction using ANN.

An ANN is a computational method
motivated by biological models. ANNs
attempt to mimic the fundamental operation
of the human brain and can be used to solve
a broad variety of problems [10]. One of the
most important features of ANNs is that it
can discover hidden patterns from data sets
[11], and solve complex problems especially
when a mathematical model does not exist
(or when the model is not suitable for the
case at hand). Furthermore, ANNs are
commonly immune to noise and
irregularities present in the data [12, 13].
ANN learning is typically based on two data
sets: the training set and the validation set.
The training set is used on a new artificial
neural network, as its name indicates, for
training. The validation set is used after the
neural network has been trained to assess its
performance. The validation set in most case
is similar to the training set but not same
[14, 15 ].
Data mapping
In artificial intelligence, a desired output is
commonly known as the target. For the
specific case of ANNs, the target is used for
network training [9]. ANNs can map a given
input to a desired output; when an ANN is
used for this purpose, the ANN is typically
called a mapping ANN. The network is
trained by applying the desired input to the
ANN, and then monitoring the actual ANN
output. The difference between the actual
ANN output and the desired output is
normally used to manage the learning
process. During the process of training, the
learning algorithm attempts to reduce the
error measured between the actual network
output and the targetin the training set [9,
11]. The training process may be time
consuming, but when the process has been
successfully completed, an ANN canquickly
calculate its output once the input data has
International Journal of Computer Trends and Technology (IJCTT) – volume 4 Issue 8– August 2013

ISSN: 2231-2803 Page 2454
Translate the
network traffic
data parameters.
Train the NN’s
architecture for N
number of epochs.
Evaluate the
Extract a new traffic
dataset dissected
Perform Prediction-
Original expected
Dissect the network
traffic dataset and
enlist the
Step: 1
Step : 2
Step :3
Step : 4
Step : 5

Figure 2: Flow diagram of ANN using NeuroSolutions:

been applied to the network input.
Data classification
Data classification or just classification is
the process of identifying an object from a
set of possible outcomes [9, 12]. An ANN
can be trained to identify and classify any
kind of objects. These objects can be
numbers, images, sounds, signals, etc. An
ANN used for this purpose is also known as
a classifier.

Figure 3. Training fractal-dataset graph
International Journal of Computer Trends and Technology (IJCTT) – volume 4 Issue 8– August 2013

ISSN: 2231-2803 Page 2455
The traffic data is trained initially with a
network traffic-dataset that had been
downloaded from wireshark sample captures
as a pcap file and the data is exported to
network builder for prediction. The
predicted fractal behavior on the traffic data
set is shown in table 1.

On investigating the effect of dependent
variable values and the distribution on the
prediction accuracy rate. The results of the
analyses lets us to find the effect of the
dependent variable values distribution on
prediction accuracy that exploits and leads
us generating an equation that would predict
the expected traffic based on the
independent variable-values distribution
using the modeling tool SPSS.
Correlation Coefficient, R, is a measure of
the strength of the association between the
independent (explanatory) variables and the
dependent (prediction) variable.R is never a
negative value. This can be seen from the
formula below, since the square root of this
value indicates the positive root[2,3].
Formula for R,Formula for two independent
variables, X1 and

The coefficient of multiple correlation
estimates the combined influence of two or
more variables on the observed (dependent)
variable. To analyse the traffic data using
multiple regression, part of the process
involves the following assumptions to be
 The dependent variable is measured
on a continuous scale.
 Two or more independent variables,
are continuous or categorical.
 Observatios should be recorded.
 Linear relationship exists between
the dependent variable and each of
the independent variables.
 Traffic data shows homoscedasticity,
which is where the variances along
the line of best-fit remain similar as
one move along the line.
 The data does not
show multicollinearity, which occurs
when two or more independent
variables are highly correlated.
 There are no significant
outliers, high leverage
points or highly influential points.
 Residuals (errors) are approximately
normally distributed.
The above listed assumptions are not
violated and henceforth the Multiple
Correlation Coefficient, R, is computed to
measure the strength of the association
between the independent (explanatory)
variables and a single dependent (prediction)
Multiple Regression-booster prediction
In MR-Booster, by using each feature of the
association existing between the actual
traffic and the dissected traffic explicitly
helped to generate the prediction equation
and the standard error factor when probed in
further boosts a better way to refine the
regression equation that predicts the
network traffic. The correlation structure of
traffic is finally generated in a much easier
Phase 1:
a. The sniffed traffic data is plotted as a
scatter plot graph to visualize if there is a
possible linear relationship.
b. Calculate and interpret the linear
correlation coefficient, using the data sets.
International Journal of Computer Trends and Technology (IJCTT) – volume 4 Issue 8– August 2013

ISSN: 2231-2803 Page 2456
Phase 2:
c. Determine all possible regression equation
for the data by refining it further by
adjusting the constant standard error from it.
d. Select and apply the best generated
regression equation and forecast.
Phase 3:
e. Identify outliers and note the
f. Process and interpret the performance of,
R-booster prediction.

Table 1.Descriptive Statistics(SPSS)
Mean Std. Deviation N
Actual-Traffic .84 1.756 2581
Traffic-n1 .77 1.656 2581
Traffic-n2 .01 .139 2581
Traffic-n3 .60 1.308 2581

Table 2.Correlation Coefficientsa (a-dependent actual traffic-graph)
T Sig. R Std. Error Beta
1 (Constant) .013 .005 2.711 .007
Network1(n1) .880 .007 .830 133.561 .000
Network2(n2) 1.047 .032 .083 32.668 .000
Network3(n3) .229 .008 .170 27.395 .000

The equation generated to predict the actual
traffic that could be generated for the
following dissected protocol-traffic.
Predicted traffic(w.r.t time slice)=n1 *(R(
n1)– standard Error-n1) + n2 *(R(n2) –
standard Error) +n3 * (R(n3) – standard
Error) +(R-constant – standard Error)

R value of traffic from n1 and n2
have a strong association with the actual
traffic, where as traffic from n3 has a weak
association is shown in table 3.

Table 3.R value strength.
R value Interpretation
0.9 strong association
0.5 moderate association
0.25 weak association

International Journal of Computer Trends and Technology (IJCTT) – volume 4 Issue 8– August 2013

ISSN: 2231-2803 Page 2457

Figure 4. Actual-traffic vs Predicted traffic(NeuroSolutions) and Computed-traffic(SPSS)
The figure 4, shows that the traffic
computed using the generated equation is
very close to the actual-target-traffic.

The overall performance of the
analyzed prediction methods are stated here
to estimate the prediction accuracy.
Coefficient Efficiency(E) is one such
estimation method that measures the
performance and reveals the efficiency rate.
The efficiency coefficient can take values in
the domain (−∞, 1]. If E = 1, we have a
perfect fit between the observed and the
forecasted data. A value of E =0 occurs
when the prediction corresponds to
estimating the mean of the actual values. An
efficiency less than zero, i.e. −∞ < E < 0,
indicates that the average of the actual
values is a better predictor than the analyzed
forecasting method. The closer E is to 1, the
more accurate the prediction is as the
coefficient efficiency stays at 0.9 for the
forecasted traffic

The experimental results demonstrate that 1)
the regression model is more effective for
traffic prediction; and 2) both the proposed
prediction equation and standard error based
R(correlation coefficient) update scheme
are effective to predict the traffic in a easier
way.The goal of the experiments is to
evaluate and to compare the performance of
the ANN prediction approaches presented
earlier in this paper. Hence, the linear
regression model offers is a powerful tool
for analyzing the association between one or
more independent variables and a single
dependent variable. Some novice
researchers wish to move quickly beyond
this model and learn to use more
sophisticated models because they get
discouraged about its limitations and believe
that other regression models are more
appropriate for their analysis needs.
[1]Wireshark Homepage, http:// www. ,
[2] ClearSight Networks, Inc. Homepage,,2008.
[3] /multiple-
regression-using-spss-statistics. php
[4] _ correlation
[6] WildPackets, Inc. Homepage, http: //www., 2008.
[7] S. Waldbusser, “Remote Network Monitoring
Management InformationBase,” RFC 2819 (Standard),
May 2000.
[8] T. Masters, Practical Neural Network Recipes in C++.
Preparing Input Data (C-16), Academic Press, Inc., pp.253-
270, (1993).
[9] S. J. Russel and P. Norvig, Artificial Intelligence: A
Modern Approach.Prentice-Hall of India, Second
Edition.Statistical Learning Methods (C-20), pp. 712-762,
[10] T. Masters, Neural, Novel & Hybrid Algorithms for
Time Series Prediction. Neural Network Tools (C-10),
J ohn Wiley & Sons Inc., pp. 367-374, (1995).
[11] T. Masters, Signal and Image Processing With Neural
Networks. Data Preparation for Neural Networks (C-3),
J ohn Wiley & Sons Inc., pp. 61-80, (1994).
[12] T. Masters, Advanced Algorithms for Neural
Networks. Assessing Generalization Ability (C-9), John
Wiley & Sons Inc., pp. 335-380, (1995).
[13] R. D. Reed and R. J. Marks II, Neural Smithing:
Supervised Learning in Feedforward Artificial Neural
Networks. Factors Influencing Generalization (C-14), The
MIT Press, pp. 239-256, (1999).
[14] _Lab