You are on page 1of 10

Network Traffic Classification

SRI JAYACHAMARAJENDRA COLLEGE OF


ENGINEERING
Mysore -570 006
DEPARTMENT OF INFORMATION SCIENCE AND ENGINEERING

Network Traffic Classification and Analysis

by,

AKSHATA GADAG 4JC13IS004


MANASA B 4JC13IS037
N DEEPTHI 4JC13IS038

GUIDED BY
Mr. Manju N
Lecturer in
Information Science and Engineering Department,
SJCE,
Mysore.

1
Network Traffic Classification

Table of Contents
Table of Contents ...................................................................................................................... 2
1. Abstract ............................................................................................................................... 3
2. Introduction ......................................................................................................................... 4
3. Literature Survey ................................................................................................................ 5
4. Problem Defination ............................................................................................................. 6
5. Methodology ........................................................................................................................ 7
6. References ............................................................................................................................ 8

2
Network Traffic Classification

1. ABSTRACT

Network Traffic Classification has become an important part of Cloud Computing. With the
development of Cloud Computing and Mobile Computing, many users gradually adopt to
using a variety of application on mobile and cloud. Huge amount of data travel through the
network. This leads to increasing demand for classification of the traffic over the network.
Network Traffic Classification is the basis for network Quality of Service management,
security and intrusion detection. Network traffic classification can be used to identify
different applications and protocols that exist in a network. Actions such as monitoring,
discovery, control and optimization can be performed by using classified network traffic. The
overall goal of network traffic classification is improving the network performance. Recently,
many new machine learning algorithms prepared to analyze the network traffic. New
machine learning algorithms are coming in this field for building the network traffic
classifiers. This project deals with one of the approach for Network Traffic Analysis in Cloud
Computing. In the cloud, the network traffic data are collected and those data are in the cloud
database and a machine learning system is formed. In a cloud computing scenario, the
network traffic data sends to the classification machine or clustering machine as per the
labeled or unlabeled network traffic data, which classifies them into different applications.
The approach we have considered is a semi-supervised classification method. Our approach
allows classifiers to be designed from training data that consists of only a few labeled and
many unlabeled traffic. We use K-means Clustering algorithm for this purpose.

3
Network Traffic Classification

2. INTRODUCTION

Accurate identification and categorization of network traffic according to application type is


an important element of many network management tasks such as flow prioritization, traffic
shaping/policing, and diagnostic monitoring. Network Traffic Classification is the process of
analyzing network traffic flows and classifies them mainly on the basis of protocols like TCP,
UDP, IMAP, or POP3 etc or applications like games, messengers or news items etc. Network
Traffic Classification plays an important role in Cloud Computing. It helps to improve the
Quality of service and security. It also helps to segregate different types of traffic so that they
can be stored and managed efficiently. Different applications are stored at different location
in the cloud, so that accessing the data becomes easy. Network traffic classification plays a
major role in this. In the cloud, all the network traffic data are collected and they a feed to a
classifier or clustering model. These models help to classify the traffic into different
applications, which makes storage of this traffic easier in the cloud database. It makes
managing of cloud much easier. Hence, network traffic classification is considered to have an
increasing demand as the demand for cloud computing increases.

Earlier Port-based classification method was used. This was an effective method at time, as
there were very limited applications and each application used a single well-known port
number. As the number of application increased, it became difficult to assign port numbers to
each application. Hence dynamic port numbers were used, which lead to downfall of this
method.
Next is the Payload-based classification, in which the whole payload was analyzed to
determine whether they contain characteristic signatures of known applications and based on
that traffic was classified. But this had a lot of disadvantages, as analyzing the complete
payload causes a lot of overhead. Also looking into the payload is breaching the privacy of
the sender and hence is not ethical. Another significant disadvantage is that these techniques
typically require increased processing and storage capacity.
The limitations of port-based and payload-based analysis have motivated use of transport
layer statistics for traffic classification. These classification techniques rely on the fact that
different applications typically have distinct behavior patterns when communicating on a
network. This technique uses machine learning approach. There are various machine learning
approaches which can be used. They can be supervised, unsupervised or semi-supervised
techniques. In our project we are using semi-supervised machine learning approach.
In this project, we propose a method in which different protocols are used to classify different
applications. As we are using semi-supervised machine learning approach, to consist of some
labeled and some unlabeled samples. The model has to identify the classes of unlabeled
applications based on the study it does on the labeled samples. For this we would be using
Simple K-means clustering algorithm. This algorithm forms clusters of similar samples which
consist of both labeled and unlabeled samples. Then it would predict the application types of
the unlabeled samples based on which cluster they belong to. The model also evaluates the
accuracy of the algorithm in predicting the type of the application. The methodology used is
given in section-5 of this report.

4
Network Traffic Classification

3. LITERATURE SURVEY

Following are the papers that we studied to decide the topic and got some ideas about the
project can be done. They helped us to understand the problem definition well and to come
up with the appropriate approach. After understanding the following papers, the approach we
are using is machine learning approach and semi-supervised learning using K-means
algorithm. We have considered most of the advantages and disadvantages of various methods
and came out with the approach.
Internet Traffic classification Methods By Indra Bhan Arya and Rachna Mishra

This paper focuses on different Internet Classification methods. Port-based Classification,


method uses linking a well-known port number with a specific application. This method is
ineffective because many recently developed applications do not communicate on
standardized ports. Other method is Deep Packet Inspection". In this approach, the packet
payloads are analyzed to see whether or not they contain characteristic signatures of known
applications. There are certain limitations. First, these techniques only identify traffic for
which signatures are available. Secondly, packet inspection techniques fail if the application
uses encryption. This technique can be extremely accurate when the payload is not encrypted.

Machine learning is one of the promising approach for traffic classification. Many
approaches have been evolved till date like unsupervised approach, supervised approach and
semi-supervised approach. The method in which the training data is labeled before is called
as supervised learning. Labeled data means the input set for which the class to which it
belong is known. The methodology in which the training data is unlabeled is called as
unsupervised method. Unlabeled dataset is one for which class to which it belongs is
unknown and is to be properly classified. Traffic classification use features, set of attributes
of each instance, to evaluate the outcome of class. A class is a special attribute of each
instance, which shows result of instance. Feature selection algorithms play an important role
for ML algorithms. It is not only reduces the features sets but also improve computational
performance and classification accuracy.

Internet Traffic Classification Using Clustering on Semi supervised Data By Sheetal S.


Shinde and Sandeep P. Abhang

This paper presents a network traffic classification based on semi-supervised approach. A


learner and a classifier are two components of it. The learner is to distinguish a mapping
between flows and traffic class from a training data set. Consequently, the classifier is
obtained using this learned mapping. The learner is build using both labelled and unlabelled
flows to show that unlabelled flows can help to make the traffic classification problem handy.
Semi supervised approach is advantageous in some situations. It is used to build fast and
accurate classifier. It classifies the given dataset into appropriate classes using the K-means
clustering algorithm. Some of the properties are maximum or minimum packet length in each
directions, minimum or maximum packet arrival time, minimum or maximum number of
bytes transferred in forward and backward directions.

5
Network Traffic Classification

Traffic Classification using Clustering algorithms By Jeffrey Erman, Martin Arlitt and
Anirban Mahanti

The author concentrates on semi-supervised learning in their paper and why it is better than
any other approaches. This paper compares different clustering algorithms namely K Means,
DBSCAN and Autoclass used in traffic classification. The K-Means algorithm partitions
objects in a data set into a fixed number of K disjoint subsets. For each cluster, the
partitioning algorithm maximizes the homogeneity within the cluster by minimizing the
square-error. The DBSCAN algorithm is based on the concepts of density reachability and
density connectivity. Density-based algorithms regard clusters as dense areas of objects that
are separated by less dense areas. Autoclass is a Probabilistic model based clustering. This
algorithm allows for the automatic selection of the number of clusters and the soft clustering
of the data. Soft clusters allow the data objects to be fractionally assigned to more than one
cluster.
The results showed that the AutoClass algorithm produces the best overall accuracy.
However, the DBSCAN algorithm has great potential because it places the majority of the
connections in a small subset of the clusters The overall accuracy of the K-Means algorithm
is only marginally lower than that of the AutoClass algorithm, but is more suitable for
problems which requires faster model building time. This helped us to choose the K-means
algorithm.

A Survey on recent Traffic classification techniques using machine learning methods by


M.Tamilkili

This paper tells about evaluating the performance of machine learning methods. All traffic
classification techniques use some metrics to evaluate the result. These classification
techniques can be differentiated by using criterion known as predictive accuracy. The
common metrics which are used: False Negative(FN), False Positive(FP), True
Negative(TN), True Positive(TP). A good classifier minimizes FN & FP. Some other
evaluation metrics used are: Accuracy, Recall and Precision. The evaluation approach we are
using is accuracy.

An Implementation of Network Traffic Classification Technique Based on K-Medoids


by Dheeraj Basant Shukla and Gajendra Singh Chandel
The authors of this paper have explained why machine learning approach is better than the
port-based and payload-based methods. Both port and payload based have their own
disadvantage, which is overcome by machine learning approach. The machine learning
approach used by them is semi-supervised, which helped us to under the approach better and
choose this approach. They also tell that K-medoids algorithm. They have explained in their
paper how semi-supervised approach can be used to solve the problem. As semi-supervised
contains both labeled and unlabeled samples, we have to identify the application types of
some traffic manually. Then the classification can be done based on these lables.

Approaching Real-time Network Traffic Classification By Wei Li, Kaysar Abdin,


Robert Dann and Andrew Moore.

Recent research explored the feasibility of using Machine Learning methods to provide
accurate network traffic classification. Accurate real-time traffic classification is of
fundamental importance to network operations and managements. It serves as the input for

6
Network Traffic Classification

Intrusion Detection Systems, provides Class-of-Service mapping for Quality of Service


control, and also provides statistics for network monitoring. A real-time network traffic
classification framework based on flow behaviour would theoretically comprise procedures
such as Packet Capture: to capture packets from a network interface. Flow Demultiplexing: to
collect and aggregate packets in each flow into single flow objects. Feature Collection: to
collect flow features required for classification from single flow objects. Classification: to
check these flow features with a pre-trained flow model, in order to predict which application
class the flow belongs to.

Efficient Flow based Network Traffic Classification using Machine Learning By


Jamuna A and Vinodh Ewards
This paper conducts a flow based traffic classification and comparison on the various
Machine Learning (ML) techniques such as C4.5, Nave Bayes, Nearest Neighbor, RBF for
IP traffic classification. From this C4.5 Decision Tree gives 93.33% accuracy compare with
other algorithms. Classifying traffic flows by their generation applications plays very
essential task in network security and management, such as, lawful interception and intrusion
detection, Quality. Conventional traffic classification methods include the port-based
prediction methods and payload-based deep inspection methods. In current network
environment, the conventional methods suffer from a number of practical problems such as
dynamic ports and encrypted applications. In this paper we identify that real-time traffic
classifiers will work under constraints, which limit the number and type of features that can
be calculated. On this source we define 43 flow features that are simple to compute and are
well implicit within the networking community. We estimate the classification accuracy and
computational performance of C4.5, Nearest Neighbor, Bayes Network, and Nave Bayes
algorithms using the 43 features and with one reduced feature sets and feature selection set.

An Overview of Network Traffic Classification Methods By Zeba Atique Shaikh and


D.G.Harkut
This paper gives an overview of available network classification methods and techniques.
The goal of network traffic classification is to improve the network performance. Once the
packets are classified as belonging to a particular application, they are marked. These
markings or flags help the router determine appropriate service policies to be applied for
those flows. Classification is achieved by various means. First approach is by using port
numbers. This method is fast and low resource-consuming. It is supported by many network
devices. It does not implement the application-layer payload, so it does not compromise the
users' privacy. It is useful only for the applications and services, which use fixed port
numbers hence easy to cheat by changing the port number in the system. Second approach is
by using Deep Packet Inspection which inspects the actual payload of the packet. It detects
the applications and services regardless of the port number, on which they operate. Statistical
classification which relies on statistical analysis of attributes such as byte frequencies, packet
sizes and packet inter-arrival times. It often uses Machine Learning Algorithms, as K-Means,
Naive Bayes Filter, C4.5, C5.0, J48, or Random Forest.

Implementation of Network Traffic Classification by Using MLA By Miss. Ghodake


shubhangi, Miss. Raut Sarika, Miss. Ghuge Shital.
Network traffic classification is challenging task in high speed network. Network monitoring
is required for quality of service and analysis, therefore it generate network traffic. Existing
system has some drawback, to overcome that drawback we have develop our system i.e
classification of network traffic using machine learning algorithm. According to generated

7
Network Traffic Classification

traffic information by client we have constructed boosted classifier with high accuracy. This
system is used to classify application like FTP, Skype, TCP, etc. For constructing c5.0
classifier we have to provide unique dataset and training set to algorithm. We have used
semi-supervised K-means classification where trained data set is not required this is the
advantage of using K-means classification.

Above the Clouds: A Berkeley View of Cloud Computing


This paper helped us to understand the importance of Network Traffic classification in Cloud
Computing. Cloud Computing, has the potential to transform a large part of the IT industry,
making software even more attractive as a service and shaping the way IT hardware is
designed and purchased. Cloud Computing refers to both the applications delivered as
services over the Internet and the hardware and systems software in the datacenters that
provide those services. The services themselves have long been referred to as Software as a
Service (SaaS). The datacenter hardware and software is what we will call a Cloud. When a
Cloud is made available in a pay-as-you-go manner to the general public, we call it a Public
Cloud; the service being sold is Utility Computing. We use the term Private Cloud to refer to
internal datacenters of a business or other organization, not made available to the general
public. Thus, Cloud Computing is the sum of SaaS and Utility Computing, but does not
include Private Clouds. People can be users or providers of SaaS, or users or providers of
Utility Computing. Service providers enjoy greatly simplified software installation and
maintenance and centralized control over versioning; end users can access the service
anytime, anywhere, share data and collaborate more easily, and keep their data stored
safely in the infrastructure. There are many Obstacles and Opportunities for Cloud
Computing such as Availability of a Service, Data Confidentiality and Auditability,
Performance Unpredictability, Scalable Storage and Bugs in Large-Scale Distributed
Systems.

4. PROBLEM DEFINATION

The problem that we are considering is the network traffic classification for the better
management of the cloud. Classifying the network traffic helps to improve the Quality of
service, provides more security and helps in maintaining the database in the cloud.
Classifying the network traffic means identifying the different applications which travel in
the network. For this the approach we are considering is the machine learning approach using
semi-supervised learning. K-means suits the problem statement well. The goal of the model is
to learn the applications that travel in the network and predict the applications of unknown
traffic.

8
Network Traffic Classification

5. METHODOLOGY

Our proposed system is based on clustering and assignment technique using semi-supervised
machine learning to analyze and classify network traffic on both labeled and unlabeled
traffic. The following are the steps that we follow to build the model:

Step 1: Installing the Softwares


Install Wireshark and weka softwares. Wireshark is used to capture the network packets and
Weka contains many machine learning algorithms which is used for building the model.

Step 2: Capturing network packets using wireshark


Using wireshark start capturing the network packets. This can be easily done by opening the
applications which we require for our model building in our browser. It gives the time of
collection, source ip, destination ip, protocol used and length of the payload in bytes along
with the payload. We can even preprocess of dataset here to remove some unwanted packets.
The captured file can be converted to csv file.

Step 3: Labeling some samples


We use Weka software for further steps. As we are using semi-supervised machine learning
approach, some of the samples in the dataset have to be labeled. This will be done manually.
The labeling is done based on the protocols used. For ex- If the protocol used is smtp, then
the application is a mail application. Similarly if bitTorrent protocol is used, then it is a peer-
to-peer application. Few of the applications are not labeled, these are to be predicted by the
model.

Step 4: Building the model


The machine learning algorithm we are using is Simple K-means algorithm. In this, based on
the similarities the algorithm will form clusters. Each cluster contains elements which are
more similar to the elements within them than those in other clusters. Few elements in each
cluster may be unlabelled. On running the Simple K-means algorithm on the dataset, we get
the application lames of the unlabeled samples.

Step 5: Evaluation of the model


The build model has to be checked if it is giving correct results for all the inputs. The
accuracy of the algorithm has to be calculated. This will help us to understand the model
better and improve it if required.
These steps can be used to classify the traffic that enter the cloud so that managing and
maintain the cloud becomes easy.

9
Network Traffic Classification

6. REFERENCES

1. Approaching Real-time Network Traffic Classification - Wei Li, Kaysar Abdin,


Robert Dann and Andrew Moore.
2. Efficient Flow based Network Traffic Classification using Machine Learning -
Jamuna .A, Vinodh Ewards S.E.
3. An Overview of Network Traffic Classification Methods - Ms. Zeba Atique
Shaikh and Prof. Dr. D.G. Harkut.
4. Implementation of Network Traffic Classification by Using MLA - Miss.
Ghodake shubhangi, Miss. Raut Sarika, Miss. Ghuge Shital.
5. Traffic classification using Clustering Algorithms- Jeffrey Erman, Martin Arlitt,
Anirban Mahanti.
6. Internet Traffic Classification: An Enhancement in performance using Classifiers
combination- Indra Bhan Arya, Rachna Mishra.
7. Evaluation of Traffic Classification Techniques using Machine Learning methods-
M. Tamilkili
8. Traffic Classification using Clustering algorithms By Jeffrey Erman, Martin Arlitt
and Anirban Mahanti
9. An Implementation of Network Traffic Classification Technique Based on K-
Medoids by Dheeraj Basant Shukla and Gajendra Singh Chandel
10. Above the Clouds: A Berkeley View of Cloud Computing by Michael Armbrust
and others
11. https://en.wikipedia.org/wiki/Wireshark
12. https://en.wikipedia.org/wiki/Weka_(machine_learning)

10