You are on page 1of 4

2016 17th International Conference on Parallel and Distributed Computing, Applications and Technologies

A Traffic Classification Method with Spectral


Clustering in SDN
Peng Xiao1*, Na Liu1, Yuanyuan Li2, Ying Lu1, Xiao-jun Tang1, Hai-wen Wang1, Ming-xia Li1
1
School of Information Science and Engineering, Dalian Polytechnic University, Dalian 116034, China
2
College of software technology, Dalian Jiaotong University, Dalian 116052, China
*
Email: forkp@126.com

Abstract—Traffic classification is becoming one of the major Enriquez, et al. [12] first use the spectral method to
applications in the data center networks with a lot of cloud partition a specified amount of the National Airspace (NAS)
services. Recent works about software defined networking (SDN) traffic data into flows, which is a new application for the
have found new ways to manage data center networks. However, spectral idea in flow detection. But the authors are limited to
with the imbalance of the elephant and mice flows is sharpening, deriving nominal paths for NAS traffic. Zhou W, et al. [13]
the accuracy and efficiency of traffic classification have become present a new classification mechanism by using spectral
more and more important in SDN management. To address this clustering. The spectral clustering method is not limited to the
issue, in this paper, we propose a traffic classification method
dynamic ports, encrypted transmission, and et al. By using the
that can deal with the traffic classification in SDN. Our method is
graph theory, the traffic classification is converted into a
based on spectral clustering and Software-Defined Networking
(SDN). We propose a real-time flow extraction and
problem of multi-way partitioning of graph. However, the data
representation method by scanning the flow tables in SDN center traffic is heavy and hard to be counted into flow in real-
controller. Then we cluster the flow data with spectral analysis. time. There should be new methods to solve this problem. The
Extensive experiments on different settings have been performed, flow-based methods need a new powerful framework to solve
showing that our method is good at traffic classification with high real-time statistic and flow representation problem.
detection rates and low overhead. All these research mentioned in [6-13] have proposed their
methods to classify traffic. However, there are some
Keywords—SDN; Traffic Classification; Spectral Clustering;
Flow Tables;
bottlenecks that these methods are difficult to classify traffic
for the data center: The conflict between the large-scale traffic
I. INTRODUCTION and real-time classification is irreconcilable. In the data
centers, the arrivals of the traffic are very fast, and the number
In recent years traffic classification [1] have become an of packets that must be counted into flow is very huge. How to
active research field, which providing important technical statistic the flow information efficiently and in real-time? How
support for the traffic engineering, security detection and to represent the flow simply and quickly? These problems
network management. With the development of cloud need to be optimized, especially for the cloud data center
computing [2] and Software-Defined Networking [3], this network. The majority of existing detection methods depend
problem in the cloud data center is becoming more and more on the training data too much. It will bring bad effects
serious [4,5]. The marriage of cloud computing and data especially to the classification accuracy. Nevertheless, the
center bring the new challenges, specially the traffic sample training data from a data center network cannot be
engineering of the cloud data centers will be deeply affected used under all conditions. As a result, we should seek a new
by the cloud services. As an important part of the traffic method based on itself rather than training data.
engineering, the traffic classification is playing an important
role in the data center network management. Thus, the most The major contributions of this paper are:
basic but essential task in traffic engineering search is the
(1) We propose a real-time traffic classification system
classification problem. The traffic classification is just the first
which provides more high efficiency for the online
stage of the traffic engineering.
classification. By analyzing the relationship between SDN and
Many concepts and systems have been developed for flow classification in data center, we propose a classification
classifying the traffic [6-11]. However, these methods all framework in SDN. The system can classify flow as soon as
focus on the machine learning and propose the solution to quickly and efficiently.
detect the flow based on statistical flow or training data. While
(2) We define the representation of the flow regarding not
the IP packets flood is coming to the border route of data
only port information, but also the statistical information. Our
center, these methods mentioned in [6-11] based on statistical
aim in this work is to study and implement the spectral
flow cannot classify traffic in real-time. Moreover, the
clustering algorithm for data center. By considering SDN
accuracy of classification is relying too much on the training
data set. There are many applications in data center and their platform, the efficiency of traffic classification is much higher.
sample data are hard to be collected and trained, which (3) We evaluate the performance of this classification
directly impact on the classification accuracy. To address system with Floodlight [14] and Mininet [15]. The
these bottlenecks, some machine learning methods without performance evaluations show that our method can
training data are proposed in traffic classification [12, 13]. significantly classify flow.

978-1-5090-5081-9/16 $31.00 © 2016 IEEE 391


DOI 10.1109/PDCAT.2016.88
The rest of the paper is organized as follows. In Section outline the practical experiences and lessons learned while
Ċ, we provide related work and introduce the existing developing an SDN based traffic classification platform for an
methods. In Section ċwe give an overview on the traffic enterprise network. Silva, et al. [19] discuss the identification
classification system and propose a classification framework and selection of flow features for accurate traffic classification
based on SDN. We discuss the spectral clustering method in in SDN, and introduce an framework to select flow features in
Section Č. Our method is implemented and evaluated in OpenFlow-based networks. With recent studies mentioned
Sectionč Finally, we conclude our work in SectionĎ. above, SDN brings us new chances to classify traffic.
Nevertheless, there is a contradictory relationship between
II. RELATED WORK SDN and traffic classification. The software-based traffic
analysis in SDN makes it easier to select flow feature. Thus,
Currently, there are many researches to classify traffic. SDN can be utilized to overcome the difficulty and effectively
Auld T, et al. [6] introduce Bayesian Neural Network into the collect flow feature.
internet traffic classification. They describe the machine
learning methods and represent flow by statistical information. III. SYSTEM DESIGN
Este A, et al. [7] propose the SVM-based classifiers to classify
flows. When Support Vector Machines are applied to the
In this section we briefly introduce the framework of
problem of traffic classification in IP networks, the classifier
traffic classification for SDN, and propose a useful
can perform correctly with as little training as a few hundred
classification system to effectively solve the traffic
samples. To meet the requirements of traffic classification in
classification in data center.
large networks, Jin Y, et al [8] also present a modular machine
learning System. It is also trained on a dataset of flow-level Fig. 1 shows the proposed real-time classification system.
data. Jaiswal, et al. [9] summarize the most machine learning Our method for classification system consists of two modules
methods and compare their performance. These methods are including the Flow Collector and the Spectral Classifier,
based on flow-level and the training data set. But they are hard which can run on the same controller host or not.
to be applied in online classification and the training data are
difficult to obtain. To classify traffic in real-time, Finsterbusch
M, et al. [10] propose a survey in which a complete and
thorough analysis of the most important open-source DPI
modules is performed. DPI can classify traffic in real-time, but
it cannot achieve a high accurate. In order to address these
problems, Wang, Y et al. [11] propose a constrained clustering
scheme that makes decisions with consideration of some
background information in addition to the observed traffic
statistics, which improves the accuracy of traffic clustering.
However, these proposed methods are mostly based on the
training data, and most methods can be hardly applied in
online classification. Moreover, the accuracy of classification
is far from satisfactory.
As more and more studies have applied some classic
Fig. 1 System Model
clustering algorithms such as K-means in traffic classification,
Enriquez, et al. [12] and Zhou W, et al. [13] present their In the Flow Collector module, the application can scan
classification mechanism by using spectral clustering, by flow tables from the SDN controllers and collects traffic flows
which the traffic classification is converted into a problem of by IP header inspection in real-time. A flow consists of IP
multi-way partitioning of graph. While limiting the processing packets having the same five-tuple {src_ip, src_port, dst_ip,
platform, it is hard to achieve in real-time for traffic dst_port, protocol}. In OpenFlow protocol, the architecture of
classification. Moreover, it is important to notice that a flow table is represented by a set of structures as Fig.2 shown.
successful classification not only depends on the analysis As a key SDN implementation today, OpenFlow offers not
method, but also depends on it framework. In order to achieve only native flow features, such as IP and port information in
an accurate and timely traffic classification, many research match fields, but also a set of statistical features, such as
works use SDN platform to solve traffic classification. byteCount , packetCount and durationSeconds, etc. We can
Recently, software defined networking (SDN) [16] has extract some features from flow tables by Json [20] and
increasingly employed in data center, which provides a represent each flow by a set of features that describe intrinsic
globally optimal management of network resources and a traffic profiles.
flow-level control of network traffic. Farhady H, et al. [17]
propose a data plane mechanism for classification in SDN.
They modify and improve current SDN APIs to get user-
defined actions. However, these methods achieve their goal Fig.2 The Architecture of Flow Table
through a set of modifications to the SDN platform, which is
difficult for most users. Ng, B, et al. [18] analyze SDN The Spectral Classifier module gets the data in flow
technique from the perspective of traffic classification. They representation and clusters the data by spectral method. After

392
receiving the flow from the Flow Collector module, the means algorithm into clusters C1 ,..., Ck .
Spectral Classifier module calculates the similarity between
each flow. Then, the similarity matrix and Laplacian matrix z return C1 ,..., Ck .
are constructed, which are important to spectral clustering. OUPUT: Clusters C1 ,..., Ck .
The classification can be made by spectral analysis. The
classifier clusters the flow group with the information about V. EXPERIMENTS
themselves and determines whether it is a group without
In this section, we present our experimental setup and
training data. Finally, the Spectral Classifier module returns
describe the experimental results of our methods compared
the classification results to the controller in real-time.
with the others.
IV. SPECTRAL CLUSTERING
A. Testbed and data sets
In recent years, spectral clustering has become one of the Our methods are implemented with Java, and have been
most popular clustering methods, which outperforms the performed on our data center which provides Floodlight [14]
traditional clustering methods such as K-means. Recent and Mininet [15]. The data center consists of 36 machines
advances in spectral clustering application focus on network running Ubuntu server 64 bit. Each node has 2 AMD Opteron
community detection [21], SDN controller placement [22], 2212 2.00 GHz CPUs, 80 GB SCSI HDD, 8 GB of RAM,
and wireless sensor networks [23]. Spectral clustering is Intel 100 Mbps Ethernet Controller. Our programs are based
simple to implement, which have been suggested to solve on Json [20] and Floodlight API. To verify the effectiveness
traffic classification [12, 13]. In the cloud data center, huge and availability of our methods, the experiments are conducted
traffic flow needs to be collected and classified in real-time. on the following data sets: (1) The wide data set, which is
The classification algorithm should be high efficient. obtained from the wide trace [24]. The data set is from daily
Obviously, the other machine learning methods based on trace at a trans-Pacific line (150 Mbps link) and has many
training data are not suitable to classify flow in high speed. application flows. (2) The data center data set, which is a full
Spectral clustering can convert the traffic classification into a payload traffic data set we collected at a 100 Mbps edge link
problem of multi-way partitioning of graph, which is taking of our data center mentioned above.
advantage of data’s similarity instead of training data.
B. Performance evaluation
We use spectral analysis to cluster the flows for data
center traffic classification. Our method is based on SDN flow
tables and by scanning flow tables. Flow tables can be used To evaluate the influence of the flow class, we test the
here to simplify the flow extraction process and reduce the spectral clustering and K-means classification on the wide
system overhead. A flow set for representing network traffic data set. We select k = 6 as the clustering number because the
S ={f1, f2,..., fn} of n elements is extracted from flow tables. wide data set has six flow classes. We compare the class
accuracy and class recall of two methods. For example, for
For each flow f ∈S , the features of f can be defined as one type of flow search on a set of flows, class precision is the
ipv4_src, ipv4_dst, byteCount, durationSeconds, and etc, number of correct results divided by the number of all
which can obtained by Json from SDN flow tables. To returned results, and class recall is the number of correct
calculate the similarity between each flow f ∈S , we use the results divided by the number of results that should have been
returned.
Euclidean distance to weight the similarity, as shown in Eq.(1).
As shown in Fig. 3 and Fig. 4, the classification results are
d ( fi , f j ) = fi − f (1)
j
greatly affected by the number of flows. The http, smtp, dns
The whole algorithm is outlined in Algorithm 1 as follows: protocols occupy the majority of the wide data set, So the
accuracy of them are higher than the others. Furthermore, the
Algorithm1. Spectral Classification Algorithm() values of class recall about some protocols are sharply down
when the flow classes are few in number, which is more
INPUT: Number k of clusters. obvious for ssh and ssl2. As is to be expected, our spectral
z Scan the flow tables from SDN controller. clustering method significantly outperforms the k-means
z Extract flow feature and represent flow. method and improves the classification accuracy. From Fig. 4,
z Construct a similarity graph A with the Eq.(2). we can see that the p2p protocol severely impact the recall.
z Compute the degree matrix D by D(i, i) = Aij . ¦j
For the p2p protocol, there is a great difference among the
different p2p applications. So it is difficult to classify the
z Compute the Laplacian matrix L by L = D − A . behavior characteristics of the p2p, which lead to the
z Compute the first k eigenvectors v1 ,..., vk of L. emergence of recognition errors in the K-means cluster
analysis. As a result, it is necessary to carry out the spectral
z Let V ∈ \ n×k be the matrix containing the vectors analysis of the traffic classification, especially small protocols.
v1 ,..., vk as columns. To compare the classification performance of our methods
z for (i=1;i<=n;i++) do let f i ∈ \ k be the vector with the other approaches, we built experiments on the data
corresponding to the i-th row of V. center data set by Floodlight and Mininet. We deploy simulate
the traffic by iperf tools and scan the flow tables from
z Cluster the flows ( f i ) i =1,...,n in \ k with the k- Floodlight.

393

Fig. 3 The Class Accuracy on the Wide Data Set Fig. 4 The Class Recall on the Wide Data Set Fig. 5 The Comparison of Classification Time

Fig. 5 shows the classification time of two competing [6] Auld T, Moore A W, Gull S F. Bayesian neural networks for internet
methods. K-means method takes more time than the spectral traffic classification.[J]. IEEE Transactions on Neural Networks, 2007,
18(1):223-39.
clustering method because K-means method need spend more
[7] Este A, Gringoli F, Salgarelli L. Support Vector Machines for TCP
time to calculate cluster centers. With the increasing of the traffic classification[J]. Computer Networks the International Journal of
flows, the gap between the K-means method and spectral Computer & Telecommunications Networking, 2009, 53(14):2476-2490.
clustering method will increase gradually. The classification [8] Jin Y, Duffield N, Erman J, et al. A Modular Machine Learning System
time of the K-means method is sharply up when the number of for Flow-Level Traffic Classification in Large Networks[J]. Acm
flows increases from 20,000 to 30,000, because the time of Transactions on Knowledge Discovery from Data, 2012, 6(1):1-34.
calculating cluster centers in K-means is very high and the [9] Jaiswal, R. C., and S. D. Lokhande. "Machine learning based internet
spectral clustering method has more advantages than K-means. traffic recognition with statistical approach." IEEE India Conference
2013:1-6.
As shown in Fig. 5, our spectral method spends less time in
detecting 30,000 flows, which is enough to classify traffic in [10] Finsterbusch M, Richter C, Rocha E, et al. A Survey of Payload-Based
Traffic Classification Approaches[J]. Communications Surveys &
real-time for data center. Tutorials IEEE, 2014, 16(2):1135-1156.
[11] Wang, Y., et al. "Internet Traffic Classification Using Constrained
VI. CONCLUSION Clustering." IEEE Transactions on Parallel & Distributed Systems
In this paper, we present a traffic classification method 25.11(2014):2932-2943.
with spectral idea in the cloud data center. Our approach is [12] Enriquez, Marco, and C. Kurcz. "A Simple and Robust Flow Detection
based on SDN by using spectral clustering. To maximize the Algorithm Based on Spectral Clustering." International Conference on
Research in Air Transportation 2012.
classification accuracy and minimize the costs of the flow
[13] Zhou W, Chen L, Dong S. Network traffic classification algorithm based
statistics, we presented a real-time traffic classification system on spectral clustering[J]. Journal of Electronic Measurement &
with SDN. Finally the experiment results show our methods Instrument, 2013, 27(12):1114-1119.
are good at solving the traffic classification problem. In the [14] Floodlight. http://www.projectfloodlight.org/floodlight/
future work, we expect to expand our analysis to the Internet [15] Mininet. http://mininet.org/
traffic. [16] Yu, Haisheng, et al. "Zebra: An East-West Control Framework for SDN
Controllers." International Conference on Parallel Processing IEEE,
Acknowledgment 2015:610-618.
This work was supported by National Natural Science [17] Farhady H, Nakao A. TagFlow: Efficient Flow Classification in SDN[J].
Ieice Transactions on Communications, 2014, E97.B(11):2302-2310.
Foundation of China (NO. 61402069), General project of
[18] Ng, B., M. Hayes, and W. K. G. Seah. "Developing a traffic
Liaoning Provincial Department of Education classification platform for enterprise networks with SDN: Experiences &
(NO.L2015047ˈNO.L2015092). lessons learned." Ifip NETWORKING Conference IEEE, 2015.
[19] Silva, Anderson Santos Da, et al. "Identification and Selection of Flow
References Features for Accurate Traffic Classification in SDN." IEEE,
International Symposium on Network Computing and Applications
[1] Zhang J, Chen X, Xiang Y, et al. Robust network traffic 2015:134-141.
classification[J]. IEEE/ACM Transactions on Networking, 2015,
23(4):1257-1270. [20] Json. http://www.json.org/
[2] Li, Yuanyuan, et al. "Efficient subspace skyline query based on user [21] Huang L, Li R, Chen H, et al. Detecting network communities using
preference using MapReduce." Ad Hoc Networks 35 (2015): 105-115. regularized spectral clustering algorithm[J]. Artificial Intelligence
Review, 2014, 41(4):579-594.
[3] Nunes, Bruno, et al. "A survey of software-defined networking: Past,
present, and future of programmable networks." Communications [22] Xiao P, Zhi-Yang L I, Guo S, et al. A K self-adaptive SDN controller
Surveys & Tutorials, IEEE 16.3 (2014): 1617-1634. placement for wide area networks[J]. Frontiers of Information
Technology & Electronic Engineering, 2016, 17(7):620-633.
[4] Wang, Haiqiang, K. K. Tseng, and J. S. Pan. "A novel statistical
automaton for network cloud traffic classification." International [23] Hu H, Wang X, Yang Z, et al. A Spectral Clustering Approach to
Conference on Information Security and Intelligence Control 2012:49- Identifying Cuts in Wireless Sensor Networks[J]. IEEE Sensors Journal,
52. 2015, 15(3):1838-1848.
[5] Dainotti A, Pescapl A, Claffy K C. Issues and future directions in traffic [24] Mawi working group traffic archive. http://mawi.wide.ad.jp/mawi
classification[J]. IEEE Network the Magazine of Global Internetworking,
2012, 26(1):35-40.

394

You might also like