Professional Documents
Culture Documents
E-mail: parfenovdi@mail.ru
Abstract. This paper examines the impact of data balancing algorithms in the network traffic
classification problem on various types of distributed denial of service attacks on the
CICDDoS2019 dataset, which contains information about reflection-based and exploitation-
based attacks. The results of computational experiments have shown the effectiveness of data
balancing algorithms such as naive random sampling, synthetic minority sampling, and adaptive
synthetic sampling in identifying network attacks. A comparative analysis of various data
sampling approaches has shown that the adaptive synthetic sampling method with the random
forest algorithm demonstrates the highest classification accuracy.
1. Introduction
At present, the urgency of the threat to the confidentiality and integrity of data requires the most careful
consideration of information security issues. Existing cryptographic protection, access control, and
intrusion detection systems do not allow identifying all possible types of attacks. This problem is related
to the fact that today there is no universal method for classifying attacks. Most of them are based on
forming an attack profile, which makes it difficult to identify new types of threats. Also, most of the
algorithms for detecting attacks depend on the type of network, as well as on its topology. In this regard,
the problem of identifying network attacks is studied by authors all over the world.
One of the most common approaches to analyzing network traffic is the use of machine learning
methods that allow to identify dependencies with attacking effects based on the fixed characteristics of
data packets and network frequency characteristics [1, 2]. This approach has already proven its
effectiveness and almost any traffic analysis system includes a classification component.
The main problem of building such classifiers, in addition to selecting and configuring classification
algorithms, is the quality of the data presented for training, as well as the test set of data used to evaluate
the accuracy of the results. It should be noted that data quality refers not only to missing values,
duplicates, anomalies, outliers, etc., but also to the difference in the number of samples of records
belonging to different classes. The latter phenomenon is called the "balancing problem" and has a whole
class of algorithms that can solve this problem and implement the principle of data diversity [3].
This paper examines the impact of data balancing algorithms in the network traffic classification
problem on various types of distributed denial of service (DDoS) attacks on the CICDDoS2019 dataset
containing information about reflection-based and exploitation-based attacks [4].
Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution
of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
Published under licence by IOP Publishing Ltd 1
APITECH II IOP Publishing
Journal of Physics: Conference Series 1679 (2020) 042020 doi:10.1088/1742-6596/1679/4/042020
2. Related works
At the moment, several studies are describing the identification of various types of DDoS attacks,
including on the CICDDoS2019 dataset. Thus, in the framework of the publication [5], new attacks of
TCP/UDP protocols using artificial neural networks (ANN), support vector machine (SVM), Gaussian
naive Bayes, K-nearest neighbor method and other algorithms are analyzed, demonstrating high
accuracy of 99.8 %.
The authors of the article [6] proposed a new DDoSNet intrusion detection system in software-
defined networks (SDN) based on recurrent neural networks with an autoencoder. The results of research
in comparison with such machine learning algorithms as Random Forest, Naive Bayes, SVM, etc.
showed an increased accuracy of 99% in determining various types of DDoS attacks.
In the study [7], based on the CICDDoS2019 data set from the Canadian cybersecurity Institute, the
authors identified the best features for predicting DDoS attacks, taking into account the discretization
of the number of class instances. As a result of training several classifiers by cross-validation, results
were obtained with an accuracy of 96.9%. Also, the ensemble approach to training allowed us to find a
balance between productivity and time spent.
The most effective functions for detecting DDoS attacks were proposed in the article [8] by the
authors of the Canadian Institute of cybersecurity I. Sharafaldin, A. H. Lashkari and others. Also, the
study introduces a new DDoS taxonomy for the application layer and assesses the main advantages and
disadvantages of existing data sets describing DDoS attacks.
As part of the study [9], a new system for detecting DoS attacks based on machine learning methods
that classify attacks based on signatures previously extracted from network traffic samples is proposed.
Experiments have shown that on the presented reference data sets, the attack detection rate is higher
than 96% and is accompanied by a low false alarm rate of 20% of network traffic.
The authors of the study [10] used three different deep learning models to detect DDoS attacks and
performed a comparative analysis of the performance of each model for different types of attacks. The
proposed models are superior to traditional machine learning classification models and show up to
93.3% accuracy in identifying network attacks.
Thus, the research has shown that the use of traditional machine learning methods to identify network
attacks such as DDoS can achieve fairly accurate results. However, the problem of the quality of the
presented data set was not considered, as well as the question of the influence of balance on the accuracy
of the constructed classifiers. In this article, we will consider the use of various data balancing algorithms
and analyze their impact on the accuracy of the resulting attack classifiers.
2
APITECH II IOP Publishing
Journal of Physics: Conference Series 1679 (2020) 042020 doi:10.1088/1742-6596/1679/4/042020
We will investigate the influence of unbalanced data sampling on the most common classifiers:
support vector machine (SVM), random forest (AS), and gradient boosting (GBM). We will research
several stages:
In this paper, we consider the following data sampling algorithms: naive random sampling (ROS),
synthetic minority sampling (SMOTE), and adaptive synthetic sampling (ADASYN).
• For Each instance of the minority class z i , match k instances of the nearest neighbors from the
minority class (the Euclidean metric is used to estimate the distance).
• Randomly match one of the k nearest neighbors to Each instance of a minority class z i :
zK = ( z1K , zK2 ,..., zKn ) , 1 K k .
• Synthesize a new instance of the minority class according to the rule: ti = zi + ( zi − zK ) rand ,
where rand is a random number from the segment [0, 1].
• Repeat steps 1-3 until more Smin synthesized instances of the minority class are created.
3
APITECH II IOP Publishing
Journal of Physics: Conference Series 1679 (2020) 042020 doi:10.1088/1742-6596/1679/4/042020
Dr with m samples xi , yi , i = 1, m , where xi is an n-dimensional vector in the feature space, and yi
is the corresponding class.
Let mr and m x is the number of samples of minority and majority classes, respectively, such that
mr mx and mr + mx = m .
ADASYN algorithm:
mr
• Calculate the proportion of classes d = .
mx
• If d d x (where d x is the specified threshold for the maximum allowable class imbalance),
then:
• Calculate the synthetic sample gi = rx G generated for each image from the minority class,
where G is the total number of examples of synthetic data.
• For each sample data from the class xi , create examples of synthetic data g i according to
the following steps: in the cycle from 1 to i:
• Randomly select one example of minority data, xu from the K nearest neighbors for
the data xi ;
• Create an example of synthetic data: gi = xi + ( xu − x ) , where ( xu − x ) is an n-
dimensional vector of Euclidean space; is a random number: 0;1 .
5. Experiment
We will investigate the influence of unbalanced data sampling on the most common classifiers: support
vector machine (SVM), random forest (RF), and gradient boosting (GBM). All calculations were
implemented programmatically in Python.
To perform a comparative analysis of the classification results, the following metrics were calculated:
accuracy, balanced accuracy, completeness, and F1-measure. The results of the research for the support
vector method are presented in table 1. According to the obtained values of balanced accuracy, this
classification algorithm identifies the PortMap attack most effectively (94.81%). It should be noted that
at the sampling stage, the SVM algorithm combined with the SMOTE algorithm showed more accurate
results of more than 79.9%.
Table 1. Results of classification by the support vector method.
Balanced
Sampling method Type of attack Accuracy F1- measure Recall
accuracy
Normal 0.9656 0.6752 0.7759 0.7634
PortMap 0.9146 0.7219 0.8017 0.8625
SVM
4
APITECH II IOP Publishing
Journal of Physics: Conference Series 1679 (2020) 042020 doi:10.1088/1742-6596/1679/4/042020
5
APITECH II IOP Publishing
Journal of Physics: Conference Series 1679 (2020) 042020 doi:10.1088/1742-6596/1679/4/042020
Similarly, tables 2, 3 are constructed, containing the results of classification using random forest and
gradient boosting, respectively. In all the considered cases, the use of sampling methods allowed us to
obtain a higher classification accuracy than on unbalanced data (figure 1). Within the framework of the
scheme described in this paper, the best classification accuracy (more than 98%) was achieved as a result
of applying the ADASYN class balancing algorithm and then the random forest algorithm.
6. Conclusion
In this paper, we investigate the issue of improving the accuracy of classification of network attacks on
unbalanced CICDDoS2019 data using class sampling algorithms such as ROS, SMOTE, and ADASYN.
The results of computational experiments have shown the effectiveness of data balancing algorithms in
identifying network attacks. Also, the ADASYN adaptive synthetic sampling method has improved the
accuracy of attack classification by up to 98% compared to other algorithms. In conclusion, it is worth
noting that the problem considered in this study can be considered with other classification algorithms,
such as recurrent neural networks and deep learning methods, and existing data sampling algorithms can
be improved.
Acknowledgments
The study was carried out with the financial support of the RFBR in the framework of scientific projects
No. 20-07-01065, as well as a grant from the President of the Russian Federation for state support of
leading scientific schools of the Russian Federation (NSh-2502.2020.9) and the grant from President of
the Russian Federation within the grant for state support of young Russian scientists (MK-860.2019.9).
References
[1] Elovici Y, Shabtai A, Moskovitch R, Tahan G and Glezer C 2007 Applying Machine Learning
Techniques for Detection of Malicious Code in Network Traffic KI 4667 ed Hertzberg J Beetz
M and et al. (Berlin: Springer) p 44-50
[2] Xin Ya, Kong L, Liu Zh, Chen Yu, Li Ya, Zhu H, Gao M, Hou H and Wang C 2018 Machine
Learning and Deep Learning Methods for Cybersecurity Ieee access 6 35365-81
[3] Sun Ya, Wong A K and Kamel M S 2009 Classification of imbalanced data: a review
International journal of pattern recognition and artificial intelligence 23(4) 687-719
[4] Canadian Institute for Cybersecurity DDoS evaluation dataset (CICDDoS2019) Retrieved from:
6
APITECH II IOP Publishing
Journal of Physics: Conference Series 1679 (2020) 042020 doi:10.1088/1742-6596/1679/4/042020
https://www.unb.ca/cic/datasets/ddos-2019.html
[5] Zekri M, El Kafhali S & Aboutabit N and Saadi Yo 2017 DDoS attack detection using machine
learning techniques in cloud computing environments 3rd International Conf. of Cloud
Computing Technologies and Application (CloudTech) Rabat 1-7
[6] Elsayed M S, Le-Khac N, Dev S and Jurcut A 2020 D DDoSNet: A Deep-Learning Model for
Detecting Network Attacks 2020 Proc. IEEE World of Wireless Mobile and Multimedia
networks (WoWMoM) 1-7
[7] Hussain Y S 2020 Network Intrusion Detection for Distributed Denial-of-Service (DDoS) Attacks
using Machine Learning Classification Techniques (Toronto: University of Toronto) p 49
[8] Sharafaldin I, Lashkari A H and Ghorbani A A 2019 Developing Realistic Distributed Denial of
Service (DDoS) Attack Dataset and Taxonomy International Carnahan Conf. on Security
Technology (Chennai) 1-8
[9] Lima F, Silveira A F, Medeiros A and Vargas-Solar G 2019 Smart Detection: An Online
Approach for DoS/DDoS Attack Detection Using Machine Learning Security and
Communication Networks ed Maglaras L 1-15
[10] Li J 2020 Detection of ddos attacks based on dense neural networks, autoencoders and pearson
correlation coefficient (Halifax: Dalhousie University) p 89