You are on page 1of 4

Conference Paper Title*

* Note: Sub-titles are not captured in Xplore and should not be used

Index Terms—component, formatting, style, styling, insert botnets, discovering and neutralising them is difficult but not
impossible, requiring constant vigilance and the deployment
I. A BSTRACT of modern detection techniques.
This project focuses on the use of machine learning tech-
niques to detect network-based botnets on a Unix-based op- B. Objectives
erating system, notably CentOS 7. CICFlowmeter-V4.0 (pre-
viously known as ISCXFlowMeter) is a network traffic bi- III. LITERATURE REVIEW
flow generator and anomaly detection analyzer. Using the
CICFlowMeter, the dataset produced from *.PCAP files is Botnets, networks of compromised computers controlled by
translated into *.CSV files with more than 80 columns. malevolent actors, pose serious challenges to cybersecurity.
Based on the processed dataset, the Jupyter Notebooks in Detecting and managing these dangers has become a signifi-
the given folder use machine learning approaches such as K- cant emphasis in the field of information security. Traditional
Nearest Neighbours, Nave Bayes, and Decision Trees to detect rule-based approaches fall short in tackling the dynamic and
botnets.The models are evaluated using key metrics such as sophisticated nature of modern botnets, driving the exploration
Accuracy, Precision, Recall, and F1-Score. A confusion matrix of machine learning (ML) techniques for more flexible and
is also used to calculate the False Positive/Negative and True efficient detection procedures.
Positive/Negative rates. The experiment findings are presented The landscape of botnet threats has developed rapidly, with
in the findings.zip file.This research advances cybersecurity attackers deploying advanced strategies to elude detection.
by utilising machine learning methods to improve the identi- Early botnets were often characterized by simple command
fication of network-based botnets. The detailed investigation and control (CC) architecture, making rule-based systems
of model performance using numerous metrics sheds light on somewhat successful. However, the shift towards decentralized
the efficacy of various machine learning algorithms for botnet and resilient structures, as seen in peer-to-peer (P2P) botnets,
identification in network traffic data. needs more complex detection approaches. This trend empha-
sises the significance of machine learning.
II. I NTRODUCTION
A. Background A. Dataset issues
A botnet is a network of internet-connected devices infected
Many existing datasets used for botnet identification have
and controlled by a common form of malware, such as
generality issues, frequently focused on a few distinct botnets.
PCs, servers, mobile devices, and Internet of Things (IoT)
The adaptability required to recognise fresh threats may be
devices. These machines, dubbed ”bots” or ”zombies,” are
lacking in ML models trained on such datasets. Taking on this
commanded by a malicious actor known as the ”botmaster.”
topic entails creating datasets that cover a greater spectrum of
These devices can be used for various malicious activities, in-
botnet behaviours.
cluding distributed denial-of-service (DDoS) attacks, sending
To effectively evaluate botnet detection systems, realistic
spam emails, spreading malware, and engaging in click fraud
traffic traces are required. However, it is difficult to gener-
campaigns. The term ”botnet” is a portmanteau of ”robot” and
ate true botnet traffic in a controlled setting without being
”network,” and each infected device is referred to as a bot.
detected. Combining simulated botnet traces with real-world
Botnet detection is an important part of cybersecurity, as it
traffic, as seen in overlay techniques, offers a workable solu-
is required to prevent data theft, network disruption, and other
tion to this problem.
unwanted activities. Botnet detection approaches and tools
include network traffic analysis, signature-based detection, Privacy constraints and the difficulties of acquiring back-
behavior-based detection, and machine learning algorithms. ground data in real-world production contexts impede botnet
Botnets, which are networks of infected devices controlled by dataset representativeness. Researchers frequently use con-
a third party, can be identified with specialised botnet detection trolled conditions to simulate or collect data. Meticulous
tools that detect anomalous patterns and behaviour in network dataset building, as demonstrated by the combination of ISOT,
traffic. Because of the dynamic and sophisticated nature of ISCX 2012 IDS, and Malware Capture Facility Project data, is
critical to guaranteeing the representativeness of the training
Identify applicable funding agency here. If none, delete this. and testing datasets.
B. Machine Learning Approaches in Botnet Detection • The Naive Bayes classifier is applied immediately to the
• K-Nearest Neighbours (KNN): KNN is used because of preprocessed dataset and does not require substantial pa-
its ease of use and versatility. KNN-based models can rameter adjustment. The model is trained on the training
detect aberrant patterns suggestive of botnet activity by data set and tested on the test data set.
examining the similarity of instances in feature space. The • K-Nearest Neighbours (KNN): KNN is implemented for
k value, on the other hand, has a considerable impact on a variety of k values ranging from 1 to 20. The algorithm
the model’s performance. is trained and tested on the preprocessed dataset, and
• Nave Bayes: Nave Bayes classifiers use probabilistic performance metrics are collected for each k value.
principles to determine whether an occurrence is benign D. Performance Evaluation
or malignant. Because of their simplicity and efficiency,
The performance of each model is thoroughly assessed
they are appealing for botnet detection, especially in cases
using a variety of measures. Metrics such as accuracy,
when computational resources are restricted.
precision, recall, F1-score, and a confusion matrix are
• Decision Trees: Decision trees are effective at capturing
computed for Decision Tree Regressor. Across multiple
complex decision boundaries and provide interpretability.
values of k, same metrics are produced for Nave Bayes
Using decision trees for botnet detection entails building
and KNN. The findings are organised methodically, and
trees that learn to differentiate between normal and harm-
comparisons are made to assess the performance of each
ful patterns based on a variety of criteria.
model in detecting botnet activity.
C. Evaluation Metrices E. Result Analysis and Interpretation
Key measures such as Accuracy, Precision, Recall, and F1- The results of each model are thoroughly examined in
Score are used to evaluate the effectiveness of ML-based order to form conclusions regarding their effectiveness
botnet identification. Confusion matrices provide a detailed in botnet detection. Individual models’ strengths and
breakdown of the True Positive, True Negative, False Positive, limitations, as well as their sensitivity to varied parameter
and False Negative rates, revealing the model’s strengths and values, are evaluated in comparison.
limitations.
F. Iterative Experimentation
IV. M ETHODOLOGY The entire process is carried out repeatedly, with param-
A. Dataset description eters being refined, algorithms being adjusted, and new k
values for KNN being explored. This iterative technique
The research begins with the creation of a large and diverse enables each model’s performance in botnet identification
dataset that addresses the issues of generality, realism, and to be optimised.
representativeness. The collection combines non-overlapping
sections of previous datasets such as ISOT, ISCX 2012 IDS, V. D ISCUSSION
and Malware Capture Facility Project traces. An overlay A. Decision Tree Classifier
mechanism is used to improve representativeness by mapping
botnet IPs to hosts outside the current network. The generated
dataset contains both hostile (botnet) and benign traffic traces.

B. Data Preprocessing
After loading the dataset into a Pandas DataFrame, pre-
processing processes are performed. Using Label Encoding,
categorical information such as ’Src IP,’ ’Dst IP,’ ’Src Port,’
’Dst Port,’ ’Protocol,’ and ’Flow Duration’ are encoded. This
ensures that machine learning methods can be used. The
dataset is then divided into two parts: training and testing.

C. Machine Learning Models


For botnet identification, three distinct machine learning
models are used: Decision Tree Regressor, Nave Bayes, and
K-Nearest Neighbours (KNN).
• A Decision Tree Regressor is trained using a prepro-
cessed training set with minimum sample split of 500.
After that, the model is assessed on the test set, and
performance metrics such as mean absolute error, mean
squared error, and root mean squared error are computed. Fig. 1. Decision Tree Classifier Model Evaluation
– Precision, Recall, and F1-Score (Class 0): For results show a balanced trade-off between precision
class 0, the Decision Tree Classifier achieved ideal and recall, considering the nature of the problem.
precision, recall, and F1-score, indicating accurate – Overall Accuracy: The overall accuracy is 98%,
detection of normal traffic. indicating dependable performance in categorizing
– Precision, Recall, and F1-Score (Class 1): Simi- cases as normal or botnet traffic.
larly, for class 1 (botnet traffic), the Decision Tree – Support: The number of instances for each class is
Classifier obtained ideal precision, recall, and F1- represented by the support values, indicating a large
score, demonstrating accurate detection of botnet dataset for evaluation.
traffic.
– Accuracy: The total accuracy reached 100%, indi-
cating outstanding performance of the classifier on
the given dataset.
– Support: The support numbers, representing the
instances for each class, reflect the classifier’s ef-
fectiveness across a large number of instances.
B. Decision Tree Regressor

Fig. 3. K-Nearest Neighbors (KNN) Model suitable value of k

Fig. 2. Decision Tree Regressor Model Evaluation

– Precision, Recall, and F1-Score (Class 0 and


1): For both classes, the Decision Tree Regres-
sor achieved great precision, recall, and F1-score.
Despite the recall for class 1 being slightly less
than 1.00, the model performed remarkably well at
identifying botnet traffic.
– Accuracy: The total accuracy is 100%, demonstrat-
ing the regressor’s efficacy in properly identifying
cases. Fig. 4. K-Nearest Neighbors (KNN) Model Evaluation
– Support: The support numbers indicate that the
model performed well across a large number of
D. Naive Bayes
instances.
– Precision, Recall, and F1-Score (Class 0 and 1):
C. K-Nearest Neighbors (KNN) For both classes, the Naive Bayes model displayed
– Accuracy, Recall, and F1-Score (Class 0 and 1): acceptable precision, recall, and F1-score. The re-
The KNN model performed well in both classes, ex- sults indicate a balanced performance in recognizing
hibiting excellent accuracy, recall, and F1-score. The normal and botnet traffic.
– Accuracy: The total accuracy is 78%, demonstrat- Decision Tree Regressor
ing a satisfactory level of correctness in classifying Strengths:
occurrences. – High precision, recall, and F1-score for both classes.
– Support: Support values reflect the number of in- – 100% accuracy, demonstrating effective identifica-
stances for each class, providing context for the tion of cases.
model’s performance. Limitations:
– Slightly lower recall for class 1 compared to the
Decision Tree Classifier.
– Similar concerns about potential overfitting.
K-Nearest Neighbors (KNN)
Strengths:
– Strong performance with high accuracy, recall, and
F1-score.
– Balanced trade-off between precision and recall.
Limitations:
– Lower accuracy compared to decision tree models,
but still highly reliable.
– Performance might vary with different choices of ’k.’
Naive Bayes
Strengths:
– Acceptable precision, recall, and F1-score, indicating
balanced performance.
– Decent accuracy considering the nature of the prob-
Fig. 5. Naive Bayes Model Evaluation
lem.
Limitations:
VI. C OMPARATIVE A NLYSIS – Lower accuracy compared to other models, reflecting
All models perform well in distinguishing between nor- a trade-off in performance.
mal and botnet traffic, each with its own set of advantages – May not capture intricate dependencies between fea-
and disadvantages.Decision Tree models achieve perfect tures.
scores, emphasising their robustness on the given dataset
while cautioning against overfitting.KNN strikes a bal- R EFERENCES
ance between precision and recall, providing a reliable 1) Yang, X., Ye, D., Fong, S., & Dehghantanha, A.
alternative with a slightly lower but still impressive (2017). Botnet research: Survey and open issues.
accuracy. Naive Bayes, while not as accurate, presents Journal of King Saud University-Computer and
a decent trade-off, making it suitable for scenarios where Information Sciences.
computational efficiency is crucial. 2) Babaei, M., & Dehghantanha, A. (2014). Botnets:
A survey. Computer Networks, 74, 87-104.
Model Precision Recall F1-Score Accuracy (%) 3) Karim, R., Shihavuddin, A. S. M., & Dehghantanha,
Decision Tree Classifier 1.00 1.00 1.00 100
Decision Tree Regressor 1.00 0.99 1.00 100 A. (2015). A comprehensive review on Zeus mal-
K-Nearest Neighbors (KNN) 0.98 0.97 0.98 98 ware: A decade of evolution. Computers & Security,
Naive Bayes - - - 78 49, 55-73.
TABLE I
M ODEL P ERFORMANCE C OMPARISON 4) Alazab, M., Hobbs, M., Abawajy, J., & Alazab,
M. (2015). A framework for accurate detection and
classification of network-based botnets. Computers
Decision Tree Classifier & Security, 52, 118-140.
Strengths: 5) Kumar, P., Kumar, M., Gaur, M. S., & Singhal, D.
– Achieved perfect precision, recall, and F1-score for (2018). A comprehensive review on network-based
both classes. botnet detection techniques. Journal of Network and
– 100% accuracy, indicating flawless classification on Computer Applications, 103, 63-85.
the dataset.
Limitations:
– While performance on this dataset is exceptional,
the model might be prone to overfitting, and its
generalization to new datasets needs to be assessed.

You might also like