Ransomware Attack Detection Using Supervised Machine Learning Classifiers

Ransomware Attack Detection Using
Supervised Machine Learning Classifiers
Computer Network and Distributed Systems

Done by Engineering
Ahmed Ali Ahmed
Engineering
Supervised byCollage – Taiz University
Boshra Hassan
Dr. Amer Sallam Hadeel Abdullah
Mujeeb Abduh Said
2021-2022 Mohammed Abdo Ali
Part One 1
Abstract
 Smartphone use and popularity have grown over the past few years, making them the primary communication tool.
Attackers keep an eye on cellphones to steal sensitive data and information from them using a variety of malware
attacks that target systems, networks, devices, and applications
 Ransomware is one of the newest and greatest dangers to cybersecurity (Ransomware)
 The CICAndMal2017 dataset, which consists of benign and various types of android malware samples, is subjected
to the application of six ML methods
Introduction
 -The Android operating system currently covers the world with 85% of the smart device market. And continues to
grow
 -The reasons for recent developments and the spread of smartphones under the conditions of COVID-19 are due
to these reasons, which made the Android operating system the main target of the attackers
 -Especially since the Android system maintains openness and does not impose restrictions on users to load and
download the application, it leaves the safety of the devices in the hands of the user by allowing him to decide
whether to install the application or not, so smartphones becomes more vulnerable to cyber-attacks, and this
hackers get benefits from the platform for the Android operating system to inject malicious code using
ransomware, which is a type of malware that infects the victim’s device with malicious code and blocked data,
that mean the screen is locked (Locker ransomware) or encrypted (Crypto ransomware)
Introduction
 Then, the victim is prevented from using the device and demands a ransom to unlock the
device, and this is done by transferring personal data, including files, photos, etc. to the
command and control server, where the attacker executes commands to control the device
remotely, then displays a threatening message on the screen to pay the ransom in the form of
Bitcoin
 Machine learning is one of the methods used to detect network traffic, and it is considered one
of the effective solutions and plays a vital role in detecting malicious patterns of ransom virus
Statement of the Problem
 Analysis and discovery of the constantly emerging Android operating is a major problem that makes the
device subject to learning techniques useless in detecting samples for which training data is not available
 Important features must be selected and new information processed in Advance. Moreover, machine learning
techniques rely on antivirus software vendors to explicitly mark samples
Research Aim
 The aim of this study to evaluate the detect o ransomware attack in android
system use based on their behavior history using the ml-based classification
approaches.
Research Objective
 to improve the detection of ransomware attack in an Android.

 . detection the Android ransomware attack by Monitoring network traffic.
 Evaluate the performance ml algorithms, and determine the best metrics
Scope and Limitations of the study
 This Research Is limited to discussion Ransomware attack in Android system

and developing a prototype to detect it.
Part Two 2
Malware
Malware
 Malware is short for malicious software, and as its name implies, malware is
designed to damage computers and Malware specifically targets internet-based
programs.
 Malware can take many different forms, but it can be broadly classified
into many classes and we will explain Ransomware attack.
Ransomware
Ransomware
 Is a type of virus that infects computers and then
prevents the user from accessing the operating system
or encrypts all data stored on the computer, and asks
the user to “ransom” or a special request, often to pay
a specified amount of money.
 Ransomware Enablers:
Several factors have contributed to the recent
surge in ransomware assaults.Financial revenue, the
availability of cryptographic techniques, untraceable
payment methods, open development kits.
Type of Ransomware
Type of Ransomware:
Ransomware is classified based on several aspects, including its
severity method of extortion, people targeted, and systems affected:
Scareware ●
Scareware is a bogus notification that threatens the victim by making
false.
Detrimental Ransomware ●
In contrast to scareware, harmful ransomware is a serious dange:
Locker-Ransomware ●
Locker-Ransomware takes control of one or more services on the
victim's system.
Crypto-Ransomware ●
encrypts the victim's files using cryptography.
encryption to create a hybrid kind known as the hybrid key.
Type of Ransomware
Follw:
●Ransomware using Symmetric Cryptography (SCR):
 Symmetric Crypto-Ransomware (SCR), as the name suggests, is a form
of crypto-ransomware that uses a single private key for both encryption
and decryption.
●Asymmetric Crypto-Ransomware (ACR):
 which utilize a pair of keys, the
public key for encryption and the private key for decryption.
●Hybrid Key Crypto-Ransomware (HCR)
 crypto-ransomware developers combine symmetric and asymmetric
Ransomware lifecycle
Static and Dynamin
 Static Analysis:
Static analysis is a passive approach that examines the payload of a sample without running
its code in order to extract structural elements from the source code.

 Dynamic Analysis:
Dynamic analysis is the process of analyzing malicious code while it is
being executed.
Related Work
 Related Work:
First data mining system-based for automatic detection and analysis of ransomware "based
on dynamic API. Then used static feature analysis to classify ransomware.The authors first
converted opcode sequences from ransomware samples into N-gram sequences.in 2019
provides a thorough analysis of crypto ransomware network traffic and proposes an advanced
ransomware detection method.In 2020, Sangal et al. used ML methods for newAndroid
malware detection. They applied many techniques (RF, KNN, SVM, and NB) . In 2021
detection system called Peeler, which uses system behaviors based detection (e.g. malicious
commands detector and I/O pattern matcher) .
Part Three 3
Chapter 3
Dataset used based network flow traffic
Network traffic refers to the data that transmits through the

network at any time
Monitoring malicious network traffic is one of the best methods
to detect malware which can uniquely offer a clear view of the
behavior of malware applications
► monitoring network traffic that enters the network and leaves it, intra-network traffic and
device activity, provides significant and beneficial information for detecting malicious
behavior. The dataset used in this research have been gotten from the Canadian Institute
for Cybersecurity
► this research concentrate on the network traffic feature for detecting ransomware
applications. 650000 ransomware and benign data samples were extracted with network
flow features that consists of six columns for each flow (Flow ID, Source IP, Destination
IP, Source Port, Destination Port, and Protocol) and 79 network traffic features [
the network traffic has been captured in pcap files during three states
► Installation :The first state of data capturing which occurs immediately after installing malware (1-3 min).
► 2. Before restart :The second state of data capturing which occurs 15 min before rebooting phones.
► 3. After restart: The last state of data capturing which occurs 15 min after rebooting phones
Ransomware Dataset
► In this research, 259110 ransomware samples were used with 85 features which were
collected from 10 popular ransomware families. Table I lists the behavior and
characteristics of ransomware and the number of samples utilized for each one of the
families.
Benign Dataset
► The benign applications used in this research were published in 2015, 2016 and 2017 in
Google play market
► These applications are more than six thousand and they have been collected based on the
popularity of the applications for each class available in the market.
► 400000 benign samples with 85 features of network traffic were extracted and utilized in
this research. These features can be classified into categories like (Flow-ID, Packetbased,
Byte-based, Flow-based, Time-based).
Data Analysis&&cleans
► This dataset based network traffic flow is rich in quantity because has 85 features in it
with 10 kind of Ransomware. Also, its contains static features such as permissions and
intents and API calls as dynamic features
► It is necessary to cleans up the dataset from any form of errors or faults that may be
found in the selected dataset (CI-CAndMal2017) to get more accurate and required
results
Implementation processes for detecting
Ransomware
Data Gathering & Analysis
Implementation Phase
Data Pre - Processing
Feature Selection
Classification
Evaluation
Results
Figure 3 : The implementation phase.

Data splitting
 The dataset utilized in this study was divided into 80% for training the algorithms, and
the rest for testing by used train_test _split from sklearn librarary. Then, Shuffle and
Cross validation methods with 10 kfold were used to dividing the dataset.
Data preprocessing
Missing value (null) Handling
The dataset is devoid of any null values. After have been getting rid of the useless features and splitting the dataset
then dataset was checking whatever the columns contains null or infinite values, it was found that the dataset does
not contain null values for any of the attributes in all its fields. However, the dataset was checked after each data
transformation or processing to ensure that changes have not occurred or are processed.
Data preprocessing
Removing columns with low variance

► Removing the low variance features (columns) When a dataset contains features that has values with a very trivial
variance or has same value for all rows in the column, then these features will not add any informative power to the
model
► The technique of removing columns with low variance is utilized to improve the model effectiveness. The Variance
Threshold technique which was provided by sklearn was utilized in this research. VarianceThreshold is a simple basic
feature selector that deletes the lowvariance columns. This technique only handles the input columns (X), not to the
target column (y), and it is most useful when used for unsupervised learning
Data preprocessing
Table 2:The features with low variance (zero values in all columns).
NO Feature’s Name NO Feature’s Name
1 Bwd PSH Flags 7 Fwd Avg Bytes/Bulk
2 Fwd URG Flags 8 Fwd Avg Packets/Bulk
3 Bwd URG Flags 9 Fwd Avg Bulk Rate
4 RST Flag Count 10 Bwd Avg Bytes/Bulk
5 CWE Flag Count 11 Bwd Avg Packets/Bulk
6 ECE Flag Count 12 Bwd Avg Bulk Rate

Feature Scaling
► Feature scaling is the operation of transforming the features using normalization. It is a way that is used to improve the
performance of the ML system
► The CICAndMal2017 dataset include features with very different values, ranges, and scales
► These lead feature selection techniques to bias towards features with larger values over other features with smaller
values
► So a data normalization technique is used to solve this problem

Feature Selection
 Feature Selection is known as the operation of locating and choosing a subset of input variables that are most
attached to the target label, and thus, reducing the statically and mathematical processes
 It is very important to use feature selection technique for improve equality of dataset in ordering increase efficiency
of the classification and detection system especially when the dataset is very huge with more dimensionality that
may lead to a complex classification model
Univariate Feature Selection Technique
► Statistical tests are used in univariate feature selection to choose the columns that have the best correlation with the
output variable. When using univariate statistical tests, univariate selection selects the most important features. It
disregards other features and compares the features to the determined target to see if there is any relationship between
them. Each feature is given a weight, and all of the weights are ultimately compared.
► Then f-test or (f- statistic) method was ued to select features with top scores.
► f-test is a method that is used when the input data is in numerical form and the output is categorical. Then the features
with the highest scores were chosen using the f-test or (f- statistic) approach.
Select from model technique
A technique called SelectFromModel is used with an estimator (model) that has the
feature importunate attribute. According to feature weights, the best features those that
are the most crucial are chosen
Implementation of Machine Learning
classifiers
Logistic Regression (LR)
► Statistical learning is the foundation of the LR algorithm. It is utilized for classification and regression problems.
LR is a probability-based prediction method that transforms the output using the sigmoid function and returns the
probability value. It creates a barrier between the samples to divide them
► The choice is made when LR analyzes the new samples to determine which side of the hyperplane they are situated
on.
►
Decision Tree (DT)
► DT is a straightforward classification and regression technique. It is a supervised machine learning (ML) sequential
model where the data is continuously separated according to a given parameter with a series of tests,
► A class label is held in the leaf node, and each branch of the tree indicates a test result
Implementation of Machine Learning classifiers
► tree, which in this research is 50, is the input for the decision tree algorithm. The root of
the DT will be the most significant feature, and other features will be dispersed from the
top to the bottom of the DT based on the answers to a series of questions (decisions), as
well as the outcomes of information gained
Random Forest (RF)
► DTs and bagging procedures are the foundation of and constituent parts of the ensemble
method known as RF.
► Bagging mandates that each DT be trained on a portion of the entire dataset. Each tree
is classified, and the
► The number of decision trees and the maximum tree depth used to train the dataset are
the inputs for the random forest technique
classifiers
Neighbor (k-NN)
It is a supervised machine learning technique used for regression and classification tasks. It presumes that related
things are located nearby
When it receives the portion of test data, it makes the prediction after storing the portion of training data. When the
instance of test data is obtained, the prediction process is initiated Next, it searches the training data for the k most
similar neighbors
A lot of machine learning (ML) applications have recently utilized this open-source library. It offers a high-
performance implementation of gradient boosted DT to quickly and accurately address a variety of data science
problems.
classifiers
XGBoost (XGB)
A lot of machine learning (ML) applications have recently utilized this open-source library. It offers a high-performance
implementation of gradient boosted DT to quickly and accurately address a variety of data science problems. To improve
the prediction, many subpar DTs must be trained at subsequent steps. When numerous weak learners are selectively
integrated to create a much more potent learning model, a bad DT model can only perform effectively on a portion of the
training data.
Multi-Layer Perceptron (MLP)
A type of feedforward ANN is called the Deep Learning MLP. For training, it applies the supervised learning method of
backpropagation. It has at least three layers of nodes (input, hidden, and output), each with a different type of activation
function (linear or nonlinear). Each node in each layer is linked to every node in the layer above it
classifiers
• AUC (Area Under the Curve): it is the calculated area under ROC curve. ROC (Receiver Operating
Characteristics) curve is another common tool used with binary classifiers. It is very similar to the
precision/recall curve, but instead of plotting precision versus recall,

the ROC curve plots the true positive rate (another name for recall) against the false positive rate
True Negative Rate (TNR): this measure calculates the percentage of benign occurrences that the algorithm
accurately detects when comparing correctly predicted negative samples to all negative samples.
Performance Evaluation Measurements and
Tools
This study used seven metrics to evaluate ML classifiers based on confusion matrix. All these criteria have
values between 0 and 1. When it is close to 1, the performance increase.
Precision: it is the ratio of correctly classified data as the attack to total data classified as the attack.
Recall (Sensitivity): it is the ratio of correctly classified data as the
attack to total attack in data
F-Measure: it is the weighted average of Precision and Recall

Part Four 4
Result
 Result:
The final outcomes M of the algorithms
will be contrasted in this section F1-
Measure.
 Measure and AUC will serve as the
primary metrics for comparison.
Thereason F1-Measure was chosen
over accuracy is that it provides a
better indicator of cases that were
erroneously classified.
Result
Conclusion
Conclusion:
This study was aimed to utilize CICAndMal2017 dataset and some machine learning algorithms
to compare between six algorithms and find the best. one. to develop a system for detecting
various types. of Ransomware attacks,. It is crucial to emphasize how this research differs
from other studies that only use a particular type of network traffic. The best features from the
entire feature set are the focus of this study. The best features were chosen for the proposed
model using a variety of preprocessing techniques, and ML classifiers were then applied to
these features.the results showed that the detection accuracy was (99.80%, 99%), and the FPR
was (0.08%,0.08%) for XGB and RF respectively. Therefore, we can use the CICAndMal2017
dataset to create the proposed System employing XGB and Random Forest.
Fulture Work
Future Works
CICAndMal2017 dataset has various types of ransomware 3- In designing a novel detection framework a
attacks. So, studying all types of attacks inside the dataset will real system that captures network packets and
help us to detect and determine all types of them. To classify tests them to determine if they are benign or
them to any type of ransomware attacks, ransomware to stop attack packets from
entering the network would be another future
1-we can use multi-class classification to do this purpose. task for our study that would make the findings
Also another future work is using network traffic features for of our study accurate.
detect and classify other types of android malware.

2- In addition to applying the ML models on one or more of
other types of features such as logs, API calls, utilizing
memory dump, permission, etc.

Ransomware Attack Detection Using Supervised Machine Learning Classifiers

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ransomware Attack Detection Using Supervised Machine Learning Classifiers

Uploaded by

Copyright:

Available Formats

Ransomware Attack Detection Using

Supervised Machine Learning Classifiers

Computer Network and Distributed Systems

 to improve the detection of ransomware attack in an Android.

 This Research Is limited to discussion Ransomware attack in Android system

Dataset used based network flow traffic

Network traffic refers to the data that transmits through the

Data Gathering & Analysis

Data Pre - Processing

Figure 3 : The implementation phase.

Missing value (null) Handling

Removing columns with low variance

NO Feature’s Name NO Feature’s Name

1 Bwd PSH Flags 7 Fwd Avg Bytes/Bulk

2 Fwd URG Flags 8 Fwd Avg Packets/Bulk

3 Bwd URG Flags 9 Fwd Avg Bulk Rate

4 RST Flag Count 10 Bwd Avg Bytes/Bulk

5 CWE Flag Count 11 Bwd Avg Packets/Bulk

6 ECE Flag Count 12 Bwd Avg Bulk Rate

► So a data normalization technique is used to solve this problem

precision/recall curve, but instead of plotting precision versus recall,

values between 0 and 1. When it is close to 1, the performance increase.

F-Measure: it is the weighted average of Precision and Recall

You might also like