You are on page 1of 29

Powerpoint Templates Page 1

Powerpoint Templates Page 2


An Intro to Expert Systems
• Rule-based decision system
applied to a given application
domain in the real world.
• When the nature of the data is
clear and conforms to known
models, there is no advantage
in using ML algorithms instead
of pre-defined models.

Powerpoint Templates Page 3


Types of ML
• Supervised Learning
– Classification Algorithm (Eg: Spam classification)
– Regression Algorithm
• Regression (linear and logistic)
• k-Nearest Neighbors (k-NNs)
• Support vector machines (SVMs)
• Decision trees and random forests
• Neural networks (NNs)

Powerpoint Templates Page 4


Types of ML
• Unsupervised Learning
– identifying new forms of malware attacks, frauds, and email spamming campaigns.
– Dimensionality reduction
• Principal component analysis (PCA)
• PCA Kernel
– Clustering
• k-means
• Hierarchical cluster analysis (HCA)

• Reinforcement Learning
– Trial and error approach
– Learning Process: Positive Reward / Negative Reward
• Markov process (HMM: polymorphic malware threats)
• Q-learning
• Temporal difference (TD) methods
• Monte Carlo methods

Powerpoint Templates Page 5


Quality vs Quantity

– What types of malware can we consider most


representative of the most probable risks and threats
to our company?
– How many example cases (samples) should we
collect and administer to the algorithms in order to
obtain a reliable result in terms of both effectiveness
and predictive efficiency of future threats?

Powerpoint Templates Page 6


Python ML Libraries

– NumPy
– Pandas
– Matplotlib
– scikit-learn
– Seaborn

Powerpoint Templates Page 7


AI in Cyber Security
– Classification: Identify types of similar attacks / malware
belonging to the same family with common characteristics
and behavior, even if their signatures are distinct
(polymorphic malware).
– classify emails, distinguishing spam from legitimate
emails.
– Clustering: automatically identify the classes to which the
samples belong when information about classes is not
available in advance (malware analysis and forensic
analysis)
– Predictive analysis: NNs and DL used to identify threats.

Powerpoint Templates Page 8


AI in Cyber Security
– Network protection: ML allows highly sophisticated
Intrusion Detection Systems (IDS) used in the network
perimeter protection area.
– Endpoint protection: Threats such as ransomware can
be adequately detected by adopting algorithms that
learn the behaviors of malware, thus overcoming the
limitations of traditional antivirus software.
– Application security: Attacks on web applications
include Server Side Request Forgery (SSRF), SQL
injection, Cross-Site Scripting (XSS), and Distributed
Denial of Service (DDoS) attacks can be adequately
countered by using AI and ML tools and algorithms.

Powerpoint Templates Page 9


AI in Cyber Security
• Suspect user behavior: Identifying attempts at fraud
or compromising applications by malicious users at the
very moment they occur is one of the emerging areas
of application of DL.

Powerpoint Templates Page 10


Detecting spam with Perceptrons
• SpamAssassin – Open source tool
• Neural Networks (NN) – Common, Simpler one Basic
form is Perceptron.
• conceptually mimic the behavior of the human brain
• The Perceptron is one of the first successful
implementations of a neuron in the field of AI.

Powerpoint Templates Page 11


Spam Filters in Nutshell
• Categorizing mails based on the presence or the
absence of particular keywords occurring within the text
of the emails with a certain frequency
• Number of occurrences of the suspicious keywords
• Assign a score to the individual messages identified as
spam, based on the number of occurrences of
identified keywords to classify subsequent email
messages.
• If score > threshold value, the email will be classified as
spam; otherwise, classified as ham.

Powerpoint Templates Page 12


Spam Filters in Action

• Spammers are well aware of our attempt to filter


unwanted messages.
• The first spam detection solution - made use of static
rules, using regular expressions to identify predefined
patterns of suspicious words in the email text.
– Static rules quickly proved to be ineffective
• Dynamic approach allowed the spam filter to learn
based on the continuous innovations introduced by
spammers.

Powerpoint Templates Page 13


Spam Filters in Action

• B variable instead of word buy, and the


• S variable instead of the word sex.
– Scoring Function: y = B + S
• If both words are present in the text of the email, the
probability of it being spam increases.
• Attribute a lower weight of 2 to the B variable and a
greater weight of 3 to the S variable.
– Corrected Scoring Function: y = 2B + 3S

Powerpoint Templates Page 14


Detecting spam with linear classifiers

• Determine the score to be associated with every


single email message.

Powerpoint Templates Page 15


Detecting spam with linear classifiers
• Generalize this formalization by shifting the θ threshold value on the left side of the equation

• Formulation of the linear classifier takes its definitive form

• Index 𝒾 now assumes the value from 0 to 𝓃

Powerpoint Templates Page 16


How the Perceptron Learns

Powerpoint Templates Page 17


Perceptron-based Spam Filter

• Used scikit- learn library to create a simple spam


filter based on the Perceptron.
• https://archive.ics.uci.edu/dataset/228/
sms+spam+collection
– Downloaded in CSV format
– Transforming it into numerical values
– Selected only the messages containing the buy and sex
keywords, counting for each message, the number of
occurrences of the keywords present in the text of the
message.
– Loading of data from the sms_spam_perceptron.csv file,
through the pandas library, extracting from the DataFrame
of pandas the respective values, referenced through the
iloc() method.

Powerpoint Templates Page 18


SVM based Spam Filter

• SVM - most popular Supervised Learning algorithms


used for Classification as well as Regression
problems.
• The goal is to create the best line or decision
boundary that can segregate n-dimensional space
into classes.

https://roadmap.sh/cyber-security

Powerpoint Templates Page 19


SVM based Spam Filter

• Load the data with pandas, associating the class


labels with the corresponding ​-1 (spam) and 1 (ham)
• Split the original dataset into 30% test data and 70%
training data
• Import the SVC class from the sklearn.svm package
• Using sklearn.metric, evaluate the accuracy of the
predictions
– Prediction Accuracy: 84%
– Number of incorrect classifications: 7 cases

Powerpoint Templates Page 20


Image spam detection with SVM

• Hackers use images as a vehicle for spreading


spam, instead of simple text.
• Image-based spam detection solutions:
– Content-based filtering:
• Pattern recognition techniques leveraging Optical Character
Recognition (OCR) technology to extract text from images
– Non content-based filtering:
• Identify specific features of spam images
• for the extraction of the features, advanced recognition techniques
based on NNs and deep learning (DL) are used.
https://scholarworks.sjsu.edu/etd_projects/486/
• Image-based spam detection solution
– https://medium.com/@yesprabhakaran98/email-spam-classification-
92b661d3b700

Powerpoint Templates Page 21


Linear Regression: Pros & Cons

Pros:
•Implementation is simple
•Linear regression works better with continuous
intervals of values

Cons:
•Can manage only quantitative data
•Assumption is unrealistic (age & weight)
•Greater classification errors
•Systematically distorted predictions

Powerpoint Templates Page 22


Linear vs Logistic

Powerpoint Templates Page 23


Logistic Regression

• Used to predict the category of a dependent variable


based on the values of the independent variable. Its
output is 0 or 1.

Classifying Credit Cards using Logistic Regression

Powerpoint Templates Page 24


Phishing Detector using Logistic Regression

• https://archive.ics.uci.edu/dataset/327/
phishing+websites
• The dataset  CSV format using the data wrangling
technique (one-hot encoding)
• Consists of records containing 30 features that
characterize phishing websites.

Powerpoint Templates Page 25


Logistic Regression: Pros & Cons

• Pros:
– The model can be trained very efficiently even in the
presence of a large number of features
– The algorithm has a high degree of scalability, due to the
simplicity of its scoring function

• Cons:
– Features to be linearly independent
– Require more training samples
– Less powerful in minimizing the prediction errors

Powerpoint Templates Page 26


Types of Malware

• Trojans: Executables that appear as legitimate and


harmless, but once they are launched, they execute
malicious instructions in the background
• Botnets: Malware that has the goal of compromising
as many possible hosts of a network, in order to put
their computational capacity at the service of the
attacker
• Downloaders: Malware that downloads malicious
libraries or portions of code from the network and
executes them on victim hosts
• APTs: APTs are forms of tailored attacks that exploit
specific vulnerabilities on the victimized hosts

Powerpoint Templates Page 27


Types of Malware

• Rootkits: compromises the hosts at the operating


system level and, therefore, often come in the form
of device drivers, making the various
countermeasures (antiviruses) ineffective
• Ransomwares: proceeds to encrypt files stored
inside the host machines, asking for a ransom from
the victim to obtain the decryption key which is used
for recovering the original files
• Zero days: exploits vulnerabilities not yet disclosed
to the community of researchers and analysts,
whose characteristics and impacts are not yet
known, and therefore undetected by antivirus s/w

Powerpoint Templates Page 28


Decision Tree
https://youtu.be/LDRbO9a6XPU

• Decision trees use binary trees to analyze and


process data for Predictions.
• Accepting both numerical values ​and qualitative
information as input data.
Implementation Steps:
• Subdividing the original dataset into two child
subsets (binary condition is verified or falsified)
• The child subsets further subdivided on the basis of
further conditions
– At each step, the condition that provides the best bipartition
of the original subset is chosen

Powerpoint Templates Page 29

You might also like