DIET SecurityandMachineLearningBruhadeshwar

Machine Learning for Security
Problems
Bruhadeshwar Bezawada
Mahindra Ecole Centrale, Hyderabad
Bru@mechyd.ac.in
Outline
Why? What? How?

Case studies
Internet-of-Things
Phishing Detection
New challenges and the case for deep
learning
And so,
Why use machine learning for security

problems?
What are security problems anyway?
How to use machine learning for security
problems?
Why?
Seems like the natural thing to do these

days
Every problem can be solved with
machine learning, right?
But...
What?
Are security problems?

A security problem, briefly stated, is a
scenario that degrades the integrity and
reliablity of a (computing) system)
Examples:
A virus deleting files on a computer
An attacker stealing passwords
A phishing website stealing credit card info
A denial-of-service attack crashing Amazon
But...How?
Machine learning problems ---need lots of

data
Most security incidents are kept hidden or
secret => No data!
Then, how to use machine learning?
Seems a mis-matched tool for this class of
problems!!
Can't stop progress … or
curiosity (that kills cats:)
Major attempts to merge these two

domains
Network traffic analysis for detecting anomalies
Malware signature matching
Profiling DDoS attacks
...so on
Case Studies
Internet-of-Things
Behavioral Fingerprinting
of IoT Devices
Bruhadeshwar Bezawada, Maalvika Bachani, Jordan Peterson, Hossein Shirazi,
Indrakshi Ray and Indrajit Ray
Colorado State University and Mahindra Ecole Centrale

Motivation (1)
IoT environment is experiencing

tremendous growth and security is a
major concern
Weakly secured smart devices are facilitators of
large scale attacks
Vulnerabilities exist in device firmware
Difficult to perform OTA firmware upgrades
Low capabilities of devices (power, processor,
memory etc.) make stronger security difficult and
expensive
Motivation (2)
IoT device identification (what device it is)

and authentication (is the device the one
that it claims to be) are equally important
for security
Most common techniques have major
issues
Authentication: uses public key certificates
(expensive for many IoT devices)
Identification: uses artificial identity softly tied to
device (easily masqueraded)
Motivation (3)
Being able to strongly identify device type

can greatly enhance security
For e.g. device is behaving the way these types of
devices are supposed to
Device type only needs permission x, y and z
Being able to strongly identify device
instances can greatly simplify authentication
and dependent security services
This is the smart camera in John’s home talking to
John’s Alexa
Device Type vs. Device Instance
Light Bulb Device Category:

Light
Monochrome Hue Device

Sub-category:
Hue Light
TCP Light TP Link Philips AWOX
Device Type:
A-19 A-21
Light Bulb |Monochrome |TCP Light | A-21
S.No: 1 S.No: 2 Device Instances of A21:

S.No:1 and S.No:2
Identifying a device type or a device

instance by observing information from
the device
Is still a largely unexplored problem in the
IoT domain
We describe our efforts to perform remote
device type fingerprinting based on
monitoring the network traffic to and from
the device
Key Observations
An IoT device-type usually has a well-

defined set of functions/services
The corresponding data of a function is
the network “behavior”
Behavioral patterns (and anomalies) of a
known device-type can be captured and
learned
Question is, how accurately
Past Fingerprinting Efforts (Non
IoT)... Many
General Device Fingerprinting: Measuring inter-

arrival times of packets per device
Wireless Device Fingerprinting: Analysis of
periodicity of link layer scans due to variations
in device driver implementations
Physical Layer Fingerprinting: Analyze
radiometric variations due to component
imperfections/variability, clock-skews
Protocol Specific Fingerprinting: Measure
implementation variations of same protocols
across different devices and/or analyze set of
messages using protocol grammar syntax
Technical Challenges in IoT Area
Multitude of protocols:
Communication standards: 802.15.4 based Zigbee,
ISA100.11a, WirelessHART, MiWi, SNAP, Bluetooth,
WiFi, Ethernet, LPWAN, LoRaWAN, RFID, 3GPP
Data protocols: REST, HTTP/2, SOAP, MQTT, MQTT-
SN, CoAP, SMCP, STOMP, XMPP, XMPP-IoT, Mihini,
AMQP, DDS, LLAP, LWM2M…
Network protocols: 6LowPAN, 6TiSCH, RPL,
IPv4/v6…
Discovery protocols: Physical Web, mDNS, DNS-
SD…
Our Work
Fingerprinting via IP packet captures

This is most feasible since most IoT devices connect
to the Internet or use IP data to communicate with
each other or with the end user’s mobile phone
Attempted earlier by IoTSentinel
Captures packet header features at device
registration
Do we have to do it at device registration?
Uses machine learning classifiers
Achieved between 50-100% identification rate
Can we improve by including other features?
What other features to include?
Threat Model
Device replacement with cheaper or

malicious devices
That exhibits normal behavior to an end user but
can secretly steal information or perform
clandestine network activities
Device compromise with the ability to
spoof soft identities like IP address, MAC
address
The compromised device tries to pose as
another device and attempts to bypass local
security policies
Dichotomy of Behavioral
Features
NETWORK LAYER FEATURES
Link Layer protocol –(Static) ARP
Network Layer protocol –(Static) IP/ICMP/ICMPv6, EAPOL
IP Options –(Static) Padding/Router Alert
Transport Layer protocol –(Static) TCP/UDP
Application Layer protocol –(Static) HTTP/SSL/TLS/DHCP/MDNS/DNS/NTP/SSDP
Payload Dependent –(Dynamic) Payload length, Payload Entropy and TCP window size
Identifying Other Packet
Features
Entropy: Denotes amount of information

content. Device-types use different
message formats, causing variation in this
feature
Payload length: Small devices send small
messages and larger devices send larger
messages
TCP Window Length : Small devices using
TCP have small window sizes and it is
proportionate to device capability
ECDF (Empirical Cumulative
Distribution Function) of Payload
Entropy
ECDF of Payload Length
ECDF of TCP Window Size
Approach
Behavioral Model
A device behavior is a set of distinct command-
response sequences
A command-response sequence is a “session”
Device behavior is a collection of sessions
Any given session data corresponds to a
“fingerprint” of the device
Approach - Issues
Observing/encoding entire session data is

infeasible
Sampling is required
Different sessions have different data
lengths
Need to find best sample size, i.e., number of
packets per sample
Determining Sample Length
Experimented with several devices to
observe average session lengths
Device Total Sessions Packets Sessions Packets / Session
AWOX Speaker 12755 3274 3.89
D-Link Camera 8600 1390 6.18
MUSAIC Player 1346 305 4.41
OMNA Camera 8253 1608 5.13
TP-Link Light 1660 175 9.48
Fixed the sample size to be 5

Fingerprinting Approach – Extract
Behavioral Features & Train ML
Classifiers
Step 1 : Capture the network data of a
device-type through interaction/observation
Step 2 : Split the data into samples of 5
consecutive packets
Step 3 : Each sample is used to generate a
machine learning feature vector with 100
features (each packet contributes 20
features)
Step 4: Train a machine learning classifier with
this data
Step 5: Each device-type’s fingerprint is
encoded in a classifier
Experimental Setup
Packet Capture &

Analysis Module Fingerprinting Module
Internet
Switching Network
Wireless IoT Devices Wired IoT Devices

Data Capturing
• Emulated the normal usage of device
• Device is controlled by a smartphone app and/or
• Device perform some actions without control
messages
Data Collection Method
Device is booted up and allowed to perform any

initial configuration or firmware upgrade
Contacted the device through its smart app if
needed and started interacting with the device
Allowed periods of idle time for the device to
perform some communication without user
intervention
Depending on the device activity, captured 1000
to 10000 packets of network traffic from each
device
Machine Learning Classifiers
Used several classifiers from Scikit-learn

tool
k-nearest-neighbors (kNN)
Decision tree
Majority voting
Gradient Boosting
This classifier gave consistently good results
across all the experiments
Devices Tested, Operations and
Data Instances
Experiments
Device-type fingerprinting: Trained

classifier per device-type and tested with
cross-validation
Device-category fingerprinting:
Grouped device-types into categories
Trained classifiers with data from same
categories and tested with cross-validation
Device-instance fingerprinting: Trained
against one device instance data and
tested with another device instance data
Evaluation Metrics
True Positive Rate : Ability of the classifier to
correctly identify a data instance when
presented with all positive instances
TP/(TP+FN)
Accuracy: Ability of the classifier to
distinguish positive and negative instances
TP+FN/(TP+FP+FN+TN)
Positive Predictive Value: Ability of the
classifier to correctly identify positive
instances when presented with mixed data of
positive and negative instances
TP/(TP+FP)
Device Type Fingerprinting - TPR
Device Type Fingerprinting - PPV
Device Type Fingerprinting -
Accuracy
D-Link Camera Cross Classifier
Comparison
Device Category Fingerprinting
Generated a data set from the original

data set by group devices into device
categories, e.g., light bulbs
Feature Robustness
We performed experiments to determine

the robustness and importance of the
new payload features: payload size,
entropy, TCP window size
Feature Robustness - TPR
Feature Robustness - PPV
Conclusions
IoT device-type fingerprinting is very important

in the context of security
Our experiments with behavioral fingerprinting
showed that it is possible to fingerprint device-
types with high true positive rate of 99%
The high accuracy reported by our
experiments show that it is possible to reduce
false positives during device fingerprinting
even in the presence of several other devices
Conclusions
Fingerprinting categories of devices is an

entirely different challenge and we
demonstrated some promising results in
this directions
We are trying to identify features that
would enable device instance
fingerprinting
Ultimate goal
“Kn0w Thy Doma1n Name”:
Unbiased Phishing Detection Using Domain Name
Based Features
Hossein Shirazi Bruhadeshwar Bezawada Indrakshi Ray

Colorado State University
Fort Collins, USA
Introduction
Phishing is a major security problem

The financial and privacy implications are
tremendous
Phishing attacks are quite resilient and
adaptive
 A game between phishing defenses and phishers
The several proposed machine learning

approaches have demonstrated the
resilient nature of machine learning
Problem Statement
Problem : Determine if a website is a

phishing website based only on the
information available from the website
Challenge : The content of a phishing
website is textually and visually similar to
some legitimate website
We focus on characterizing the nature of
such websites using only the information
from the website
Limitations of Past Work
Three limitations: efficiency, privacy, and

bias in/from data sets
The content-based approaches perform
in-depth analysis of content and build
classifiers to detect phishing websites
 Several features used in these approaches do not
accurately model the phishing phenomenon
Using third-party servers violates user

privacy by revealing the user’s browsing
history
Bias in/from Datasets
Two reasons for bias in/from datasets: dataset

usage and URL-based features
Dataset usage:
 Researchers used Alexa.com website to create the list of
legitimate websites.
 They used anti-phishing sites like PhishTank.com for phishing
websites .
Key difference : Alexa.com publishes highly
ranked domains whereas the anti-phishing
sites list the entire URLs of the phishing web
pages.
URL based features cause bias because
Phishing URLs are typically longer than
legitimate URLs and might contain special
features
 However, these days legitimate URLs have similar features
Proposed Approach
Our work is the first solution to be entirely

focused on the domain name of the
phishing website
Intuition: The domain name of the
phishing websites is a key indicator of a
phishing attack
Our approach differs as it explores the
relationship of the domain name to its
intent for phishing
Domain Name Importance: Legitimate vs Phishing
Domain name
Domain name
Frequency
Mismatch
Title Match
Mismatch
Copyright Match
No Copyright logo with domain name

Legitimate Website Phishing
Website
Key Contributions
We describe a machine learning (ML) based
approach for phishing detection that relies
entirely on domain name based features
Achieves 97% accuracy on a set of 2000 URLs
with five-fold cross-validation
Achieves 97-99.7% detection rate on live
blacklist data from different source and
Run-time detection speed of our approach is
4 times faster for legitimate websites and 10
times faster than the state-of-the-art work in
this domain
We demonstrate the bias induced by features
like URL length, which raises the question of
revisiting many of the existing works in
literature
Domain Name Based Feature
Design
Our feature design attempts to be data

set agnostic
Feature design aims to model the
principles of phishing attacks
 Reduce the dependence of the features on specific data values
All features depend on the domain name

of the website and the relation of the
domain with respect to the content of the
website
Non-binary Features
Feature 1 (New) : Domain Length

 The attackers who want to register domain for phishing have to choose
longer domain name in comparison with the legitimate website
Feature 2 (Existing):URL Length

 Phishing URLs using longer URLs.
 We describe this feature here primarily to highlight the issue of dataset
bias
Feature Validation
Non-binary Features (contd.)
Feature 3 (Existing) : Link Ratio in BODY.

 This feature is defined as the ratio of the number of hyper-links pointing to
the same domain to the total number of hyper-links on the web page
Feature 4 (New): Frequency of Domain

Name.
 Counting the number of times the domain name appears as a word in the
visible text of the web page
Feature validation
Binary Valued Features
Feature 5 (Existing) : HTTPS Present

Feature 6 (New) : Non-alphabetical
Characters in Domain Name
Feature 7 (New) : Domain name with
Copyright Logo
 Many legitimate websites use the copyright logo to indicate the
trade-mark ownership on their organization name
 None of the phishing websites placed their actual domain names
along with the copyright logo
Feature 8 (New) : Page Title and Domain

Name Match
 Many legitimate websites repeat the domain name in the title of
web page
Statistical analysis of binary
features
Experimental Methodology
We conducted two sets of experiments to

assess the performance
First experiment were conducted on a
prepared dataset
Second set of experiments were conducted
on live unknown phishing dataset from
OpenPhish.com
 Only one past work demonstrated a similar result on unknown
datasets with a detection rate of 95%.
 In contrast, our approach achieves much higher detection
accuracy, close to 99.7%
Tested using multiple ML classifiers including

kNN, Decision tree, and Gradient Boosting
Data sets
DS-1 : is combining rows 1 and 2 for both

training and testing with 5-fold cross-
validation
DS-2 : Rows 1 and 2 are used for training
and Row 3 is used for testing
Results without URL Length
Feature
 Our domain name based approach achieves a high 97%

accuracy and validates our basic hypothesis
 A maximum accuracy of 99.55% percentage and an average
accuracy of 97.74%
 The TPR is a high, 98.12% and 97.46%, respectively compared to
prior art TPR
 The true positive rate is higher than the previously reported best of 97%
 The average accuracy is 97.7% high compared to existing

works with larger feature sets
Results: TPR and TNR
Results of experiments Without including URL Length feature

Results: Accuracy
Results: TPR and TNR
Results of experiments including feature URL Length

Results: Accuracy
Results for Live Detection (DS-2)
Timing Analysis For Feature
Extraction
Time Analysis
Conclusion(s)
The first approach towards the design of
only domain name based features for
detection of phishing websites using
machine learning
Elimination of the possible bias in
classification due to differently chosen
datasets of phishing and legitimate pages
Difficult to bypass for attacker as our
features explore the content found in the
visible space of the web page
Demonstrated the shortcoming of using
features such as URL length
Low feature extraction and classification
time suitable for real-world deployment
Summary
Machine learning may not be a good fit for

all possible problems –like security problems
Some security problems don't require deep
learning techniques due to various reasons
Some security problems are still to be
modeled and solved using deep learning
methods –no guarantee of success
Peripheral methods might be a way to
analyze security problems

DIET SecurityandMachineLearningBruhadeshwar

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DIET SecurityandMachineLearningBruhadeshwar

Uploaded by

Copyright:

Available Formats

Machine Learning for Security

Why? What? How?

Why use machine learning for security

Seems like the natural thing to do these

Are security problems?

Machine learning problems ---need lots of

Major attempts to merge these two

Colorado State University and Mahindra Ecole Centrale

IoT environment is experiencing

IoT device identification (what device it is)

Being able to strongly identify device type

Light Bulb Device Category:

Monochrome Hue Device

TCP Light TP Link Philips AWOX

S.No: 1 S.No: 2 Device Instances of A21:

Identifying a device type or a device

An IoT device-type usually has a well-

General Device Fingerprinting: Measuring inter-

Fingerprinting via IP packet captures

Device replacement with cheaper or

NETWORK LAYER FEATURES

Link Layer protocol –(Static) ARP

Network Layer protocol –(Static) IP/ICMP/ICMPv6, EAPOL

IP Options –(Static) Padding/Router Alert

Transport Layer protocol –(Static) TCP/UDP

Application Layer protocol –(Static) HTTP/SSL/TLS/DHCP/MDNS/DNS/NTP/SSDP

Entropy: Denotes amount of information

Observing/encoding entire session data is

Device Total Sessions Packets Sessions Packets / Session

AWOX Speaker 12755 3274 3.89

D-Link Camera 8600 1390 6.18

MUSAIC Player 1346 305 4.41

OMNA Camera 8253 1608 5.13

TP-Link Light 1660 175 9.48

Fixed the sample size to be 5

Packet Capture &

Wireless IoT Devices Wired IoT Devices

Device is booted up and allowed to perform any

Used several classifiers from Scikit-learn

Device-type fingerprinting: Trained

Generated a data set from the original

We performed experiments to determine

IoT device-type fingerprinting is very important

Fingerprinting categories of devices is an

Hossein Shirazi Bruhadeshwar Bezawada Indrakshi Ray

Phishing is a major security problem

The several proposed machine learning

Problem : Determine if a website is a

Three limitations: efficiency, privacy, and

Using third-party servers violates user

Two reasons for bias in/from datasets: dataset

Our work is the first solution to be entirely

No Copyright logo with domain name

Our feature design attempts to be data

All features depend on the domain name

Feature 1 (New) : Domain Length

Feature 2 (Existing):URL Length

Feature 3 (Existing) : Link Ratio in BODY.

Feature 4 (New): Frequency of Domain

Feature 5 (Existing) : HTTPS Present

Feature 8 (New) : Page Title and Domain

We conducted two sets of experiments to

Tested using multiple ML classifiers including