You are on page 1of 71

Machine Learning for Security

Problems
Bruhadeshwar Bezawada
Mahindra Ecole Centrale, Hyderabad
Bru@mechyd.ac.in
Outline

Why? What? How?


Case studies
Internet-of-Things
Phishing Detection
New challenges and the case for deep
learning
And so,

Why use machine learning for security


problems?
What are security problems anyway?
How to use machine learning for security
problems?
Why?

Seems like the natural thing to do these


days
Every problem can be solved with
machine learning, right?
But...
What?

Are security problems?


A security problem, briefly stated, is a
scenario that degrades the integrity and
reliablity of a (computing) system)
Examples:
A virus deleting files on a computer
An attacker stealing passwords
A phishing website stealing credit card info
A denial-of-service attack crashing Amazon
But...How?

Machine learning problems ---need lots of


data
Most security incidents are kept hidden or
secret => No data!
Then, how to use machine learning?
Seems a mis-matched tool for this class of
problems!!
Can't stop progress … or
curiosity (that kills cats:)

Major attempts to merge these two


domains
Network traffic analysis for detecting anomalies
Phishing Detection
Malware signature matching
Profiling DDoS attacks
...so on
Case Studies

Internet-of-Things
Phishing Detection
Behavioral Fingerprinting
of IoT Devices
Bruhadeshwar Bezawada, Maalvika Bachani, Jordan Peterson, Hossein Shirazi,
Indrakshi Ray and Indrajit Ray

Colorado State University and Mahindra Ecole Centrale


Motivation (1)

IoT environment is experiencing


tremendous growth and security is a
major concern
Weakly secured smart devices are facilitators of
large scale attacks
Vulnerabilities exist in device firmware
Difficult to perform OTA firmware upgrades
Low capabilities of devices (power, processor,
memory etc.) make stronger security difficult and
expensive
Motivation (2)

IoT device identification (what device it is)


and authentication (is the device the one
that it claims to be) are equally important
for security
Most common techniques have major
issues
Authentication: uses public key certificates
(expensive for many IoT devices)
Identification: uses artificial identity softly tied to
device (easily masqueraded)
Motivation (3)

Being able to strongly identify device type


can greatly enhance security
For e.g. device is behaving the way these types of
devices are supposed to
Device type only needs permission x, y and z
Being able to strongly identify device
instances can greatly simplify authentication
and dependent security services
This is the smart camera in John’s home talking to
John’s Alexa
Device Type vs. Device Instance

Light Bulb Device Category:


Light

Monochrome Hue Device


Sub-category:
Hue Light

TCP Light TP Link Philips AWOX

Device Type:
A-19 A-21
Light Bulb |Monochrome |TCP Light | A-21

S.No: 1 S.No: 2 Device Instances of A21:


S.No:1 and S.No:2
Behavioral Fingerprinting

Identifying a device type or a device


instance by observing information from
the device
Is still a largely unexplored problem in the
IoT domain
We describe our efforts to perform remote
device type fingerprinting based on
monitoring the network traffic to and from
the device
Key Observations

An IoT device-type usually has a well-


defined set of functions/services
The corresponding data of a function is
the network “behavior”
Behavioral patterns (and anomalies) of a
known device-type can be captured and
learned
Question is, how accurately
Past Fingerprinting Efforts (Non
IoT)... Many

General Device Fingerprinting: Measuring inter-


arrival times of packets per device
Wireless Device Fingerprinting: Analysis of
periodicity of link layer scans due to variations
in device driver implementations
Physical Layer Fingerprinting: Analyze
radiometric variations due to component
imperfections/variability, clock-skews
Protocol Specific Fingerprinting: Measure
implementation variations of same protocols
across different devices and/or analyze set of
messages using protocol grammar syntax
Technical Challenges in IoT Area

Multitude of protocols:
Communication standards: 802.15.4 based Zigbee,
ISA100.11a, WirelessHART, MiWi, SNAP, Bluetooth,
WiFi, Ethernet, LPWAN, LoRaWAN, RFID, 3GPP
Data protocols: REST, HTTP/2, SOAP, MQTT, MQTT-
SN, CoAP, SMCP, STOMP, XMPP, XMPP-IoT, Mihini,
AMQP, DDS, LLAP, LWM2M…
Network protocols: 6LowPAN, 6TiSCH, RPL,
IPv4/v6…
Discovery protocols: Physical Web, mDNS, DNS-
SD…
Our Work

Fingerprinting via IP packet captures


This is most feasible since most IoT devices connect
to the Internet or use IP data to communicate with
each other or with the end user’s mobile phone
Attempted earlier by IoTSentinel
Captures packet header features at device
registration
Do we have to do it at device registration?
Uses machine learning classifiers
Achieved between 50-100% identification rate
Can we improve by including other features?
What other features to include?
Threat Model

Device replacement with cheaper or


malicious devices
That exhibits normal behavior to an end user but
can secretly steal information or perform
clandestine network activities
Device compromise with the ability to
spoof soft identities like IP address, MAC
address
The compromised device tries to pose as
another device and attempts to bypass local
security policies
Dichotomy of Behavioral
Features

NETWORK LAYER FEATURES

Link Layer protocol –(Static) ARP

Network Layer protocol –(Static) IP/ICMP/ICMPv6, EAPOL

IP Options –(Static) Padding/Router Alert

Transport Layer protocol –(Static) TCP/UDP

Application Layer protocol –(Static) HTTP/SSL/TLS/DHCP/MDNS/DNS/NTP/SSDP

Payload Dependent –(Dynamic) Payload length, Payload Entropy and TCP window size
Identifying Other Packet
Features

Entropy: Denotes amount of information


content. Device-types use different
message formats, causing variation in this
feature
Payload length: Small devices send small
messages and larger devices send larger
messages
TCP Window Length : Small devices using
TCP have small window sizes and it is
proportionate to device capability
ECDF (Empirical Cumulative
Distribution Function) of Payload
Entropy
ECDF of Payload Length
ECDF of TCP Window Size
Behavioral Fingerprinting
Approach

Behavioral Model
A device behavior is a set of distinct command-
response sequences
A command-response sequence is a “session”
Device behavior is a collection of sessions
Any given session data corresponds to a
“fingerprint” of the device
Behavioral Fingerprinting
Approach - Issues

Observing/encoding entire session data is


infeasible
Sampling is required
Different sessions have different data
lengths
Need to find best sample size, i.e., number of
packets per sample
Determining Sample Length
Experimented with several devices to
observe average session lengths

Device Total Sessions Packets Sessions Packets / Session

AWOX Speaker 12755 3274 3.89

D-Link Camera 8600 1390 6.18

MUSAIC Player 1346 305 4.41

OMNA Camera 8253 1608 5.13

TP-Link Light 1660 175 9.48

Fixed the sample size to be 5


Fingerprinting Approach – Extract
Behavioral Features & Train ML
Classifiers
Step 1 : Capture the network data of a
device-type through interaction/observation
Step 2 : Split the data into samples of 5
consecutive packets
Step 3 : Each sample is used to generate a
machine learning feature vector with 100
features (each packet contributes 20
features)
Step 4: Train a machine learning classifier with
this data
Step 5: Each device-type’s fingerprint is
encoded in a classifier
Experimental Setup

Packet Capture &


Analysis Module Fingerprinting Module

Internet
Switching Network

Wireless IoT Devices Wired IoT Devices


Data Capturing
• Emulated the normal usage of device
• Device is controlled by a smartphone app and/or
• Device perform some actions without control
messages
Data Collection Method

Device is booted up and allowed to perform any


initial configuration or firmware upgrade
Contacted the device through its smart app if
needed and started interacting with the device
Allowed periods of idle time for the device to
perform some communication without user
intervention
Depending on the device activity, captured 1000
to 10000 packets of network traffic from each
device
Machine Learning Classifiers

Used several classifiers from Scikit-learn


tool
k-nearest-neighbors (kNN)
Decision tree
Majority voting
Gradient Boosting
This classifier gave consistently good results
across all the experiments
Devices Tested, Operations and
Data Instances
Experiments

Device-type fingerprinting: Trained


classifier per device-type and tested with
cross-validation
Device-category fingerprinting:
Grouped device-types into categories
Trained classifiers with data from same
categories and tested with cross-validation
Device-instance fingerprinting: Trained
against one device instance data and
tested with another device instance data
Evaluation Metrics
True Positive Rate : Ability of the classifier to
correctly identify a data instance when
presented with all positive instances
TP/(TP+FN)
Accuracy: Ability of the classifier to
distinguish positive and negative instances
TP+FN/(TP+FP+FN+TN)
Positive Predictive Value: Ability of the
classifier to correctly identify positive
instances when presented with mixed data of
positive and negative instances
TP/(TP+FP)
Device Type Fingerprinting - TPR
Device Type Fingerprinting - PPV
Device Type Fingerprinting -
Accuracy
D-Link Camera Cross Classifier
Comparison
Device Category Fingerprinting

Generated a data set from the original


data set by group devices into device
categories, e.g., light bulbs
Feature Robustness

We performed experiments to determine


the robustness and importance of the
new payload features: payload size,
entropy, TCP window size
Feature Robustness - TPR
Feature Robustness - PPV
Conclusions

IoT device-type fingerprinting is very important


in the context of security
Our experiments with behavioral fingerprinting
showed that it is possible to fingerprint device-
types with high true positive rate of 99%
The high accuracy reported by our
experiments show that it is possible to reduce
false positives during device fingerprinting
even in the presence of several other devices
Conclusions

Fingerprinting categories of devices is an


entirely different challenge and we
demonstrated some promising results in
this directions
We are trying to identify features that
would enable device instance
fingerprinting
Ultimate goal
“Kn0w Thy Doma1n Name”:
Unbiased Phishing Detection Using Domain Name
Based Features

Hossein Shirazi Bruhadeshwar Bezawada Indrakshi Ray


Colorado State University
Fort Collins, USA
Introduction

Phishing is a major security problem


The financial and privacy implications are
tremendous
Phishing attacks are quite resilient and
adaptive
 A game between phishing defenses and phishers

The several proposed machine learning


approaches have demonstrated the
resilient nature of machine learning
Problem Statement

Problem : Determine if a website is a


phishing website based only on the
information available from the website
Challenge : The content of a phishing
website is textually and visually similar to
some legitimate website
We focus on characterizing the nature of
such websites using only the information
from the website
Limitations of Past Work

Three limitations: efficiency, privacy, and


bias in/from data sets
The content-based approaches perform
in-depth analysis of content and build
classifiers to detect phishing websites
 Several features used in these approaches do not
accurately model the phishing phenomenon

Using third-party servers violates user


privacy by revealing the user’s browsing
history
Bias in/from Datasets

Two reasons for bias in/from datasets: dataset


usage and URL-based features
Dataset usage:
 Researchers used Alexa.com website to create the list of
legitimate websites.
 They used anti-phishing sites like PhishTank.com for phishing
websites .
Key difference : Alexa.com publishes highly
ranked domains whereas the anti-phishing
sites list the entire URLs of the phishing web
pages.
URL based features cause bias because
Phishing URLs are typically longer than
legitimate URLs and might contain special
features
 However, these days legitimate URLs have similar features
Proposed Approach

Our work is the first solution to be entirely


focused on the domain name of the
phishing website
Intuition: The domain name of the
phishing websites is a key indicator of a
phishing attack
Our approach differs as it explores the
relationship of the domain name to its
intent for phishing
Domain Name Importance: Legitimate vs Phishing

Domain name
Domain name
Frequency
Mismatch
Title Match

Mismatch

Copyright Match

No Copyright logo with domain name


Legitimate Website Phishing
Website
Key Contributions
We describe a machine learning (ML) based
approach for phishing detection that relies
entirely on domain name based features
Achieves 97% accuracy on a set of 2000 URLs
with five-fold cross-validation
Achieves 97-99.7% detection rate on live
blacklist data from different source and
Run-time detection speed of our approach is
4 times faster for legitimate websites and 10
times faster than the state-of-the-art work in
this domain
We demonstrate the bias induced by features
like URL length, which raises the question of
revisiting many of the existing works in
literature
Domain Name Based Feature
Design

Our feature design attempts to be data


set agnostic
Feature design aims to model the
principles of phishing attacks
 Reduce the dependence of the features on specific data values

All features depend on the domain name


of the website and the relation of the
domain with respect to the content of the
website
Non-binary Features

Feature 1 (New) : Domain Length


 The attackers who want to register domain for phishing have to choose
longer domain name in comparison with the legitimate website

Feature 2 (Existing):URL Length


 Phishing URLs using longer URLs.
 We describe this feature here primarily to highlight the issue of dataset
bias
Feature Validation
Non-binary Features (contd.)

Feature 3 (Existing) : Link Ratio in BODY.


 This feature is defined as the ratio of the number of hyper-links pointing to
the same domain to the total number of hyper-links on the web page

Feature 4 (New): Frequency of Domain


Name.
 Counting the number of times the domain name appears as a word in the
visible text of the web page
Feature validation
Binary Valued Features

Feature 5 (Existing) : HTTPS Present


Feature 6 (New) : Non-alphabetical
Characters in Domain Name
Feature 7 (New) : Domain name with
Copyright Logo
 Many legitimate websites use the copyright logo to indicate the
trade-mark ownership on their organization name
 None of the phishing websites placed their actual domain names
along with the copyright logo

Feature 8 (New) : Page Title and Domain


Name Match
 Many legitimate websites repeat the domain name in the title of
web page
Statistical analysis of binary
features
Experimental Methodology

We conducted two sets of experiments to


assess the performance
First experiment were conducted on a
prepared dataset
Second set of experiments were conducted
on live unknown phishing dataset from
OpenPhish.com
 Only one past work demonstrated a similar result on unknown
datasets with a detection rate of 95%.
 In contrast, our approach achieves much higher detection
accuracy, close to 99.7%

Tested using multiple ML classifiers including


kNN, Decision tree, and Gradient Boosting
Data sets

DS-1 : is combining rows 1 and 2 for both


training and testing with 5-fold cross-
validation
DS-2 : Rows 1 and 2 are used for training
and Row 3 is used for testing
Results without URL Length
Feature

 Our domain name based approach achieves a high 97%


accuracy and validates our basic hypothesis
 A maximum accuracy of 99.55% percentage and an average
accuracy of 97.74%
 The TPR is a high, 98.12% and 97.46%, respectively compared to
prior art TPR
 The true positive rate is higher than the previously reported best of 97%

 The average accuracy is 97.7% high compared to existing


works with larger feature sets
Results: TPR and TNR

Results of experiments Without including URL Length feature


Results: Accuracy
Results: TPR and TNR

Results of experiments including feature URL Length


Results: Accuracy
Results for Live Detection (DS-2)
Timing Analysis For Feature
Extraction

Time Analysis
Conclusion(s)
The first approach towards the design of
only domain name based features for
detection of phishing websites using
machine learning
Elimination of the possible bias in
classification due to differently chosen
datasets of phishing and legitimate pages
Difficult to bypass for attacker as our
features explore the content found in the
visible space of the web page
Demonstrated the shortcoming of using
features such as URL length
Low feature extraction and classification
time suitable for real-world deployment
Summary

Machine learning may not be a good fit for


all possible problems –like security problems
Some security problems don't require deep
learning techniques due to various reasons
Some security problems are still to be
modeled and solved using deep learning
methods –no guarantee of success
Peripheral methods might be a way to
analyze security problems

You might also like