You are on page 1of 29

Narus Company Confidential 1

Summer 2011 Company Meeting


CyberEagle: Automated Discovery, Attribution, Analysis and Risk
Assessment of Information Security Threats
Saby Saha, Narus
Lei Liu, Michigan State University
Prakash Mandayam, Michigan State University
Narus Company Confidential 2 Narus Company Confidential
CyberEagle
Motivation and Challenges
Project Layout
Architecture
Statistical Machine Learning/ Data Mining
Results
Conclusion & Future Work
Narus Company Confidential 3 Narus Company Confidential
Increasing Security Threats
Continuous and increased attacks on infrastructure
Threats to business, national security
Huge financial stake (Conficker: 10 million machines, Loss
$9.1 Billion)
Attacks are becoming more advanced and sophisticated
Honeypots, IDS/IPS, Email/IP Reputation Systems are
inadequate
Zeus: 3.6 million machines [HTML Injection]
Koobface: 2.9 million machines [Social
Networking Sites]
TidServ: 1.5 million machines [Email spam
attachment]

Narus Company Confidential 4 Narus Company Confidential
More Sophisticated Attacks
Narus Company Confidential 5 Narus Company Confidential
Host Based Security
Complete monitoring end hosts behavior and the
state of the system
Often analyzes a malware program in a controlled
environment to build a model its behavior
Pros
Information rich view: high detection rate with low false
positive
Reverse engineer the properties of the Threat
Cons
After-the-fact approach
Require malicious code for analysis
Fail to identify evolved threats
Not effective to identify zero-day threats


Narus Company Confidential 6 Narus Company Confidential
Network Security
Firewall systems
IDS/IPS
Network behavior anomaly detection (NBAD)
Pros:
Complete macro view of the network
With the knowledge of good traffic it can identify
anomalies
Able to identify new threats as anomalies
Cons
Generate large number of false positives
Unsupervised approach, lacks ground truth
Narus Company Confidential 7 Narus Company Confidential
Bringing Them Together
Leverage advantages of both the approaches
Host-security tag flows with threat signatures
Generates ground truth for associated with flows
Network security can learn rich statistical model
for all threats using the flow data tagged with
ground truth
Develop a comprehensive end-to-end data
security system for real-time discovery, analysis,
and risk assessment of security threats
Narus Company Confidential 8 Narus Company Confidential
Enhanced Comprehensive Security System
Discover common and persistent behavioral
patterns for all security threats
Even when sessions are encrypted (IDS/IPS fails)
Generate precise threat alerts in real-time
Reduce the false positive rate
Identify new threats which has some similarities
with previous ones
Newly evolved version of a threat
New threat with similar behavioral pattern
Inform about the newly identified threat to the
host-security
Narus Company Confidential 9 Narus Company Confidential
System Overview
Model Generation
Extract Set of Transport Layer
Features
Generation of Statistical Models
Classification
Flush Out Model to Streaming
Classification Path
Redirect Packets Matching Model to
Binary Analysis Module
Validation
Assessment
Extract Executable and Execute
Executable
Analysis of Information Touched
Assess the Risk
Increase Confidence and Alert
Narus Company Confidential 10 Narus Company Confidential
Information Flow
Narus Company Confidential 11 Narus Company Confidential
Supervised Threat Classification
Data
Network flow features
Kernel
Define similarity between different flows
Classifier
Binary to separate good from bad
Multiclass to further separate bad flows
Scalability issues
Hierarchy
Narus Company Confidential 12 Narus Company Confidential
Challenges
Irregular data
Missing values.
Imbalanced data
Heterogeneous.
Non applicable features.
Large number of classes (Number of threats
reaches hundreds of thousands)
New classes
Noise in the data
All threat classes may not be captured
Minimize false positives
Narus Company Confidential 13 Narus Company Confidential
Preprocessing
Normalization

Deal with missing values
Case deleting method:
Mean imputation
Overall classes
Each individual class
Median imputation
Overall classes
Each individual class


Narus Company Confidential 14 Narus Company Confidential
Classifier Framework

S
u
p
e
r
v
i
s
e
d

C
l
a
s
s
i
f
i
e
r

Flows
SNORT
Bad Flows
76 different classes
13935 Flows
Unknown Flows
44427 Flows
Class 1
Class 76
Class 2


Shellcode
Spambot_Proxy_Control_Channel
Exploit_Suspected_PHP_Injection_Attack
Macro-Level
Classifier
Unknown
CL_A
Bad
CL_B CL_N
Micro-Level
Classifier

Learning/Training
Learning/Training
Narus Company Confidential 15 Narus Company Confidential
Binary Classifier Results
Biased SVM performance comparison with different kernels
Linear Kernel RBF Kernel Poly Kernel
Precision good 79.75 87.46 78.70
Recall good 87.07 90.42 97.79
F1 good 83.25 88.9347 87.2126
Precision bad 79.75 69.33 79.78
Recall bad 37.17 62.55 24.81
F1 bad 42.74 65.7657 34.8495
Accuracy 74.08 83.26 78.79
G-mean 56.89 75.21 49.25
Kernel Learning
Narus Company Confidential 16 Narus Company Confidential
Binary Classifier Results
Parameter selection for Biased SVM with RBF Kernel
When gamma=10,
C+/C_=0.5, win best
F1_bad = 0.6494
When gamma=10,
C+/C_=0.55, win best
F1_bad = 0.657657
Narus Company Confidential 17 Narus Company Confidential
Binary Classifier Results
F1 bad comparison of the methods for Binary classifier
F1 best performance with/without noise: 79.07/88.7 %
F1 bad comparison with noise
45.57 45.57
46.41
63.74
65.7657
76.01
79.07
0
10
20
30
40
50
60
70
80
90
Bagging
SMO
Adaboost SMO KNN Biased
SVM
Decision
Tree
Bagging
Decision
Tree
F1 bad comparison without noise
51.7 51.7
53.4
79.43
67.55
86.8
88.7
0
10
20
30
40
50
60
70
80
90
100
Bagging
SMO
Adaboost SMO KNN Biased
SVM
Decision
Tree
Bagging
Decision
Tree
Narus Company Confidential 18 Narus Company Confidential
Preprocessing (Multiclass)
Tree based generated features
For each class k, do
Repeat c times
Collect samples from class k, label them +1
Collect samples from class k
c
, label them -1.
Build a regression tree on above binary data.
Store the tree as T
ik

End
End
Example:

Home
owner
Marital status Annual
income
Number of
children
age
- married 125K - 41
No Not married 70K N/A 22
No - 59K 1 55
yes Not married - N/A 23
yes married 100K 1 -
Tree 1 Tree 2 Tree 3 Tree 4 Tree5
-0.25 -1 -0.5 -1 -0.14286
-0.25 -1 -0.5 -0.33333 -0.14286
-1 0.2 1 1 0.142857
0.5 0.714286 0.5 0.25 -1
-0.25 -0.33333 -0.5 0.777778 -0.14286
Original features Tree based features
transformation
Narus Company Confidential 19 Narus Company Confidential
Preprocessing
Multiclass results comparison with
Original features
Tree based generated features

Original Features
Tree based features
Class
ID
Precision Recall
F1 Precision Recall F1
24 77.65 78.30
77.97 86.12 88 87.05
25 63.62 70.02
66.67 79.3 82 80.63
28 99.36 99.70
99.53 100 100 100
48 82.16 73.95
77.84 79.68 77.9 78.78
68 69.05 71.38
70.20 67.7 76 71.61
76 66.58 71.23
68.83 68.45 66.6 67.51
76.40
80.21
77.43
81.75
76.84
80.93
73
74
75
76
77
78
79
80
81
82
83
Precision
original
features
Precision
Tree based
features
Recall
original
features
Recall Tree
based
features
F1 original
features
F1 tree
based
features
Average performance of 6 majority classes
Performance of 6 majority classes
Narus Company Confidential 20 Narus Company Confidential
Multi-class Classification
Identify individual threats
Identify new classes and provide properties
Classifiers
K-Nearest Neighbor
No training involved
Computationally intensive for testing
Ensemble methods
Failing to scale up for huge number of classes
Sphere-based SVM
Encapsulate each class in a hyper sphere.
Transform data into appropriate space such that
they cluster into single cohesive unit
Narus Company Confidential 21 Narus Company Confidential
Building Kernel
Let (X
i
,Y
i
) be the data points where Y
i
={+1,-1}
Construct ground truth kernel K
K
ij
= Y
i
Y
j
Now learn a parametric kernel as follows
K
ij
= f

(X
i
,X
j
)

Home
owner
Marital
status
Annual
income
Number
of
children
age Y
- married 125K - 41 +1
No Not married 70K N/A 22 +1
No - 59K 1 55 +1
yes Not married - N/A 23 -1
yes married 100K 1 - -1
- Married - 2 32 -1
K
ij
~f

(X
i
,X
j
)
Once is learned, it can be applied onto the test set.
=
T
y y

class
1 2 3 4 5 6
1 +1 +1 +1 -1 -1 -1
2 +1 +1 +1 -1 -1 -1
3 +1 +1 +1 -1 -1 -1
4 -1 -1 -1 +1 +1 +1
5 -1 -1 -1 +1 +1 +1
6 -1 -1 -1 +1 +1 +1
Narus Company Confidential 22 Narus Company Confidential
Kernel for Multi Class
For each class we do following
Collect samples belonging to class and label as +1
Collection samples from rest of data and label as -1
Build separate kernel for each class.
class
1 2 3 4 5 6
1 +1 +1 +1 -1 -1 -1
2 +1 +1 +1 -1 -1 -1
3 +1 +1 +1 -1 -1 -1
4 -1 -1 -1 +1 +1 +1
5 -1 -1 -1 +1 +1 +1
6 -1 -1 -1 +1 +1 +1
K
ij
~f

(X
i
,X
j
)
Narus Company Confidential 23 Narus Company Confidential
Boosted Trees for Kernel Learning
(
(
(
(
(
(
(
(

+
=
1
1
1
1
1
1
1
y
1 2 3 4 5 6
1 +1 -1 -1 +1 -1 +1
2 -1 +1 +1 -1 +1 -1
3 -1 +1 +1 -1 +1 -1
4 +1 -1 -1 +1 -1 +1
5 -1 +1 +1 -1 +1 -1
6 +1 -1 -1 +1 -1 +1
1 2 3 4 5 6
1 +1 +1 +1 -1 -1 -1
2 +1 +1 +1 -1 -1 -1
3 +1 +1 +1 -1 -1 -1
4 -1 -1 -1 +1 +1 +1
5 -1 -1 -1 +1 +1 +1
6 -1 -1 -1 +1 +1 +1
Output of tree 1 Kernel matrix for tree 1
(
(
(
(
(
(
(
(

+
=
1
1
1
1
1
1
2
y
Output of tree 2 Kernel matrix for tree 2
1 2 3 4 5 6
1 +1 -1 +1 +1 -1 +1
2 -1 +1 -1 +1 +1 -1
3 -1 +1 +1 -1 +1 -1
4 +1 -1 -1 +1 -1 +1
5 -1 +1 +1 -1 +1 -1
6 +1 -1 -1 +1 -1 +1

.
Narus Company Confidential 24 Narus Company Confidential
Multi class Results
Spheres require only K =6
(number of classes)
comparison whereas KNN
require N comparisons.

Narus Company Confidential 25 Narus Company Confidential
Classification +New Class Detection
Find transformation
to separate class +
from rest of data
Find transformation
to separate class x
from rest of data
Find transformation
to separate class --
from rest of data
Find transformation
to separate class ^
from rest of data
Build a
separate
Kernel for
each class
Narus Company Confidential 26 Narus Company Confidential
New Class Generation
Narus Company Confidential 27 Narus Company Confidential
Conclusion
CyberEagle: An enhanced comprehensive security
system
Bringing Host and Network security together to fight
security threats
Identify threats that IDS/IPS fails to detect
(Encrypted, evolved)
Identify new threats in the earliest stage
Generate signatures for the new threats and alert
the host security system in an automated way


Narus Company Confidential 28 Narus Company Confidential
Future Work
Improve classification accuracy
Scaling up for huge number of classes
Reduce computation during classification
Learn class hierarchy
Increase speed without sacrificing accuracy
Validate with diverse data
Reputation analysis of the ip addresses
Online update of the classifier
Mapreduce implementations

Narus Company Confidential 29
Summer 2011 Company Meeting
Thank You
Prakash, Lei, Saby