You are on page 1of 23

Intrusion Detection and Malware Analysis

Malware analysis

Pavel Laskov
Wilhelm Schickard Institute for Computer Science
What do we want to know about malware?

Is it recognized by existing antivirus products?


What is its functionality?
How is malware distributed?
What other harmful functions does malware carry out?
What are relationships between various classes of malware?
Do they share common techniques? How do they evolve?
How much of malware is unknown?

Experiment:
Current instances of malware were collected from a
Nepenthes honeypot.
Files were scanned with Avira AntiVir.
Results:
First scan: Second scan:
undetected undetected
24% 15%

detected detected
76% 85%

After four weeks 15% of malware instances were still


not recognized!
Traditional malware analysis tools

Binary disassemblers (IDA Pro, OllyDbg)


(+) precise understanding of malware behavior
(−) extensive manual effort
Static flow analysers (BinDiff, BinNavi)
(+) visual representation of malware program flow
(−) prone to obfuscation
Modern malware analysis tools

Sandboxes
Behavior analysers
Clustering and classification methods
Signature generation tools
CWSandbox: key ideas
Malware

first
Kernel32.dll-CreateFileA (*without* Hook):
instr
API hooking: intercept Windows 77E8C1F7
77E8C1F8
PUSH ebp
MOV ebp, esp
can
them
API calls by overwriting API 77E8C1FA
77E8C1FD
77E8C202
PUSH SS:[ebp+8]
CALL +$0000d265
TEST eax, eax
over
the
77E8C1FD
functions’ code loaded in the …
77E8C226
JNZ

RET
with
orig
whe
process memory. (a)
hoo
nece
Kernel32.dll-CreateFileA (*with* Hook):
deta
DLL injection: insertion of hook 77E8C1F7
77E8C1FD
JMP [CreatFileA-Hook]
CALL +$0000d265
offer
resea
77E8C202
functions into the running 77E8C1FD

77E8C226
TEST eax, eax
JNZ +$05
… 1 tion
T
RET
processes. Application.CreatFileA-Hook:
level
sider
exist
2005EDB7 -custom hook code-

Recording of persistent system 3 …


2005EDF0

JMP [CreatFileA-SavedStub]
the I
syste
nipu
changes collected by a hook in an Application.CreatFileA-SavedStub:

21700000 PUSH ebp


2
can
don
XML report. 21700001
21700003
21700006
MOV ebp, esp
PUSH SS:[ebp+8]
easie
the f
(b) JMP $77E8C1FD
hoo
O
Figure 1. In-line code overwriting. (a) shows the original function to av
code. In (b), the JMP instruction overwrites the API function’s first tech
block (1) and transfers control to our hook function whenever the the m
to-be-analyzed application calls the API function. (2) The hook tem
function performs the desired operations and then calls the vanc
original API function’s saved stub. (3) The saved stub performs the desig
overwritten instructions and branches to the original API function’s user
unmodified part.
DLL
DLL
CWSandbox: system architecture
Malware

virtual memory and executing code in a different


Malware application
process’s context. Executes
Windows kernel32.dll offers the API functions cwmonitor.dll
ReadProcessMemory and WriteProcessMemory, Communication
which lets the CWSandbox read and write to an arbi-
trary process’s virtual memory, allocating new memory Executes
cwsandbox.exe
regions or changing an already allocated memory re-
gion’s using the VirtualAllocEx and Virtual- Malware application child
ProtectEx functions.
cwmonitor.dll
It’s possible to execute code in another process’s con- Communication
text in at least two ways:

• suspend one of the target application’s running threads, Figure 2. CWSandbox overview. CWSandbox.exe creates a new
copy the to-be-executed code into the target’s address process image for the to-be-analyzed malware binary and then injects
During the initialization of a malware binary cwmonitor.dll
space, set the resumed thread’s instruction pointer to the
copied code’s location, and then resume the thread; or
the cwmonitor.dll into the target application’s address space. With
the help of the DLL, we perform API hooking and send all observed
is injected into its memory to carry out API hooking.
• copy the to-be-executed code into the target’s address behavior via the communication channel back to cwsandbox.exe. We
space and create a new thread in the target process with use the same procedure for child or infected processes.
the code location as the start address. DLL intercepts all API calls and reports them to CWSandbox.
With these building blocks in place, it’s now possible to
The samethe
inject and execute code into another process.
monitored API calls. The report provides data for
procedure is repeated for any child or infected
each process and its associated actions—one subsection
process. work operations,
The most popular technique is DLL injection, in
which the CWSandbox puts all custom code into a DLL
for all accesses to the file system and another for all net-
for example. One of our focuses is on
and the hook function directs the target process to load bot analysis, so we spent considerable effort on extracting
this DLL into its memory. Thus, DLL injection fulfills and evaluating the network connection data.
both requirements for API hooking: the custom hook After it analyzes the API calls’ parameters, the sand-
functions are loaded into the target’s address space, and box routes them back to their original API functions.
the API hooks are installed in the DLL’s initialization rou- Therefore, it doesn’t block the malware from integrating
Case study: Malware classification

Goals:
Detect variations of known malware families.
Detect previusly unknown families.
Determine essential features of specific families.
Main ideas:
Use AV scanners to assign labels to malware binaries.
Use machine learning to classify binaries among known
malware families.

exploits binary report label


Wild Collection Monitoring Classification

model

label
AV scanner Learning
Malware classification: an overview

Report corpus
Acquisition of training data: creation of
balanced data sets for multi-class Acquisition of
classification. training data

Feature extraction: embedding of reports


Feature extraction
in a feature space.
Training: optimizaiton and model Training
selection.
Classification: combining results of single Classification
family classifiers.
label
Malware classification: data acquisision

Report corpus

Binaries are collected from Nepenthes


Acquisition of
and from spam-traps, and are analyzed training data
by CWSandbox.
Labels are assigned by running Avira Feature extraction

AntiVir. Unrecognized binaries are


discarded. Training

14 classes are considered for training: 1


backdoor, 2 trojans and 11 worms. Classification

label
Malware classification: feature extraction

Operational features: a set of all strings


Report corpus
contained between delimeters “<” and “>”.
“Wildcarding”: removal of potentially Acquisition of
random attributes. training data

<copy_file filetype="File" srcfile="c:\1ae8b19ecea1b65705595b245f2971ee.exe" Feature extraction


dstfile="C:\WINDOWS\system32\urdvxc.exe"
creationdistribution="CREATE_ALWAYS" desiredaccess="FILE_ANY_ACCESS"
flags="SECURITY_ANONYMOUS" fileinformationclass="FileBasicInformation"/>
<set_value key="HKEY_CLASSES_ROOT\CLSID\{3534943...2312F5C0&}" Training
data="lsslwhxtettntbkr"/>
<create_process commandline="C:\WINDOWS\system32\urdvxc.exe /start"
targetpid="1396" showwindow="SW_HIDE"
apifunction="CreateProcessA" successful="1"/>
<create_mutex name="GhostBOT0.58b" owned="1"/> Classification
<connection transportprotocol="TCP" remoteaddr="XXX.XXX.XXX.XXX"
remoteport="27555" protocol="IRC" connectionestablished="1" socket="1780"/>
<irc_data username="XP-2398" hostname="XP-2398" servername="0" label
realname="ADMINISTRATOR" password="r0flc0mz" nick="[P33-DEU-51371]"/>
Malware classification: training

For each known malware family, train a


Support Vector Machine for separating
this family for the others:
M
1
min ||w||2 +C ∑ ξ i
w,b 2 i=0
s.t. yi ((w ·xi ) + b) ≥ 1−ξ i , i = 1, . . . , M.
ξi ≥ 0

Determine the optimal parameter C by


cross-validation
Classification of unknown malware binaries

Maximum distance: a label is assigned to


a new report based on the highest score
Report corpus
among the 14 classfiers:
Acquisition of
d(x) = (w · x) + b training data

Maximum “likelihood”: estimate Feature extraction


conditional probability of class “+1” as:
Training
1
P(y = +1 | d(x)) =
1 + exp(Ad(x) + B)
Classification
where parameters A and B are estimated
by a logistic regression fit on an label

independent training data set.


Results: known malware instances

Test binaries are drawn from the same 14 families


recognized by AntiVir.
Accuracy of classification Confusion matrix for classification
1
1
2 0.9

0.8 3 0.8
4

True malware families


0.7
Accuracy per family

5
0.6 6 0.6
7
0.5
8
0.4 9 0.4
10 0.3
11
0.2 12 0.2
13 0.1
14
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Malware families Predicted malware families

(a) Accuracy per malware family (b) Confusion of malware families

Average accuracy: 88%.


Results: unknown malware instances

Test binaries are drawn from the same 14 families


recognized by AntiVir four weeks later.
Accuracy of prediction Confusion matrix for prediction
1
1
0.9
2
0.8 0.8
3

True malware families


0.7
Accuracy per family

4
0.6 7 0.6

8 0.5

0.4 9 0.4
10 0.3
11
0.2 0.2
13
0.1
14
0
1 2 3 4 7 8 9 10 11 13 14 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Malware families Predicted malware families

(c) Accuracy per malware family (d) Confusion of malware families

Average accuracy: 69%.


Botzilla: analysis of malware communication

Operational goals:
Generation of signatures for malware communicatioin
Detection of “drop-zones” and compromized machines

Internet
Signature
Bot master generation

Regular traffic

Malware binary Signature

Repetitve execution
(Sandnet)
Execution of malware in a sandnet

A sandnet is a monitored network of sandbox machines with


various network configuration.
Captured malware binaries are executed multiple times in a
sandnet using various network settings (IP addresses,
operating systems, date and time).
All outgoing communication is captured in a tcpdump.
Security precautions:
Redirection of SMTP traffic to a network sink
Blocking of DCOM and SMB requests (known vulnerabilities)
Signature generation

Malware traffic X+ Regular traffic X−

Token extraction Signature assembly Signature

Natural clustering: multiple executions of the same binary


Extraction of (quasi)-distinct tokens using suffix trees
Signature assembly: false positive check on regular traffic
(real or simulated)
Signature filtering: elimination of redundant signatures from
similar binaries
Offline evaluation of Botzilla

Procedure:
Signatures are generated for 43 classes of malware collected
from a honeypot.
Test traffic is generated by running the same malware again
and merging it with normal traffic from two sources (DARPA
and University of Erlangen).
Results:
DARPA Erlangen
Average detection rate 0.81 0.72
Average false alarm rate 0 0.51 · 10−6
Online evaluation of Botzilla

Procedure:
Generated signatures were deployed on a large network at the
University of Erlangen (ca. 50,000 hosts) using a Vermont
network monitorin sensor.
Traffic statistics: ca. 840M flows in 4 days.
Alerts were manually evaluated.
Results:
# malicious flows detected 219
# false alarms 295
Detection rate 0.43
False alarm rate 3.51 · 10−7
Signature examples

Banload keylogger:

Storm worm:
Lessons learned

Malware analysis significantly facilitates malware


understanding and the development of protection
mechanisms.
The main technique of malware analysis is execution of
malware in a specially instrumented environment.
Further analysis techniques require extensive machine
learning and string matching instrumentation.
Recommended reading

Konrad Rieck, Thorsten Holz, Carsten Willems, Patrick Düssel, and Pavel
Laskov.
Learning and classification of malware behavior.
In Detection of Intrusions and Malware, and Vulnerability Assessment, Proc.
of 5th DIMVA Conference, pages 108–125, 2008.

Konrad Rieck, Guido Schwenk, Tobias Limmer, Thorsten Holz, and Pavel
Laskov.
Botzilla: Detecting the phoning homeöf malicious software.
In Proc. of 25th ACM Symposium on Applied Computing (SAC), March
2010.
(to appear).

Carsten Willems, Thorsten Holz, and Felix Freiling.


CWSandbox: Towards automated dynamic binary analysis.
IEEE Security and Privacy, 5(2):32–39, 2007.

You might also like