13 Malware Analysis PDF

Intrusion Detection and Malware Analysis
Malware analysis
Pavel Laskov
Wilhelm Schickard Institute for Computer Science
What do we want to know about malware?
Is it recognized by existing antivirus products?

What is its functionality?
How is malware distributed?
What other harmful functions does malware carry out?
What are relationships between various classes of malware?
Do they share common techniques? How do they evolve?
How much of malware is unknown?
Experiment:
Current instances of malware were collected from a
Nepenthes honeypot.
Files were scanned with Avira AntiVir.
Results:
First scan: Second scan:
undetected undetected
24% 15%
detected detected
76% 85%
After four weeks 15% of malware instances were still

not recognized!
Traditional malware analysis tools
Binary disassemblers (IDA Pro, OllyDbg)

(+) precise understanding of malware behavior
(−) extensive manual effort
Static flow analysers (BinDiff, BinNavi)
(+) visual representation of malware program flow
(−) prone to obfuscation
Modern malware analysis tools
Sandboxes
Behavior analysers
Clustering and classification methods
Signature generation tools
CWSandbox: key ideas
Malware
first
Kernel32.dll-CreateFileA (*without* Hook):
instr
API hooking: intercept Windows 77E8C1F7
77E8C1F8
PUSH ebp
MOV ebp, esp
can
them
API calls by overwriting API 77E8C1FA
77E8C1FD
77E8C202
PUSH SS:[ebp+8]
CALL +$0000d265
TEST eax, eax
over
the
77E8C1FD
functions’ code loaded in the …
77E8C226
JNZ
…
RET
with
orig
whe
process memory. (a)
hoo
nece
Kernel32.dll-CreateFileA (*with* Hook):
deta
DLL injection: insertion of hook 77E8C1F7
77E8C1FD
JMP [CreatFileA-Hook]
CALL +$0000d265
offer
resea
77E8C202
functions into the running 77E8C1FD
…
77E8C226
TEST eax, eax
JNZ +$05
… 1 tion
T
RET
processes. Application.CreatFileA-Hook:
level
sider
exist
2005EDB7 -custom hook code-
Recording of persistent system 3 …

2005EDF0
…
JMP [CreatFileA-SavedStub]
the I
syste
nipu
changes collected by a hook in an Application.CreatFileA-SavedStub:
21700000 PUSH ebp

2
can
don
XML report. 21700001
21700003
21700006
MOV ebp, esp
PUSH SS:[ebp+8]
easie
the f
(b) JMP $77E8C1FD
hoo
O
Figure 1. In-line code overwriting. (a) shows the original function to av
code. In (b), the JMP instruction overwrites the API function’s first tech
block (1) and transfers control to our hook function whenever the the m
to-be-analyzed application calls the API function. (2) The hook tem
function performs the desired operations and then calls the vanc
original API function’s saved stub. (3) The saved stub performs the desig
overwritten instructions and branches to the original API function’s user
unmodified part.
DLL
DLL
CWSandbox: system architecture
Malware
virtual memory and executing code in a different

Malware application
process’s context. Executes
Windows kernel32.dll offers the API functions cwmonitor.dll
ReadProcessMemory and WriteProcessMemory, Communication
which lets the CWSandbox read and write to an arbi-
trary process’s virtual memory, allocating new memory Executes
cwsandbox.exe
regions or changing an already allocated memory re-
gion’s using the VirtualAllocEx and Virtual- Malware application child
ProtectEx functions.
cwmonitor.dll
It’s possible to execute code in another process’s con- Communication
text in at least two ways:
• suspend one of the target application’s running threads, Figure 2. CWSandbox overview. CWSandbox.exe creates a new
copy the to-be-executed code into the target’s address process image for the to-be-analyzed malware binary and then injects
During the initialization of a malware binary cwmonitor.dll
space, set the resumed thread’s instruction pointer to the
copied code’s location, and then resume the thread; or
the cwmonitor.dll into the target application’s address space. With
the help of the DLL, we perform API hooking and send all observed
is injected into its memory to carry out API hooking.
• copy the to-be-executed code into the target’s address behavior via the communication channel back to cwsandbox.exe. We
space and create a new thread in the target process with use the same procedure for child or infected processes.
the code location as the start address. DLL intercepts all API calls and reports them to CWSandbox.
With these building blocks in place, it’s now possible to
The samethe
inject and execute code into another process.
monitored API calls. The report provides data for
procedure is repeated for any child or infected
each process and its associated actions—one subsection
process. work operations,
The most popular technique is DLL injection, in
which the CWSandbox puts all custom code into a DLL
for all accesses to the file system and another for all net-
for example. One of our focuses is on
and the hook function directs the target process to load bot analysis, so we spent considerable effort on extracting
this DLL into its memory. Thus, DLL injection fulfills and evaluating the network connection data.
both requirements for API hooking: the custom hook After it analyzes the API calls’ parameters, the sand-
functions are loaded into the target’s address space, and box routes them back to their original API functions.
the API hooks are installed in the DLL’s initialization rou- Therefore, it doesn’t block the malware from integrating
Case study: Malware classification
Goals:
Detect variations of known malware families.
Detect previusly unknown families.
Determine essential features of specific families.
Main ideas:
Use AV scanners to assign labels to malware binaries.
Use machine learning to classify binaries among known
malware families.
exploits binary report label

Wild Collection Monitoring Classification
model
label
AV scanner Learning
Malware classification: an overview
Report corpus
Acquisition of training data: creation of
balanced data sets for multi-class Acquisition of
classification. training data
Feature extraction: embedding of reports

Feature extraction
in a feature space.
Training: optimizaiton and model Training
selection.
Classification: combining results of single Classification
family classifiers.
label
Malware classification: data acquisision
Report corpus
Binaries are collected from Nepenthes

Acquisition of
and from spam-traps, and are analyzed training data
by CWSandbox.
Labels are assigned by running Avira Feature extraction
AntiVir. Unrecognized binaries are

discarded. Training
14 classes are considered for training: 1

backdoor, 2 trojans and 11 worms. Classification
label
Malware classification: feature extraction
Operational features: a set of all strings

Report corpus
contained between delimeters “<” and “>”.
“Wildcarding”: removal of potentially Acquisition of
random attributes. training data
<copy_file filetype="File" srcfile="c:\1ae8b19ecea1b65705595b245f2971ee.exe" Feature extraction

dstfile="C:\WINDOWS\system32\urdvxc.exe"
creationdistribution="CREATE_ALWAYS" desiredaccess="FILE_ANY_ACCESS"
flags="SECURITY_ANONYMOUS" fileinformationclass="FileBasicInformation"/>
<set_value key="HKEY_CLASSES_ROOT\CLSID\{3534943...2312F5C0&}" Training
data="lsslwhxtettntbkr"/>
<create_process commandline="C:\WINDOWS\system32\urdvxc.exe /start"
targetpid="1396" showwindow="SW_HIDE"
apifunction="CreateProcessA" successful="1"/>
<create_mutex name="GhostBOT0.58b" owned="1"/> Classification
<connection transportprotocol="TCP" remoteaddr="XXX.XXX.XXX.XXX"
remoteport="27555" protocol="IRC" connectionestablished="1" socket="1780"/>
<irc_data username="XP-2398" hostname="XP-2398" servername="0" label
realname="ADMINISTRATOR" password="r0flc0mz" nick="[P33-DEU-51371]"/>
Malware classification: training
For each known malware family, train a

Support Vector Machine for separating
this family for the others:
M
1
min ||w||2 +C ∑ ξ i
w,b 2 i=0
s.t. yi ((w ·xi ) + b) ≥ 1−ξ i , i = 1, . . . , M.
ξi ≥ 0
Determine the optimal parameter C by

cross-validation
Classification of unknown malware binaries
Maximum distance: a label is assigned to

a new report based on the highest score
Report corpus
among the 14 classfiers:
Acquisition of
d(x) = (w · x) + b training data
Maximum “likelihood”: estimate Feature extraction

conditional probability of class “+1” as:
Training
1
P(y = +1 | d(x)) =
1 + exp(Ad(x) + B)
Classification
where parameters A and B are estimated
by a logistic regression fit on an label
independent training data set.

Results: known malware instances
Test binaries are drawn from the same 14 families

recognized by AntiVir.
Accuracy of classification Confusion matrix for classification
1
1
2 0.9
0.8 3 0.8
4
True malware families

0.7
Accuracy per family
5
0.6 6 0.6
7
0.5
8
0.4 9 0.4
10 0.3
11
0.2 12 0.2
13 0.1
14
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Malware families Predicted malware families
(a) Accuracy per malware family (b) Confusion of malware families
Average accuracy: 88%.

Results: unknown malware instances
Test binaries are drawn from the same 14 families

recognized by AntiVir four weeks later.
Accuracy of prediction Confusion matrix for prediction
1
1
0.9
2
0.8 0.8
3
True malware families

0.7
Accuracy per family
4
0.6 7 0.6
8 0.5
0.4 9 0.4
10 0.3
11
0.2 0.2
13
0.1
14
0
1 2 3 4 7 8 9 10 11 13 14 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Malware families Predicted malware families
(c) Accuracy per malware family (d) Confusion of malware families
Average accuracy: 69%.

Botzilla: analysis of malware communication
Operational goals:
Generation of signatures for malware communicatioin
Detection of “drop-zones” and compromized machines
Internet
Signature
Bot master generation
Regular traffic
Malware binary Signature
Repetitve execution
(Sandnet)
Execution of malware in a sandnet
A sandnet is a monitored network of sandbox machines with

various network configuration.
Captured malware binaries are executed multiple times in a
sandnet using various network settings (IP addresses,
operating systems, date and time).
All outgoing communication is captured in a tcpdump.
Security precautions:
Redirection of SMTP traffic to a network sink
Blocking of DCOM and SMB requests (known vulnerabilities)
Signature generation
Malware traffic X+ Regular traffic X−
Token extraction Signature assembly Signature
Natural clustering: multiple executions of the same binary

Extraction of (quasi)-distinct tokens using suffix trees
Signature assembly: false positive check on regular traffic
(real or simulated)
Signature filtering: elimination of redundant signatures from
similar binaries
Offline evaluation of Botzilla
Procedure:
Signatures are generated for 43 classes of malware collected
from a honeypot.
Test traffic is generated by running the same malware again
and merging it with normal traffic from two sources (DARPA
and University of Erlangen).
Results:
DARPA Erlangen
Average detection rate 0.81 0.72
Average false alarm rate 0 0.51 · 10−6
Online evaluation of Botzilla
Procedure:
Generated signatures were deployed on a large network at the
University of Erlangen (ca. 50,000 hosts) using a Vermont
network monitorin sensor.
Traffic statistics: ca. 840M flows in 4 days.
Alerts were manually evaluated.
Results:
# malicious flows detected 219
# false alarms 295
Detection rate 0.43
False alarm rate 3.51 · 10−7
Signature examples
Banload keylogger:
Storm worm:
Lessons learned
Malware analysis significantly facilitates malware

understanding and the development of protection
mechanisms.
The main technique of malware analysis is execution of
malware in a specially instrumented environment.
Further analysis techniques require extensive machine
learning and string matching instrumentation.
Recommended reading
Konrad Rieck, Thorsten Holz, Carsten Willems, Patrick Düssel, and Pavel
Laskov.
Learning and classification of malware behavior.
In Detection of Intrusions and Malware, and Vulnerability Assessment, Proc.
of 5th DIMVA Conference, pages 108–125, 2008.
Konrad Rieck, Guido Schwenk, Tobias Limmer, Thorsten Holz, and Pavel
Laskov.
Botzilla: Detecting the phoning homeöf malicious software.
In Proc. of 25th ACM Symposium on Applied Computing (SAC), March
2010.
(to appear).
Carsten Willems, Thorsten Holz, and Felix Freiling.

CWSandbox: Towards automated dynamic binary analysis.
IEEE Security and Privacy, 5(2):32–39, 2007.

13 Malware Analysis PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

13 Malware Analysis PDF

Uploaded by

Copyright:

Available Formats

Intrusion Detection and Malware Analysis

Is it recognized by existing antivirus products?

After four weeks 15% of malware instances were still

Binary disassemblers (IDA Pro, OllyDbg)

Recording of persistent system 3 …

21700000 PUSH ebp

virtual memory and executing code in a different

exploits binary report label

Feature extraction: embedding of reports

Binaries are collected from Nepenthes

AntiVir. Unrecognized binaries are

14 classes are considered for training: 1

Operational features: a set of all strings

<copy_file filetype="File" srcfile="c:\1ae8b19ecea1b65705595b245f2971ee.exe" Feature extraction

For each known malware family, train a

Determine the optimal parameter C by

Maximum distance: a label is assigned to

Maximum “likelihood”: estimate Feature extraction

independent training data set.

Test binaries are drawn from the same 14 families

True malware families

(a) Accuracy per malware family (b) Confusion of malware families

Average accuracy: 88%.

Test binaries are drawn from the same 14 families

True malware families

(c) Accuracy per malware family (d) Confusion of malware families

Average accuracy: 69%.

Malware binary Signature

A sandnet is a monitored network of sandbox machines with

Malware traffic X+ Regular traffic X−

Token extraction Signature assembly Signature

Natural clustering: multiple executions of the same binary

Malware analysis significantly facilitates malware

Carsten Willems, Thorsten Holz, and Felix Freiling.

You might also like