Dynamic binary instrumentation

TRACING

Peter Hlavaty aka @zer0mem, specialized software engineer at
ESET
Lots of software families essentially needa tracer. To name a few for
example - emulators, dynamic unpackers, fuzzers, … Of course a
numbers of tracers exist with different methods. Some of them use
emulation of instruction set (bochs), some utilize it by binary
translation (qemu), alter binary and patch control flow (pin), or
attaching a debugger (paimei, IDA based). And there are also some
other options!
WHY TO TRACE ?
I divide tracing reason to 3 groups : Control Flow, Data Flow and
tracking OS interaction.
Control flow tracking can help with understanding binary at
runtime, to overcome obfuscation. When it is using in fuzzers, then
good tracer can help with code coverage. Or in AV software it can
be usable for tracing through binary and making patterns at some
point of tracing, or dynamic - unpack some binary, etc.
Tracing can be used on various levels : for tracing on every
instruction, based on basic blocks, tracking some interesting
functions. This is commonly implemented by inserting pre / post
instrumentation a.k.a. patching binary control flow on sensitive
places. Another method is just attaching debugger and handling
traps or breakpoints. Not so commonly used way is by applying CPU
features. One of interesting features is Intel MSR-BTF flag, which
allows tracing on basic block levels – on branches :

18.6.5 Single-Stepping on Branches, Exceptions, and Interrupts

When software sets both the BTF flag in the MSR_DEBUGCTLA MSR
and the TF flag in the EFLAGS register, the processor generates a
single-step debug exception the next time it takes a branch, services
an interrupt, or generates an exception.

Data flow tracking come to use to unpacking code, tracking
processing of some sensitive data. By monitoring processing of
various data, can be detected object missuses, overflows and also
can be used for save & restore context upon tracing. Commonly it is
done by disassembling whole binary, find all read & store
instructions, parse them and resolve destination address at runtime.
Or also by setting virtual memory protection by some API, and
handle memory access violations. Not so commonly used way is
alter in-memory table itself, utilize directly Page Tables.

OS interaction tracking is valuable for filtering registry access,
monitoring process interaction, files altering, monitoring target
specific API calls… It is commonly implemented by API hooking -
inserting trampolines, inline hooks, import hooks, setting
breakpoints. Another way is abusing SYSCALL system. Basically each
API, that alter OS, ends as wrapper to some SYSCALL.

another path - ENTER RING0
For mentioned features is necessary to enter Ring0. In supervisor
are available also some features that offers OS itself. LoadNotify,
ThreadNotify, ProcessNotify routines. That helps with collecting
load & unload informations about targeted process, like list of
modules, thread stack ranges, child processes, etc.
Second round of features can include memory dumper by MDL,
process memory monitor by VAD, System interaction monitor by
nt!KiSystemCall64, intercepting Memory Access and traps by IDT.

VAD-tree structure is AVL tree used for maintaining informations
about process memory address space, and is also used when it
comes to initializing PTE for particular memory page.
Very interesting article related to VAD-tree is : “The VAD tree: A
process-eye view of physical memory5” by Brendan Dolan-Gavitt



SYSCALL mechanism is fast way how to switch CPL from UserMode
to SuperVisor, and this is how usermode application alter OS.


As I proposed monitoring memory access can be done by memory
protection mechanism, but do it in user mode by some API is quite
performance overkill. The key point is that memory protection is
based on MMU mechanism – Paging. Altering PageTable in kernel
mode is straightforward, and memory violations are handled by
generating PageFault exception by processor and control flow is
redirected to IDT[PageFault] handler. Intercepting PageFault
handler results in getting fast callback on desired memory access to
selected pages.
It is because is necessary to use only pages marked as Valid (paged
in memory), otherwise is generated PageFault exception, which is
already intercepted. That means, when are purposely set Valid flag
of selected memory page to invalid (paged out), then every access
to this memory is invoked PageFault handler and this kind of access
can be easily filtered out and handled (invoking callback to tracer,
and set Valid flag of particular PTE).


go deeper - ENTER VMM!
In previous section I proposed some dirty methods in kernel mode.
Hooking is not correct way, and besides I don’t like it, also Microsoft
don’t like it, and they introduced PatchGuard for mitigating such
actions. But fortunately there is also another way for performing
intercepting PageFaults, Traps, or SYSCALLs! But being hypervisor
comes with some cons and same time with pros as well.
Cons : Virtualized is not just single app, but whole system – CPU-
core. VMMExit switch comes with performance impact as well as
hypervisor code executed per VMMExit.
On the other side, pros : You are more privileged than supervisor,
and of course set of callbacks offered by virtualization technology.
VMM can be minimalistic – micro VMM, and really implements just
necessary handling and minimalistic code :
https://github.com/zer0mem/ShowMeYourGongFu/tree/x64userla
nd/src/HyperVisor
Some of callbacks offered by Intel VTx :

Instead of hooking IDT for Traps it can be handled directly by debug
exception in VMM. And same stands for intercepting page faults, by
PageFault exception in VMM or by implementing EPT.

RESULTS
Some features comes with implementing this approach :
- Target is almost untouched
o For tracing (single step / branch trace) TRAP flag
inserted
o Address breakpoints by 0xCC, or usage of DRx
o Monitoring memory by altering process PageTables
o No patching of binary
- It can be used as module used for tracing from another app
- Multiple applications can be traced simultaneously
- Multiple threads per application can be traced
- Implementation of fast calls for switching CPL
Separation from main target process space to another process as
tracer, comes with nice pros, that it can be used as separate module
and can have binding for python, ruby, … But same time it comes
with unpleased drawback – performance impact overkill (inter-
process communication : read from other process memory, wait for
event mechanism … ).
For speeding up tracing, is necessary to move logic to target address
space for fast accessing target resources (memory / stack / registry
context), and optionally also dropping VMM because of VMMExit
switch performance impact and due to this introduce Trap and
PageFault IDT hooks. But on the other side, virtualization technology
in future processors would be probably more efficient in
performance impact. And also virtualization itself can be used in
bigger scale in tracing than I proposed here, and so another pros
can balance performance cost.

HIDDEN FEATURE - DBI for kernel code
Switching to kernel code tracer is quite straightforward. And the
same principles remains:
- Tracing by TRAP
- Memory monitoring by altering PageTables
- Tracer callbacks delivered to usermode app
- No patching binary of Target!
Main feature is that it is no need to patch binary itself. And ability of
tracing (fuzzing, unpacking ..) from usermode (from f.e. python
based tracer), but of course more performance optimized – trace
directly from kernel mode as well.



On the other side, these features come with responsibilities :
- Address space of driver is not its own!
o In-memory fuzzing is not so easy
- Bad RIP, regs, memory .. manipulating ends very badly
o You have to know what you are fuzzing, tracing!
- Various IRQLs that is necessary to have on mind whole
tracing!
- Exception handling

But “Simplicity is the key to brilliance” – Bruce Lee
And separation from target plus module encapsulating as well
brings high scalability, and possible cooperation with another
modules to more complex tool. Highlights for Python arsenal:
- IDA python – binary detailed info
- LLVM bindings for python
- Dbghelp for symbols
- Disassemblers – capstone engine, bea engine
- Many many others – python arsenal


This snapshot of python code, is responsible for watching over 3
accesses (RWE) to selected memory:

And this piece of code - stepping application on branch level, and
skip processing of not in-main-module instructions:

DbiFuzz
DbiFuzz framework shows how to trace binary in another manner.
Some of known tools use instrumentation which is fast solution, but
on the other side it is invasive and do not keep integrity of targeted
binary itself. DbiFuzz keep target almost untouched, just altering
PTE, BTF and inserting Trap flag. And the the other side of this
approach is that per interesting event is invoked interrupt : ring3 –
ring0 –ring3 gate. But DbiFuzz approach means straightforward
altering of target context and control flow as well. Due to this it is
easy to write tools using DbiFuzz (even in python!), they have
authentic view and access to target binary and its resources.
When is a showtime ?
There are different reasons of tracing, and due to this, different dbi
approaches can be usefull per case. DbiFuzz framework can comes
to use for example :
 On the fly needs to trace code
 Unpacking of binary, trace trough the malware envelope
 Monitor processing of sensitive data
 In-memory fuzzing, easy to monitor & alter flow
 Use in some different tools, not necessary written in C
It is no problem to turn on DbiFuzz on the fly, just set up for Trap or
INT3 hook and tracing can start anytime! Not touching binary code
of target itself means no problem with integrity checks, and trap
flag can be substituted by MTF. Monitoring of sensitive data, it is no
problem as well, it is just setting appropriate PTE and your monitor
is ready! Python/Ruby/.. tools ? just implement bindings and go!
This framework has its own drawbacks but same time come with
some handy features. It is up to you to play with DbiFuzz idea (PoC),
utilize your tools for your needs, and trace everything you want 
CODE
PoC of implementation is hosted at github as DbiFuzz framework,
and it is open source:
https://github.com/zer0mem/ShowMeYourGongFu/tree/x64userla
nd
It is just demonstration of described idea and implemented for
usermode targets, but just tested for win8 CP yet. In this project are
also implemented some features which can help to someone in
further kernel development (VAD walker, PageTable walker,
AutoLoks, msdn containers, micro VMM). I will soon or later start
with focusing to switch project to kernel code tracer. Meanwhile
feel free to use sources, improving idea, experimenting with new
ideas …
Some related materials to this project you can find at :
 http://2013.zeronights.org/program#hlavaty
 http://www.slideshare.net/PeterHlavaty/dbifuzz-
framework-zeronights-e0x03-slides
 http://www.zer0mem.sk/



REFS
Blogs :
http://www.ivanlef0u.tuxfamily.org/?p=120
http://www.openrce.org/blog/view/535/Branch_Tracing_with_Intel
_MSR_Registers
http://gynvael.coldwind.pl/?id=148

intel :
http://download.intel.com/products/processor/manual/326019.pdf
http://download.intel.com/products/processor/manual/253669.pdf
http://www.intel.com/content/dam/www/public/us/en/documents
/manuals/64-ia-32-architectures-software-developer-manual-
325462.pdf

VAD related :
http://www.dfrws.org/2007/proceedings/p62-dolan-gavitt.pdf
http://pc.fk0.name/pub/books/windows/insideW2k/html/ch07h.ht
m
http://technet.microsoft.com/en-us/sysinternals/bb963901.aspx
https://www.reactos.org/

virtualization :
http://linux.linti.unlp.edu.ar/images/f/f1/Vtx.pdf
http://fdbg.x86asm.net/hdbg/hdbg.html
http://code.google.com/p/hyperdbg/
http://www.blackhat.com/presentations/bh-usa-06/BH-US-06-
Rutkowska.pdf

python modules [disasemblers]
http://beatrix2004.free.fr/BeaEngine/index1.php
http://www.capstone-engine.org/