You are on page 1of 10

2012 IEEE 11th International Conference on Trust, Security and Privacy in Computing and Communications

Wire – A Formal Intermediate Language for Binary Analysis

Silvio Cesare and Yang Xiang


School of Information Technology
Deakin University
Burwood, Victoria 3125, Australia
{scesare, yang}@deakin.edu.au

Abstract— Wire is a intermediate language to enable static malware. Moreover, the use of program structure and
program analysis on low level objects such as native semantics to extract robust features allows machine learning
executables. It has practical benefit in analysing the structure to detect novel samples of malware that we can predict as
and semantics of malware, or for identifying software defects being malicious, but not belonging to known families of
in closed source software. In this paper we describe how an malicious software. Malware is almost always in binary form
executable program is disassembled and translated to the Wire so a low level static analysis system that examines the binary
intermediate language. We define the formal syntax and form of executables content is required.
operational semantics of Wire and discuss our justifications for Software theft detection is another motivation for why
its language features. We use Wire in our previous work
low level static analysis is needed. Detecting unauthorized
Malwise, a malware variant detection system. We also examine
applications for when a formally defined intermediate
use of software code is desirable to protect industry
language is given. Our results include showing the semantic investment. Similar to the malware variant detection
equivalence between obfuscated and non obfuscated code problem, software theft detection extracts program structure
samples. These examples stem from the obfuscations and semantics and identifies unauthorized software copies by
commonly used by malware. finding those same features in illegitimate software. It is
necessary then to be able to examine closed source software
Keywords-Binary analysis, intermediate language, semantics. by using low level static analysis.
More motivation is that of detecting the presence of
I. INTRODUCTION software bugs in binaries. The purpose of this form of bug
Static program analysis is a useful tool that provides detection is not to replace traditional source level analysis,
many benefits and applications. In summary, static analysis but complement it by providing an increased level of
identifies the runtime behaviour of software. It does this assurance. Source level analysis by definition is the
analysis statically, meaning that the program is not executed. unfinished form of a software that is lacking detail of how
Applications of static analysis include detecting plagiarism the program will be physically executed after assembly and
of software code, optimising code during compilation, linking. Bug detection in binaries by nature has access to the
verifying software by proving the absence of certain bug final form of the program where assembling and link time
classes, or in a weakened form, to identify software bugs. editing has been performed. This also provides additional
Static analysis is generally performed at the source level, but assurance that the compiler has done what it was designed to
applications exist when we only have access to low level do. This type of assessment is not only useful for
object code. The applications of low level static analysis development and quality assurance; it is also beneficial to
include the analysis and detection of malware, detecting the system auditors who by requirements do not have access to
theft of proprietary or licensed software, or detecting bugs in software source.
binaries which are the result of compilation or link-time Analysing binaries is hard. Many simple problems such
conditions. as separating code from data are undecidable. Our first
Malware analysis and detection is a large motivation for motivation stems from the desire of representing a binary in
why low level static analysis is required. Traditional static a manner that makes analysis easier. The native assembly in
malware detection employed in commercial Antivirus has a binary is unfavourable for analysis. The reasons that native
ignored program structure and semantics. Instead, pattern assembly is difficult to use are:
recognition on the raw byte-level content has been the x Native CISC assemblies such as x86 have hundreds
dominant technique in signature based detection. However, of instructions which requires significant and
program structure such as that exhibited by the static control duplicate efforts to model for each class of static
and data flow of the malware results in more robust and analysis.
predictive characteristics. These characteristics or x Native assemblies have instructions with side
fingerprints are often invariant in large malware families and effects which make analyses require hidden
strains. Thus, by employing static analysis techniques, information and assumptions.
signature based detection is much more resistant in the
detection of variants such as polymorphic and metamorphic

978-0-7695-4745-9/12 $26.00 © 2012 IEEE 515


DOI 10.1109/TrustCom.2012.301
x Native assemblies are platform dependent which Decompilation can be modelled in a similar manner to a
requires separate static analysis implementations for compiler and systems such as DCC [10] and Boomerang
each architecture. [11]have employed intermediate languages and static
This motivates us to use an intermediate language to analysis. The commercial Interactive Disassembler Pro (IDA
represent native assembly. The intermediate language should Pro) [12] and its decompiler plugin HexRays provide
be low level enough so that translation from assembly is not intermediate representations of binaries, but do not provide a
complex. It should also be high level enough so that monotone framework or exposure to their intermediate
traditional static analysis techniques can be applied. language.
We have implemented Wire and use it as the intermediate Vine and BIL are the intermediate languages and static
representation in performing static analysis on binaries and analysis framework of the BitBlaze project. These systems
to detect malware variants in our previous research system are closely related to our work. However, our intermediate
Malwise [1-3]. This paper represents our first formal language contains some higher level constructs such as
description of the intermediate language we have function calls, function arguments and dynamic memory
implemented. allocation. A focus of our work is to narrow the gap between
traditional analyses and binary analyses so we have
A. Innovation incorporated these instructions in our language. The
The contributions of our paper are as follows: instructions can only be successfully constructed using
x We propose a new low level intermediate language decompilation techniques.
and define its formal operational semantics. The Reverse Engineering Intermediate Language (REIL)
x We propose methods to translate native assembly [13] is a commercial offering in a similar vein to Vine and
into our intermediate language. BIL. MonoREIL is a monotone framework built using REIL.
x We propose applications of a formally defined REIL is a very low level language and lacks a number of
intermediate language and demonstrate operational high level instructions that our language supports.
semantics can be used to show equivalence between
metamorphic malware codes.
x We use our language as the basis for Malwise – a III. TRANSLATING NATIVE CODE
malware variant detection system. The input to our system is an object file. The most typical
case is an x86 binary. For Windows this is a portable
B. Paper Outline executable (PE) object or an Executable and Linking format
The structure of this paper is as follows: Section II object under Linux. The system can also partially process
discusses related work in low level static analysis and some Java class files, and C source code for the GNU compiler
existing frameworks. Section III describes how low level (GCC). Our system is designed as modular software that
objects and code are translated to our intermediate language allows plugin extensions to inspect or modify the object file
Section IV defines the formal syntax and semantics of Wire. or the results of a static analysis. An XML configuration file
Section V examines applications of our formal language. determines which plugins will be loaded, the order in which
Finally, Section VI concludes the paper. they are processed, and at which stage of object file
processing and static analysis they will be called.
II. RELATED WORK The first stage is object file parsing. PE and ELF binaries
Static analysis has a long history when applied at the contain information on how to access the object code and the
source level. In compiler theory [4], intermediate dynamic linking information such as imported and exported
representations or languages are used when code is generated functions. The object code is extracted and code is
from an abstract syntax tree after parsing. A common low processed. For x86 binaries, a disassembly is performed. Our
level intermediate language is three address code which is system can also translate C source code, however this is not
defined by a 4-tuple consisting of an opcode and three the focus of this paper.
operands. A compiler uses the intermediate code to perform The native representation contains instruction level
platform independent optimisations. information. These native instructions are translated to an
Static analysis on binaries does not have such a history as intermediate language. All further static analyses operate on
source level analysis, but there are several areas of related the intermediate language which by its construction is easier
work. Binary translation [5] is one such area which has to analyse. Our implementation consists of 10,000 lines of
employed the use of static analysis. Retargetable binary C++ code for the disassembly to be translated to the
translators translate machine code and assembly language intermediate language.
into a platform independent intermediate language. Dynamic
A. Disassembly
binary instrumentation [6, 7] is related to dynamic binary
translation and Valgrind [8] is an example tool which Disassembly is the process of translating machine code to
employs the intermediate language Vex. Vex has been used assembly language [14]. This is the first stage of a static
in other systems including the Bitblaze project [9] which analysis. Static disassembly parses the entire binary image
specifically focuses on binary analysis for security related statically without execution. In static disassembly, there are
applications. two main algorithms. In the Linear Sweep algorithm, the
instructions are disassembled one instruction after another,

516
starting from the beginning of code. The disadvantage of this _ _
method is that data introduced into instruction stream may be
→ n, ( , 1, 2, 3) n∈ℕ
erroneously disassembled.
The other main algorithm to perform disassembly is the
Recursive Traversal algorithm. This algorithm decodes each D. Register Mapping between Native Architectures and
instruction following the order of the control flow. This Wire
resolves the issue of embedded data, but may miss decoding Wire assigns registers labels using a 32 bit number.
instructions that are the target of indirect jumps or other Wire’s registers overlap the native registers for the x86
situations when it is hard to resolve control flow statically. architecture. That is, the 8 x86 registers numbered 0 to 7 in
Speculative Disassembly attempts to remedy the the native disassembly are reserved and map to the first 8
problems of the Recursive Traversal algorithm problem by registers of the Wire intermediate language.
first performing the Recursive Traversal, and then
performing a Linear Sweep in regions that are not decoded. E. Label Generation
We employ the use of speculative disassembly in our Native assembly memory addresses are not used in the
framework. intermediate language. Nor do all instructions have a
The set of addresses for a machine is defined by A. A memory location. Instead, a label is assigned at the
native instruction in an executable is located in memory and beginning of a basic block. The labels contain an address to
is defined by the ordered pair. A disassembly is the set of identify the location of a basic block. We make two passes
ordered pairs. over the assembly to generate label addresses. In the first
= {( , _ _ )} pass, all branch targets are identified, and then a Wire label
address is assigned to each native address. Finally, the native
Execution transfers from one instruction to another and is addresses are eliminated and labels are used to replace them.
identified using speculative disassembly in Wire. There are ( , )∈ ∪ { |( , ) ∈
= ℎ ℎ ℎ}
two types of control transfers. The first type is the when
execution transfers from one instruction to the subsequent or : →
fall through instruction without following a branch or a call.
Like the execution flow in disassembly, labelled basic
The second type is when a branch or call is taken.
blocks in the intermediate language have an execution flow.
ℎ ℎ = {( , )| , ∈ }
ℎ ℎ = {( , )| , ∈ }
ℎ = {( , )| , ∈ }
ℎ = {( , )| , ∈ }
B. Abstract Machines
F. Condition Code Generation
The intermediate language used for the intermediate code
Condition codes represent arithmetic conditions. For
runs on an abstract machine that has a correspondence to the
example, an arithmetic instruction performing an assignment
actual machine. Typical models of computation for the
may store the fact that the operand is zero. In x86 assembly,
abstract machine are register machines or random access
arithmetic instructions such as subtraction also store
machines. In wire we use a register machine which has the
information on inequalities such as one operand being less,
following components:
greater, or equal to the other. In Wire, each possible
x An unlimited number of uniquely labelled registers condition is stored in a separate register. That is, there is a
(in practice this number is limited by a 32 bit register storing equality, less than, zero status etc. Each
representation). arithmetic instruction sets the set of these registers based on
x A small number of instructions roughly into divided the operands of the instructions. These registers are set using
into arithmetic and control. Wire’s mkbool instructions which can assign a register a
x An instruction pointer. Boolean value (a numeric 1 or 0) based on an inequality and
x A sequence of labelled instructions. its parameters.
x A random access memory.
x An entry point. G. Decompilation
Native instructions are translated into the Wire
C. Intermediate Code Generation intermediate language, but after construction, the
A simple approach to transforming assembly into an intermediate code is analysed to generate additional or
intermediate language is to translate each instruction without replacement code. For example, Wire uses the PUSHARG
maintaining intermediate state. This approach has been used instruction to give procedure calls arguments, however this
successfully in the Reverse Engineering Intermediate requires decompilation to generate this information.
Language (REIL) [13]. We use this approach also and in our Decompilation is used for the following components:
framework we translate native assembly into three address x Local variable reconstruction
code. This part of our system is not formally verified and we x Procedure argument reconstruction
assume the translation is correct. The generated three address x Condition code elimination
code is a list of ordered intermediate instructions. The use of decompilation to generate IL instructions
enables high level static analysis to be employed. Traditional

517
source level analyses such as bug detection can use the | s32_t
decompiled results. This feature distinguishes itself from
most other intermediate languages for reverse engineering Instructions m ::= *(r3) := r1
except those specifically used for decompilation. | r3 := (*r1)
Local variable reconstruction transforms stack based | r3 := r1
memory access into much simpler register based variables. | r3 := n
Procedure argument reconstruction extends the stack | r3 := uop r1
based memory analyses to identify arguments which are on | r3 := r1 bop r2
the stack at call sites. This is done by reconstructing what the | r3 := r1 bop n
stack looks like at a call site and unwinding values from it. | mkbool r1 ucond
Condition code elimination transforms explicit use of | mkbool r1 bcond r2
condition codes and a branch on a condition code into a | nop
simpler branch on condition. The approach is to look at the | halt
reaching definition of the condition code at a branch on | label l
condition code and then to propagate the definition and | jmp l
transform the branch into the branch on condition. | ijmp r
| if r1 cond1 jmp l
H. Intermediate Code Optimisation | if r1 cond2 r2 jmp l
The generation of the intermediate language produces a | lcall s
very verbose and inefficient code. We transform this into a | cast(r1, t)
simpler code by using compiler style optimisations. The | r3 := getpc()
optimisations we employ are: | r3 := returnaddress()
x Dead code elimination | pusharg(n, r)
x Constant propagation | r3 := malloc(r)
x Constant Folding | free(r)
x Copy Propagation | r3 := alloca(r)
Dead code elimination or more correctly dead store
elimination removes stores which are never subsequently Operations uop ::= -|~|!
read before they are redefined. Constant propagation and bop ::= +,-,*,/,%,>>,<<,|,&,^
constant folding simply expressions and assignments using Conditions ucond ::= == 0|!= 0
constants such that their result is calculated when possible bcond ::= ==|!= | >|>=|<|<=
during the optimization pass. Copy propagation eliminates Operands v ::= n (an integer literal)
extraneous copies/assignments that are often used to has r (a register)
temporary placeholders for further expressions. l (a label)
s (a symbol)
IV. FORMAL SYNTAX AND SEMANTICS
B. Functions
In this section we define our intermediate language’s
syntax, the abstract machine it runs on, and its operational Instructions I ::= nÆi
semantics. We believe formally defining Wire is important Heap H ::= nxn Æ n
because it allows formal reasoning about the assembly Memory M ::= nÆn
language it represents. One application that becomes possible Register R ::= rÆn
is the ability to prove semantic equivalence between two Labels L ::= l Æ pc
different syntactical representations. The problem of AllocAMemory V ::= nxnÆn
semantic equivalence is central to the problem of
metamorphic malware detection. We give a detailed Instructions: (maps instruction number to instruction)
description of the Wire language to make these proofs and to Heap: (maps heap address and memory size to non
also give insight into the language features required to overlapping memory addresses)
represent assembly language. Register: (maps register name to numeric value)
Memory: (maps address to numeric value)
A. Syntax Labels: (maps label to instruction address pc)
Program p ::= pi|i AllocAMemory: (maps alloca address and memory size
to non overlapping memory addresses)
Instruction i ::= m| m t
Note that we assign each instruction a unique program
Type t ::= u8_t counter address that is used internally to describe the
| u16_t semantics.
| u32_t C. Abstract Machine State
| s8_t
| s16_t Call Stack C ::= stack of (l,pc,A,V)

518
Argument Stack A ::= stack of (n,r) The CJMP-T instruction implements a conditional branch
Process State P ::= (I,L,H,M,C,A,V,pc) on a true condition to a branch target specified by a label.
There are a number of possible conditions including less
CallStack: Where l is the current function label, pc is the than, greater than, less than or equal to and so forth.
return address, A is the argument stack for function l, and V ( 1) → 1
is the alloca memory mappings for function l. ( 2) → 2
ArgumentStack: (argument stack for callee of current ( 1, 2) ≠ 0
function) Where n is the argument index and r is the register −
( 1 2 , )⇒ [ = + 1]
argument.
The CJMP-F implements a conditional branch on
D. Operational Semantics of Core Instructions condition false.
Operational semantics [15] describe the state transitions ( , )⇒ [ = + 1]
that occur from execution of a program. We follow the
following format: The LABEL instruction specifies a location in the
1 instruction sequence. Wire does not assign individual
. addresses to instructions to specify locations, so whenever an
. instruction is the target of a branch a label must be specified.
. ( , )⇒ [ = + 1]
The NOP instruction implements a no operation.
( , )⇒ ′ 2) Arithmetic Instructions
The arithmetic instructions handle unary and binary
Where i is the current instruction, P is the current state
operations. The binary operation instructions have a version
and P’ is the next state following execution of the instruction
where one of the arguments is a constant.
i.
( 1) → 1
For simplicity, in this section we only show instructions
of a single typing. In practice we have separate instructions 3= 1
for 8, 16, and 32 bit types. ( 3≔ 1, ) ⇒ [ = + 1, [ 3 ↦ 3]]
1) Control Flow Instructions
The control flow instructions handle conditional and ( 1) → 1
unconditional branches. ( 2) → 2
3= 1 2
( )→ ′ ( 3≔ 1 2) ⇒ [ = + 1, [ 3 ↦ 3]]
( , )⇒ [ = ′]
The OP instruction implements the arithmetic
The JMP instruction implements an unconditional instructions. It is a function that takes 3 operands and
branch. It simply changes the program counter to the target modifies those operands as necessary. In practice, the 3rd
of the branch. In the case above, it is a direct branch to a operand is kept as a destination register when possible.
label. ( 2) → 2
( )→ ′ 3= 1 2
( , ) ⇒ [ = ′] ( 3≔ 1 2, ) ⇒ [ = + 1, [ 3 ↦ 3]]
The IJMP instruction also implements an unconditional The OPC instructions implements the OP instruction
branch, but uses register contents as the branch target. except 2 of the operands are registers and the 3rd operand is a
( 1) → 1 constant.
( 1) → 0 3) Boolean Instructions
()→ ′ ( 1) → 1
− ( 1) = 0
( 1 , ) ⇒ [ = ′] −
( 3≔ 1, ) ⇒
( 1) → 1 [ = + 1, [ 3 ↦ 1]]
( 1) ≠ 0
− ( 1) → 1
( 1 , )⇒ [ = + 1]
1( 1) ≠ 0
( 1) → 1 −
( 3≔ 1 1, ) ⇒
( 2) → 2 [ = + 1, [ 3 ↦ 0]]
( 1, 2) → 0
()→ ′

( 1 2 , )⇒ [ = ′]

519
( 1) → 1 ( ′, ′
,
, ′) → ′
( )
( 2) → 2 ( )→ ′
( 1, 2) = 0 ( , ) ⇒ [ = ′, = ′, = ∅]

( 3≔ 1 2 2, ) ⇒
[ = + 1, [ 3 ↦ 1]] The RETURN instruction implements a return from a
procedure. The return address is stored at the top of the call
( 1) → 1 stack. The memory allocated by ALLOCA instructions
2( 1, 2) ≠ 0 becomes freed after a return. Likewise, the argument stack is
− emptied.
( 3≔ 1 2, ) ⇒
[ = + 1, [ 3 ↦ 0]] E. Operational Semantics of Decompiled Instructions
A number of instructions in Wire are only generated after
4) Transfer Instructions a stage that decompiles the specified object file.
The transfer instructions handle assignments of either 1) Address Instructions
registers or constants. ( 3≔ (), ) ⇒ [ = + 1, [ 3 ↦ ]]
( 1) → 1
The GETPC instructions returns the address of the
( 3 ≔ 1, ) ⇒ [ = + 1, [ 3 ↦ 1]] current instruction in the binary being analysed.
( 3 ≔ 1, ) ⇒ [ = + 1, [ 3 ↦ 1]] ( ′ , ′ , ′ , ′) → ( )
( 3≔ , )⇒
5) Memory Access Instructions [ = + 1, [ 3 ↦ ′ ]]
The memory access instructions handle reading and
writing to memory. The RETURNADDRESS returns the return address of
the current procedure.
( 1) → 1 2) Memory Allocation Instructions
( 1) → 2 [ℎ, ℎ + 1) ∉ , ℎ ∈ ℕ

( 3: =∗ ( 1), ) ⇒ [ = + 1, [ 3 ↦ 2]] → ∪ [ℎ, ℎ + 1)
′ (ℎ,
1) →
The LOAD instruction implements a memory read.
( 1) → 1 ( )→
( 3) → 2 ( 3≔ ( 1), ) ⇒
[ = + 1, = ′ , [ 3 ↦ ]]
(∗ 3): = 1, ⇒ [ = + 1, [ 2 ↦ 1]]
The MALLOC instruction implements dynamic memory
The STORE instruction implements a memory write. allocation. It stores the allocation information on the heap
6) Casting Instructions (H).
The CAST instruction is an assignment instruction (ℎ, ) →
between operands of different types. ( )→
( 1) → 1
− [ℎ, ℎ + ) → ′
3= _ ( 1, _ , _ ))
( ( ), ) ⇒ [ = + 1, = ′ ]
( 3: = ( 1, ), ) ⇒ [ = + 1, [ 3 ↦ 3]]
7) Procedural Instructions The FREE instruction frees dynamically allocated
( , )⇒ [ = + 1, = ∅] memory.
( ) → ( ′, ′, ′, )
The LCALL instruction implements an API or library
[ , + 1) ∉ , ∈ ℕ
call. ′

= ℎ( , ( , + 1, , 0)) → ∪ [ , + 1)
′(
()→ ′ , 1) →
( , ) ⇒ [ = ′ , = ′ , = ∅] ( )→
( ′, ′, ′, ′) → ′
The CALL instruction implements a procedure call ( 3≔ ( 1), ) ⇒
instruction to a label target. The return address (pc+1) is
[ = + 1, = ′ , = ′ , [ 3 ↦ ]]
pushed onto the call stack.
( 3) → ′ The ALLOCA instruction performs dynamic memory
()→ ′ allocation for the current procedure. The memory does not
ℎ , ( , + 1, , 0) → ′ require freeing and will be done so automatically when the
procedure returns.
( 3, ) ⇒ [ = ′ , = ′ , = ∅] 3) Procedural Instructions
The ICALL instruction implements an indirect procedure ℎ ,( , ) → ′
call to a register target. ( ℎ ( , ), ) ⇒ [ = + 1, = ′ ]

520
The PUSHARG instruction pushes the contents of a signature based detection and classification that is routinely
register onto the argument stack. The argument stack is employed by traditional Antivirus. Metamorphism borrows
passed into the next called procedure. The PUSHARG many of the techniques from the field of program
instructions are generated as a result of decompilation to obfuscation.
identify procedure arguments. 1) Dead Code Insertion
F. Three Address Code Dead code is also known as junk code and a semantic
nop [16]. Dead code is semantically equivalent to a nil
The high level syntax we have described is not used operation. Insertion of this type of code has no semantic
internally by Wire. For that we employ a three address code. impact on the malware. The insertion increases the size of
The semantic equivalence between the high level syntax and the malware and modifies the byte and instruction level
three address code is shown using the semantic function A content of the malware.
for the high level syntax and the semantic function B for the An example of dead code insertion is shown below. The
three address code. intermediate code is also shown. For simplicity we assume
⟦∗ ( 3) ≔ 1⟧ = ⟦ 1, −, 3⟧ that the condition codes are not required as is the case when
⟦ 3 ≔∗ ( 1)⟧ = ⟦ 1, −, 3⟧ a future arithmetic instruction overrides earlier ones.
⟦ 3 ≔ 1⟧ = ⟦ 1, −, 3⟧
⟦ 3 ≔ 1⟧ = ⟦ 1, −, 3⟧ add $50,%eax
⟦ 3 ≔ 1( 1)⟧ =
mov $0,%eax
1 1, −, 3 sub $50,%eax
⟦ 3 ≔ 2( 1, 2)⟧ = 2 1, 1, 3 mov $0,%eax
⟦ 3 ≔ 3( 1, 1)⟧ = 2 2, 1, 3
⟦ ⟧= ⟦ −, −, −⟧
⟦ ⟧= ⟦ ASSIGNC $0,-,%eax
, −, −⟧
⟦ ⟧= ⟦ −, −, ⟧
⟦ ⟧= ⟦ −, −, ⟧ BOPCADD %eax,$50,%eax
⟦ 1 1 ⟧= ⟦ 1 1, −, ⟧ BOPCSUB %eax,%50,%eax
⟦ 1 2 2 ⟧= ⟦ 2 1, 2, ⟧ ASSIGNC $0,,%eax
⟦ ⟧= ⟦ −, −, ⟧
⟦ 3≔ ( 1, 1)⟧ = ⟦ 1 1, −, 3⟧
⟦ 3≔ ()⟧ = ⟦ −, −, 3⟧ Figure 1. Dead code insertion.
⟦ 3≔ ()⟧
= ⟦ −, −, 3⟧ In the proof that we perform we show the equivalence
⟦ ℎ ( , )⟧ = ⟦ , , −⟧ between code using dead code and code that is not using
⟦ 3≔ ( )⟧ = ⟦ , −, 3⟧ dead code. The proof is carried out by simulating execution
of each code sample and showing that the program states for
⟦ ( )⟧ = ⟦ , −, −⟧
both sequences are the same once complete.
⟦ 3≔ ( )⟧ = ⟦ , −, 3⟧ Firstly, we map register names to register indices that
⟦ 3≔ )⟧ will be used in all proofs in this section of the paper.
= ⟦ , −, 3⟧
⟦ 3≔ 1 2 2)⟧ Reg_name(“eax”) = 0
= ⟦ 1, 2, 3⟧ Reg_name(“ebx”) = 1
Reg_name(“zf”) = 100
V. APPLICATIONS
One application of a formally defined language is to In the first part of the dead code equivalence proof we
prove properties of its programs. One type of proof that can execute the instructions without the dead code.
be performed is an equivalence proof. Equivalence proofs 1=0
are useful and we will examine the particular case of (" 0, −,0", ) ⟹ ′
equivalence between obfuscated codes which is a commonly
seen occurrence in malware. Our proofs work on the ′= = + 1, [0 ↦ 1]
intermediate code only and assume the intermediate code
generation has been performed correctly. ′= = + 1, [0 ↦ 0]

Semantic Equivalence of Obfuscated Code In the second part of the proof we execute the
instructions with the dead code.
A syntactic metamorphic malware technique is a method (0) → 1
that changes the syntactic structure of the malware [16].
3 = 1 + 50
Though the syntactic structure changes in polymorphic
(" 0, $50,0", ) ⟹ ′
malware, the malware semantically remains identical. The
technique is predominantly used to evade byte level ′
= [ = + 1, [0 ↦ 3]]

521

= [ = + 1, [0 ↦ 1 + 50]] ′= = + 1, [0 ↦ 1]
(0) → 1 ′= = + 1, [0 ↦ 2]
3 = 1 − 50
(" 0, $50,0", ′) ⟹ ′′ 1=1
(" 1,0,1", ′) ⟹ ′′
′′ = = + 1, [0 ↦ 3]
′′ = = + 1, [0 ↦ 2,1 ↦ 1]
′′ = = + 1, [0 ↦ ( 1 + 50) − 50]
′′ = = + 1, [0 ↦ 2,1 ↦ 1]
(0) → 1
3=0 (0) → 1
(" 0, −,0", ′′) ⟹ ′′′ (1) → 2
3= 1+ 2
′′′ = = + 2, [0 ↦ 1] (" 1,0,1", ′′) ⟹ ′′′

′′′ = = + 2, [0 ↦ 0] ′′′
= [ = + 1, [0 ↦ 2,1 ↦ 3]]
Now we can see that t’’’-pc = s’-pc which means they are ′′′
= [ = + 1, [0 ↦ 2,1 ↦ 3]]
semantically equivalent when ignoring the effect the code
has on the program counter. We also note that s’ and s’’ are For the second part of the proof we execute the second
semantically equivalent. We have thus proven the obfuscated instruction sequence.
and deobfuscate code samples are equivalent. 1=1
This approach to proving semantic equivalence between (" 1,0,1", ′) ⟹ ′′
code samples is useful to a malware researcher who wants to
identify malware instances and variants. ′= = + 1, [1 ↦ 1]
2) Code Reordering ′= = + 1, [1 ↦ 1]
Code reordering [17] changes the syntactic order of the
code in the malware [16]. The actual or semantic execution 1=2
path of the program does not change. However, the syntactic (" 2,0,0", ′) ⟹ ′′
order as present in the malware image is altered..
We show an example of code reordering and the ′′ = = + 1, [0 ↦ 1,1 ↦ 1]
intermediate code generated from each sequence.
′′ = = + 1, [0 ↦ 2,1 ↦ 1]
mov $2,%eax mov $1,%ebx (0) → 1
mov $1,%ebx mov $2,%eax (1) → 2
add %eax,%ebx add %eax,%ebx 3 = 1+ 2
(" 1,0,1", ′′) ⟹ ′′′

′′′
= [ = + 1, [0 ↦ 2,1 ↦ 3]]
ASSIGNC $0x2,,%eax ′′′
ASSIGNC $1,,%ebx = [ = + 1, [0 ↦ 2,1 ↦ 3]]
BOPADD %ebx,%eax,%ebx Thus we see that t’’’-pc = s’’’-pc and therefore the two
instruction sequences are semantically equivalent.
ASSIGNC $0x1,-,%ebx 3) Opaque Predicate Insertion
ASSIGNC $2,-,%eax An opaque predicate [18] is a predicate that always
BOPADD %ebx,%eax,%ebx evaluates to the same result. An opaque predicate is
constructed so that it is difficult for an analyst or automated
analysis to know the predicate result. Opaque predicates can
Figure 2. Code reordering be used to insert superfluous branching in the malware’s
control flow. They can also be used to assign variables
For the first part of the proof we execute the first values which are hard to determine statically. The use of
instruction sequence. opaque predicates is primarily for code obfuscation, and to
1=2 prevent understanding by an analyst or automated static
(" 2, −,0", ) ⟹ ′ analysis.

522
We see that register 100 is set which makes the
xor %eax,%eax conditional branch in the following instruction use a false
xor %eax,%eax
jnz $0x80482000 condition.
mov $2,%eax (100) → 0
mov $2,%eax −
(" 100,0, ", ′′) ⟹ ′′′
′′′ = = + 3, [0 ↦ 0,100 ↦ 1]
BOPXOR %eax,%eax,%eax
UMKBOOLIsZero %eax,,%zf 1=2
ASSIGNC $2,-,%eax (" 2, −,0", ′′′) ⟹ ′′′′
′′′′ = = + 4, [0 ↦ 1,100 ↦ 1]
BOPXOR %eax,%eax,%eax ′′′′ = = + 4, [0 ↦ 2,100 ↦ 1]
UMKBOOLIsZero %eax,,%zf
UCJMPIsNotZero %zf,,$target
ASSIGNC $2,-,%eax Thus we see that s’’-pc=t’’’’-pc and this proves semantic
equivalence.
B. Assisted and Automated Theorem Proving
Figure 3. An opaque predicate.
The manual proofs shown in the previous section are
useful. However, a more automated approach is beneficial.
In the first part of the proof we execute the first code
Algebraic specification [19] has been used in previous
sequence.
(0) → 1 research to combine algebraic semantics [20] and theorem
proving. Our work is different and uses operational
(0) → 2 semantics. Proof assistants may be used by an analyst. An
3= 1 2 alternative is to use automated theorem provers such as those
(" 0, −,0", ) ⟹ ′ for Satisfiability over Modulo Theories (SMT). These
′= = + 1, [0 ↦ 3] solvers can solve 1st order logic problems in a number of
theories including bit vectors. Public solvers are freely
available [21]. SMT solvers have been used in the past to
perform semantic NOP detection [16] and show equivalence
′= = + 1, [0 ↦ 0] between the code in basic blocks of two programs [22]. Our
(0) → 1 work gives a semantic basis and theory for these solvers to
− be used.
(" 0,0,100", ′) ⟹ ′′
VI. CONCLUSION
′′ = = + 2, [0 ↦ 0,100 ↦ 1]
Wire is an intermediate language that enables analysis of
1=2 executable programs. Wire has unique features including the
(" 2, −,0", ) ⟹ ′ ability to integrate the results of decompilation into the core
language. While this makes the translation possibly unsound,
′= = + 1, [0 ↦ 1,100 ↦ 1] for the majority of programs the translation is effective and
useful for analysis. A formal definition of the operational
′= = + 1, [0 ↦ 2,100 ↦ 1]
semantics of the language enables researchers to formally
In the second part of the proof we execute the second reason about assembly code. We demonstrated proofs of
code sequence. program equivalence between obfuscated and non
(0) → 1 obfuscated code samples. This reinforces our belief that a
(0) → 2 formal approach to describing Wire has practical benefits.
3= 1 2 REFERENCES
(" 0, −,0", ) ⟹ ′
[1] S. Cesare and Y. Xiang, "Classification of Malware Using
′= = + 1, [0 ↦ 3] Structured Control Flow," in 8th Australasian Symposium on
Parallel and Distributed Computing (AusPDC 2010), 2010.
′= = + 1, [0 ↦ 0]
[2] S. Cesare and Y. Xiang, "A Fast Flowgraph Based
(0) → 1 Classification System for Packed and Polymorphic Malware on

(" 0,0,100", ′) ⟹ ′′ the Endhost," in IEEE 24th International Conference on
Advanced Information Networking and Application (AINA
′′ = = + 2, [0 ↦ 0,100 ↦ 1]
2010), 2010.

523
[3] S. Cesare and Y. Xiang, "Malware Variant Detection Using [14] C. Kruegel, W. Robertson, F. Valeur, and G. Vigna, "Static
Similarity Search over Sets of Control Flow Graphs," in IEEE disassembly of obfuscated binaries," in USENIX Security
Trustcom, 2011. Symposium, 2004, pp. 18-18.
[4] A. V. Aho, R. Sethi, and J. D. Ullman, Compilers: principles, [15] H. R. Nielson and F. Nielson, Semantics with applications: an
techniques, and tools. Reading, MA: Addison-Wesley, 1986. appetizer: Springer Verlag, 2007.
[5] F. Bellard, "QEMU, a fast and portable dynamic translator," in [16] M. Christodorescu, J. Kinder, S. Jha, S. Katzenbeisser, and H.
USENIX Annual Technical Conference, 2005, pp. 41–46. Veith, "Malware normalization," University of Wisconsin,
[6] V. Bala, E. Duesterwald, and S. Banerjia, "Dynamo: a Madison, Wisconsin, USA Technical Report #1539, 2005.
transparent dynamic optimization system," presented at the [17] C. Mihai and J. Somesh, "Testing malware detectors," presented
Proceedings of the ACM SIGPLAN 2000 conference on at the Proceedings of the 2004 ACM SIGSOFT international
Programming language design and implementation, 2000. symposium on Software testing and analysis, Boston,
[7] C. K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, Massachusetts, USA, 2004.
S. Wallace, V. J. Reddi, and K. Hazelwood, "Pin: Building [18] L. Cullen and D. Saumya, "Obfuscation of executable code to
customized program analysis tools with dynamic improve resistance to static disassembly," presented at the
instrumentation," presented at the Proceedings of the 2005 Proceedings of the 10th ACM conference on Computer and
ACM SIGPLAN conference on Programming language design communications security, Washington D.C., USA, 2003.
and implementation, 2005. [19] M. Webster and G. Malcolm, "Detection of metamorphic
[8] N. Nethercote and J. Seward, "Valgrind A Program Supervision computer viruses using algebraic specification," Journal in
Framework," Electronic Notes in Theoretical Computer Computer Virology, vol. 2, pp. 149-161, 2006.
Science, vol. 89, pp. 44-66, 2003. [20] J. Goguen and G. Malcolm, Algebraic semantics of imperative
[9] D. Song, D. Brumley, H. Yin, J. Caballero, I. Jager, M. Kang, Z. programs: The MIT Press, 1996.
Liang, J. Newsome, P. Poosankam, and P. Saxena, "BitBlaze: A [21] V. Ganesh and D. L. Dill, "A decision procedure for bit-vectors
new approach to computer security via binary analysis," and arrays," presented at the Proceedings of the 19th
presented at the Information Systems Security, 2008. international conference on Computer aided verification, Berlin,
[10] C. Cifuentes, "Reverse compilation techniques," Queensland Germany, 2007.
University of Technology, 1994. [22] D. Gao, M. K. Reiter, and D. Song, "Binhunt: Automatically
[11] M. J. Van Emmerik, "Static single assignment for finding semantic differences in binary programs," in
decompilation," The University of Queensland, 2007. Information and Communications Security, 2008, pp. 238–255.
[12] S. Hex-Rays, "IDA Pro Disassembler," ed, 2008.
[13] T. Dullien and S. Porst, "REIL: A platform-independent
intermediate representation of disassembled code for static code
analysis," ed: CanSecWest, 2009.

524

You might also like