Professional Documents
Culture Documents
Research Article
COOPS: A Code Obfuscation Method Based on Obscuring
Program Semantics
Yang Li , Fei Kang , Hui Shu , Xiaobing Xiong, Zihan Sha, and Zhonghang Sui
State Key Laboratory of Mathematical Engineering and Advanced Computing, Zhengzhou 450000, China
Copyright © 2022 Yang Li et al. This is an open access article distributed under the Creative Commons Attribution License, which
permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
As reverse engineering technology develops rapidly, the financial loss caused by software security issues is urgent. Therefore, how
to effectively protect software is a critical problem to solve. The software protection method based on code obfuscation is an
effective way, and constructing an effective obfuscation algorithm can increase the cost of reverse software. It is conspicuous that
current development of code obfuscation focuses on increasing the complexity of the code structure without paying much
attention to the protection of program semantic information, which may help experienced attackers improve their analysis
efficiency. This paper proposes COOPS for protecting software based on program semantic information, in which functions are
regarded as basic semantic units. The switch relationship between the intrafunction control flow and the interfunction calling is
established. The interfunction calling can be hidden in the intrafunction control flow, and in reverse, the intrafunction control
flow can also be converted to interfunction calling. In this way, considering intraprogram function semantic unit level discrete,
this method reconstructs the intraprogram semantic relationship. To determine the relative effectiveness, we have evaluated
COOPS on OpenSSL and SpecInt-2000 test sets. For both of them, the function calling graphs before and after obfuscation differ
more than 90%, which means COOPS significantly changes the control flow of the program. The evaluation shows that compared
with O-LLVM, COOPS manifests strong resistance to Asm2vec and other program similarity analysis techniques and significantly
improves the level of software protection rather than necessitating time-consuming and heavyweight problems.
threaded mutually exclusive execution. The method creates to these techniques. The main contributions of this paper are
multiple threads in a program. It assigns different functions summarized as follows:
to different threads for execution, so that the program ex-
(i) We initially propose the idea that code obfuscation
ecution trajectory with functions is converted as a unit from
based on semantic obscurity. A function is regarded
calling execution within a single thread to repeated
as a semantic unit, and the switch relationship
switching execution between multiple threads. Yadav et al.
between the intrafunction control flow and the
[27] and Hataba et al. [28] implement code protection at the
interfunction calling is established. In the level of
source code level, where the former uses a pointer multi-
control flow inside and outside the function, the
plication mechanism to replace integer operations, and the
intraprogram semantic hierarchy is changed.
latter diversifies the generated software by changing the
Through experimental comparison and analysis, it
control flow of the input program at the compiler level.
can effectively resist the similarity analysis tech-
However, the abovementioned code obfuscation algo-
niques such as Asm2vec.
rithm mainly focuses on how to increase the complexity of
the internal structure of the code, and the complex code (ii) We apply function inlining and outlining tech-
structure fail to hide the semantic information behind the nology to code obfuscation, and three outlining
structure. The stage semantic information obtained by the methods are proposed novelly.
attacker during the reverse analysis process is still available. (iii) We implemented an automatic code obfuscation
Experienced attackers can effectively improve the analysis ef- system-based on LLVM, which can automatically
ficiency by using the semantic information. Therefore, how to obfuscate the input source code program. The source
improve the protection of program semantics by code ob- code and the experimental results are available at
fuscation technology is our necessary technique in this paper. GitHub (https://github.com/Rookiellvm/COOPS).
From the perspective of protecting program semantics,
this paper proposes a code obfuscation method named 2. Motivation
COOPS to obscure program semantics. It is important to
note that the code obfuscation is converted on the basis of Figure 1 presents a high-level abstraction how an attacker
the equivalence of overall semantics of a certain program. can perform analysis on a program using semantic infor-
The obscure program semantics in this paper does not mation. The figure below on the left depicts the program
change the overall semantics of the program, but obscures under the protection of conventional obfuscation tech-
the intraprogram semantic layers and the relationship be- niques, where the semantic level inside the program has not
tween levels. To destroy the interfunction semantic rela- changed. The functions themselves and interfunction rela-
tionship and the semantic information of the function itself, tionship have clear semantics. The attacker deduces the
we regard the function as a basic semantic unit, and semantic behavior of the program through obtained semantic in-
obscurity is implemented both the control flow inside and formation and semantic relationship of each function. The
outside the function. So that attackers are faced with more design goal of the obfuscation method in this paper is to
abstract function units and calling relationships. The core destroy interfunction semantic information and the function
approach is based on function inlining and outlining. The itself, and change the intraprogram semantic level. The
initial mention of function inlining and outlining in code figure below on the right depicts that the internal semantic
obfuscation is in the Collberg article [7]. This article only level of the program is obfuscated by COOPS, where
describes the obfuscation effect of this method, but there are interfunction semantic logic is disordered. It becomes in-
no specific implementation details. In this paper, we initially feasible for the attacker to understand whether the analyzed
combine the function inlining and outlining technology with function has the correct semantic function, let alone deeply
the obfuscation strategy and present a novel control flow understand the program.
obfuscation method.
In order to verify the effectiveness of the obfuscation 3. System Design
algorithm, we designed a code obfuscation system based on
In this chapter, we describe the obfuscation system design
the LLVM [29] compiler. The system considers software
method in this paper. In Section 3.1, we provide an overview
programs written in C/C++ as input and PE binary files after
of the obfuscation system. In Section 3.2, we describe the
obfuscation as outputs. We choose OpenSSL and SpecInt-
control flow transformation methods used in the obfusca-
2000 as test set. Extensive tests are used to evaluate the
tion system. In Section 3.3, we describe how to reconstruct
effectiveness of the system in terms of code protection and
the data relationship, that is, broken during control flow
the costs of operation caused by the obfuscation process.
transformation. Combining control flow transformation
Experiments show that the system has effective obfuscation
methods with obfuscation strategies to implement program
and is suitable for various complex code structures. After
control flow protection is described in Section 3.4.
obfuscation, the difference in the calling graph of the
software is more than 90%, and the complexity of the calling
graph circle also increases by about 20%, which significantly 3.1. Overview. The overview of COOPS is shown in Figure 2.
improves the analysis difficulty. We select Asm2vec [30], It is mainly divided into two stages. The first stage analyzes
DeepBindiff [31], Safe [32] and BinDiff [33] for similarity the calling relationship of the original program and inline
analysis, and the results show the system is strongly resistant the functions with higher importance. The so-called inlining
Security and Communication Networks 3
Behavior
???
Fun Deduce
Fun Deduce
Relation
???
Fun
Fun
Fun
Semantic Non-semantic Fun
logic logic
Fun Fun
Semantic
Function
Non-semantic
Function
M M M
A B C A B F C A
E E B F
B'
C D E F H D H D
G H I L G I K L G' G'' I K L
K
Figure 2: Overview of our approach. (a) Original call graph. (b) Inline call graph. (c) Outline call graph.
is to hide the called function into the caller’s function, combines the obfuscation strategy to achieve the purpose of
thereby eliminating the function calling relationship, which control flow obfuscation.
corresponds to the process of Figures 2(a) and 2(b). As-
suming that functions C, E, F, and H are functions of high
3.2. Control Flow Transformation Method. Function inlining
importance, we inline these functions into A, B, and D. In the
and outlining is a way to implement control flow trans-
second stage, on the basis of inlining transformation,
formation. In this paper, function inlining and outlining are
function outlining transformation is performed on the
used to implement control flow transformation. Combining
program. The so-called outlining is to transfer the code
inlining and outlining can effectively change the control flow
snippet to the outside of the function. COOPS consists of
structure of a program.
three outlining methods, corresponding to the process of
Figures 2(b) and 2(c). The split outlining corresponds to
function G being converted into G′ and G″. The parent-child 3.2.1. Inlining Transformation. Inlining transformation is a
outlining corresponds to functions B, I, K, and L being control flow transformation technique that eliminates
converted into subfunctions B′, I′, K′, and L′. The cross function callings by inlining the called function into the
outlining corresponds to the outline code snippets of caller’s function. The inlined program will make the inlined
functions B′ and D being converted to functions K and L, in function and its related callings disappear, and the semantic
which there originally is not calling relationship between function unit composed of the original instruction set will be
functions B′\D and K\L. Section 3.2 of this chapter intro- destroyed. In reverse analysis, it is only necessary to un-
duces inlining obfuscation methods and strategies, and derstand the function semantic unit, but it becomes the
Section 3.3 introduces three outlining methods and strate- understanding of the instruction semantic unit. It is much
gies. Three outlining methods are applied at different stages more difficult for a reverse analysis program to understand a
of the obfuscation process. COOPS uses inlining and out- piece of instruction than to understand an independent
lining as the control flow transformation method, and function unit, because it is not clear that this piece of code
4 Security and Communication Networks
A F1 F2
Caller Entry
CallInst
Entry
Callee
B Callee B Call F2
C D C D
InLine
E E D End
End
F F
B
use[Inst]
� {v|the set of variables referenced when Inst is executed}.
F2 C (2)
D
Definition 2. If variable v ∈ def[Insti ] and v ∈ use[Instj ],
Insti ≠ Instj , where Instj and Insti have an executable path.
End On this path, there is no statement to redefine v; hence,
instruction Instj directly depends on Insti or instruction
Figure 6: Split outlining. Insti directly dominates Instj on the data stream of variable
Instj , which is denoted as
child outlining can increase the complexity of the function DDInstj , Insti , v. (3)
calling relationship.
The abovementioned three outlining transformation As shown in Figure 9, “Code Snippet” is a code snippet
methods move the code snippet intrafunction to the outside obfuscated by outlining. Each basic block is represented as a
for execution. As shown in Figure 7, the outlining destroys the set of instructions, in which each instruction is represented
control relationship between basic blocks within a function as a set of corresponding variables. The dependent in-
and generates a new interfunction call. In the process of struction data in the outlining basic block set are analyzed to
outlining transformation, the control relationship of basic determine the return value and the parameters of the out-
blocks A to B is converted into CallInst instruction, in which lining function.
data dependencies are passed through parameters; the control For each instruction Insti in the outlining code snippet,
relationship between basic blocks F to G is converted into the set of variables referenced by instruction use[Insti ] are
Return instruction, in which data dependencies are passed calculated (Phase ①). For each instruction Instj in the
through return values; the control relationship within the predecessor instruction set (Prev in Figure 9) of the outlining
outlining code snippet remains unchanged. Therefore, the code snippet, the variable set def[Instj ] (Phase ②) defined
conversion from intrafunction control relations to inter- by the instruction Instj are calculated. Each variable v in
function calls is proved equivalent. The construction of pa- use[Insti ] is used as the parameter of the outlining function
rameters and return values is described in Section 3.3. (Phase ③), if the variable v is in def[Instj ].
For each instruction Insti in the outlining code snippet,
the variable set def[Insti ] (Phase ④) defined by the in-
3.3. Data Relationship Reconstruction. During inlining, the struction Insti are calculated. For each instruction Instj of
entire code segment of the function is inlined at the calling the subsequent basic block set (Succ in Figure 9) of the
point. However, this process destroys the original data re- outlining code snippet, the variable set use[Instj ] (Phase ⑤)
lationship, which needs to be reconstructed. Figure 8 shows referenced by the instruction Instj are calculated. Each
the implementation process of inlining obfuscation. First, variable v in def[Insti ] is used as the return value of the
the code in the inline function is replicated to the calling outlining function, if the variable v is in use[Instj ]. All
point, and then variable remapping and basic block logic variables used as the return value are stored in the Ret Struct
processing are performed. When the parameters inlined by (Phase ⑥).
the function are processed, the process of calling function
parameters is transformed into the process of instruction
assignment. Since the exit of the function is not unique, the 3.4. Control Flow Obfuscation Strategy. Two types of control
return value is uncertain. Hence, all the exit basic blocks at flow transformation methods are described in Section 3.2. In
the intermediate semantic level are collected, and a tag summary, inlining can hide function calls into intrafunction
variable is added when processing the return value. Based on control flow. Outlining can transfer the intrafunction to
the runtime result, the return value is determined at the interfunction call. The design goal of this paper is to destroy
switch basic block according to the tag scalar. Finally, the the internal semantic unit level of the program as much as
calling instruction and the function prototype are removed. possible, and obfuscate the control flow inside and outside the
When the system performs outlining, outlining code program. This section designs the control flow transformation
snippets to form functions. The outlining of the code will strategy in stages, applies the abovementioned control flow
destroy the data dependencies during program execution, so transformation methods to different stages to achieve the
it is necessary to analyze the intraprogram data flow, find the destruction of the internal control flow of the program, and
relevant dependencies and complete the reconstruction. then protects semantic information of the program.
6 Security and Communication Networks
A
Caller
CallInst
Callee
B B
Snippet
C D C D
E E
F F
Return
Caller Entry
Ret
Ret
Stage 1. We hide highly important functions by inline vertex. Let vi represent the i function in the program, then
transformation. The inlining transformation hides the the calculation formula of the importance can be expressed
function calling into the intrafunction control flow, and the as
inlining of the function with complex calling relationship
Sp vi � Ip vi × Op vi , (4)
can destroy the control flow logic to the greatest extent.
Therefore, the object of the inline transformation is the where Ip (vi ) represents the number of paths ending with
important function in the program calling relationship. vertex vi , and Op (vi ) represents the number of all paths
Function calling graphs can be abstracted into directed starting with vertex vi . Different from the conventional
graphs with explicit pointing relationships, so vertex im- directed graph, the function calling graph has two partic-
portance can be used to evaluate the importance of functions ularities: (1) the function calling graph is a directed graph
in function calling relationships. The connectivity between with rings; (2) the function calling graph has a definite
vertices and other vertices is of great significance for judging starting point and end point.
the importance of vertices. That is, the more paths through a For the former, the ring of the directed graph may cause
vertex, the more important that vertex is. Therefore, the the vertex path to explode, so the back-edge in the ring is
path-based vertex importance is specifically defined as: pruned to convert the graph with rings into an acyclic graph.
For the latter, since the function calling graph has a
Definition 3. For a vertex, the number of all paths passing definite starting point (main function) and a definite end
through the vertex is used to judge the importance of the point (leaf node), the calculation formula for the number of
Security and Communication Networks 7
Parameter
def[Instj]
~~~ ~~~
~~~ ~~~ Code
~~~ Snippet
~~~ ~~~
~~~ Arguments
~~~ ~~~ ~~~
~~~ ~~~
Prev
~~~ ~~~
~~~
~~~
BasicBlock ~~~ ~~~
~~~ Instruction
~~~
Succ Variable
~~~ ~~~
~~~ ~~~
~~~ Ret Struct
~~~ ~~~ ~~~
~~~ ~~~ ~~~ ~~~
~~~
~~~
def[Insti]
use[Instj] DD (Insti, Instj, v)
Ret Values
vertex paths can be quickly converged after the pruning of would be an abstract solution for implementing function
the backward edges is completed. It can be seen from for- outlining: suppose the function calling graph is shown on the
mulas (2) and (3) that the number of paths passing through a left of Figure 10. Traverse the function calling graph, gen-
vertex can be solved by calculating Ip (vi ) and Op (vi ) re- erate a multi-fork tree A that describes the function calling
spectively, and this calculation process is recursive: relationship, and build a target spanning tree B on the basis
of the multi-fork tree A. The tree B has the following re-
Ip vi � Ip (u), quirements: (1) complete multi-fork tree; (2) tree B has more
u∈I(v)
(5) layers than tree A. In this example, the number of layers of
Op vi � Op (u). the target spanning tree B is 3, and the number of forks is 3.
u∈O(v) The nodes generated in the process of constructing the target
spanning tree B are used as candidates for parent-child
Among them, I(v) represents the set of vertices that have a outlining to provide a variety of choices for the third stage.
direct calling relationship with vertex vi , and O(v) represents
the set of vertices that have a direct calling relationship with
Stage 4. The real function calling graph has back-edges, and
vertex vi . Since the function calling graph has a definite start
Cross outlining is implemented on the constructed target
and end point, the recursive calculation process is determin-
spanning tree with the help of the Barabasi‒Albert model
istically convergent. The function with high importance in the
[34] to generate the final function calling graph. The Bar-
program is obtained by formula (1) for inline transformation.
abasi‒Albert model is a graph generation algorithm for the
preferential connection model. Preferential attachment
Stage 2. The split outlining can split functions into different means that the more connections between vertex, the greater
functions. Therefore, the split outlining is performed on the the possibility of accepting new connections. The core idea of
function that was inlined in the first stage, and the inline the algorithm is to add an edge between the vertex and the
code and the function interleaved by the inline function are vertex in the original graph for a new vertex in the graph. The
split into different ones. As a result, the function semantic probability of connection is ki / ni�1 ki , where ki represents
information with high importance is discrete. the degree of vertex i in the original graph. Referring this
model to the calling graph, the more connections between
Stage 3. The function call spanning tree is constructed to vertexes, the greater the importance of functions. However,
implement parent-child outlining, and the function call the purpose of the obfuscation in this paper is to hide the
process is similar in form to the depth-first-based multi-fork importance of the function, and the probability of trans-
tree node traversal. Therefore, building a multi-fork tree forming the model connection is as follows:
8 Security and Communication Networks
F1
F1
F2 F3 F2 F1.1 F3
(LLVM bitcode)
(LLVM bitcode)
Orig-obf.bc
Orig.bc
5.1. Evaluation Platform and Benchmarks. In this section, we Obfuscator-LLVM (O-LLVM) [35] is applied beyond
explain how the COOPS obfuscation method is tested and LLVM’s intermediate language (IR) to generate more
verified. The machine used for testing has an Intel Core i7- complex binary files by transforming IR. O-LLVM includes
9700 CPU @ 3.00 GHz with 32 GB memory and Windows 10 three different obfuscation techniques: control flow flat-
as the operating system. tening (FLA), instruction substitution (SUB), and bogus
We evaluated the test sets commonly used in the field, control flow (BCF).
including the cryptographic algorithms RSA and AES in FLA flattens the control flow of the program, adding new
OpenSSL, Bzip, Gzip, Twoif, and Parser in SpecInt-2000 [36]. conditional predicates to hide the control logic of the
Table 1 shows the basic information of the test suite. The first original program through switch-case. FLA greatly changes
column is the test program name, the second column is the the internal control flow structure of the program.
size of the program, the third column is a brief description of BCF adds a large number of conditional judgments to the
the program function, the fourth column is the number of program by adding opaque predicates and garbage in-
program instructions, and the fourth column is the number structions to the program. The program after BCF obfus-
of program functions. cation adds a lot of conditional branches, which
differentiates between the control flow and the original
program significantly.
5.2. Evaluation of System Protection Effect. In this section, we SUB is to replace the operation in the program with a
analyze the antianalysis capability of the software under the more complex operation, for example, replace the AND
protection of the obfuscated system. Antianalysis ability is an operation with a � (b^∼c) & b. SUB significantly changes the
important indicator to evaluate the increase of the cost of sequence of instructions in the program.
software analysis by obfuscation methods. In this section, we O-LLVM has greatly changed the control flow structure
intend to evaluate from two aspects. One is obfuscation of the original program. Based on the abovementioned
system resistance to similarity analysis; the other is obfus- similarity analysis techniques and tools, the following 4
cation strength. experiments were set up.
Similarity analysis can eliminate the confounding factors
faced by reverse analysis. A common situation is that a
Experiment 1. To compare similarities, we use Asm2vec,
statically linked lib library is carried during software de-
Safe, and DeepBindiff.
velopment, and a large number of library functions can be
Experimental setup: we considered the original AES
excluded by similarity analysis with known library function
program as the benchmark comparison object, and the
codes to improve analysis efficiency. This section evaluates
compare the similarities between RSA, Bzip, Gzip, Parser,
the resistance of COOPS to similarity analysis using
Twoif and obfuscated AES (AES-Obf ).
Asm2vec, DeepBindiff, and Safe academic similarity analysis
Experimental results: the experimental results are shown
techniques in 5.2.1, as well as commercial-grade BinDiff.
in Figure 12. The similarity between AES-Obf and AES is at a
The obfuscation strength is used to evaluate the com-
low level, indicating that COOPS significantly changes the
plexity increment of the obfuscated program. Anckaert et al.
internal control flow structure of the program and can ef-
[37] proposed to evaluate obfuscation strength based on
fectively resist the similarity comparison technology such as
complexity measure in software engineering. In this section,
Asm2vec. The similarity comparison results obtained by
we adopt the software cyclomatic complexity in Section 5.2.2
AES-Obf and the complex program Twoif are similar, in-
to evaluate the obfuscation strength.
dicating that the degree of difference of the programs after
COOPS obfuscation has reached the level of complex pro-
5.2.1. Program Similarity Analysis. Asm2vec, Safe and grams. Technologies such as Asm2vec extract the semantic
DeepBindiff are the techniques used in the academic com- information of the assembler by analyzing the assembly
munity to perform binary similarity analysis. By matching code. Since both RSA and AES are cryptographic algorithms,
known functions through similarity analysis, the influence of the similarity between the two is high, while the similarity of
obfuscation is eliminated and the analysis efficiency is im- the obfuscated AES-Obf is low, which also reflects that
proved. BinDiff is a tool used by the business community for COOPS obscures the semantic information of the program.
similarity comparison analysis. It compares the similarity of
binary files from five dimensions: Function, Calls, Basic- Experiment 2. Comparing COOPS with O-LLVM using
Block, Jumps, and Instructions. Asm2vec.
10 Security and Communication Networks
120
120
90
80
Percentage (%)
40 60
0
30
Experimental setup: we use BCF, FLA, SUB, MIX Figure 13: Comparing COOPS with O-LLVM using Asm2vec.
(BCF + FLA + SUB) and COOPS to obfuscate the 6 appli-
cations respectively, and use Asm2vec to compare the
similarities between the programs before and after the 30
obfuscation.
Experimental results: the experimental results are shown
in Figure 13. Compared with the other four obfuscation
methods, COOPS has better control flow obfuscation effect, 20
Percentage (%)
30
As shown in Table 2, the difference degree of calling
graph before and after obfuscation reaches more than 90%.
As the number of functions of the obfuscated program
increases, the difference degree of calling graph is greater.
0 On the whole, the obfuscated calling graph is close to re-
AES RSA Gzip Bzip Parser Twoif
construction, which is significantly different from the
BCF COOPS original program calling graph. At the same time, the
MIX SUB
cyclomatic complexity of the program has also increased,
FLA
which shows that the increased complexity of the callings
Figure 15: Comparing COOPS with O-LLVM using BinDiff. between the functions of the COOPS makes the calling
relationship more complicated after obfuscation, and ef-
measure the complexity of a software’s internal structure. In fectively improves the anti-analysis ability of the program.
this section, the cyclomatic complexity is used to measure COOPS can provide multiple rounds of iterative ob-
the obfuscation potency, and the difference of the calling fuscation, and multiple COOPS can be applied to improve
graph before and after obfuscation is included in the ob- the obfuscation intensity and increase the diversity and
fuscation potency evaluation. complexity of the function calling graph. The complexity of
Cycle complexity Cp is used to measure the complexity calling graphs and the cost of running time after applying
of a module’s decision structure, whose value is linearly COOPS for multiple times are summarized in Figure 16. It
independent of the number of paths, i.e., the minimum can be seen that under the time overhead of not more than
number of paths to be tested for reasonable error pre- 50%, when the number of iterations is 3 and 4, the cyclo-
vention. Cp can be obtained by calculating the cyclomatic matic complexity of the program increases by about 50%. In
complexity of ϑ (GP ) and C(GP ), i.e., V(GF ) � C − F + 2P, particular, the cyclomatic complexity of the complex pro-
where C and F respectively are the number of edges and gram Parser and Twoif calling graphs increases by more than
vertexes in the calling graph, and P is the number of 90%. The number of obfuscated functions in complex
connected components of the graph, i.e., the maximum set programs is large, and the complexity of the generated
of connected vertexes. Since the control flow graphs are all calling graph is high, so the COOPS has better obfuscation
connected, P is equal to 1. Therefore, the obfuscation can be effect on complex programs.
expressed as
V ϑ GP 5.3. Evaluation of System Efficiency. This section compares
Cp ϑ, GP � 100 − 1. (7) COOPS and O-LLVM with the abovementioned three ob-
V GP
fuscation options to evaluate program execution efficiency
The difference degree Dp , i.e., the difference of each and code inflation rate.
calling graph ϑ (GP ) after obfuscation compared to GP before The execution efficiency of the program Tcost is the
obfuscation, can be defined as obfuscated program execution time T(ϑ(P)) divided by the
original program execution time T(P), i.e.,
F+S
Dp ϑ, GP � 100 . (8) T(ϑ(P))
T Tcost ϑ, P � . (9)
T(P)
Here,
Similarly, the code inflation degree Scost is the division of
(i) F � |C(ϑ(GP )) − C(GP )| represents that the figure program disk overhead after obfuscation S(ϑ(P)) to the
belongs to ϑ (GP ) but does not belong to C(GP ) (the original program disk overhead S(P):
cardinality of the edge)
S(ϑ(P))
(ii) S � |C(GP ) − C(ϑ(Gp ))| represents that the figure Scost ϑ, P � . (10)
belongs to C(GP ) but does not belong to ϑ(GP ) (the S(P)
cardinality of the edge) Each program in the test set is obfuscated. To accurately
(iii) T � F + S + |C(GP ) ∩ C(ϑ(GP ))|, where |C(GP ) ∩ C measure the running time of each program, they are exe-
(ϑ(GP ))| is the cardinalate of intersection set be- cuted one hundred times, and the average running time
tween ϑ (GP ) and C(GP ) before and after obfuscation is calculated.
12 Security and Communication Networks
50 50 50 50
Runtime (%)
0 0 0 0
2.0
1.52 1.56
1.51 1.48 1.48
1.5 1.38 1.38 1.42
1.32 1.35
Average Runtime
1.28
1.24 1.21 1.22 1.23 1.25
1.18
1.1 1.13 1.12
1.04 1.05 1.04 1.02
1.0
0.5
0.0
AES RSA Bzip Gzip Parser Twoif
SUB BCF
OCALLING FLA
Figure 17: Comparison of average cost of runtime.
Figure 17 shows the time cost comparison between the 6. Related Works
obfuscation methods in COOPS and OLLVM before and
after the obfuscation. The x-axis represents each test pro- COOPS is a code obfuscation method based on control flow
gram, and the y-axis represents the ratio of cost of time of transformation, which effectively protects the program
postobfuscation to preobfuscation. The results show that the control flow. In recent years, the new direction of new
cost of time of COOPS is close to that of OLLVM, and it has control flow obfuscation is to expand the obfuscation
almost no effect on obfuscation of simple programs such as process from intraprocedure to interprocedure. The control
AES and RSA, and the overall the cost of obfuscation time is flow transfer becomes the program and the system to co-
less than 25%. operate with. The state-of-art research summarized below is
Figure 18 compares the obfuscation methods in similar to the obfuscation method of function calling rela-
COOPS and OLLVM in terms of the cost of disk before tionship proposed in this paper.
and after the obfuscation. The x-axis represents each test In [38], Miguel et al. proposed a mechanism based on the
program, and the y-axis represents the ratio of cost of disk routing of function calls to realize the static obfuscation and
of postobfuscation to preobfuscation. The results show diversification of the function calling graph. Compared to
that the cost of disk of COOPS is close to that of OLLVM, the original software, the obfuscated calling graph had an
but the cost of disk of complex programs increases average difference of 25% in structure, and the fuzziness was
significantly. increased from 20% to 30%. In all cases, only 3% of the
Security and Communication Networks 13
2.5
1.5 1.47
1.4 1.39
1.29
1.11 1.16
1.03 1.04 1.05 1.05 1.09
1.0
0.5
0.0
AES RSA Bzip Gzip Parser Twoif
SUB BCF
OCALLING FLA
Figure 18: Comparison of code size.
execution time was spent. This mechanism allowed the ef- program. When the program was executed, the obfuscation
ficient reconstruction of the entire calling graph, thereby program looked up the address table and restores the
improving the protection level without significantly affecting function calling relationship.
the software performance. In [45], the authors proposed an obfuscating method
In [39, 40], the authors proposed an obfuscating method based on Return Oriented Programming (ROP). This method
to extract code snippets from original functions and store focused on transforming the direct control flow into an
them in another function, i.e., each function had code indirect one, dividing the code in the binary file into a basic
snippets of other functions. The correct execution of the block, and converting that basic block into code snippets
program was realized by unconditionally jumping to the through binary tools. All identified direct control flows were
address of the basic block stored in the new function. In [40], hidden by RET instruction, and all code snippets were added
Vivek et al. guaranteed the correct execution of the in- to the original file to regenerate the obfuscated binary file.
struction by returning the instruction to the end function of The control flow mixing algorithms mentioned above can
the basic block. Therefore, there were many return in- be divided into two groups. One is obfuscating a specific
structions in the main body of the function, so the reverse structure example in the function, as in [39, 40, 45], for
tool was unable to correctly identify the boundary of the jumping between basic blocks. The others, e.g., [40–44],
function. However, the program could execute correctly, scramble and obfuscate the function calling graph. The dif-
which effectively obfuscated the calling relationship between ference between COOPS and the studies covered above is that
the functions. our proposal obfuscates the jumps between basic blocks and
In [41], the authors obfuscated the branching condition function-calling graphs at the same time. Besides, according
containing the control flow logic of the program. They used to dynamic and static program analysis, there occurs strategic
lambda calculus to hide the computational semantics of the obfuscation. To conclude, COOPS provides more diversified
original condition and replaced the conditional jump in- software calling graphs, achieving a better hiding effect on the
struction with the Lambda calculus function calling. This key structures in the programs. This richer diversity can better
obfuscation method can protect the key branching condi- protect software from code reverse engineering.
tions from symbolic execution technology with only mod-
erate overhead. 7. Conclusions and Future Work
Tatsuya et al. imitated the obfuscation of the control flow
in functions, flattened and obfuscated the function calling This paper introduces COOPS, a code protection method
graph, employed the switchfunc function, and used random based on the obscure semantics. COOPS starts from pro-
numbers to ensure the proper scheduling of functions [42]. gram semantics, regards functions as basic semantic units.
However, this method was easily influenced by dynamic The switch relationship between the intrafunction control
analysis and had certain obfuscating features. flow and the interfunction calling is established, in which the
The algorithm proposed in [43] generating the bogus original semantic level of the program is destroyed. Ex-
function calling graph, which changed the called function periments show that COOPS can effectively change the
before running the target program, and the program called control flow structure of the program, and has excellent
the Hook method at runtime to ensure its correct execution. resistance to similarity analysis techniques such as Asm2vec.
Although this method effectively resisted the static analysis The difference in the degree of program calling relationship
of programs, it failed to resist the dynamic analysis. between before and after obfuscation reaches more than
In [44], the authors removed the edge in the function 90%. COOPS makes the reverse analyzer have to face more
calling graph, i.e., the address of the calling function was abstract functional units and significantly increases the cost
stored in a table at a different position from the obfuscated of code reverse engineering.
14 Security and Communication Networks
However, COOPS only implements inlining and out- [12] X. Xie, “Mixed obfuscation of overlapping instruction and
lining, and the obfuscation strategy is relatively simple. We self-modify code based on hyper-chaotic opaque predicates,”
can expand it in future work to increase the diversification of in Proceedings of the 2014 Tenth International Conference on
its obfuscation techniques. Computational Intelligence and Security, IEEE, Yunnan,
China, November 2014.
[13] V. Sergeichik and I. Alexander, “Implementation of opaque
Data Availability predicates for FPGA designs hardware obfuscation,” Journal
of Information, Control and Management Systems, vol. 12,
The source code and the experimental results are available at
no. 2, 2014.
GitHub (https://github.com/Rookiellvm/COOPS). [14] Y. Yubo, F. Wenqing, H. Wei, X. Guoai, and Y. Yixian, “The
research of multi-point function opaque predicates obfus-
Conflicts of Interest cation algorithm,” Applied Mathematics & Information Sci-
ences, vol. 8, pp. 3063–3070, 2014.
The authors declare that they have no conflicts of interest. [15] D. Xu, M. Jiang, and D. Wu, “Generalized dynamic opaque
predicates: a new control flow obfuscation method,” in
Proceedings of the International Conference on Information
Acknowledgments
Security, Springer, Bali, Indonesia, September 2016.
This work was supported by the National Key Research and [16] K. Yakdan, “No More Gotos: Decompilation Using Pattern-
Development Project (2016YFB08011601). independent Control-Flow Structuring and Semantic-Pre-
serving Transformations,” in Proceedings of the The 2015
Network and Distributed System Security (NDSS) Symposium,
References NDSS, San Diego, CA, USA, February 2015.
[17] T. László and Á. Kiss, “Obfuscating C++ programs via control
[1] M. Al Kadri, M. Nassar, and H. Safa, “Transfer learning for flow flattening,” Annales Universitatis Scientarum Budapes-
malware multi-classification,” in Proceedings of the 23rd In-
tinensis de Rolando Eötvös Nominatae Sectio Computatorica,
ternational Database Applications & Engineering Symposium,
vol. 30, no. 1, pp. 3–19, 2009.
Athens, Greece, June 2019.
[18] J. Cappaert and B. Preneel, “A general model for hiding
[2] M. Nassar and H. Safa, “Throttling malware families in 2d,”
2019, https://arxiv.org/abs/1901.10590. control flow,” in Proceedings of the Tenth Annual ACM
[3] Y. Awad, M. Nassar, and H. Safa, “Modeling malware as a Workshop on Digital Rights Management, Chicago Illinois
language,” in Proceedings of the 2018 IEEE International USA, October 2010.
Conference on Communications (ICC), IEEE, Kansas City, [19] J. Yi, “A security model and implementation of embedded
MO, USA, May 2018. software based on code obfuscation,” in Proceedings of the
[4] H. Safa, M. Nassar, and Wael Al Rahal Al Orabi, “Bench- 2020 IEEE 19th International Conference on Trust, Security
marking convolutional and recurrent neural networks for and Privacy in Computing and Communications (TrustCom),
malware classification,” in Proceedings of the 2019 15th In- IEEE, Guangzhou, China, January 2020.
ternational Wireless Communications & Mobile Computing [20] B. Johansson, P. Lantz, and M. Liljenstam, “Lightweight
Conference (IWCMC), IEEE, Tangier, Morocco, June 2019. dispatcher constructions for control flow flattening,” in
[5] S. Schrittwieser and S. Katzenbeisser, “Code obfuscation Proceedings of the 7th Software Security, Protection, and Re-
against static and dynamic reverse engineering,” International verse Engineering/Software Security and Protection Workshop,
Workshop on Information Hiding, Springer, Berlin, Heidel- Orlando FL USA, December 2017.
berg, 2011. [21] P. Rajba and W. Mazurczyk, “Data hiding using code ob-
[6] K. Kuang, Z. Tang, X. Gong, D. Fang, X. Chen, and Z. Wang, fuscation,” in Proceedings of the The 16th International
“Enhance virtual-machine-based code obfuscation security Conference on Availability, Reliability and Security, Vienna,
through dynamic bytecode scheduling,” Computers & Secu- Austria, August 2022.
rity, vol. 74, pp. 202–220, 2018. [22] P. Ahire and J. Abraham, “Secure Cloud Model for Intellectual
[7] C. Collberg, C. Thomborson, and D. Low, A Taxonomy of Privacy protection of Arithmetic Expressions in Source Codes
Obfuscating transformations, Department of Computer Sci- Using Data Obfuscation Techniques,” Theoretical Computer
ence, The University of Auckland, New Zealand, 1997.
Science, vol. 922, 2022.
[8] C. Collberg, T. Clark, and L. Douglas, “Manufacturing cheap,
[23] M. I. Sharif, “Impeding malware analysis using conditional
resilient, and stealthy opaque constructs,” in Proceedings of the
code obfuscation,” NDSS, San Diego, CA, USA, February
25th ACM SIGPLAN-SIGACT Symposium on Principles of
2008.
Programming Languages, Boston, MA, USA, January 1998.
[24] X. Xie, B. Lu, D. Gong, X. Luo, and F. Liu, “Random table and
[9] G. Myles and C. Collberg, “Software watermarking via opaque
predicates: implementation, analysis, and attacks,” Electronic hash coding-based binary code obfuscation against stack trace
Commerce Research, vol. 6, no. 2, pp. 155–171, 2006. analysis,” IET Information Security, vol. 10, no. 1, pp. 18–27,
[10] H. Xu, “Manufacturing resilient bi-opaque predicates against 2016.
symbolic execution,” in Proceedings of the 2018 48th Annual [25] C. Linn and S. Debray, “Obfuscation of executable code to
IEEE/IFIP International Conference on Dependable Systems improve resistance to static disassembly,” in Proceedings of the
and Networks (DSN), IEEE, Luxembourg, June 2018. 10th ACM Conference on Computer and Communications
[11] A. Moser, C. Kruegel, and E. Kirda, “Limits of static analysis Security, Singapore, March 2003.
for malware detection,” in Proceedings of the Twenty-Third [26] Z. Sha, “Model of execution trace obfuscation between
Annual Computer Security Applications Conference (ACSAC threads,” IEEE Transactions on Dependable and Secure
2007), IEEE, Miami Beach, FL, USA, October 2007. Computing, 2021.
Security and Communication Networks 15
[27] K. Yadav, “Source code obfuscation: novel technique and Distributed Computing (SNPD), IEEE, Las Vegas, NV, USA,
implementation,” ICT Systems and Sustainability, pp. 197– September 2014.
204, Springer, Singapore, 2022. [44] L. Jones, “Flowtables: program skeletal inversion for defeat of
[28] M. Hataba, A. Sherif, and R Elkhouly, “Enhanced obfuscation interprocedural analysis with unique metamorphism,” in
for software protection in autonomous vehicular cloud Proceedings of the 5th Program Protection and Reverse En-
computing platforms,” IEEE Access, vol. 10, Article ID 33943, gineering Workshop, Los Angeles, CA, USA, December 2015.
2022. [45] D. Mu, “ROPOB: obfuscating binary code via return oriented
[29] C. Lattner and V. Adve, “LLVM: a compilation framework for programming,” in Proceedings of the International Conference
lifelong program analysis & transformation,” in Proceedings of on Security and Privacy in Communication Systems, Springer,
the International Symposium on Code Generation and Opti- Washington, WA, USA, September 2017.
mization, IEEE, San Jose, CA, USA, March 2004.
[30] S. Ding, C. M. Benjamin, and P. Charland, “Asm2vec:
boosting static representation robustness for binary clone
search against code obfuscation and compiler optimization,”
in Proceedings of the 2019 IEEE Symposium on Security and
Privacy (SP), IEEE, San Francisco, CA, USA, May 2019.
[31] L. Massarelli, “Safe: self-attentive function embeddings for
binary similarity,” in Proceedings of the International Con-
ference on Detection of Intrusions and Malware, and Vul-
nerability Assessment, Springer, Lisbon, Portugal, June 2019.
[32] Y. Duan, “DeepBindiff: Learning Program-wide Code Rep-
resentations for Binary Diffing,” in Proceedings of the Network
and Distributed System Security Symposium, San Diego,
California, USA, February 2020.
[33] BinDiff, “BinDiff,” 2001, https://www.zynamics.com/
software.html.
[34] A. L. Barabási, “Network science,” Philosophical Transactions
of the Royal Society A: Mathematical, Physical & Engineering
Sciences, vol. 371, Article ID 20120375, 1987 pages, 2013.
[35] P. Junod, “Obfuscator-LLVM--software protection for the
Masses,” in Proceedings of the 2015 IEEE/ACM 1st Interna-
tional Workshop on Software Protection, IEEE, Florence, Italy,
May 2015.
[36] J. L. Henning, “Spec CPU2000: measuring CPU performance
in the new millennium,” Computer, vol. 33, no. 7, pp. 28–35,
2000.
[37] B. Anckaert, “Program obfuscation: a quantitative approach,”
in Proceedings of the 2007 ACM Workshop on Quality of
protection, Alexandria, VA, USA, October 2007.
[38] M. Rodrı́guez-Veliz, Y. Nuñez-Musa, and R. Sepúlveda-Lima,
“Call graph obfuscation and diversification: an approach,” IET
Information Security, vol. 14, no. 2, pp. 241–252, 2020.
[39] V. Balachandran, W. K. Ng, and S. Emmanuel, “Function level
control flow obfuscation for software security,” in Proceedings
of the 2014 Eighth International Conference on Complex, In-
telligent and Software Intensive Systems, IEEE, Birmingham,
July 2014.
[40] V. Balachandran, S. Emmanuel, and W. K. Ng, “Return
oriented obfuscation,” in Proceedings of the Eighth Int. Conf.
On Networks & Communications (NETCOM-2016), Sydney,
NSW, Australia, October 2016.
[41] P. Lan, “Lambda obfuscation,” in Proceedings of the Inter-
national Conference on Security and Privacy in Communi-
cation Systems, Springer, Washington, WA, USA, October
2017.
[42] T. Toyofuku, T. Tabata, and K. Sakurai, “Program obfuscation
scheme using random numbers to complicate control flow,”
in Proceedings of the International Conference on Embedded
and Ubiquitous Computing, Springer, Berlin, Heidelberg,
March 2005.
[43] K. Fukuda and H. Tamada, “An obfuscation method to build a
fake call flow graph by hooking method calls,” in Proceedings
of the 15th IEEE/ACIS International Conference on Software
Engineering, Artificial Intelligence, Networking and Parallel/