Professional Documents
Culture Documents
Abstract—Because of the financial and other gains attached up a keyboard logger and stealing personal information etc.
with the growing malware industry, there is a need to automate Antimalware software detects and neutralizes the effects
the process of malware analysis and provide real-time malware of a malware. There are two basic detection techniques
detection. To hide a malware, obfuscation techniques are used.
One such technique is metamorphism encoding that mutates [13]: anomaly-based and signature-based. Anomaly-based
the dynamic binary code and changes the opcode with every detection technique uses the knowledge of the behavior of
run to avoid detection. This makes malware difficult to detect a normal program to decide if the program under inspection
in real-time and generally requires a behavioral signature for is malicious or not. Signature-based detection technique
detection. In this paper we present a new framework called uses the characteristics of a malicious program to decide
MARD for Metamorphic Malware Analysis and Real-Time
Detection, to protect the end points that are often the last if the program under inspection is malicious or not. Each
defense, against metamorphic malware. MARD provides: (1) of the techniques can be performed statically (before the
automation (2) platform independence (3) optimizations for program executes), dynamically (during or after the program
real-time performance and (4) modularity. We also present a execution) or both statically and dynamically (hybrid).
comparison of MARD with other such recent efforts. Experi- Detecting whether a given program is a malware is
mental evaluation of MARD achieves a detection rate of 99.6%
and a false positive rate of 4%. an undecidable problem [6], [16]. Antimalware software
detection techniques are limited by this theoretical result.
Keywords-End Point Security; Metamorphism; Malware Malware writers exploit this limitation to avoid detection.
Analysis and Detection; Control Flow Analysis; Automation;
In the early days the malware writers were hobbyists
but now the professionals have become part of this group
I. I NTRODUCTION
because of the financial gains [5] attached to it.
End point protection is often the last defense against a One of the basic techniques used by malware writers
security threat. An end point can be a desktop, a server, a is obfuscation [15]. Such a technique obscures a code to
laptop, a kiosk or a mobile device that connects to a net- make it difficult to understand, analyze and detect malware
work (Internet). Recent statistics by the ITU (International embedded in the code.
Telecommunications Union) [14] show that the number of Initial obfuscators were simple and were detected by
Internet users in the world have increased from 20% in 2006 simple signature-based detectors. To counter these detectors
to 35% (almost 2 billion in total) in 2011. According to the obfuscation techniques have evolved in sophistication
the 2011 Symantec Internet security threat report [8] there and diversity. Currently there are three categories of
was an 81% increase in the malware attacks over 2010, and obfuscation techniques [21]: packing, polymorphism and
403 million new malware were created, a 41% increase over metamorphism.
2010. In 2012 there was a 42% increase in the malware
attacks over 2011. With these increases and the anticipated Packing is a technique where a malware is packed
future increases, these end points pose a new security (compressed) to avoid detection. Unpacking needs to
challenge [23] to the security professionals and researchers be done before the malware can be detected. Current
in industry and in academia, to devise new methods and antimalware tools normally use entropy analysis [21] to
techniques for malware detection and protection. detect packing, but to unpack a program they must know
A broad definition of malware, also called malicious code, the packing algorithm used to pack the program. Packing
is used in the literature that includes viruses, worms, spy- is also used by legitimate software companies to distribute
wares and trojans. Here we use one of the earliest definitions and deploy their software products.
by Gary McGraw and Greg Morrisett [19]: Malicious code
is any code added, changed, or removed from a software Polymorphism is an encryption technique that mutates
system in order to intentionally cause harm or subvert the the static binary code to avoid detection. When an infected
intended function of the system. A malware carries out program executes, the malware is decrypted and written
activities such as setting up a back door for a bot, setting to memory for execution. With each run of the infected
program, a new version of the malware is encrypted and how parallelization and CFG reduction reduces the runtime
stored for the next run. This results in a different malware of MARD. In Section IV we evaluate the performance
signature with each new run of the program. The changed of MARD using an existing malware dataset. Section V
malware keeps the same functionality (i.e the opcode is describes the most recent efforts of detecting metamorphic
semantically the same for each instance). It is possible malware and compares them to MARD. We conclude in
for a signature-based technique to detect this similarity of Section VI.
signatures at runtime.
II. T HE MARD F RAMEWORK
Metamorphism is a technique that mutates the dynamic Our primary goal is to detect automatically in real-time
binary code to avoid detection. It changes the opcode with both known and unknown malware. We present in this sec-
each run of the infected program and does not use any tion the design principles underlying the MARD framework
encryption or decryption. The malware never keeps the and describes the proposed malware detection approach.
same sequence of opcodes in the memory. This is also
called dynamic code obfuscation. There are two kinds of A. MARD Design
metamorphic malware defined in [21] based on the channel Most existing malware use binaries to infiltrate a computer
of communication used: Closed-world malware, that do system. Binary analysis is the process of automatically
not rely on external communication and can generate the analysing the structure and behavior of a binary program.
newly mutated code using either a binary transformer or a We use binary analysis for malware detection.
metalanguage. Open-world malware, that can communicate To achieve platform independence, automation and opti-
with other sites on the Internet and update themselves with mization a binary program is first disassembled and trans-
new features. lated to an intermediate language. We introduce in [3], a new
intermediate language for malware analysis named MAIL.
As is clear from the above discussion out of the three MAIL is the acronym for Malware Analysis Intermediate
malware groups mentioned above, metamorphic malware are Language. In [2], we explain in detail with examples how to
getting more complex and pose a special threat and new translate a x86 and an ARM assembly program into a MAIL
challenges to the end point security. We propose in this paper program. For indirect jumps and calls (branches whose target
a new framework for real-time detection of metamorphic is unknown or cannot be determined by static analysis) only
malware using subgraph isomorphism. The proposed frame- a change in the source code can change them, so it is safe
work is named MARD for Metamorphic Malware Analysis to ignore these branches for malware analysis where the
and Real-Time Detection. change is only carried out in the machine (binary) code.
To provide continuous protection to an end point a secu- We ignore indirect jumps and mark them as UNKNOWN
rity software needs to be operated and threats need to be when translating them to MAIL. The MAIL program is then
detected in real-time. Antimalware provides protection from annotated with patterns. We then build a CFG (control flow
malware in two ways: graph) of the annotated MAIL program.
1) They can provide protection by detecting a malware Some of the major characteristics and advantages of the
before the software is installed. All the incoming MARD framework are as follows:
network traffic is monitored and scanned for malware.
Depending on the methods used this continuous mon- Platform independence: To make modern compilers [1],
itoring and scanning slows down a computer consider- [20] platform independent they are divided into two major
ably, which is not practical and desirable. This is one components: a front end and a back end. The design of the
of the main reasons this type of protection is not very MARD framework follows the same principle. In compilers
popular. the same C++ front end can be used with different back
2) They can provide protection by detecting a malware ends to generate code for different platforms such as x86,
during or after the software installation. A user can ARM and PowerPC etc. In the case of MARD the same
scan different files and parts of the computer as and back end can be used with different front ends to detect
when he/she desires. This type of protection is much malware for different platforms (Windows and Linux etc).
easier to use and is more popular. For example, we can implement a front end for the PE
In this paper our emphasis is on the second option. The (Windows executable) files and another front end for the
general problem of detecting metamorphic malware is NP- ELF (Linux executable) files. Both these front ends generate
complete [26]. That means there exist some metamorphic their output in our intermediate language (i.e. MAIL) that
malware whose detection is NP-complete. is used by the back end. Programs compiled for different
architectures such as Intel x86 and ARM (the two most
The remainder of the paper is organized as follows. popular architectures) can be translated to MAIL. Examples
Section II describes MARD in detail. In Section III we show of such translations are described in [2]. So we only need
to implement one back end that is able to perform analysis Data Set
independence.
Template
Malware
Optimization: The main purpose of the optimizations Templates
analysis such as NOP instructions etc. (2) Generating an Unpacked binary ACFG2
malware. We also use parallelization and CFG reduction to Optimized code Results
MAIL1 Report
After a program sample is translated to MAIL, an an- NOTE: The component “Unpacker” is not implemented in this version of the Malware Detector
Table I
d i j RUNTIME OPTIMIZATION BY PARALLELIZING THE Subgraph
Matching COMPONENT USING DIFFERENT NUMBER OF THREADS
e k m 2 Cores 4 Cores
Number of Runtime Number of Runtime
Threads Reduced By Threads Reduced By
exit exit
2 3.20 times 4 3.98 times
4 5.92 times 8 4.11 times
(a) A malware (b) The malware embedded in-
sample side a benign program 8 7.20 times 32 5.95 times
Figure 2. An example of subgraph matching. The graph in Figure (a) 16 5.72 times 64 7.76 times
is matched as a subgraph of the graph in Figure (b). 32 5.64 times 128 7.53 times
250 5.41 times 250 6.37 times
III. RUNTIME O PTIMIZATION 8 14.87 times 64 14.53 times
There are two components in the MARD framework that Machines used:
consumes most of its runtime; one is the Translator (part of Intel Core i5 CPU M 430 (2 Cores) @ 2.27 GHz with 4GB of RAM
the MAIL Generator in Figure 1) and the other is the Sub- and Windows 8 Professional installed
Intel Core 2 Quad CPU Q6700 (4 Cores) @ 2.67 GHz with 4GB of
graph Matching (part of the Similarity Detector in Figure 1). RAM and Windows 7 Professional installed
The Translator reads each assembly instruction one by one
and translates it to the MAIL language. The translation time
increases with the size of the binary, and provide limited As we increased the number of threads the reduction
avenue for optimization. The Subgraph Matching component in the runtime also increased upto a certain number (8 in
matches the CFG against all the malware sample graphs. As 2 Cores and 64 in 4 Cores) of threads, after which the
the number of nodes in the graph increases the Subgraph reduction in the runtime started decreasing. As we increased
Matching runtime increases. The runtime also increases with the number of benign samples from 30 to 1387 (the last
the increase in the number of malware samples but provide row in Table I) the runtime only reduced by almost 2 times.
some options for optimization. We use two techniques to Beside other factors such as the number of CPUs/Cores and
improve the runtime of MARD, namely, parallelization and the memory available to the application, the other reasons
CFG reduction, described in the following. for this reduction in the runtime are the fact that more time
was spent on thread management and the increased sharing
A. Parallelization of information among threads.
Multicore processors are becoming popular nowadays. All Based on this experiment we developed Equation 1 to
the current desktops, laptops and even the energy efficient compute the maximum number of threads to be used by the
small mobile devices contain a multicore processor. Intel Subgraph Matching component. We also give an option to
has recently announced its Single-Chip Cloud Computer the user to choose the maximum number of threads used by
[18], a processor with 48 cores on a chip. Keeping in view the tool for the Subgraph Matching component.
b9
b10
b11 b13
b14
b12
b15
b35
3 b16
b185
b279
b270
b271
b273
b260
b34
b33
b32
b28
b19
b29
b17
b18
b31
b30
b184
b183
b371
b243
b372
b373
b197
b190
b187
b188
b189
b207
b199
b244
b198 b245
b114 b325
b278 b191
b178
b303 b113 b181 b342
b272 b193
b179 b324 b246
b293 b301 b115
b120 b121 b247
b300 b277
b343
b215 b302
b275 b180 b251
B. CFG Reduction
b475 b396 b233 b110
b146
b1 b109
b145 b395
Entry
b2 b111 b112
b232
b0
b169
b142 b231 b3
b170 b165
b143
b171 b144 b4 b5 b8
b172 b136 b7
b77 b230 b91 b167 b6
b175 b166
b393 b78
b327 b168
b394 b76
b326
b364 b75
b62
b317
b49 Exit
b50
called CFG shrinking. We reduce the number of blocks in (a) Before shrinking (b) After shrink-
ing
a CFG by merging them together. Two blocks are merged
only if the merging does not change the control flow of a Figure 4. Example of a CFG, of one of the functions of one of the
program. samples of the MWOR class of malware, before and after shrinking.
The CFG has been reduced from 484 nodes to 145 nodes.
Given two blocks A and B in a CFG, if all the paths that
reach node B pass through block A, and all the children of b545
b546
b455
b232
b329
b297
b296
b544b543
b542
b435
b272
b270
b271
b44
b43
b41
b42
b328
the graphs before and after shrinking are the same. More
b1279 b1162 b125 b12 b157
b766 b779 b1167 b1273
b667
b633
b632 b1278 b1166
b1161 b61b65b319b321b56b323b326b270b47b60b282
b666 b287 b284b41
b668 b686 b764 b1153 b1272 b355 b111b86
b109 b46b135
b1333 b728 b780 b1169 b127 b94
b665 b122 b121 b733 b837 b126 b105
b1332 b112 b158
b663 b727 b507 b114
b120 b726 b1170 b506 b267 b320 b57 b59b269b48b50
b874 b338b62b64 b283 b42 b33
b119 b725 b1171 b1173 b505 b356 b87b89 b32
b110 b45
b235 b875 b1172 b1175 b504 b128 b108 b286
b306 b876 b106 b160
b305 b877 b838 b1174 b113 b159 b26
b234 b268 b43 b100
b878 b10 b839 b131 b63 b58 b34 b134
b9 b83 b88 b49 b122
b879 b508 b103 b37 b132 b121
b8 b11 b515 b514 b8 b129 b35
b513 b102 b107 b44 b133 b120
b1322 b1225 b7 b27
b12 b775
b776 b1155 b104 b99 b18
b832
b833
b834 b512 b123
b160 b840 b1226 b1154 b516 b257 b17
b13 b161 b1320 b519 b130 b431 b29
b783 b1156 b509 b36 Entry b119
b15 b68 b1284 b781 b1222 b521 b263 b101
b14 b1323 b1321 b1223 b1211 b167 b84
b162 b159 b157 b841 b510 b517 b20
∼salam/PhD/cfgs.html.
b427 b1089 b1202
b1254 b1294 b1293 b1071 b1091 b1117 b368 Exit
b1200 b369
b1121 b367
b610 b845 b1070 b1088 b433
b1295 b1237 b374 b370 b426
b1092 b432 b327
b611 b1253 b1296 b1072 b1122 b80
b1298 b1301 b1302 b1239 b1025 b424 b427
b609 b1331 b1330 b1238 b1069 b1087 b221 b328 b329
b1297 b1300 b1024 b425
b614 b1304 b1093
b1236 b1228 b1073 b172 b206 b163
b1306 b987
b1299 b1308 b1303 b1026 b1068 b259 b161 b205
b1309 b1361 b1023 b354 b162 b164
b1329 b1240 b949 b1029 b1094 b353 b168 b428
b1305 b1310 b681 b846 b986 b1074 b352 b377 b220 b169
b1307 b1311 b1067 b351 b429
b1227 b988 b349 b171 b170
b999 b1360 b1027 b1022 b607 b350
b948 b396 b397
b1077 b395 b398 b166 b165
b682 b929 b1075 b399
b1362 b950 b207 b187 b430 b175
b1000 b680 b1030 b608 b198 b197 b237
b928 b1021 b186 b193 b192 b184
b1328 b947 b1028 b1078
b1359 b1076 b409 b2 b3
b997 b1181 b922 b1033
b1020 b418
b1001 b998 b678 b1363 b904 b930 b1031 b400 b238 b236 b337
b990 b679 b675
b676 b204
b946 b417 b401 b217 b322
b989 b651 b1186 b921 b291 b219
b240
b1327 b634
b996 b644 b923 b1032 b296 b293 b292 b402 b239
b1002 b1358 b1180 b903 b931 b203 b294 b416 b227
b945 b218 b235 b73
b991 b905 b205 b295 b241
b677 b594 b937 b387 b290 b253 b300
b386 b652 b388 b53 b228 b216 b301 b72
b994 b650 b1187 b635 b298 b403 b226
b1003 b1326 b1182
b1185 b924 b289 b225 b251 b254
b995 b645 b932 b297 b415 b242 b75
b1354 b288 b404 b302
b1251 b992 b1357 b593 b936 b85 b405 b196 b229 b74
b7 b563 b944 b390 b255 b303
b385 b1098 b940 b406 b179 b1
b1004 b993 b646 b595 b664 b925 b407 b408 b243 b224 b215 b252
b647 b1188 b939 b389 b195 b149 b148
b649 b1183
b1184 b396 b636 b230 b214 b234
b1252 b648
b1355 b562 b933 b178 b150
b372 b592 b943 b249
b1347 b1097 b244
b1005 b384 b1356 b253 b564 b1099 b176 b247 b380 b151
b601 b938 b177 b248
b395 b596 b637 b941 b376 b213
b239 b591 b245 b136
b397 b580 b934 b250
b233 b137
b1006 b382 b1348 b6 b225 b373 b603 b246 b381 b141
b1346 b24 b211 b1096 b375 b419
b1013 b383 b93 b199 b252 b565 b600 b420 b138
b602 b140
b37 b106 b124 b133 b164
b193
b166
b238
b254 b597
b942 b203 b181 b256
b54 b145
b143
b30 b1007
b1008 b240 b102
b398
b582
b581
b590
b606
b935 b357 b382 b182
b393
b259 b52 b146 b139
b375 b5 b224 b575 b604 b1095 b212b232b194
b231 b4 b144
b1012 b251 b599 b199 b142
b25 b23 b210b226
b198 b264
b574
b204
b180
b191 b394 b392 b55 b147
b94b92 b579 b383 b202
b107
b105 b194 b200
b163 b167
b255 b566 b598 b589 b174
b183
b185 b421
b346
b1009 b125
b123
b134 b132 b165 b237 b583 b91b391
b985 b200 b412
b399 b605 b410
b29 b241b3 b576 b97
b38 b32 b376
b374b4
b26 b29 b100b223
b227 b275 b577
b573 b384 b190
b211
b210 b92 b98
b22b95b91 b197 b263
b265 b578 b173 b201 b422 b347 b90
b927b419 b201 b256 b567 b411
b195 b236 b308 b5 b96
b104 b126 b338 b561 b169 b261 b93
b359 b131 b984 b571 b385
b31 b302
b242 b274 b572 b189
b413
b209
b276 b423 b332
b348
b27b28 b228b99 b276 b570
b266 b307 b6
b257 b568 b306
b926 b202 b619 b312
b28 b83
b420
b418 b423
b103 b360 b358 b361 b339 b342b585b560 b260
b262 b39
b305
b324
b243 b275 b325
b40 b76 b39 b229
b301
b303 b277 b414 b38
b208 b280b277 b310
b309
b267 b569 b313
b258 b314
b620 b618 b40 b315 b304
b417 b424 b422 b584
b586
b355 b357 b340 b343 b341 b311b316
b244 b14 b274
b304 b16b19 b278b25
b279
b281
b230 b268
b84
b41 b77 b425
b 426
b617 b66 b318
b317
b356
b354 b344 b347 b587 b15 b273
b27 b11
b13 b95b271
b231 b269 b68
b612b616 b69
b67
b588
b82 b33 b353 b345b 348b 346 b9 b10
b78
b71
b613b615 b70
b43
b36
b25 b35
b75
b46
b21
b22
b23
b24
b89
b90
b91
Exit
(a) Before shrinking (b) After shrinking
b47 b20 b88
b87
b86
b85
b19
b44
b45
b18
b17
b79
b80
Figure 5. Example of a CFG, of one of the functions of the Windows
b48
b49
b50
b51
b57
b16
b60
b61
b62 b45
disk free space utility program df.exe, before and after shrinking. The
b53 b15 b63
b26 b44
b12
b68
b20
b43
b27 b14
b69 b13
b11
b21
b70
b12
b15
b10
b71
b11
b72 b28
b9
b7
b8
b74
b73
b29
b10
b9
b16
b17
(for smaller dataset) and 100 times (for larger dataset), and
b5
b6
b30
b31 b32
b8
b7
b33
b34
still achieved a detection rate of 99.6% with a false positive
b35
b6
b5
b3 b36
b4
b2
b3 b37
b1
b2
b1
b0
b38
b0
Entry
Entry
b40
b39
Table VI
M ALWARE DETECTION RESULTS FOR SMALLER DATASET
Table VII
M ALWARE DETECTION RESULTS FOR LARGER DATASET
The perfect results of 100% and 0% should be validated with more number of samples than tested in the papers. The DR and FPR values for
Opcode-Graph are not directly mentioned in the paper. We computed these values by picking a threshold of 0.5 from the similarity score in the paper.
[12] Mahboobe Ghiasi, Ashkan Sami, and Zahra Salehi. Dynamic [22] B.B. Rad, M. Masrom, and S. Ibrahim. Opcodes Histogram
Malware Detection Using Registers Values Set Analysis. In for Classifying Metamorphic Portable Executables Malware.
Information Security and Cryptology, pages 54 – 59, 2012. In ICEEE, pages 209 – 213, September 2012.
[13] Nwokedi Idika and Aditya P. Mathur. A Survey of Malware [23] Thomas Raschke. The New Security Challenge: Endpoints.
Detection Techniques. Purdue University, 2007.
c International Data Corporation, 2005.
[14] International Telecommunications Union (ITU). The World [24] Neha Runwal, Richard M. Low, and Mark Stamp. Opcode
in 2011: ICT Facts and Figures.
c ITU, 2011. Graph Similarity and Metamorphic Detection. Journal in
Computer Virology, 8(1-2):37 – 52, May 2012.
[15] Cullen Linn and Saumya Debray. Obfuscation of Executable
Code to Improve Resistance to Static Disassembly. In ACM [25] Fu Song and Tayssir Touili. Efficient Malware Detection
CCS, pages 290 – 299, New York, NY, USA, 2003. ACM. Using Model-Checking. In Dimitra Giannakopoulou and Do-
minique Mry, editors, FM: Formal Methods, volume 7436 of
[16] David M. Chess and Steve R. White. An Undetectable Lecture Notes in Computer Science, pages 418–433. Springer
Computer Virus. Virus Bulletin Conference, September 2000. Berlin Heidelberg, 2012.
[17] Sudarshan Madenur Sridhara and Mark Stamp. Metamorphic [26] D. Spinellis. Reliable Identification of Bounded-Length
worm that carries its own morphing engine. Journal of Com- Viruses is NP-Complete. Information Theory, IEEE Trans-
puter Virology and Hacking Techniques, 9(2):49–58, 2013. actions on, 49(1):280 – 284, January 2003.
[18] Timothy G. Mattson, Michael Riepen, Thomas Lehnig, Paul [27] AnnieH. Toderici and Mark Stamp. Chi-squared Distance and
Brett, Werner Haas, Patrick Kennedy, Jason Howard, Sriram Metamorphic Virus Detection. Journal in Computer Virology,
Vangal, Nitin Borkar, Greg Ruhl, and Saurabh Dighe. The pages 1 – 14, 2012.
48-core SCC Processor: the Programmer’s View. In SC’10,
pages 1–11, Washington, DC, USA, 2010. IEEE Computer [28] P. Vinod, V. Laxmi, M.S. Gaur, and G. Chauhan. MOMEN-
Society. TUM: Metamorphic Malware Exploration Techniques Using
MSA Signatures. In IIT, pages 232 – 237, March 2012.
[19] Gary McGraw and Greg Morrisett. Attacking Malicious
Code: A Report to the Infosec Research Council. IEEE Softw., [29] Heng Yin and Dawn Song. Privacy-Breaching Behavior
17(5):33 – 41, September 2000. Analysis. In Automatic Malware Analysis, SpringerBriefs in
Computer Science, pages 27–42. Springer New York, 2013.
[20] Steven S. Muchnick. Advanced Compiler Design and Imple-
mentation. Morgan Kaufmann Publishers Inc., San Francisco,
CA, USA, 1997.