Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Standard view
Full view
of .
Save to My Library
Look up keyword
Like this
0 of .
Results for:
No results containing your search query
P. 1
Malwise - An Effective and Efficient Classification System for Packed and Polymorphic Malware

Malwise - An Effective and Efficient Classification System for Packed and Polymorphic Malware

Ratings: (0)|Views: 1,235 |Likes:
Published by Silvio Cesare

More info:

Published by: Silvio Cesare on Apr 05, 2012
Copyright:Attribution Non-commercial


Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less





An Effective and EfficientClassification System for Packed andPolymorphic Malware
Silvio Cesare,
Student Member, IEEE 
, Yang Xiang,
Member, IEEE 
,and Wanlei Zhou,
Senior Member, IEEE 
Signature based malware detection systems have been a much used response to the pervasive problem ofmalware. Identification of malware variants is essential to a detection system and is made possible by identifying invariantcharacteristics in related samples. To classify the packed and polymorphic malware, this paper proposes a novel system,named Malwise, for malware classification using a fast application level emulator to reverse the code packing transformation,and two flowgraph matching algorithms to perform classification. An exact flowgraph matching algorithm is employed that usesstring based signatures, and is able to detect malware with near real-time performance. Additionally, a more effectiveapproximate flowgraph matching algorithm is proposed that uses the decompilation technique of structuring to generate stringbased signatures amenable to the string edit distance. We use real and synthetic malware to demonstrate the effectiveness andefficiency of Malwise. Using more than 15,000 real malware, collected from honeypots, the effectiveness is validated byshowing that there is an 88% probability that new malware is detected as a variant of existing malware. The efficiency isdemonstrated from a smaller sample set of malware where 86% of the samples can be classified in under 1.3 seconds.
Index Terms
Computer security, malware, control flow, structural classification, structured control flow, unpacking.
1 I
alware, short for malicious software, means a vari-ety of forms of hostile, intrusive, or annoying soft-ware or program code. Malware is a pervasiveproblem in distributed computer and network systems.According to the Symantec Internet Threat Report[1],499,811 new malware samples were received in the sec-ond half of 2007. F-
Secure additionally reported, “As
much malware [was] produced in 2007 as in the previous
20 years altogether“
[2]. Detection of malware is impor-tant to a secure distributed computing environment.The predominant technique used in commercial anti-malware systems to detect an instance of malware isthrough the use of malware signatures. Malware signa-tures attempt to capture invariant characteristics or pat-terns in the malware that uniquely identifies it. The pat-terns used to construct a signature have traditionally de-
rived from strings of the malware’s machine code and
raw file contents[3, 4]. String based signatures have re-mained popular in commercial systems due to their highefficiency, but can be ineffective in detecting malwarevariants.Malware variants often have distinct byte level repre-sentations while in principal belong to the same family of malware. The byte level content is different because smallchanges to the malware source code can result in signifi-cantly different compiled object code. In this paper wedescribe malware variants with the umbrella term of polymorphism. Polymorphism describes related malwaresharing a common history of code. Code sharing amongvariants can be derived from autonomously self mutatingmalware, or manually copied by the malware creator toreuse previously authored code.
1.1 Existing Approaches and Motivation
Static analysis incorporating n-grams[5, 6], edit distances[7], API call sequences[8], and control flow[9-11]have been proposed to detect malware and their polymorphicvariants. However, they are either ineffective or inefficientin classifying packed and polymorphic malware.A malware's control flow information provides a char-acteristic that is identifiable across strains of malwarevariants. Approximate matchings of flowgraph basedcharacteristics can be used in order to identify a greaternumber of malware variants. Detection of variants is pos-sible even when more significant changes to the malwaresource code are introduced.Control flow has proven effective[9, 11, 12], and fastalgorithms have been proposed to identify exact isomor-phic whole program control flow graphs[13]and relatedinformation[14], yet approximate matching of programstructure has shown to be expensive in runtime costs[15].Poor performance in execution speed has resulted in theabsence of approximate matching in endhost malwaredetection.To hinder the static analysis necessary for control flowanalysis, the malware's real content is frequently hiddenusing a code transformation known as packing[16]. Pack-ing is not solely used by malware. Packing is also used insoftware protection schemes and file compression for le-gitimate software, yet the majority of malware also usesthe code packing transformation. In one month during2007, 79% of identified malware was packed[17]. Addi-tionally, almost 50% of new malware in 2006 were re-packed versions of existing malware[18].
xxxx-xxxx/0x/$xx.00 © 200x IEEE
Cesare, Y. Xiang, and W. Zhou are with the School of InformationTechnology, Deakin University, Victoria, Australia. E-mail:s.cesare@deakin.edu.au, yang@deakin.edu.au, wanlei@deakin.edu.au.
Unpacking is a necessary component to perform staticanalysis and to reveal the hidden characteristics of mal-ware. In the problem scope of unpacking, it can be seenthat many instances of malware utilise identical or similarpackers. Many of these packers are also public, and mal-ware often employs the use of these public packers. Manyinstances of malware also employ modified versions of public packers. Being able to automatically unpack mal-ware in any of these scenarios, in addition to unpackingnovel samples, provides benefit in revealing the mal-
ware’s real content – 
a necessary component for staticanalysis and accurate classification.Automated unpacking relies on typical behaviour seenin the majority of packed malware
hidden code is dy-namically generated and then executed. The hidden codeis naturally revealed in the process image during normalexecution. Monitoring execution for the dynamic genera-
tion and execution of the malware’s hidden code can be
achieved through emulation[19]. Emulation provides asafe and isolated environment for malware analysis.Malware detection has been investigated extensively,however shortcomings still exist. For modern malwareclassification approaches, a system must be developedthat is not only effective against polymorphic and packedmalware, but that is also efficient. Unless efficient systemsare developed, commercial Antivirus will be unable toimplement the solutions developed by researchers. Webelieve combining effectiveness with real-time efficiencyis an area of research which has been largely ignored. Forexample, the malware classification investigated in[5, 6, 9-11]has no analysis or evaluation of system efficiency.We address that issue with our implementation andevaluation of Malwise.In this paper we present an effective and efficient sys-tem that employs dynamic and static analysis to auto-matically unpack and classify a malware instance as avariant, based on similarities of control flow graphs.
1.2 Contributions
This paper makes the following contributions. First, wepropose using string based signatures to represent controlflow graphs and a fast classification algorithm based onset similarity to identify related programs in a database.Second, we propose using two algorithms to generatestring signatures for exact and approximate identificationof flowgraphs. Exact matching uses a graph invariant as astring to heuristically identifiy isomorphic flowgraphs.Approximate matching uses decompiled structured flow-graphs as string signatures. These signatures have theproperty that similar flowgraphs have similar strings.String similarity is based on the string edit distance.Third, we propose and evaluate automated unpackingusing application level emulation that is equally capableof desktop Antivirus integration. The automated un-packer is capable of unpacking known samples and isalso capable of unpacking unknown samples. We alsopropose an algorithm for determining when to stop emu-lation during unpacking using entropy analysis. Finally,we implement and evaluate our ideas in a novel proto-type system called Malwise that performs automated un-packing and malware classification.
1.4 Structure of the Paper
The structure of this paper is as follows: Section 2 de-scribes related work in automated unpacking and mal-ware classification; Section 3 refines the problem defini-tion and our approach to the proposed malware classifica-tion system; Section 4 describes the design and imple-mentation of our prototype Malwise system; Section 5evaluates Malwise using real and synthetic malwaresamples; finally, Section 8 summarizes and concludes thepaper.
2 R
2.1 Automated Unpacking
Automated unpacking employing whole system emula-tion was proposed in Renovo[19]and Pandora's Bochs[20]. Whole system emulation has been demonstrated toprovide effective results against unknown malware sam-ples, yet is not completely resistant to novel attacks[21].
Renovo and Pandora’s Bochs both detect execution of 
dynamically generated code to determine when unpack-ing is complete and the hidden code is revealed. An alter-native algorithm for detecting when unpacking is com-plete was proposed using execution histograms in Hump-and-dump[22]. The Hump-and-dump was proposed aspotentially desirable for integration into an emulator.Polyunpack [16]proposed a combination of static anddynamic analysis to dynamically detect code at runtimewhich cannot be identified during an initial static analy-sis. The main distinction separating our work from previ-ously proposed automated unpackers is our use of appli-cation level emulation and an aggressive strategy to de-termine that unpacking is complete. The advantage of application level emulation over whole system emulationis significantly greater performance. Application levelemulation for automated unpacking has had commercialinterest[23]but has realized few academic publicationsevaluating its effectiveness and performance.Dynamic Binary Instrumentation was proposed as analternative to using an instrumented emulator[24]em-
 ployed by Renovo and Pandora’s Bochs. Omnipack 
[25] and Saffron[24]proposed automated unpacking usingnative execution and hardware based memory protectionfeatures. This results in high performance in comparisonto emulation based unpacking. The disadvantage of un-packing using native execution is evident on E-Mailgateways because a virtual machine or emulator is re-quired to execute the malware. A virtual machine ap-proach to unpacking, using x86 hardware extensions, wasproposed in Ether[26]. The use of such a virtual machineand equally to a whole system emulator is the require-ment to install a license for each guest operating system.This restricts desktop adoption which typically has a sin-gle license. Virtual machines are also inhibited by slowstart-up times, which again are problematic for desktopuse. The use of a virtual machine also prevents the systembeing cross platform, as the guest and host CPUs must bethe same.
2.2 Polymorphic Malware Classification
Malware classification has been proposed using a vari-ety of techniques[5, 6,27]. A variation of n-grams, coined n-perms has been proposed[6]to describe malware char-acteristics and subsequently used in a classifier. An alter-native approach is using the basic blocks of unpackedmalware, classified using edit distances, inverted indexesand bloom filters[7]. The main disadvantage of these ap-proaches is that minor changes to the malware sourcecode can result in significant changes to the resultingbytestream after compilation. This change can signifi-cantly impact the classification. Abstract interpretationhas been used to construct tree structure signatures[28] and model malware semantics[29]. However, these ap-proaches do not perform inexact matching. Windows APIcall sequences have been proposed[8]as an invariantcharacteristic, but correctly identifying the API calls canbe problematic[20]when code packing obscures the re-sult. Flowgraph based classification is an alternativemethod that attempts to solve this issue by looking atcontrol flow as a more invariant characteristic betweenmalware variants.Flowgraph based classification utilises the concept of approximating the graph edit distance, which allows theconstruction of similarity between graphs. The commer-cial system Vxclass[30]presents a system for unpackingand malware classification based on similarity of flow-graphs. The algorithm in Vxclass is based on identifyingfixed points in the graphs and greedily matchingneighbouring nodes[9-11]. BinHunt[31]provides a more thorough test of flowgraph similarity by soundly identify-ing the maximum common subgraph, but at reduced lev-els of performance and without application to malwareclassification. Identifying common subgraphs of fixedsizes can also indicate similarity and has better perform-ance[32]. SMIT[15]identifies malware with similar call graphs using minimum cost bipartite graph matching andthe Hungarian algorithm, improving upon the greedyapproach to graph matching investigated in earlier re-search. SMIT additionally uses metric trees to improveperformance on graph based database searches. Treeautomata to quickly recognize whole program controlflow graphs has also been proposed[13]but this ap-proach only identifies isomorphisms or subgraph ismor-phisms, and not approximate similarity. Context freegrammars have been extracted from malware to describecontrol flow[14], but this research only identifies wholeprogram equivalence. Clustering is related to classifica-tion and has been proposed for use on call graphs usingthe graph edit distance to construct similarity matricesbetween samples[33].
2.3 The Difference between Malwise and PreviousWork
Our research differs from previous flowgraph classifi-cation research by using a novel approximate control flowgraph matching algorithm employing structuring. We arethe first to use the approach of structuring and decompi-lation to generate malware signatures. This allows us touse string based techniques to tackle otherwise infeasiblegraph problems. We use an exact matching algorithmwhich performs in near real-time while still being able toidentify approximate matches at a whole program level.The novel set similarity search we perform enables thereal-time classification of malware from a large database.No prior related research has performed in real-time. Theclassification systems in most of the previous researchmeasure similarity in the call graph and control flowgraphs, whereas our work relies entirely on the controlflow graphs at a procedure level. Additionally distin-guishing our work is the proposed automated unpackingsystem, which is integrated into the flowgraph basedclassification system.
3 P
The problem of malware classification and variant detec-tion is defined in this Section. The problem summary is touse instance based learning and perform a similaritysearch over a malware database. Additionally defined inthis Section is an overview of our approach to design theMalwise system.
3.1 Problem Definition
A malware classification system is assumed to have ad-vance access to a set of known malware. This is for con-struction of an initial malware database. The database isconstructed by identifying invariant characteristics ineach malware and generating an associated signature tobe stored in the database. After database initialization,normal use of the system commences. The system has asinput a previously unknown binary that is to be classifiedas being malicious or non malicious. The input binaryand the initial malware binaries may have additionallyundergone a code packing transformation to hinder staticanalysis. The classifier calculates similarities between theinput binary and each malware in the database. The simi-larity is measured as a real number between 0 and 1 - 0indicating not at all similar and 1 indicating an identicalor very similar match. This similarity is a based on thesimilarity between malware characteristics in the data-base. If the similarity exceeds a given threshold for anymalware in the database, then the input binary is deemeda variant of that malware, and therefore malicious. If identified as a variant, the database may be updated toincorporate the potentially new set of generated signa-tures associated with that variant.
3.2 Our Approach
Our approach employs both dynamic and static analysisto classify malware. Entropy analysis initially determinesif the binary has undergone a code packing transforma-tion. If packed, dynamic analysis employing applicationlevel emulation reveals the hidden code using entropyanalysis to detect when unpacking is complete. Staticanalysis then identifies characteristics, building signa-tures for control flow graphs in each procedure. The simi-larities between the set of control flow graphs and thosein a malware database accumulate to establish a measureof similarity. A similarity search is performed on the mal-ware database to find similar objects to the query.

Activity (3)

You've already reviewed this. Edit your review.
1 hundred reads
1 thousand reads

You're Reading a Free Preview

/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->