Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more ➡
Standard view
Full view
of .
Add note
Save to My Library
Sync to mobile
Look up keyword
Like this
0 of .
Results for:
No results containing your search query
P. 1
Malware Variant Detection Using Similarity Search over Sets of Control Flow Graphs

Malware Variant Detection Using Similarity Search over Sets of Control Flow Graphs

Ratings: (0)|Views: 2,949|Likes:
Published by Silvio Cesare
2011 International Joint Conference of IEEE TrustCom-11/IEEE ICESS-11/FCST-11

Malware Variant Detection Using Similarity Search over Sets of Control Flow Graphs

Silvio Cesare and Yang Xiang
School of Information Technology Deakin University Burwoord, Victoria 3125, Australia {scesare, yang}@deakin.edu.au

Abstract— Static detection of polymorphic malware variants plays an important role to improve system security. Control flow has shown to be an effective characteristic that represents polymor
2011 International Joint Conference of IEEE TrustCom-11/IEEE ICESS-11/FCST-11

Malware Variant Detection Using Similarity Search over Sets of Control Flow Graphs

Silvio Cesare and Yang Xiang
School of Information Technology Deakin University Burwoord, Victoria 3125, Australia {scesare, yang}@deakin.edu.au

Abstract— Static detection of polymorphic malware variants plays an important role to improve system security. Control flow has shown to be an effective characteristic that represents polymor

More info:

Categories:Types, Research
Published by: Silvio Cesare on Jan 17, 2012
Copyright:Attribution Non-commercial


Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See More
See less





Malware Variant Detection Using Similarity Search over Sets of Control FlowGraphs
Silvio Cesare and Yang Xiang
School of Information TechnologyDeakin UniversityBurwoord, Victoria 3125, Australia{scesare, yang}@deakin.edu.au
Static detection of polymorphic malware variantsplays an important role to improve system security. Controlflow has shown to be an effective characteristic that representspolymorphic malware instances. In our research, we propose asimilarity search of malware using novel distance metrics of malware signatures. We describe a malware signature by theset of control flow graphs the malware contains. We proposetwo approaches and use the first to perform pre-filtering.Firstly, we use a distance metric based on the distance betweenfeature vectors. The feature vector is a decomposition of the setof graphs into either fixed size k-subgraphs, or q-gram stringsof the high-level source after decompilation. We also propose amore effective but less computationally efficient distancemetric based on the minimum matching distance. Theminimum matching distance uses the string edit distancesbetween programs’ decompiled flow graphs, and the linearsum assignment problem to construct a minimum sum weightmatching between two sets of graphs. We implement thedistance metrics in a complete malware variant detectionsystem. The evaluation shows that our approach is highlyeffective in terms of a limited false positive rate and our systemdetects more malware variants when compared to the detectionrates of other algorithms.
 Keywords-computer security; malware classification; staticanalysis; control flow; structuring; decompilation
Malicious software presents a significant challenge tomodern desktop computing. According to the SymantecInternet Threat Report [1], 499,811 new malware sampleswere received in the second half of 2007. F-Secureadditionally reported, “As much malware [was] produced in2007 as in the previous 20 years altogether“ [2]. Detection of malware before it adversely affects computer systems ishighly desirable. Static detection of malware is still thedominant technique to secure computer networks andsystems against untrusted executable content.Detecting malware variants improves signature baseddetection methods. The size of signature databases isgrowing exponentially, and detecting entire families of related malicious software can prevent the blowout in thenumber of stored malware signatures.
Malware classification and detection can be divided intothe tasks of detecting novel instances of malware, anddetecting copies or variants of known malware. Both tasksrequire suitable feature extraction, but the class of features to be extracted is often dependant on which problem is trying to be solved. Detecting novel samples primarily uses statisticalmachine learning. On the contrary, malware variant detectionuses the concept of similarity searching to query a databaseof known instances. These similarity queries or nearestneighbour searches are known in machine learning asinstance-based learning. Instance-based learning usesdistance functions to show dissimilarity and hence similarity between objects. If the distance function has themathematical properties of a metric, then algorithms existthat enable more efficient searching than an exhaustive set of queries over the database.Traditional and commercial malware detection systemshave predominantly utilised static string signatures [3-4] toquery a database of known malware instances. Static stringsignatures capture sections of the malwares’ raw file contentthat uniquely identifies them. String signatures have beenemployed because they have desirable performancecharacteristics that enable real-time use [5]. However, stringsignatures perform poorly when faced with polymorphicmalware variants. Exact string matching also ineffectivelyhandles closely related but non-identical signatures.Polymorphic malware variants have the property that the byte level content of the malware changes between instances.This can be the result of source code modifications or self mutation and obfuscation to the malware. Signatures thatrely on fixed byte level content are unable to capture theinvariant characteristics between these polymorphicinstances.Control flow has also been used to overcome thelimitations of byte level and instruction level classification[6]. Control flow has the desirable property that instructionlevel changes do not affect the resulting flowgraphs. Controlflow is observed to be more invariant in polymorphicmalware [7].Control flow represents the execution path a programmay take. Control flow information appears in two mainforms. The call graph represents the inter-procedural controlflow. The intra-procedural control flow is represented as a
2011 International Joint Conference of IEEE TrustCom-11/IEEE ICESS-11/FCST-11
978-0-7695-4600-1/11 $26.00 © 2011 IEEEDOI 10.1109/TrustCom.2011.26181
set of control flow graphs with one graph per procedure.Most malware analysis using control flow has focused onanalysing the call graph.The challenge of using graphs to show similarity is thataccurately measuring similarity such as when using thegraph edit distance does not perform in polynomial time.Therefore, research must investigate methods that makeusing graphs feasible for large scale malware detection. Thereal or near real-time constraints of Antivirus software makethis challenge even more significant. The challenge increasesagain when complex graph based objects are considered suchas the set of graphs signature our research investigates.
Our work is based on control flow classification but wemake the following contributions:
We propose a system that performs similaritysearching of sets of control flow graphs. We perform the search in close to real-time in theexpected case. No other system has demonstratednear real-time performance for this use of controlflow based signature.
We propose using fixed size k-subgraphs toconstruct a feature vector approximating a set of graphs. Using a vector representation improvesefficiency significantly and has not been used before.
We also propose the novel use of a polynomial timealgorithm to generate q-gram features of decompiled control flow graphs to construct afeature vector. These features are shown to havemore accuracy than k-subgraphs and can beconstructed faster than k-subgraphs. K-subgraphfeature construction is not known to take polynomial time. Q-grams of decompiled graphshave not been used before for malwareclassification.
We propose a distance metric between two sets of graphs based on the minimum matching distance.The minimum matching distance uses the linear sum assignment problem. It has been used previously with sets of vectors, but not sets of graphs. The minimum matching distance has not been used before in malware classification.We implement these ideas in a complete prototypesystem and perform an evaluation on a set of benign binariesand on real malware, including those malware that are packed and polymorphic. The evaluation demonstrates thesystem is effective and fast enough for potential desktopadoption.
Structure of the Paper 
The structure of this paper is as follows: Section IIdescribes the related work in static malware classification.Section III defines the malware classification problem andour approach. Section IV describes the unpacking andgeneral static analysis component of the system. Section Vdescribes the pre-filtering stage used in classification. This isa course grained classification process. Section VI describesthe fine grained classification algorithms. Section VII performs an evaluation using benign and malicious samples.Section VIII summarizes and concludes the paper.II.
 The related work reviewed in this paper is limited tostatic classification approaches. Dynamic approachesexamining run-time behaviour are not closely related to this paper and are therefore not examined.Automatic extraction of string signatures has been proposed and employed by commercial Antivirus software[3-4]. In these systems, all possible signatures of fixed sizeare extracted and then culled to eliminate strings that appear in benign samples. However, polymorphic malware makestring signatures prone to failure when the byte level contentchanges due to mutation, recompilation, and source codemodification of the malware.Code normalization has been investigated to canonizemalware before string scanning [8-9] and improve detectionrates of malware variants. In [8], static analysis eliminatedsuperfluous control flow. Additionally, instruction sequencesthat had no effect were also removed using a decision procedure to perform the analysis. Malware normalizationimproves detection but does not always effectively canonizea program to a unique form which affects the effectivenessand efficiency of the system.Fingerprinting malware based on opcode distributionshas shown to be partly effective [10]. An improved approachwas to examine byte sequences and opcode sequences in ann-gram analysis. N-grams and n-perms were proposed toidentify similarity between malicious programs to buildevolutionary trees [11]. By extracting relevant n-grams as particular features, feature vectors could be compared inorder to construct similarity. Machine learning andclassification extended these systems to detect unknownmalware in [12-13]. These systems improve the effectivenessof static string signatures, but instruction level classificationhas similar problems when the instruction stream changes toany significant degree.Malware classification using the basic blocks of a program has been investigated in [14]. A basic block represents a sequence of instructions without an interveningcontrol transfer instruction. The edit distance between basic blocks identifies similarity. Existence of a basic block in amalicious sample can be determined using an inverted indexor bloom filters. Previous research demonstrated that this iseffective at detecting some malware variants, but is noteffective when byte and instruction level changes occur. Inthe systems that implemented basic block classification,efficiency was not shown to be sufficient for desktopadoption.The ordering of system API calls of a program can beextracted and used for malware classification. [15] proposedusing association mining of API calls to detect unknownmalicious programs. The problem of this approach is thatcode packing can obscure the API calls [16].An approach to malware classification that is moreresistant to byte and instruction level changes is by usingcontrol flow as a feature. Combining data flow analysis and
control flow analysis was proposed in [17-18]. Annotationswere made to the control flow graphs to incorporateabstractions of the instructions and data flow. Theseannotated flowgraphs were compared to signatures, or automata, that describe the malware. If the malwaresignature is present in the query program, a malware instancecan be detected. In [19], value set analysis was used as aspecific data flow analysis to construct signatures. Our control flow graph based system is more effective atdetecting modified variants because it allows for errors in thesignature generation.Interprocedural control flow using the call graphs of a program have been compared to show similarity to existingmalware [6, 20-22]. Procedures or nodes in the call graph arefingerprinted and common nodes between two program callgraphs are identified. By identifying matching nodes, anapproximation is made to the graph edit distance. The editdistance allows similarity to be measured. An approach totransform the interprocedural control flow information into acontext free grammar, allowing for homomorphism testingusing string equality was also proposed in [23], but thistechnique does not allow for approximate matches.An alternative approach to using the call graph is usingthe control flow graphs. Ignoring the edges in the call graphand identifying exact or isomorphic control flow graphs was proposed for a real-time system and shown to be effective[24]. Intraprocedural control flow using the control flowgraphs have been compared by decompiling the graph tosource code and using the string edit distance [25], but wasnot as efficient as exact matching. Identifying ismorphismsand the maximum common subgraph of a whole programcontrol flow graph using tree automata was proposed in [26].A worm detection system fingerprinted control flow usingsubgraphs of a small and fixed size as a feature in [7].Our research employs a novel approach to finding thedistance between malware based on individual distances or similarity between control flow graphs. We allow for efficiently approximate matching of sets of graphs whereas previous research only provided efficient exact matching.Our research additionally extends the concept of usesubgraphs as a feature by incorporating it into larger classification system based representing a set of graphs asfeature vectors and consequently using similarity searching.We also demonstrate a more effective feature using the q-grams of decompiled control flow graphs.III.
 Problem Statement 
 New programs that are discovered on the host system areinspected to determine if they are malicious or benign.Unknown malware are detected by calculating their similarity to existing malware. A high similarity identifies amalicious variant. Existing malware are collected fromhoneypots and other malicious sources to construct adatabase of malware signatures. The described malwarevariant detection problem is equivalent to the softwaresimilarity search problem.The software similarity problem is to determine if  program p is a copy or derivative of program q. It is based onthe definition proposed in [27]. A workflow is shown in Fig.1. The software similarity problem is extended to operateover a database of programs. We use the nearest neighbour search.
Our Approach
Our approach builds a signature or birthmark of amalware based on the set of control flow graphs it has. Wecompare signatures using distance metrics to showsimilarity. We take advantage of the efficiency of usingfeature vectors to represent a program’s control flowinformation. Because a feature vector is a course grainedview of the program’s control flow and introduces falsedetections, we incorporate an additional and more accuraterepresentation using the minimum matching problem tocompare two sets of graphs.Malware is first unpacked to remove obfuscations.Control flow is reconstructed and the control flow graphsdecompiled and structured into strings. Malware variants aredetected by identifying existing malware the query programsare related to. Pre-filtering is used to provide a list of  potentially related malware. The pre-filtering algorithm is based on constructing a feature vector to represent the query programs and malware. Either of two algorithms can be usedto extract features. Firstly, subgraphs of size k are used torepresent features. Alternatively, q-grams are extracted fromthe strings representing the structured graphs. Using either algorithm for feature extraction, the most relevant featuresare used to construct a feature vector. The pairwise similarity between two feature vectors employs a distance function onthe pair of vectors. Vectors that are close to each other areindexed to the same bucket. To identify candidates with highsimilarity to existing malware, a metric similarity search is performed using Vantage Point trees [28].To compare the query to the candidate malware, a moreaccurate pairwise distance function is used. Each controlflow graph from one program is assigned a unique mappingto a flowgraph from the other program. This mappingintuitively shows the flowgraphs represent the same procedure. The mapping is assigned a weight and themappings chosen by considering it as an optimization problem. The mappings are chosen to minimize the sum of all weights associated with the mappings. The weight is thedistance between flowgraphs and is based on the stringdistance between structured graphs. This sum weight is
Program pProgram qBirthmarkBirthmarkSimilar?MATCH!Different
Figure 1. The software similarity problem.

Activity (5)

You've already reviewed this. Edit your review.
1 thousand reads
1 hundred reads
slowlove liked this
fatullahdonie liked this

You're Reading a Free Preview

/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->