Survey in Static Detection of Malware

Silvio Cesare
School of Information Technology Deakin University Burwood, Victoria 3125, Australia

<silvio.cesare@gmail.com>

ABSTRACT
Malware continues to be a significant problem facing computer use in today’s world. Historically Antivirus software has employed the use of static signatures to detect instances of known malware. Signature based detection has fallen out of favour to many, and detection techniques based on identifying malicious program behavior are now part of the Antivirus toolkit. However, static approaches to malware detection have been heavily researched and can employ modern fingerprints that significantly improve on the simple string signatures used in the past. Instancebased learning can allow the detection of an entire family of malware variants based on a single signature of static features. Statistical machine learning can turn the features extracted into a predictive Antivirus system able to detect novel and previously unseen malware samples. This paper surveys the approaches and techniques used in static malware detection.

raw content. Thus, traditional signatures can prove ineffective when dealing with unknown variants. Modern approaches to signature generation involve less fragile and more versatile fingerprints. Program features are extracted that enable a more robust representation to detect an entire family of malware variants. Machine learning and statistical classification using those same program features can allow the detection of novel and unknown malware not belonging to previously identified families. Static program analysis is undecidable for many problems concerning binaries, and a transformation of a compiled program known as code packing is often used by malware authors to hide the intent of the malware and make analysis more difficult. The packing process encrypts, compresses, or obfuscates the malware. The original unobfuscated code is restored at run time, or in the case of instruction virtualization, a byte code representing the original code is executed. In most cases, unpacking is a requirement for effective static malware classification and use of signatures. Automated unpacking has been partly successful but for those cases where it cannot be achieved, it is sometimes better to mark those programs as likely to be malicious. Thus, even with packed samples, static detection of malware can still be an effective tool.

Keywords
Malware classification.

1. INTRODUCTION
Malicious software is a significant problem that threatens the security of users on the internet. Today, malware is created by criminal gangs for the purposes of financial gain. These criminals employ malware for the purposes of stealing of credit card information to commit fraud or to obtain illegal use of a computer to launch spam campaigns. A simple approach often used by criminals on victims to is by having innocent users open an EMail attachment that is malicious. To protect users from malware, detection of the threat before it is allowed to execute its malicious intent is a necessity. Behaviour blocking is a useful approach, but relying solely on the dynamic behaviour of a program may allow unwanted actions to be performed before the malware is detected. Running a program in a virtual machine or isolated sandbox to detect its intent is not always effective. Dynamic analysis can never reason about all potential behaviours. If the malware performs differently while being analysed, or can detect the analysis itself, then the malware has a high probability to escape detection. Static analysis and detection provides a possible solution in the arsenal of defences. Static signature based detection has been a dominant feature in Antivirus. Because of performance constraints, the most widely used signature is a string containing patterns of the raw file content [1, 2]. This allows for a string search [3] to quickly identify patterns associated with known malware. However, these patterns can easily be invalidated because minor changes to the malware source code have significant effects on the malware’s

1.1 Structure of the Paper
The format of this paper is as follows: Section 2 describes the taxonomy of program features useful to malware classification. Section 3 compares those features. Section 4 describes the approaches that can be employed in static malware classification and Section 5 describes the specific techniques based on feature. Section 6 identifies future trends. Finally, Section 7 concludes the survey.

2. TAXONOMY OF STATIC PROGRAM FEATURES
Malware classification and detection involves the extraction of features which are subsequently used to characterize the malware. Features may be extracted dynamically or statically. Dynamic approaches to malware classification involve monitoring execution of the programs and extracting features based on their behaviour. Static approaches extract features without program execution.

2.1 Object File Header Attributes
The object file header contains attributes which are often custom written during link editing and binary rewriting.

8d 83 ff 55 89 51 83 e8 c7 eb c7 e8 83 83 7e 83 59 5d 8d c3

4c 24 04 e4 f0 71 fc e5 ec 6a 45 10 04 5d 45 7d ea c4 24 00 00 00 f8 00 00 00 00 24 00 f8 f8 24 a0 20 40 00 00 00 01 09

61 fc

lea and pushl push mov push sub call movl jmp movl call addl cmpl jle add pop pop lea ret

0x4(%esp),%ecx $0xfffffff0,%esp -0x4(%ecx) %ebp %esp,%ebp %ecx $0x24,%esp 4011b0 <___main> $0x0,-0x8(%ebp) 40115f <_main+0x2f> $0x4020a0,(%esp) 4011b8 <_puts> $0x1,-0x8(%ebp) $0x9,-0x8(%ebp) 40114f <_main+0x1f> $0x24,%esp %ecx %ebp -0x4(%ecx),%esp

lea and pushl push mov push sub call movl jmp movl call addl cmpl jle add pop pop lea ret

0x4(%esp),%ecx $0xfffffff0,%esp -0x4(%ecx) %ebp %esp,%ebp %ecx $0x24,%esp 4011b0 <___main> $0x0,-0x8(%ebp) 40115f <_main+0x2f> $0x4020a0,(%esp) 4011b8 <_puts> $0x1,-0x8(%ebp) $0x9,-0x8(%ebp) 40114f <_main+0x1f> $0x24,%esp %ecx %ebp -0x4(%ecx),%esp

Figure 1. Instructions and basic blocks.

2.2 Bytes
One of simplest features that can be extracted from a program is the raw byte level content of the malware executable file [4]. An alternative source of content comes from the individual program sections in the binary, including the code and data segments.

is a directed graph representing the inter-procedural control flow. Like the control flow graph, alternative or abstracted representations are possible such as dominator trees.

2.7 API Calls
Programs interface with the underlying operating system and libraries. The invocation of an API function from a known a library can often be identified statically [10]. The API call sequence gives insight to the behaviour of the program.

2.3 Instructions
An executable program is constructed of code and data. The code is represented as assembly language. Extracting the assembly is the process of disassembling. The instruction level content of a program can represent a more resilient form than the byte level content if the instructions are considered by their type or mnemonic representation [5].

2.8 Data Flow
The data flow of a program represents the set of possible values data may hold during program execution [11]. Many types of data flow analyses exist, including live variable analysis, reaching definitions, and value-set analysis. Each analysis looks at a particular property of the data at specific program points. Modelling the data flow requires that the control flow be successfully identified. A simpler model of data dependencies can be modelled as described in the basic block feature section.

2.4 Basic Blocks
A basic block is a straight line sequence of code without an intervening control transfer instruction [6]. The basic block may be treated at the byte level, or at the instruction level. Additionally, data dependency within the basic block may be examined to construct a directed acyclic graph [7]. The basic blocks may also be grouped to form a set, or they may have additional structure imposed by the control flow graph.

2.9 Procedure Dependence Graph
A procedure dependency graph combines the control dependencies and data dependencies of a procedure into a single graph.

2.5 Control Flow Graphs
The control flow graph is a directed graph, where the nodes are basic blocks [8]. The edges in the graph represent the possible control flow of the associated procedure. The control flow graph represents the intra-procedural control flow. A program may be considered a set of control flow graphs, or the control flow graphs may have additional structure as dictated by the call graph. Alternatively, control flow graphs may represent inter-procedural and intra-procedural control flow in a single graph. In this case, the graph represents the inter-procedural control flow graph. It is possible to construct alternative or abstracted representations of the control flow graph. Loop nest trees, dominator trees, and control dependency graphs can also be constructed [7] which are different ways of representing control flow.

2.10 System Dependence Graph
The system dependence graph is a collection of procedure dependence graphs, one for each procedure in the program.

3. COMPARISON OF STATIC PROGRAM FEATURES
Malware may be polymorphic, but static program features are known to be invariant under different polymorphic techniques. Byte and instruction level program features perform poorly when faced with the polymorphic variations and mutations. Recompiling source code using different compile time options may result in syntactic changes including variable renaming, and instruction substitution. Code normalization can sometimes reverse the effects of syntactic polymorphism and can work in practice but is not based on a sound technique. Additionally, the

2.6 Call Graph
Call graphs, like control flow graphs, model the possible execution paths and control flow in a program [9]. The call graph

lea and pushl push mov push sub call movl jmp

0x4(%esp),%ecx $0xfffffff0,%esp -0x4(%ecx) %ebp %esp,%ebp %ecx $0x24,%esp 4011b0 <___main> $0x0,-0x8(%ebp) 40115f <_main+0x2f> Proc_1 movl call addl $0x4020a0,(%esp) 4011b8 <_puts> $0x1,-0x8(%ebp)

Proc_0

Proc_3

cmpl jle

$0x9,-0x8(%ebp) 40114f <_main+0x1f>

Proc_4

add pop pop lea ret

$0x24,%esp %ecx %ebp -0x4(%ecx),%esp

Proc_2

Figure 2. Control flow graph (left) and call graph (right). byte and instruction stream may change when minor semantic alterations are made to the malware source code. The advantage of byte level content as a program feature is that the dependence on accurate static analysis of the programs semantics or structure is not required. If the instruction stream is used, additional challenges are presented because it is known that perfect disassembly of an unknown image is undecidable on the x86 platform [12]. To avoid the problems of syntactic polymorphism, higher level abstractions of the program can be used. The control flow features including control flow graphs and call graphs are considered more invariant in polymorphic malware than byte and instruction level content [8]. However, opaque predicates - conditions that always evaluate to the same result but are hard to determine statically may result in these features being altered. The detection of opaque predicates has been investigated, but it is not evident that this is entirely satisfactory, and a sound method of detection against all unknown predicates is not possible. For example, it is known that some algorithms which are used to construct predicates are actually only strong conjectures in evaluating to the same result. This implies an automated approach to prove that it constant is hard. The presence of pointers and indirection in assembly language also present problems to static analyses which may not have the precision required to construct a control flow graph or call graph with the degree of accuracy required for malware classification. For all its disadvantages, control flow has shown to be an effective feature that is invariant in most current malware. The use of API calls is another approach to solve the syntactic polymorphism problem. This approach has problems with malware that obscures the use of those calls, as is the case of the stolen bytes technique [13] introduced by code packing tools. Data flow analysis is another high level abstraction but when used in the presence of pointers is compounded by the problems that static analyses must face. The procedure and system dependence graphs have similar problems with pointers and indirection even when data dependencies of pointers are ignored. The dependence graphs are also dependent on accurate modelling of the instruction sequence. This avoids problems such as register reassignment because the data dependency is represented as a graph. The problem occurs with the modelled instructions used in the data dependencies which may be polymorphic and variant. Polymorphism is not handled effectively in this situation although code normalization may help.

4. STATIC APPROACHES TO MALWARE CLASSIFICATION
Malware classification is the process of determining if an unknown binary belongs to the class of malicious programs or the class of benign programs.

4.1 Statistical Classification
A data mining approach to malware detection is to employ statistical classification. Each classification algorithm constructs a model, using machine learning, to represent the benign and malicious classes. In this approach, a labeled training set is required to build the class models during a process of supervised learning. Many statistical classification algorithms exist including Naive Bayes, Neural Networks, and Support Vector Machines. The key to statistical classification is to represent the malicious and benign samples in an appropriate manner to enable the classification algorithms to work effectively. Feature extraction is an important component of effective classification, and an associated feature vector that can accurately represent the invariant characteristics in the training sets and query samples is highly desirable.

4.2 Instance-Based Learning
Instance-based learning is a related and traditionally popular approach that can be employed wherein the query program is

classified by identifying a high similarity to a known instance of malware in the training set. Traditional Antivirus utilises this approach when it performs signature based detection. The key component to perform classification using instance-based learning is a distance or similarity function between the objects representing samples and queries. For a distance function to be effective between objects, the objects must be modeled by a limited set of features that capture the invariant characteristics of the malicious and benign programs. In some cases, the distance function is replaced with a test for equality. However, testing only for equality reduces the effectiveness of the classification process when dealing with malware variants. Instance-based learning can additionally identify high similarity to benign or white-listed samples, depending on the aims of the classification.

5.2.2 Kolmogorov Complexity
Kolmogorov complexity is a theoretical measure of the computational complexity, or minimum string length in a universal description language, required to represent an object or set of data. It is a theoretical measure that is not computable. To estimate the Kolmogorov complexity, an object may be compressed and concatenated with the associated decompression routine, to give the approximate minimum string length to describe the object. The observation, when this theory is related to malware, is that similar malware have similar measures of Kolmogorov complexity. This form of analysis occurs on the malwares raw file or section content. Estimating Kolmogorov complexity was proposed in peHash [14] by identifying the compression ratio of a malicious sample that was subsequently used for clustering malware families. Another measure of similarity related to Kolmogorov complexity is the Normalized Compression Distance (NCD). The NCD was used in [15] to cluster worms into families. This approach, like peHash [14], was not used to classify samples as being benign or malicious, but to cluster malicious samples only. It was the observation in [16] that malware and benign programs can be classified according to a likeness to a compression model for each of the malicious and benign classes. In this research, it was proposed that two compression models be constructed from a two training sets, one of malicious samples, and one of benign samples. To classify a query sample as being malicious or benign, the number of bits required to encode the query was calculated for each compression model. The query was classified by identifying the class that requires the least data to encode the query.

4.3 The Similarity Search Used in InstanceBased Learning
A search of a database to find similar, but not necessarily identical objects to a query is known as a similarity search. The similarity search is a central aspect of instance-based learning when applied to malware detection and classification using a large number of malware signatures and training instances. Distance functions between objects that have the properties of a metric can employ the use of Metric Access Methods. A similarity search using metric access methods performs faster than exhaustive linear search and enables significantly larger databases without being restricted by an equivalent increase in running time. Metric access methods may use either static or dynamic databases. In dynamic Metric Access Methods, dynamic database operations, such as object insertion, can be effectively performed with reasonable performance expectations.

5.2.3 String Signatures
Static string signatures have been the dominant technique used in traditional Antivirus. String signatures represent patterns in a malware’s raw content used to uniquely identify it. In [1, 2] it was proposed to automatically extract the string signatures used by the detection system. The set of all possible likely string signatures of fixed size are extracted from the malware, and those that result in high similarity to a corpus of benign programs are removed from the signature candidate list. String signatures may use fast string matching algorithms to detect malicious instances [3]. Wildcards and regular expressions present extensions of string matching that can be used in the detection of malware variants. String signatures are efficient, and more effective than file hashing, but can be ineffective when faced with polymorphic malware variants.

5. CLASSIFICATION APPROACHES BY FEATURE 5.1 Object File Header Attributes
Identifying object file discrepancies to detect malware has sometimes been used by commercial Antivirus to detect suspicious binaries. peHash [14] proposed a similar technique by hashing object file features to cluster malware. The advantage of this approach is that it is highly efficient. The disadvantage is that using object file attributes predominantly identifies those attributes of the packing tool used when packing the malware. The concept is very similar to the techniques used when identifying code packing using object file features. Classifying a sample as malicious, based on the packer, is not necessarily accurate. It is not necessarily true that the presence of code packing indicates the malicious intent of a program.

5.2.4 Malware Normalization
To improve the effectiveness of string based signatures, malware normalization has been proposed. Normalizing malware before passing it to Antivirus software was investigated by Christodorescu et al in [17]. Static analysis was carried to eliminate unnecessary control flow as indicated by superfluous unconditional branches. Semantic nops were also removed from the malware by using decision procedures. At this point the malware was passed, now in a more canonical form, to Antivirus software. Another approach to the code normalization problem was to rewrite sequences of code using compiler optimisation techniques [18]. Expression propagation, dead code elimination, and expression simplification using algebraic identities was used. The intuition is that that the process of an optimising compiler

5.2 Byte Level Approaches
5.2.1 File Hashing
The simplest approach to malware detection is hashing the contents of the file and comparing that hash against a blacklist. This approach is widely used in commercial Antivirus. The disadvantage of using this approach on its own is that it ineffectively detects malware that has incurred any byte level alterations. However, the blacklisting of specific and unaltered malware instances is a useful technique that is easily and efficiently implemented.

removes the redundancy of the original code and improves the terseness, resulting in a normalized representation. An approach using term rewriting was proposed in [19] where rewrite rules were constructed to model the malware transformations that occur during polymorphic and metamorphic mutation. From these, a normalizing rule-set was constructed that could rewrite the malware to a canonical or near canonical representation.

The main disadvantage with this approach is that minor changes to the malware source code can result in significant changes to almost all basic blocks. Changes in compiler configuration and optimisations can equally result in large changes.

5.4.2 Membership Testing - Inverted Index and Bloom Filters
Gheorghescu also proposed the use of finding identical features in basic blocks shared between malware [6]. The inverted index and bloom filters provided for faster searching of exact matches in a database. An inverted index is an associative mapping between the content and the source of that content. A bloom filter allows for fast set membership queries with allowable false positives and guaranteed no false negatives. To make this an approximate search, features extraction occurred on the basic block. Bloom filters were shown to perform fast enough to be performed on a desktop system, but not fast enough for desktop Antivirus. The disadvantage of this approach is even more pronounced than using the edit distance between basic blocks. Minor changes to the malware source or compiler configuration can change almost all basic blocks.

5.3 Instruction Distributions
5.3.1 Opcode Distributions
Instruction level content of malware can provide a more resilient representation than the byte level data. This is especially true if the instruction arguments or operands are ignored leaving only the opcodes to be examined. To determine the instructions and opcodes, a disassembly of the malware is required. A classification technique using the statistical distribution of opcodes as a predictor of malware was proposed in [5]. The investigation found that rarely occurring opcodes were a strong predictor of malware than compared to frequently occurring opcodes. The disadvantage of opcode distributions is that polymorphism can change the distributions.

5.5 Control Flow Graphs
5.5.1 Whole Program Control Flow Graph Isomorphism Recognition Using Tree Automata
A fast approach to detecting whole program control flow graph isomorphism and subgraph isomorphism was proposed in [22]. This approach constructed a spanning tree based structure from the control flow graph, and then built a tree automaton for graph recognition. This approach appears to have reasonable performance. However, this technique is not effective at detecting malware variants that alter the control flow or have semantic changes.

5.3.2 Byte and Instruction Level N-grams
An approach to classify malware using evolutionary trees and phylogeny based on n-grams and n-perms was proposed by Karim et al in [20]. Their approach was to show a similarity between malware. At this point phylogeny which generates evolutionary trees could be used for taxonomy. N-grams of byte and instruction level content were extracted as features from each binary. These vectors were compared to establish a similarity using a variety of metrics. One such metric was cosine similarity. N-perms extended the concept of n-grams to group permutations of each n-gram as a single feature. N-grams are more resistant to polymorphic changes than string based signatures, but are not highly effective when faced with techniques such as instruction substitution or register reassignment.

5.5.2 Common k-subgraphs
Decomposing control flow graphs into subgraphs was proposed by Kruegel et al in [8] to classify polymorphic worms. The control flow graphs were decomposed into the set of all subgraphs of fixed size k, where k is the number of nodes in the subpgrah. The k-subgraphs were subsequently transformed into their canonical labeled form. The adjacency matrix of the canonically label graph was transformed into a string. This string represented the k-subgraph feature of the control flow being analysed. Worm detection and classification occurs through identifying the prevalence of k-subgraph features between worm like executable content and unclassified executable programs. The performance of this system was reasonable. Because the classification only occurs on network streams identified as potential worms, it is hard to determine the accuracy of the classification when applied to a larger set of malware.

5.3.3 N-gram Analysis and Machine Learning
N-gram analysis using machine learning was proposed by Perdisci et al to classify malware in McBoost [21]. A similar method also involving machine learning and classification was proposed in [4]. The classification was performed using a similar algorithm to how McBoost used n-gram analysis to detect packed binaries. The most informative n-grams of a training set were used in representing the occurrence of those n-grams in each binary as a vector. Those vectors were used to train a statistical classifier. Automated unpacking was performed on the malware, if necessary, using a method similar to the Renovo unpacking system.

5.4 Basic Blocks
5.4.1 Edit Distances
The edit distance between basic blocks was used to classify malware by Gheorghescu in [6]. The edit distance describes the number of insertions, deletions and substitutions to convert one string to another. To classify malware using edit distance, each malware was statically disassembled and the basic blocks extracted. These basic blocks were then considered as strings. Classification then proceeded to build a similarity between the malware's basic blocks and the binary being examined using a similarity ratio based on the edit distance.

5.5.3 Structured Control Flow Graphs Using Decompilation
An approach that decompiles the control flow into a high level source code like representation was proposed in [23]. Comparing two control flow graphs is performed by using the string edit distance on their decompiled sources. The similarities of each control flow graph are accumulated to give a similarity taking into account the entire program. A related approach is to decompose the decompiled strings into q-grams [24].

5.6 Call Graphs
5.6.1 Whole Program Context-Free Control Flow
It was proposed in [25], that the inter-procedural control flow information could be represented as a context free grammar with a limited loss of information. A string could represent the grammar, and string equality used to show equivalence between the grammar, and inter-procedural control flow they represented. The advantage of this approach, is that string based representations allow for fast searches in a malware database using a dictionary search. The disadvantage of the approach investigated in this research is that it did not employ approximate matching of the inter-procedural control flow. For polymorphic malware variants that alter the control flow through source code modification, an approximate match is necessary for detection of the malware.

An alternative approach to using vectors to represent API call sequences was proposed in the IMDS malware detection system [10]. IMDS’s approach employed the use of the data mining technique known as association mining. Association mining was able to associate sequences of API calls to classify query samples as benign or malicious.

5.8 Data Flow
Combining data flow analysis and control flow analysis was proposed in [11, 30]. Annotations were made to the control flow graphs to incorporate abstractions of the instructions and data flow. These annotated flowgraphs were compared to signatures, or automata, that described the malware. If the malware signature was present in the query program, a malware instance had been detected. In [31], value set analysis was used as a specific data flow analysis to identify fixed points that was subsequently used to construct signatures.

5.6.2 Flowgraph Based Classification using Fixed Points
Carrera proposed an approximate flowgraph matching algorithm in [9] by identifying fixed points in the flowgraphs and successively matching surrounding nodes in the graph. Carrera built a similarity index between malware and used this to build phylogeny trees for taxonomy. Dullien and Rolles expanded the approximate graph comparison algorithm in [26] to identify identical nodes between call graphs and control flow graphs. Their algorithm worked by identifying nodes, or fixed points, between binaries that have uniquely identifiable features. Features for a node in the call graph include the number of basic blocks, control flow edges, and number of subfunction calls. Carrera also proposed an estimation of a control flow graph isomorphism based on string equality and a string signature of the graph representing a graph traversal. Once a set of fixed points were known, their neighbouring nodes could be examined. Identifying neighbours sharing common and unique features iteratively allowed greater parts of the flowgraph to be identified. The advantage of this approach is that it allows for moderately fast pair-wise comparison between graphs. However, the approach does not perform efficiently for a database of graphs and is not fast enough for desktop Antivirus use.

6. TRENDS
Malware obfuscation has been increasingly addressed by researchers, and deobfuscation will continue to be developed and incorporated into malware detection systems. These deobfuscation techniques have increasingly borrowed from formal program analyses in an attempt to make sound analyses possible in regards to their given constraints. Malware classification has employed statistical techniques to detect unknown malware. We believe research will continue using this approach and new features will be developed that can more accurately characterize malware. Instance-based learning will also be developed with particular research opportunity in working with large scale datasets. Static program features have been extracted at increasing levels of abstraction, and we expect this to continue in future research. Abstraction has the benefit of being resistant to lower level polymorphic changes. The performance of these research systems has not been fully investigated, and we expect that future research opportunity lies in making classification systems practical for industrial and widespread use.

5.6.3 Approximating the Graph Edit Distance
An alternative approach to approximate graph matching was proposed in the SMIT system [27]. SMIT employed the use of bipartite graphs and the Hungarian nor Munkres algorithm to find matching nodes between two call graphs being compared in O(N3) running time. The strength of their matching algorithm was that they allowed for it be used as an approximation to the graph edit distance. The graph edit distance between two graphs, is the number of edit operations to convert one graph to the other. The graph edit distance gives a sound basis for similarity and dissimilarity between graphs. The graph edit distance is known to have the properties of a metric which allows the use of metric access methods to search a database of objects.

7. CONCLUSION
Detecting malware before it is allowed to execute is an important feature of Antivirus and system security. Static analysis techniques allow feature extraction of programs which allows machine learning to identify variants of malware and novel samples. Malware packing which hides the code from analysis remains the main sticking point for static detection and it can be hard to reverse all packers automatically. If unpacking is achievable, the problem of malware detection using static analysis is quite feasible and we expect the accuracy and efficiency of such systems will continue to improve as research continues.

REFERENCES
[1] K. Griffin, S. Schneider, X. Hu, and T. Chiueh, "Automatic Generation of String Signatures for Malware Detection," in Recent Advances in Intrusion Detection: 12th International Symposium, RAID 2009, Saint-Malo, France, 2009. J. O. Kephart and W. C. Arnold, "Automatic extraction of computer virus signatures," in 4th Virus Bulletin International Conference, 1994, pp. 178-184.

5.7 API Calls
API calls are a feature used in several malware classification and detection systems. In the SAVE system [28], it was proposed to disassemble the malware image and then to extract the API call sequence as they appeared in the disassembled output. The API call sequence was used to construct a vector. Similarity between vectors employed the use of the cosine similarity measure, the Jaccard index [29], and the Pearson’s correlation measure.

[2]

[3] [4]

[5] [6] [7] [8]

[9] [10]

[11]

[12] [13] [14]

[15] [16]

[17]

A. V. Aho and M. J. Corasick, "Efficient string matching: an aid to bibliographic search," Communications of the ACM, vol. 18, p. 340, 1975. J. Z. Kolter and M. A. Maloof, "Learning to detect malicious executables in the wild," in International Conference on Knowledge Discovery and Data Mining, 2004, pp. 470-478. D. Bilar, "Opcodes as predictor for malware," International Journal of Electronic Security and Digital Forensics, vol. 1, pp. 156-168, 2007. M. Gheorghescu, "An automated virus classification system," in Virus Bulletin Conference, 2005, pp. 294300. A. V. Aho, R. Sethi, and J. D. Ullman, Compilers: principles, techniques, and tools. Reading, MA: Addison-Wesley, 1986. C. Kruegel, E. Kirda, D. Mutz, W. Robertson, and G. Vigna, "Polymorphic worm detection using structural information of executables," Lecture notes in computer science, vol. 3858, p. 207, 2006. E. Carrera and G. Erdélyi, "Digital genome mapping– advanced binary malware analysis," in Virus Bulletin Conference, 2004, pp. 187-197. Y. Ye, D. Wang, T. Li, and D. Ye, "IMDS: intelligent malware detection system," in Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, 2007. M. Christodorescu, S. Jha, S. A. Seshia, D. Song, and R. E. Bryant, "Semantics-aware malware detection," in Proceedings of the 2005 IEEE Symposium on Security and Privacy (S&P 2005), Oakland, California, USA, 2005. R. N. Horspool and N. Marovac, "An approach to the problem of detranslation of computer programs," The Computer Journal, vol. 23, pp. 223-229, 1979. L. Boehne, "Pandora’s Bochs: Automatic Unpacking of Malware," University of Mannheim, 2008. G. Wicherski, "peHash: A Novel Approach to Fast Malware Clustering," in Usenix Workshop on LargeScale Exploits and Emergent Threats (LEET'09), Boston, MA, USA, 2009. S. Wehner, "Analyzing worms and network traffic using compression," Journal of Computer Security, vol. 15, pp. 303-320, 2007. Y. Zhou and W. M. Inge, "Malware detection using adaptive data compression," in Proceedings of the 1st ACM workshop on Workshop on AISec (AISec '08), 2008, pp. 53-60. M. Christodorescu, J. Kinder, S. Jha, S. Katzenbeisser, and H. Veith, "Malware normalization," University of Wisconsin, Madison, Wisconsin, USA Technical Report #1539, 2005.

[18]

[19]

[20]

[21]

[22]

[23]

[24] [25]

[26] [27]

[28] [29] [30]

[31]

D. Bruschi, L. Martignoni, and M. Monga, "Using code normalization for fighting self-mutating malware," presented at the Proceedings of International Symposium on Secure Software Engineering, 2006. W. Andrew, M. Rachit, R. C. Mohamed, and L. Arun, "Normalizing Metamorphic Malware Using Term Rewriting," presented at the Proceedings of the Sixth IEEE International Workshop on Source Code Analysis and Manipulation, 2006. M. E. Karim, A. Walenstein, A. Lakhotia, and L. Parida, "Malware phylogeny generation using permutations of code," Journal in Computer Virology, vol. 1, pp. 13-23, 2005. R. Perdisci, A. Lanzi, and W. Lee, "McBoost: Boosting Scalability in Malware Collection and Analysis Using Statistical Classification of Executables," in Proceedings of the 2008 Annual Computer Security Applications Conference, 2008, pp. 301-310. G. Bonfante, M. Kaczmarek, and J. Y. Marion, "Morphological Detection of Malware," in International Conference on Malicious and Unwanted Software, IEEE, Alexendria VA, USA, 2008, pp. 1-8. S. Cesare and Y. Xiang, "Classification of Malware Using Structured Control Flow," in 8th Australasian Symposium on Parallel and Distributed Computing (AusPDC 2010), 2010. S. Cesare and Y. Xiang, "Malware Variant Detection Using Similarity Search over Sets of Control Flow Graphs," in IEEE Trustcom, 2011. R. T. Gerald and A. F. Lori, "Polymorphic malware detection and identification via context-free grammar homomorphism," Bell Labs Technical Journal, vol. 12, pp. 139-147, 2007. T. Dullien and R. Rolles, "Graph-based comparison of Executable Objects (English Version)," in SSTIC, 2005. X. Hu, T. Chiueh, and K. G. Shin, "Large-Scale Malware Indexing Using Function-Call Graphs," in Computer and Communications Security, Chicago, Illinois, USA, pp. 611-620. A. H. Sung, J. Xu, P. Chavez, and S. Mukkamala, "Static analyzer of vicious executables (save)," 2004, pp. 326-334. G. Salton and M. J. McGill, Introduction to modern information retrieval: McGraw-Hill New York, 1983. M. Christodorescu and S. Jha, "Static analysis of executables to detect malicious patterns," presented at the Proceedings of the 12th USENIX Security Symposium, 2003. F. Leder, B. Steinbock, and P. Martini, "Classification and Detection of Metamorphic Malware using Value Set Analysis," in Proc. of 4th International Conference on Malicious and Unwanted Software (Malware 2009), Montreal, Canada, 2009.

Sign up to vote on this title
UsefulNot useful