2 IEEE TRANSACTIONS ON COMPUTERS
Unpacking is a necessary component to perform staticanalysis and to reveal the hidden characteristics of mal-ware. In the problem scope of unpacking, it can be seenthat many instances of malware utilise identical or similarpackers. Many of these packers are also public, and mal-ware often employs the use of these public packers. Manyinstances of malware also employ modified versions of public packers. Being able to automatically unpack mal-ware in any of these scenarios, in addition to unpackingnovel samples, provides benefit in revealing the mal-
ware’s real content –
a necessary component for staticanalysis and accurate classification.Automated unpacking relies on typical behaviour seenin the majority of packed malware
hidden code is dy-namically generated and then executed. The hidden codeis naturally revealed in the process image during normalexecution. Monitoring execution for the dynamic genera-
tion and execution of the malware’s hidden code can be
achieved through emulation. Emulation provides asafe and isolated environment for malware analysis.Malware detection has been investigated extensively,however shortcomings still exist. For modern malwareclassification approaches, a system must be developedthat is not only effective against polymorphic and packedmalware, but that is also efficient. Unless efficient systemsare developed, commercial Antivirus will be unable toimplement the solutions developed by researchers. Webelieve combining effectiveness with real-time efficiencyis an area of research which has been largely ignored. Forexample, the malware classification investigated in[5, 6, 9-11]has no analysis or evaluation of system efficiency.We address that issue with our implementation andevaluation of Malwise.In this paper we present an effective and efficient sys-tem that employs dynamic and static analysis to auto-matically unpack and classify a malware instance as avariant, based on similarities of control flow graphs.
This paper makes the following contributions. First, wepropose using string based signatures to represent controlflow graphs and a fast classification algorithm based onset similarity to identify related programs in a database.Second, we propose using two algorithms to generatestring signatures for exact and approximate identificationof flowgraphs. Exact matching uses a graph invariant as astring to heuristically identifiy isomorphic flowgraphs.Approximate matching uses decompiled structured flow-graphs as string signatures. These signatures have theproperty that similar flowgraphs have similar strings.String similarity is based on the string edit distance.Third, we propose and evaluate automated unpackingusing application level emulation that is equally capableof desktop Antivirus integration. The automated un-packer is capable of unpacking known samples and isalso capable of unpacking unknown samples. We alsopropose an algorithm for determining when to stop emu-lation during unpacking using entropy analysis. Finally,we implement and evaluate our ideas in a novel proto-type system called Malwise that performs automated un-packing and malware classification.
1.4 Structure of the Paper
The structure of this paper is as follows: Section 2 de-scribes related work in automated unpacking and mal-ware classification; Section 3 refines the problem defini-tion and our approach to the proposed malware classifica-tion system; Section 4 describes the design and imple-mentation of our prototype Malwise system; Section 5evaluates Malwise using real and synthetic malwaresamples; finally, Section 8 summarizes and concludes thepaper.
2.1 Automated Unpacking
Automated unpacking employing whole system emula-tion was proposed in Renovoand Pandora's Bochs. Whole system emulation has been demonstrated toprovide effective results against unknown malware sam-ples, yet is not completely resistant to novel attacks.
Renovo and Pandora’s Bochs both detect execution of
dynamically generated code to determine when unpack-ing is complete and the hidden code is revealed. An alter-native algorithm for detecting when unpacking is com-plete was proposed using execution histograms in Hump-and-dump. The Hump-and-dump was proposed aspotentially desirable for integration into an emulator.Polyunpack proposed a combination of static anddynamic analysis to dynamically detect code at runtimewhich cannot be identified during an initial static analy-sis. The main distinction separating our work from previ-ously proposed automated unpackers is our use of appli-cation level emulation and an aggressive strategy to de-termine that unpacking is complete. The advantage of application level emulation over whole system emulationis significantly greater performance. Application levelemulation for automated unpacking has had commercialinterestbut has realized few academic publicationsevaluating its effectiveness and performance.Dynamic Binary Instrumentation was proposed as analternative to using an instrumented emulatorem-
ployed by Renovo and Pandora’s Bochs. Omnipack
 and Saffronproposed automated unpacking usingnative execution and hardware based memory protectionfeatures. This results in high performance in comparisonto emulation based unpacking. The disadvantage of un-packing using native execution is evident on E-Mailgateways because a virtual machine or emulator is re-quired to execute the malware. A virtual machine ap-proach to unpacking, using x86 hardware extensions, wasproposed in Ether. The use of such a virtual machineand equally to a whole system emulator is the require-ment to install a license for each guest operating system.This restricts desktop adoption which typically has a sin-gle license. Virtual machines are also inhibited by slowstart-up times, which again are problematic for desktopuse. The use of a virtual machine also prevents the systembeing cross platform, as the guest and host CPUs must bethe same.