You are on page 1of 13

HoneySift: A fast approach for low interaction client based Honeypot

Damien Forest
a0066246@nus.e du.sg
National University of Singapore

Choo Weisen Christopher Ledesma


a0066276@nus.e du.sg
National University of Singapore

Kng Poh Leong Vincent


g0906179@nus.e du.sg
National University of Singapore

Ho Yeow Siang
g0806342@nus.e du.sg
National University of Singapore

ABSTRACT
In this paper we present a comprehensive, low-interaction, clientbased honeypot, based on a combination of open-source software and a static analysis engine that performs code obfuscation detection based on a classification technique, signature detection, similarity analysis with known malicious shellcodes, in order to classify a webpage as malicious or not. This paper discusses the background of how malwares propagate, the current landscape of honeypot solutions, and the static analysis techniques we used to tackle the key challenges confronting honeypot developers.

1. INTRODUCTION
Millions of web pages are compromised every day with malware that embed themselves into vulnerable websites. Once these web pages have been compromised, they are used as a staging platform to launch attacks on the browsers of unsuspecting visitors through drive-by download attacks. Drive-by download attacks happen without knowledge of the client. Simply by visiting a compromised website using a vulnerable browser, the client may inadvertently download spyware, computer viruses, and other types of malware without giving explicit consent. Due to the covert nature of these attacks, clients may be compromised even by viewing an e-mail message or by clicking on a deceptive popup window. In recent years, the number of client-side attacks has grown significantly. People who are less technically-inclined are unaware of the need to keep their computers up-to-date with the latest software updates, thus making them susceptible to such attacks. Client-based honeypots are specially designed to gather information on client-side attacks, as shown in Figure 1. This information is then used to develop defences against attacks such as drive-by downloads.

General Terms
Algorithms, Measurement, Experimentation, Security.

Keywords
Honeypot, low-interaction, signature, classifier, obfuscation, disassembly, shellcode, similarity analysis.

Generate Reports

Source: Honeynet.org

Figure 1: Typical implementation of a Honeypot solution

After studying various implementations of open-source clientbased low-interaction honeypots, we noticed that there was no comprehensive solution that featured a dictionary of malicious websites, performed similarity analysis, malicious code classification, signature detection, and code obfuscation detection. This spurred us to develop a solution by putting together a combination of open-source programs and our own static analysis engine. In section 4 we present the architecture of this solution on

its objectives. In the following sections we present in details our methodology for the implementation of three modules: web crawling, obfuscation detection, signature detection and similarity analysis.

comprehensive solution that encompasses the most important features that distinguish each solution. 3.1 Overview We studied various low-interaction client-based honeypot implementations currently available online. These include HoneyC, PhoneyC, Microsofts HoneyMonkey, and HoneyWare. We mapped their key features in the following diagram to determine the components that we would put together in our own implementation. In particular, we looked at the web crawler function, dictionary of malicious websites, obfuscation detection, and analysis engine to understand the building blocks of such a honeypot.

2. DEFINITIONS
Client-based honeypots are systems that gather information on client-side attacks by crawling the Internet, searching for malicious websites to study for they interact with a users web browser. They are typically characterized by the level of interaction they have with websites they visit. Those that emulate a particular operating system and browser intensively are known as high-interaction honeypots, whereas those that are more passive and have less emulation are known as low-interaction honeypots. Both high and low-interaction honeypots have their benefits and disadvantages. High-interaction honeypots produce more accurate results as they emulate a fully-functional system, but are slower due to the level of interaction involved. Lowinteraction honeypots are not able to deliver quality results, but are useful in crawling the web more rapidly due to their lower level of interaction. Static analysis denotes that that the analysis is performed without compiling or executing the code. In our case it involves the review of HTML code including Javascript, prior to execution. This analysis will help to determine whether the code is likely to capitalize on vulnerability within the web browser to gain elevated access privileges on the clients system. Similarity analysis refers to techniques that are used to estimate the similarity of a given program to another one. In our case it will be used for the detection of hexadecimal shell code within JavaScript encapsulated in a HTML webpage. Code obfuscation is the practice of making code unintelligible or hard to understand. The process involves transformations of the code, changing its physical appearance, while preserving the black-box specifications of the program. Erreur ! Source du renvoi introuvable. The intention is to hide the true meaning of a code or increase the cost of reverse engineering to seriously hinder detection techniques based on static analysis. Erreur ! Source du renvoi introuvable.

3.2 Comparison
3.2.1 Existing solutions
HoneyC WebCrawler Yes PhoneyC Yes HoneyMonkey Yes HoneyWare Yes emulating several web browsers No Yes

KeyWord Search Dictionary of malicious websites Obfuscation detection Analysis engine

Yes No

No No

No Manual

No Signature detection

Partial Detection based on code Entropy

Partial Malware detection programs and then manually analyzed.

No Large variety of scan engines

Table 1: Comparative existing solutions

3.2.2 Honey
We studied HoneyC Erreur ! Source du renvoi introuvable., a low-interaction client-based honeypot to better understand how such systems work. To achieve partial emulation, simulated clients are deployed instead of having a fully functional system to interact with potentially rouge servers. To identify these servers, HoneyC employs the Yahoo Search API to retrieve servers' Uniform Resource Locator (URL)s. Next, HTTP requests are made to these servers. Upon receiving the responses, different modes of analysis can be applied. For the analysis, HoneyC compares against Snort signatures, which may contain the payload of known malicious attacks. The performance of signature-based detection depends on signature quality. Hence, low-quality signatures are likely to result in some false alarms. HoneyC can detect some malicious responses (e.g. time bombs), which often elude traditional high-interaction client honeypots. On the other hand, these low-interaction client honeypots cannot match the detection capability of their high-interaction counterparts. Therefore, they are used in tandem, offering good performance and detection rates.

3. RELATED WORK
Many have undertaken work to improve the accuracy of lowinteraction client-based honeypots. There are two principle approaches to this. The first approach involves the static analysis of network flows. Using this approach, Chinchani and van den Berg proposed a method to distinguished between data and executable code Erreur ! Source du renvoi introuvable.. Others, such as S. Almotairi et al. used principal component analysis to characterize honeypot traffic and classify activities Erreur ! Source du renvoi introuvable.. The second approach involves the analysis of HTTP responses from web servers, which is the primary approach in this paper. Various implementations are discussed to illustrate the need for a

3.2.3 PhoneyC
PhoneyC Erreur ! Source du renvoi introuvable. was setup in order to review a static analysis approach. A malicious HTML webpage is loaded into PhoneyC, which consists of two basic static analysis methods. The older version of PhoneyC uses the Linux ClamAV Erreur ! Source du renvoi introuvable. package for malicious signature verification whereas newer releases use information entropy and shell code detection. Information entropy enables the detection of heap sprays Erreur ! Source du renvoi introuvable. through NOP sleds. Shell Code detection is done using Libemu Erreur ! Source du renvoi introuvable., an external Linux shell code library. Libemu is effective even on polymorphic shell code. 3.2.4 HoneyMonkey HoneyMonkeys strength lies in its exploit detection system, which can be divided into 3 stages. During the first stage, simultaneous URLs are visited inside one unpatched virtual machine (VM). Upon exploit detection, dedicated VMs are assigned to each of the associated URLs to determine the existence of exploit URLs. Next, recursive direction analysis is performed to identify all web pages involved in exploit activities and determine their relationships. Lastly, fully patched VMs are used on Stage-2 results, to detect the latest attacks. For exploit detection, monkey programs are run. These launch web browsers to visit the URLs, with an appropriate amount of time allocated for any code downloading. Meanwhile, the detection system monitors changes to executable files and registry entries. This offers the advantage over the detection of known vulnerability exploits. The robustness of exploit detection depends on the scan rate. If insufficient time is given for the waiting period, some exploit pages may not be able to perform the attacks. Similarly, running multiple browsers running concurrently may potentially lead to excessive slowdowns to the monkey programs.

3.2.5 HoneyWare To retrieve URLs for scanning, Honeyware employs Yahoo and MSN search engines. Clients are simulated in the form of different web browsers; including: Internet Explorer, Firefox, Opera, Chrome, Safari and Konqueror. These tools interact with target servers to retune the downloaded files and examine them. A suite of scan engines (AVG, F-Prot, Avast, ClamAV and AntiVir) is used to analyse the files. Each scan engine searches for different threats, providing a more wholesome solution. Honeyware is designed to overcome some challenges faced by low-interaction client honeypots. Some malicious servers track the Internet Port (IP) addresses. The identity of client honeypots will be exposed when servers observe an abnormal number of visits within a timeframe. To address this shortcoming, Honeyware attempts to mimic humans surfing patterns. In some cases, visitors are treated differently depending on their geographical locations. In other words, malicious attacks may be directed at visitors from certain countries. A basic search is conducted to understand the distribution of malicious attacks in different countries. Honeyware will plant its client into the country which experiences the highest number of attacks.

3.3 Challenges
3.3.1 Limitations of signature based techniques
There are several challenges in signature-based techniques used detect malicious code. Among these challenges are the difficulty of maintaining an updated database of signatures, dealing with obfuscated malicious code, and handling new variants that are not yet captured in the database. New forms of malicious code are more easily detected through highinteraction honeypots, which track the state of an entire system. The high degree of monitoring is computationally intensive and slow, but allows more effective detection. In the case of low-interaction honeypots, this risk can only be mitigated by guessing whether certain types of code are likely to cause harm to a computer system. Keeping an updated database of signatures requires the support of a large community that regularly submits new forms of malicious code found in the wild. This is achieved fairly well with some open-source implementations such as Snorts Intrusion Detection System (IDS). For low-interaction honeypots, this requires a persistent Internet connection to a reliable source for the database. Lastly, obfuscated code presents a challenge that is discussed in this paper. Problems encountered include code substitution, where a certain malicious function is written in another form to mask its true purpose.

3.3.2 Difficulties of web crawling


Crawling the web in search of malicious websites has its fair share of difficulties. For instance, crawlers are normally expected to obey robot exclusion protocols while traversing a domain. Within each website, a file (robots.txt) specifies the sections which are accessible to web robots. The same file also lists out the prohibited sections of a web site. Compromised websites could have malicious code hidden within the restricted sections, reducing the possibility that the crawler would discover the webpage containing the malicious code. Moreover, malicious websites could be coded to detect the fingerprint of non-browser activity, and refuse access to a website

or simply return a non-malicious webpage. Such countermeasures may be deployed by hackers to make it difficult for lowinteraction honeypots to retrieve and analyze the actual malicious contents of a webpage. Another issue with crawlers is that they may search a website too quickly and trigger an alarm from an in-built IDS. Due to the speed of low-interaction honeypots, crawling a website too quickly may be misinterpreted as a Denial-of-Service (DoS) attack, and could lead to the blacklisting of the honeypots IP address.

Network flow analysis is better accomplished by dedicated IDS systems or specialized firewalls. Compared to high interaction Honeypots our program will provide less information for a complete understanding of the methods used by malicious coders. However our solution offers advantages such as a limited computation time, a fully automated architecture, a generation of report that could prove useful for the needs of search engines which need reliable statistics in order to filter their results. [25]

4. MODEL ARCHITECTURE

5. WEB-CRAWLER
One of the shortfalls of HoneyC was that it relied on results of popular search engines to generate a list of URLs to search. However, modern search engines employ their own filtering mechanisms to ensure that malicious sites are not displayed. The efficacy of the honeypot is adversely affected if it is unable to find malicious websites for analysis. Hence, our preferred approach was to rely on website listings found on popular security blacklists published by companies such as Symantec and Sophos. We needed to rely on website listings as seeds for the web-crawler. This led us to an open source solution known as WebSPHINX (Website-Specific Processors for HTML Information eXtraction) a Java class library and interactive development environment for web crawlers. The following figure illustrates the operation of the crawling module.

Figure 1: Workflow Diagram of our Solution

Our solution is based on the assumption that most exploits are spread through JavaScript. Therefore, we limited our implementation to the detection of obfuscated and malicious JavaScript. Besides the Similarity analysis and signature detection we propose is limited to the detection of hexadecimal shellcodes and should therefore be completed by other tools to detect different forms of exploits. This design decision led to a software architecture consisting of four main components: 1. Web crawling (Section 5) 2. JavaScript extraction (Section 6) 3. Obfuscation detection (Section 6) 4. Similarity analysis and signature detection (Section 7) Hence, due to high computational cost and data requirement for network flow analysis [24], we chose to restrict our scope to static analysis and signature based methods which we believe can be improved so as to provide comparable results as network flow analysis with less requirements. We also avoided network flow analysis as our main intention was to ascertain whether a website was malicious, and not whether a computer within a network was receiving malicious content.

5.1 Configuration
Several configurations were applied to the module to customize WebSPHINX for our use. Firstly, multiple threads were employed to speed up the process. Secondly, the searching algorithm can be either a depth-first-search (DFS) or breadth-first-search (BFS). DFS performs the crawl within the domain, whereas BFS crawls across different domains. Next, the crawl depth is set appropriately, so as to restrict the number of results generated. In addition, the crawler can be configured to obey robot exclusion protocols.

The Multipurpose Internet Mail Extensions (MIME) parameter determines the type of file types. For simplicity, it is limited to "text/html". Other common extensions (e.g. image/jpeg, video/mpeg) are ignored, since we are looking only for malicious code within HTML documents. Other parameters include maximum page size, crawl timeout, and download timeout, among others. To handle dynamic web pages, web pages are saved in the process of crawling web links.

From that observation we chose 3 metrics and verified their distribution properties: 1. N-grams based on the string extracted: how many times each byte code (ASCII) is used in the strings. We used the restricted ASCII code table Erreur ! Source du renvoi introuvable. (restricted to human readable characters). The justification for that choice was based on the observation that obfuscated codes use special characters (/, +, etc.) in excess. 2. Density of entropy of each string (Entropy/length). By obfuscating the code we observe a diminution of the information contained in each string (which is a direct effect of hiding the meaning of the code), thus the entropy contained in each strings tends to vary between normal and obfuscated code. 3. String lengths. Obfuscated code often uses long encrypted strings.

6. DETECTION OF OBFUSCATION 6.1 Utility and Challenges


Obfuscation was identified by M. Egele et al. Erreur ! Source du renvoi introuvable. as one of the challenges to detect malicious code in webpages. Although it was formerly a legitimate means for software engineers to protect their system from attacks, it was soon adopted by exploits coders to bypass signature detection systems. Therefore though legitimate obfuscated webpages exist, we believe it is useful to collect statistics on obfuscation in order to first adapt our detection methods. Over the long term this can help to develop new heuristics for malicious code detection. Erreur ! Source du renvoi introuvable. A complete presentation of obfuscation techniques is beyond the scope of this paper so will only retain that there exists to encrypt attack codes in such way the encryption key depends on the source code of the decryption function and thus, modifying the decryption routine by adding debugging instructions modifies the key and subsequently results in distorted and invalid output. In this respect we chose to restrict our analysis only to the detection of obfuscation and not its decryption.

6.2.2 Elements of justification N-Grams


To choose which types of characters are significant indicators of obfuscation, we studied their normalized distribution (divided by the maximum frequency) for normal and obfuscated code. While the use of alphabet characters (65-90, 97-122) does not strongly differ when comparing normal and obfuscated code, we observe an abnormal concentration of hexadecimal characters (u117, x122), special characters (0-46, 58-64, 91-96, 123-127) and numbers (48-57) in obfuscated codes.

1.5

6.2 A supervised machine learning approach


6.2.1 Approach
Though there has been a recent attempt by M. Dalla Preda et al. in Erreur ! Source du renvoi introuvable. to devise a formalism specific to obfuscated codes, due to implementation difficulties in a real world context Machine-Learning-oriented approaches remain the standard and most promising methodology for code obfuscation detection Erreur ! Source du renvoi introuvable.Erreur ! Source du renvoi introuvable., and maybe even for malicious code detection. Erreur ! Source du renvoi introuvable. Based on these works, our approach was to look for metrics that differ between normal and obfuscated code. The rationale is that obfuscated code strongly differs from normal code in readability. Indeed Coders use automated programs Erreur ! Source du renvoi introuvable. to obfuscate their codes which result in a less readable code using more special characters, encrypted strings, and longer strings in order to hide the true meaning of the operations being performed.

1 0.5 0 37 49 51 53 55 57 98 100 102

AscII code frequencies


Figure 2: Obfuscated code normalized frequencies
%u03eb%ueb59%ue805%ufff8%uffff%u4949%u3749%u4949%u4949%u4949%u4949%u4

N-Grams Category u, x Special characters: ! # $ % & { } *+,-./:;<=>?@[\]^_`{|} ~ Numbers Decimal Ascii Codes 117, 120 0-46, 58-64, 91-96, 123-127

48-57 Figure 3: ascii codes used for N-grams

To confirm this trend we did supervised machine learning over 100 obfuscated webpages and normal ones, which represents around 600 javascripts. Then we plotted the empirical probability density function for the parameters of our N-grams.

6.3 Mathematical Assumptions


With the estimations of the above mentioned metrics for a given code we use a Nave Bayes classifier [34] to classify the code as obfuscated or normal. Utilization of such a classification method relies on the assumption that the distributions of metrics are independent. Over a few samples we did not observe strong correlations between them which allowed us as a first approximation to use this methodology.
Figure 4: pdf estimation for the frequency of char u,x

In summary by defining the following variables:

{ We then test the hypothesis H on the estimated densities of probability:


( Figure 5: pdf estimation for number frequency | ) ( | ) ( ( | ) ( | ) ( | ) ( | ) ( | ) | ) ( | )

H>0

Abs(H) <10% >10%

Detection Obfuscated Suspicious Non Obfuscated

Density of Entropy and Average length


Our second metric is a density of entropy. In information theory the entropy represents the information contained in a sequence of characters. where is the frequency for the total count of bytes in a string. ( )

Y N N

The probability density functions are estimated with the kernel estimation method. [27], [33]

byte value and T refers to the

For strings making too much use of the same characters, the density of entropy will tend to zero.

Figure 6: Estimation of the density of entropy

Similarly for the average length we observe differences between normal and obfuscated code.
Figure 7: Estimation of pdf for special characters

For future work we suggest that our model be refined with more metrics and trained a very large dataset. Hence, the assumption of independence is quite strong and might not be respected over a large number of metrics. In this respect we suggest other forms of modeling such as using the copula theory [34] in order to take possible dependencies between metrics into consideration. Besides we had very encouraging results with the density of entropy which makes us believe that machine learning methods could also be adapted to reach the next level: malicious code detection. Indeed this approach provides quite decent results in language processing which fosters us into further investigating the matter for malicious code detection.

Figure 8: Estimation of pdf for density of entropy

7. SIGNATURE AND SIMILARITY ANALYSIS 7.1 Approach


As mentioned earlier signature detection techniques are crippled by code obfuscation. In this respect we decided to combine both a signature detection engine to another kind of analysis that can deal with obfuscation. In [9] A. Karnik et al. proposed a new formalism to test the similarity between different programs. Thus this approach can also be used as a detection mechanism of malicious codes provided that we have a database a malicious code. In summary we chose to apply this method to hexadecimal shellcodes embedded in javascripts. If the shellcode presents a great similarity with known malicious shellcodes it will be tagged as malicious.

6.4 Algorithms
We applied our method only to parameters of dangerous functions by establishing a list a Javascript functions that could potentially harm vulnerable systems. Our routine extracts all strings related to these functions.
MaliciousFunctions = { "eval","escape","unescape", "document.write", "AddRouteEntry","OpenURL","obj.GetHistory", "obj.deleteReport","iframe","obj.saveNessusRC", "obj.addsetConfig","obj.AddFolder", "obj.ExecuteStr","storm.rawParse", "o2obj.LaunchApp","initx","GetRegValue", "SetRegValue","SaveToFile","Install", 1. String Extraction() "target.Update","o2obj.LaunchApp", "SaveFile","PTZCamPanelCtrl.ConnectServer", Extracts all javascript in the webpage "BD.initx","qvod.url", "Register", Search javascript for malicious functions "RecordSend.SetPort","open","Open"};

7.2 Implementation
The implementation of our engine for similarity analysis consists of the detection of the presence of any binary or hexadecimal shell codes inside the Javascript of a html webpage. Our implementation was done in java in 3 classes, AnalysisRunner.java, AnalysisEngine.java and Global.java. AnalysisRunner.java class performs script extraction from HTML webpages, using the open source Jericho HTML parser [6].

2.

3.

Extract parameters of these functions Figure 9: Javascript dangerous functions Extract all strings related to the previously extracted parameters Metrics_calculation() N-grams Density Entropy of each string Average Length of strings for the whole Javascript Classfication() PDF estimations for each metric Test H Write report

For supervised machine learning only string extraction and metrics calculation are invoked.

6.5 Future Work


We present the results of our method in section 8.2.

Hex Shell Code Downloaded HTML pages HTML script string extractShellCode() SQLite signature database

dataModeling() Jericho HTML parser Hex Shell Code

Source (HTML page)

Nop Check()

Match?

getTitle()

Signature Check()

Benign Malicious Code Detected

getScript()

Similarity Check() Signature database

Figure 3: Signature Dictionary Check Hex Shell Code

AnalysisRunner.java

AnalysisEngine.java

X86dis (linux disassembler)

Figure 2: AnalysisRunner and AnalysisEngine

Upon extraction, we can get the requested HTML tag string that we specify, such as the title of the webpage. For our analysis we extract the tag named script (content of JavaScript) which will be extracted and sent to the AnalysisEngine. AnalysisEngine class is the core of the static analysis. It consists of a function for binary shellcode extraction that extracts any embedded shellcode found within the script portion of our HTML webpage, by identifying the \x or %u character. The extracted shell code will then be passed to various other functions to check if they are malicious. Three functions are written to perform 3 portions of analysis; the first is nopsled check to examine the possibility of NOP string. The second and the third functions consist of signature check and obfuscation detection which was presented in section 6.

dataModeling() Global.java -: used to count the occurrences o each instruction Similarity Calculation

> Min ?

Malicious Detected

Code

Benign

Figure 4: Similarity Check Function Flow

The signature check is a standard method of verification with a set of known malicious codes that compared to the extracted binary shellcode. Thus if the extracted shellcode matches with one of the malicious signatures the page will be tagged as malicious, else it is not. The malicious signatures are stored in an SQLite [7] database, which is stored locally, and the signature is loaded via JDBC [8]. Off course this method does not support obfuscation techniques; the third function will try to address this problem. The third function is the similarityCheck function that uses the method of Cosine Similarity [9]. Before estimating the similarity the machine instructions related to the hecadecimal shellcode

must be identified. The dataModeling function is written in this purpose, it first makes use of an external disassembler to disassemble the shellcode. The external disassembler used is x86dis [10], a linux-based disassembler. The dataModeling function will then count the number of each opcodes in the disassembled codes, such as the number of mov,add..etc. The Global java class stores the list of x86 opcodes [11] to be counted. The dataModeling function will process the shellcode as well as the signature and return the opcodes count, which will then be used to estimate the similiratiy to known malicious shellcodes. The maximum cosine similarity is recorded and compared to a minimum threshold.

8. RESULTS 8.1 Verification of similarity measures


Two shell codes are extracted from shell storm [12] and inserted into the Javascript of shell_1 and shell_2 HTML webpages. File shell_1_o is an obfuscated version of shell_1 via NOP dead code insertion. File shell_2_o is the obfuscated version of shell_2 via instruction replacement of xor ecx,ecx with mov ecx, 0x00000000.
//----shell_1.html----xor eax, eax mov al, 0x25 push 0xFF pop ebx mov cl, 0x09 int 0x //----shell_1_o.html-----NOP dead code insertion xor eax, eax nop mov al, 0x25 push 0xFF pop ebx mov cl, 0x09 nop int 0x80 //----shell_2.html----push 0x0B pop eax cdq push edx push 0x68732F2F push 0x6E69622F mov ebx, esp xor ecx, ecx int 0x80 //----shell_2_o.html-----Instruction replacement push 0x0B pop eax cdq push edx push 0x68732F2F push 0x6E69622F mov ebx, esp mov ecx, 0x00000000 int 0x8

7.3 Similarity Measures


Based on the work of [9] we use three measures of similarity (with i denoting the machine instruction and we sum over all the machine instructions of the x86 architecture) 1. 2. 3. Cosine similarity= xi yi / [( xi2)1/2 ( yi2)1/2] Jaccard correlation= xi yi / [( xi2)+( yi2) - xi yi ] Pearson correlation=[(xi-Mean( xi))*(yi-Mean( yi))]/ [(xi-Mean( xi))2 *(yi-Mean( yi))2]1/2]

For instance the procedure for the cosine similarity will be:
Loop(each count of instruction){ Sum_XY = targetA [x] * targetB [x] Sum_DenomX = targetA [x] * targetA [x] Sum_DenomY = targetB [x] * targetB [x] } Cosine_sim = Sum_XY / Math.sqrt(Sum_DenomX * Sum_DenomY);

The targetA array is extracted from each signature from our signature dictionary database where else targetB array is extracted from the shell code which in turn extracted from our self-written parser. Finally we keep only the highest measure of similarity and compare to a threshold defined by the user.

7.4 Limitations Future Work


We present an outline of the results in section 8. By definition the measures of similarity will cope well with instruction permutation and we modified he procedure so that NOP code insertion does not change the estimation of similarity. However in the case of instruction substitution (such as substituting sub -1 to add 1) the measure will be hindered. As a future work we suggest that most common permutation of instructions be tried and to keep the highest value of similarity but this will require an in depth study of 259 instructions of the x86 architecture. We also suggest to expand to other measures such as levensthein distance or to define measure that will intrinsically take into account the possibility of substitution (e.g. valueof(add) and valueof(sub) are correlated) Hence setting up the value of the threshold could be refined by machine learning approaches.

Testing shell_1.html gives the following output:


Script 2 : var haimeng14=unescape("\x31\xc0\xb0\x25\x6a\xff\x5b\xb1\x09\xcd\x80"); var pkblack = haimeng16+haimeng16; var headhaimeng = 20; var i="i"; var ii="i";

var pkjuhaimeng = headhaimeng+heiba.length; while (pkblack.length<pkjuhaimeng) pkblack+=pkblack; fillhaimeng = pkblack.substring(0, pkjuhaimeng); codejuhaimeng = pkblack.substring(0, pkblack.length-pkjuhaimeng); while(codejuhaimeng.length+pkjuhaimeng<0x40000) codejuhaimeng = codejuhaimeng+codejuhaimeng+fillhaimeng; modehaimeng = new Array(); for (x=0; x<300; x++) modehaimeng[x] = codejuhaimeng + heiba; var buffer = ''; pkpkhaimeng.Register(11,buffer); Extracted ShellCode: \x31\xc0\xb0\x25\x6a\xff\x5b\xb1\x09\xcd\x80 Connected to signature database...

============================================ shell_1_o.html Extracted ShellCode: \x31\xc0\x90\xb0\x25\x6a\xff\x5b\xb1\x09\x90\xcd\x80 00000000 31 C0 xor 00000002 90 nop 00000003 B0 25 mov 00000005 6A FF push 00000007 5B pop 00000008 B1 09 mov 0000000A 90 nop 0000000B CD 80 int similarity: 1.0 similarity: 0.9486832980505138 similarity: 0.9799118698777318 similarity: 0.9354143466934853 similarity: 0.33541019662496846 similarity: 0.7071067811865475 similarity: 0.9185586535436918 similarity: 0.8745458870034072 similarity: 0.9733285267845753 similarity: 0.8890008890013334 similarity: 0.7941013883159839 similarity: 0.17437145811572893 similarity: 0.625 similarity: 0.7092081432669752 similarity: 0.9192388155425117 similarity: 0.5443310539518174 similarity: 0.5111630125684566 similarity: 0.0 Max_sim: 1.0 Connected to signature database... eax, eax al, 0x25 0xFF ebx cl, 0x09 0x80

shell_1.html Extracted ShellCode: \x31\xc0\xb0\x25\x6a\xff\x5b\xb1\x09\xcd\x80 00000000 31 C0 xor 00000002 B0 25 mov 00000004 6A FF push 00000006 5B pop 00000007 B1 09 mov 00000009 CD 80 int similarity: 1.0 similarity: 0.9486832980505138 similarity: 0.9799118698777318 similarity: 0.9354143466934853 similarity: 0.33541019662496846 similarity: 0.7071067811865475 similarity: 0.9185586535436918 similarity: 0.8745458870034072 similarity: 0.9733285267845753 similarity: 0.8890008890013334 similarity: 0.7941013883159839 similarity: 0.17437145811572893 similarity: 0.625 similarity: 0.7092081432669752 similarity: 0.9192388155425117 similarity: 0.0 similarity: 0.5111630125684566 similarity: 0.0 Max_sim: 1.0 Connected to signature database... eax, eax al, 0x25 0xFF ebx cl, 0x09 0x80

-----------Analysis for this script = Malicious Shell Code Detected by cosine similarity-

The maximum cosine similarity obtained is 1, from the first signature of the database that corresponds to the unobfuscated version of the shellcode, thus the page is tagged as malicious despite the obfuscation. This expected as NOP dead code insertion is managed by our algorithm. Besides signature detection fails as it requires an exact match. We run again our static analysis engine on shell_2 and obtain:
shell_2.html Extracted ShellCode: \x6a\x0b\x58\x99\x52\x68\x2f\x2f\x73\x68\x68\x2f\x62\x69\x6e\x89\xe3\x3 1\xc9\xcd\x80 00000000 6A 0B push 0x0B 00000002 58 pop eax 00000003 99 cdq 00000004 52 push edx 00000005 68 2F 2F 73 68 push 0x68732F2F 0000000A 68 2F 62 69 6E push 0x6E69622F 0000000F 89 E3 mov ebx, esp 00000011 31 C9 xor ecx, ecx 00000013 CD 80 int 0x80 similarity: 0.9486832980505138 similarity: 1.0

-----------Analysis for this script = Malicious Shell Code Detected by signature dictionaryMalicious Shell Code Detected by cosine similarity-

The above output shows the extracted javascript from shell_1.html and the extracted shellcode from the extracted javascript. The output of the disassambler is also shown and the similarity is calculated for each signature in the SQLite database. It can be seen that the similarity is 1.0 (exact match) as the shell code is same as the first signature of the database. The shell code is malicious and is also detected by the signature dictionary as well as with the cosine similarity method. With obfuscated version shell_1_o.html we get the output:

10

similarity: 0.9534625892455924 similarity: 0.8451542547285166 similarity: 0.282842712474619 similarity: 0.6708203932499369 similarity: 0.7745966692414834 similarity: 0.8733337646093731 similarity: 0.9233805168766388 similarity: 0.722897396012249 similarity: 0.6954006683576303 similarity: 0.22056438662814232 similarity: 0.6324555320336759 similarity: 0.6210590034081188 similarity: 0.8049844718999243 similarity: 0.0 similarity: 0.4913975701062781 similarity: 0.0 similarity: 0.9486832980505138 Max_sim: 1.0 -----------Analysis for this script = Malicious Shell Code Detected by signature dictionaryMalicious Shell Code Detected by cosine similarity shell_2_o.html Extracted ShellCode: \x6a\x0b\x58\x99\x52\x68\x2f\x2f\x73\x68\x68\x2f\x62\x69\x6e\x89\xe3\xB 9\x00\x00\x00\x00\xcd\x80 00000000 6A 0B push 0x0B 00000002 58 pop eax 00000003 99 cdq 00000004 52 push edx 00000005 68 2F 2F 73 68 push 0x68732F2F 0000000A 68 2F 62 69 6E push 0x6E69622F 0000000F 89 E3 mov ebx, esp 00000011 B9 00 00 00 00 mov ecx, 0x00000000 00000016 CD 80 int 0x80 similarity: 0.9354143466934853 similarity: 0.8451542547285166 similarity: 0.8864052604279183 similarity: 0.8571428571428571 similarity: 0.23904572186687872 similarity: 0.7559289460184544 similarity: 0.9819805060619657 similarity: 0.6888949638271721 similarity: 0.9538209664765319 similarity: 0.8146130799625568 similarity: 0.816278935560424 similarity: 0.18641092980036 similarity: 0.6681531047810609 similarity: 0.46656947481584343 similarity: 0.9071147352221454 similarity: 0.0 similarity: 0.4808814966867716 similarity: 0.0 similarity: 0.9354143466934853 Max_sim: 0.9819805060619657 Connected to signature database... -----------Analysis for this script = Malicious Shell Code Detected by cosine similarity

In this case the code will still appear as malicious if the threshold is setup under 0.98. However this is only because the obfuscation made the shellcode more similar to another signature present in the database. As we mentioned earlier our approach does not deal well with instruction substitution and would require an in depth study of the possible substitutions of machine instructions (e.g. substitute add 1 to sub -1).

8.2 Results of Obfuscation Detection


Our training data set was limited to obfuscated malicious codes in the first case and normal malicious code as well as normal legitimate code in the second case, therefore our static analysis engine is more liable to detect only malicious obfuscated code and miss legitimate obfuscated code. Producing an accurate estimate if the accuracy and recall of our approach requires an important investment of time (as this is supervised machine learning limited to dangerous functions) so we will only give an outline of the results. Over our training dataset we obtained a precision of 100% and recall of 84%. On other webpages we do not have data to provide an accurate estimate, for instance facebook uses obfuscation but not with function we consider as dangerous so the obfuscated strings are not taken into account by our program. To test our program please refer to the Annexes.

9. CONCLUSION AND FUTURE WORK


Through this paper, we have illustrated the benefits of our method by combining the best-of-breed features from existing lowinteraction client-based honeypots. Our implementation provides a web crawler coupled with a static analysis engine that performs obfuscation detection and similarity analysis to deliver a product that is usable and extensible. The low-interaction client-based honeypot we have built demonstrates the ability to detect various forms of malicious code through static analysis methods involving similarity analysis and obfuscation detection. However, more can be done to develop a more robust system that builds upon developments in this paper. For instance, a more comprehensive list of seed URLs can be obtained by adding more security blacklists, or by automating the aggregation process to update the seed URLs on a regular basis. This will allow the honeypot to traverse more recent malicious websites to improve the quality of information received. Another possible improvement would be to introduce more browser emulation techniques. This will allow analysis of contents of a webpage customized to suit different browsers, or detect malicious websites that present benign content to web crawlers. Regarding our static analysis engine an online signature database would be necessary in order to move to a production environment. Hence as mentioned in section 4 our analysis is limited to the detection hexadecimal shellcodes and should therefore be

11

completed by other software to provide a more comprehensive solution. Finally from the research point of view we hope that people will be able to take on our work to improve code obfuscation detection and similarity analysis with the guidelines we provided in section 6 and 7.

[18]. M. Egele, M. Szydlowski, E. Kirda, and C. Kruegel. Using static program analysis to aid intrusion detection. In DIMVA, pages 1736, 2006. Vulnerability Assessment, 6th International Conference, DIMVA 2009 (to appear), 2009. [19]. Vinod P., V.Laxmi,M.S.Gaur Survey on Malware Detection Methods, , Malaviya National Institute of Technology [20]. Peter Likarish, Eunjin (EJ) Jun , Insoon Jo, Obfuscated Malicious Javascript Detection using Classification Techniques [21]. Michalis Polychronakis, Kostas G. Anagnostakis, Evangelos P. Markatos. Comprehensive Shellcode Detection using Runtime Heuristics [22]. Marco Cova, Christopher Kruegel, and Giovanni Vigna. Detection and Analysis of Drive-by-Download Attacks and Malicious JavaScript Code [23]. Arini Balakrishnan, Chloe Schulze. Code Obfuscation Literature Survey. [24]. Ramkumar Chinchani and Eric van den Berg. A Fast Static Analysis Approach to Detect Exploit Code inside Network Flows. [25]. Niels Provos, Dean McNamee, Panayiotis Mavrommatis, Ke Wang and Nagendra Modadugu, The Ghost In The Browser Analysis of Web-based Malware, Google, Inc. [26]. YoungHan Choi, TaeGhyoon Kim, SeokJin Choi, Automatic Detection for JavaScript Obfuscation Attacks in Web Pages through String Pattern Analysis, The Attached Institute of ETRI. [27]. Weka 3: Data Mining Software in Java, http://www.cs.waikato.ac.nz/ml/weka/ [28]. JOpt Simple is a Java library for parsing command line options, http://jopt-simple.sourceforge.net/ [29]. M. Qassrawi, Hongli Zhang, Client Honeypots: Approaches and Challenges [30]. M. Dalla Preda and R. Giacobazzi, Control Code Obfuscation by Abstract Interpretation; Dipartimento di Informatica Universit`a di Verona [31]. Automated javascript obfuscation, http://javascriptobfuscator.com/default.aspx [32]. Restricted Ascii codes table, http://www.ascii-code.com/ [33]. Kernel density estimation, http://en.wikipedia.org/wiki/Kernel_density_estimation [34]. Nave Bayes classifier, http://en.wikipedia.org/wiki/Naive_Bayes_classifier [35]. Copula Theory, http://en.wikipedia.org/wiki/Copula_(statistics)

10. ACKNOWLEDGMENTS
We thank DR Liang Zhenkai for introducing us to systems security issues thus giving us the basic materials that were necessary to conduct this study.

11. REFERENCES
[1]. Bowman, M. Debray, S. K., and Peterson, L. L. 1993. Reasoning about naming systems. ACM Trans. Program. Lang. Syst. 15, 5 (Nov. 1993), 795-825. DOI= http://doi.acm.org/10.1145/161468.16147. [2]. PhoneyC, http://code.google.com/p/phoneyc/ [3]. ClamAV, http://www.clamav.net/lang/en/ [4]. Niels Provos, Dean McNamee, Panayiotis Mavronmmatis, Ke Wang and Nagendra Modadugu,The Ghost In the Brower Analysis of Web-based Malware [5]. Libemu, http://libemu.carnivore.it/ [6]. Jericho HTML parser, http://jericho.htmlparser.net/docs/index.html [7]. SQLite SQL database, http://www.sqlite.org/ [8]. JDBC, http://www.oracle.com/technetwork/java/overview141217.html [9]. Abhishek Karnik, Suchandra Goswami & Ratan Guha, Detecting Obfuscated Viruses Using Cosine Similarity Analysis, 2007 http://www.cs.ucf.edu/courses/cis4363/spr2007/Lectures/Det ecting%20Obfuscated%20Code%20Using%20Cosine%20Si milarity.ppt [10]. x86dis, http://www.linuxcertif.com/man/1/x86dis/ [11]. x86 Opcodes listing, http://asm.inightmare.org/opcodelst/ [12]. Shell Storm, http://www.shell-storm.org/ [13]. Manuel Egele, Engin Kirda, and Christopher Kruegel, Mitigating Drive-by Download Attacks: Challenges and Open Problems [14]. [M. Polychronakis, K. G. Anagnostakis, and E. P. Markatos. Emulation-based detection of non-self-contained polymorphic shellcode. In Recent Advances in Intrusion Detection, 10th International Symposium (RAID), pages 87 106, 2007. [15]. W. K. Robertson, G. Vigna, C. Krgel, and R. A. Kemmerer. Using generalization and characterization techniques in the anomaly-based detection of web attacks. In Proceedings of the [16]. Network and Distributed System Security Symposium, NDSS 2006, San Diego, California, USA, 2006 [17]. M. Egele, E. Kirda, and C. Kruegel. Defending browsers against drive-by downloads: Mitigating heap-spraying code injection attacks. In Detection of Intrusions and Malware, and

12. ANNEXES
Our program is written in java and consists of one executable jar file that can be run on every platform, the signature database MalDB is necessary for the analysis engine. You are invited to refer to README.txt for detailed instructions.

12

Figure 10: Honeysift command line interface

13

You might also like