Professional Documents
Culture Documents
Webshell Detection Techniques in Web Applications: Truong Dinh Tu, Cheng Guang, Guo Xiaojun, Pan Wubin
Webshell Detection Techniques in Web Applications: Truong Dinh Tu, Cheng Guang, Guo Xiaojun, Pan Wubin
Abstract - With widely adoption of online services, the hackers use webshell to browse files, upload tools, and
malicious web sites have become a malignant tumor of run commands; after that, they escalate privileges and pivots
the Internet. Through system vulnerabilities, attackers to other targets. Thus, the role of web server managers is to
can upload malicious files (which are also called ensure the privacy and security for customers’ web pages. A
webshells) to web server to create a backdoor for web page may contain thousands of code lines or more in
hackers’ further attacks. Therefore, finding and each file; therefore, it is very difficult and sometimes
detecting webshell inside web application source code impossible to review the source code manually to detect
are crucial to secure websites. In this paper, we propose malicious file such as webshell. So, finding and detecting
a novel method based on the optimal threshold values to webshell inside webpage application are necessary for
identify files that contain malicious codes from web securing websites.
applications. Our detection system will scan and look for
malicious codes inside each file of the web application, There are some known tools to find webshell, such as
and automatically give a list of suspicious files and a NeoPI [1], PHP Shell Detector [2], and Grep (which is a
detail log analysis table of each suspicious file for unix command for searching files for lines matching a given
administrators to check further. The Experimental regular expression). However, these tools still remain
results show that our approach is applicable to identify advantages and disadvantages (as discussed in Section IV).
webshell more efficient than some other approaches. In this paper, we propose a novel method based on the
optimal threshold values that we computed to detect
Keywords: backdoor, webshell, webshell detection, webshell from web application’s source code. Moreover, our
malicious web detection. method also solves their existing disadvantages.
Therefore, we can calculate and choose a threshold value On the other hand, we also collected 26,445 files of
for MS and MF approaches. open sources web pages, such as Oscommerce, VBB,
Joomla, Wordpress, PhpNuke, Phpbb, etc., that were shared
For example, in order to choose a threshold value for MF, on the Internet (these web pages may or may not contains
we scan each file of a large set of benign web pages, count malicious). We named this testing file is T2.
the appearance frequency of the MF samples (which are
stored in XMLDB) within benign files, and calculate the B. Experimental results
total score of each file bs {xi | i 1, n} , in which xi is 1. ROC Curve
th
the total score of i benign file, n is the number of files in A Receiver Operating Characteristic (ROC) curve is
benign web page. Moreover, we also scan and count the normally used to measure effectiveness of the detection
appearance frequency of those MFs in the file webshell, and method. It indicates how the detection rate changes, as the
compute the total score for each file webshell internal threshold is varied to generate more or fewer false
alarms. It plots detection accuracy against false positive
ws { yi | i 1, m} , in which, yi is the total score of ith
probability. Fig.3 depicts the ROC curve for our approaches.
file webshell, m is the number of file webshell. From there,
we choose the best threshold value to identify webshell. ROC curves signify the tradeoff between false positive
Similarly, we also can determine the best threshold value for rate (FPR) and true positive rate (TPR), which means that
MS. any increase in the TPR is accompanied by a decrease in the
FPR. Then, as shown by the ROC curves, lines in the
To determine a threshold value for the LW, we conduct diagram is the closest to the left-hand border and the top
a scan to look for the LW from a set of samples encrypted border compared to other diagrams, indicated that it offers
webshells given and a set of benign web sources. From the the finest result among the other methods. The experimental
list of values obtained, we decide an appropriate threshold result in Fig.3(c) shows that method based on MS performed
value for the LW. the best result.
For example, in order to choose an optimal threshold Furthermore, the area under the curve (AUC) is used to
value for the LW, we scan each file of the benign web pages measure the accuracy. An area of 1 means a perfect result
and calculate the value of the LW for each file while a 0.5 value is a worthless result. The AUC point
lw {z i | i 1, n} , in which zi is the value of the LW of system is as follows: 0.90 – 1.00 = excellent, 0.80 - 0.90 =
ith benign file, n is the number of files in benign web pages. good, 0.70 - 0.80 = fair, 0.60 - 0.70 = poor and 0.50 - 0.60 =
Similarly, we also scan and compute the value of the LW in fail. Fig.3 presents the experimental AUC values, it is found
that the MS-based approach (Fig.3(c)) has the best AUC
a set of encrypted webshells lws {hi | i 1, m} , in value (1.00) compared to the other approaches. The MF-
th
which hi is the value of the LW of i file webshell, m is based approach (Fig.3(a)) is in second place with 0.937,
the number of files in webshell. Then, we choose the best LW-based approach (Fig.3(b)) is in third place with 0.829
threshold value for the LW. AUC value, more efficient than NeoPI approach only with
0.77 AUC .
IV. EXPERIMENTAL AND EVALUATION
2. The optimal threshold value
A. Data set
The threshold value is a configurable value to adjust
For our testing, we collected total 13,151 files on the
heuristic sensitivity (TPR) that the system can classify
test dataset in which include 12,982 files of benign web
whether a file is webshell or benign. With high threshold
pages (with high confidence) that do not contain malicious
values, all webshell are classified as benign, resulting in a
codes and 169 malicious webshell files.
TPR of 0 and a FPR of 0. Lowering the threshold will
Our collecting method: for benign web pages, we increase both the TPR and FPR. For the lowest possible
downloaded directly from open sources websites which are threshold, both TPR and FPR are 1 (as shown in Fig.3).
prestigious and high rank based on Alexa [18]; for malicious
To determine an optimal threshold value (also called
webshell files, we downloaded and collected from several
optimal cut-off point), we considered the Youden Index (J)
security forums that were published and discussed on
[21].
Internet [17, 19]. We named this testing file is T1.
Fig.3 The ROC curve - false positive rate vs. true positive rate.
The red dot represents the optimal threshold used.
2) NeoPI has the advantage that it can find and detect V. CONCLUSION
obfuscated and encrypted contents within text and script
In summary, in this paper we proposed a novel method
files by calculating and listing some indices such as
based on the optimal threshold values of MS and MF
coincidence, Entropy, LongestWord, Signature; after that, it
samples as well as LW to identify webshell. Our detection
will give a rank for all files and top ten files with highest
system will scan and look for malicious codes inside each
rank[1]. However, its disadvantage is that it does not
file of web application, and automatically give a list of
provide a threshold value for these indices in order to
suspicious files as well as a detail log analysis table of each
determine which files is webshell. Therefore, we have to
suspicious file for administrator to examine further. The
refer to analysis and judgement from experts.
evidence from the experimental results shown that our
3) PHP Shell Detector (SD): Basically, SD has a known approach has relatively high recognition ratio and identify
webshells signature database saved in web server. The webshell better than recognition ratio of some other AV
advantage of this method is that its true positive can be up to softwares.
99% for known webshells in the database [2]. However, it
Generally, our current system can help to minimize the
has a disadvantage that if a new webshell has not been
time and costs for administrators by reviewing a source code
updated in database, this approach can not detect.
automatically and give warnings if there is any file
Our proposed method helps to solve these above suspected to be malicious.
disadvantages:
However, we aware that our work may have two
1) We provided a relatively completed list of MS and MF limitations. First, the number of MS and MF samples that
samples normally used in webshells, and each MS or we collected may be missing. Second, our system could not
MF was assigned a score according to its danger and differentiate between encrypted webshell files and some
can be updated to add/delete at any time within the file legal files which programmers used a few malicious
XML database (see Section III_A). functions for purposes to install or to encrypt sensitive
2) We calculated and gave the optimal threshold values in information such as passwords or personal information.
order to determine whether that file is webshell or not Thus, our future studies should examine and address above
(see Section III_G). This approach helps to reduce the limitations.
FPR compared to above methods (as shown in Fig.3(b)).
ACKNOWLEDGMENT
3) We proposed supporting techniques in order to improve
This work was supported by the Technology Support
the webshell detecting process that current tools have
Program (Industry) of Jiangsu Province (No. BE2011173);
not been mentioned or have been ignored. Those
The Future Network Proactive Program of Jiangsu Province
techniques are: (1) to find blacklist keyword in file
(No.BY2013095-5-03); The Six major talent Summit
name and in file comment lines; and (2) to check and
Project of Jiangsu Province (No. 2011ˉDZ024);
compare current file structure with its original web
application (see Section III_E, III_F).