You are on page 1of 7

IEEE - 33044

Webshell Detection Techniques in Web Applications


Truong Dinh Tu1,2,3 , Cheng Guang1,3 , Guo Xiaojun1,3, Pan Wubin1,3
1
School of Computer Science and Engineering, Southeast University, Nanjing 210096, China.
2
Department of information technology, Tuyhoa Industrial College, Phuyen 620900, Vietnam.
3
Key Laboratory of Computer Network and Information IntegrationˈSoutheast University, Ministry of Education.

Abstract - With widely adoption of online services, the hackers use webshell to browse files, upload tools, and
malicious web sites have become a malignant tumor of run commands; after that, they escalate privileges and pivots
the Internet. Through system vulnerabilities, attackers to other targets. Thus, the role of web server managers is to
can upload malicious files (which are also called ensure the privacy and security for customers’ web pages. A
webshells) to web server to create a backdoor for web page may contain thousands of code lines or more in
hackers’ further attacks. Therefore, finding and each file; therefore, it is very difficult and sometimes
detecting webshell inside web application source code impossible to review the source code manually to detect
are crucial to secure websites. In this paper, we propose malicious file such as webshell. So, finding and detecting
a novel method based on the optimal threshold values to webshell inside webpage application are necessary for
identify files that contain malicious codes from web securing websites.
applications. Our detection system will scan and look for
malicious codes inside each file of the web application, There are some known tools to find webshell, such as
and automatically give a list of suspicious files and a NeoPI [1], PHP Shell Detector [2], and Grep (which is a
detail log analysis table of each suspicious file for unix command for searching files for lines matching a given
administrators to check further. The Experimental regular expression). However, these tools still remain
results show that our approach is applicable to identify advantages and disadvantages (as discussed in Section IV).
webshell more efficient than some other approaches. In this paper, we propose a novel method based on the
optimal threshold values that we computed to detect
Keywords: backdoor, webshell, webshell detection, webshell from web application’s source code. Moreover, our
malicious web detection. method also solves their existing disadvantages.

I. INTRODUCTION This paper is organized as follows: Section II describes


overview about webshell and classic webshell attacks;
Currently, the attacking Web server is one of the most
Section III presents methods and techniques that we used in
frustrating problems, not only for businesses and individuals,
this study; Section IV presents the results, discussion that
but also for the governments and organizations. A malicious
we obtained; and finally, Section V presents the conclusions
file that hackers planted on web server through system
of this paper.
vulnerabilities or web pages open source to create a
backdoor for them to come back next time is called
II. OVERVIEW OF WEBSHELL
Webshell. Once the webshell is run, it provides a web
interface for remote operations on the server with attacker Based on identification purposes, webshells can be
functionalities such as file transfer, command execution, divided into two main types: Non-encrypted and Encrypted
network reconnaissance, database connectivity, SQL webshells.
manager, etc. The technologies used to build web
x Non-encrypted Webshells: The source code of this
applications are often developed in languages such as PHP,
webshell is stored in clearly normal text. From the
Active Server Page (ASP), Java, Python, Ruby, Perl, etc. It
source code files, we can see functions that being used
not only supports full functions to access files, consoles,
in it.
networks, and database, but also runs and manages the
process in the system. This is quite handy through in x Encrypted Webshells: The source code of this kind of
building management applications in the web environment. webshell usually contains characters that are obfuscated
However, this is also the point that hackers aim to create a and not meaningful, but it still properly executes its
backdoor. One of the usual ways to attack web server is that features. The popular encryption functions that most

5th ICCCNT - 2014


July 11 - 13, 2014, Hefei, China
IEEE - 33044

attackers often use in their webshells are


base64_encode, base64_de-code, etc [14].

The attacker often finds a vulnerability in a hosted web


application and uploads a malicious dynamic web page to a
vulnerable web server [9, 10]; then (s)he uses the webshell
to browse files, upload tools, and run commands [11-13].
After that, (s)he escalates privileges and pivots to other
targets as allowed.

III. THE PROPOSED METHOD


Our proposed system includes following main modules
(as shown in Fig.1): (A) Calculate score for malicious
signature (MS) and malicious functions (MF) samples, (B)
Find MS samples, (C) Find MF samples, (D) Find Fig.1 The proposed framework to detect webshells
obfuscated and encrypted contents, (E) Find dangerous/-
low dangerous level may be assigned by 1-5.
blacklist keywords, (F) Compare with original application,
and finally, (G) Determine the optimal threshold value. Each In the following sections, we will present a detail for
module performs a specific task for scanning and matching components of the proposed system.
samples in web application source code with a known MS
and MF database. We collect a set of MF and MS samples B. Find MS samples
based on common tasks that are often used in the webshells, This module performs a search of matching MS
such as file management functions, database server access samples with files of web page source code by scanning
functions, filesystem functions, etc [12, 14-16]. every files in web application in order to find the MS
samples in those files (as shown in Fig.2, the sample.type =
To decrease the false positive, we give an extra method
1 implies that it is a MS sample). After that, we count the
of calculating score for the MS and MF samples based on
appearance frequency of those signatures and compute the
different dangerous levels, the popularity that those
total score of those files based on known database file which
functions are used in the webshells and in legitimate web
was calculated in Section III_A. If the signature total score
applications. Lists of the MF and MS samples and scores are
(STS) exceeds a threshold value (How to determine a
stored in a database file (XMLDB) to be easily updated and
threshold value, see Section III_G) of MS (Thre_MS), that
further developed.
file is marked as suspicious one.
A. Calculate score for MS and MF samples
C. Find MF samples
The scoring system is proposed based on a numerical
MF samples can be classified into the encoded/decoded
scale, such as 1-10, which corresponds to low, medium, and
support functions, command execution functions, filesystem
high dangerous level, (low <5, medium = 5-9, and high
functions, compressed function patterns, etc. The attackers
>=10). Thus, a high score of 10 or more may indicate almost
often use these MF samples for their webshells. Therefore,
no false positives; a medium score of 6-9 may imply a
finding out the MF samples is necessary to detect webshells.
possibility of some false positives and higher sensitivity;
and a low score of 5 or less may imply audit level In this module, we perform scanning each file to filter
sensitivity. MF name within files and check their dangerous levels by
matching with MF samples on known database file (in Fig.2,
For example, if the functions have a highly dangerous
the MF samples is sample.type = 2). The next step is to
level, they almost do not appear in benign web application
count the appearance frequency and calculate malicious
source codes; then the score for these functions may be
function total score (MFTS) of a file. If the MFTS exceeds a
assigned by 10 points or higher. However, if the functions
threshold value (How to determine a threshold value, see
have a medium dangerous level and they are sometimes
Section III_G) of MF (Thre_MF), that file is also marked as
used in the normal websites; then the scores for these
suspicious one.
functions may be assigned by 6-9; and the functions have a

5th ICCCNT - 2014


July 11 - 13, 2014, Hefei, China
IEEE - 33044

with obfuscated or encrypted codes.

Scott el al. [1] used this method in their NeoPI project


to find obfuscated and encrypted contents within text or
script files. However, if it is scanned on a set of source
codes that contain many images, videos, rich-text-format, js,
css, the false alert rate will be very high. Thus, we tried to
test NeoPI on a set of benign source codes, the result
showed a relatively high false alert rate. This will be
difficult for administrators to analyze.

Our approach in this module searches only and return


back the longest word (LW) that starts and ends by header
tags, such as “<?Php...?> for php language, "<\%\@ Page
Language>" for aspx language, similarly for other languages
because the source code is only executed when it is enclosed
in header tags. As experimental result in Fig.3(b) showed
that our this approach significantly reduces false positive
rate compared with the approach of NeoPI [1] from 24.5%
to 6.7%.

E. Find dangerous/blacklist keywords


The file webshells are often written and developed by
many hackers, in which there are comment lines with
suspicious keywords on the source codes. Therefore
scanning to look for threat keywords that are contained in
blacklist, such as “Web Shell by *”, “Hack by *”,
“developed by *”, “r57”, “c99”, “n3shell”, “TrYaG Team”,
http://cctea-m.ru/update/c999shell, http://ccteam.ru/files/c9-
Fig.2 Flow diagram of proposed detection algorithms
99sh_sources, etc., [17] will help in finding webshell. These
D. Find obfuscated or encrypted contents above keywords are rarely used in benign website source, so
their appearance inside source code are able to make higher
Obfuscated or encrypted contents are often stored as an malicious and may be marked as a suspicious file.
uninterrupted long string of encoded text within a file; after
that, it decodes strings into an actual source and executes. F. Compare with original application
Finding longest uninterrupted string existing in the files is
very useful to identify obfuscated or encrypted contents. If any webshell is able to bypass above techniques by
There are many popular encoding methods. For example, the hidden techniques of hackers, we will compare the
base64 encoding() will generate a long string without space scanned application code with the original one within this
characters; the following code illustrates the use of function module. After comparing, if any file is different, it will be
eval() and base64_decode(). marked as suspicious for further examination.
eval(base64_decode('cGFyc2Vfc3RyKCRfU0VSVkVSW
ydIVFRQX1JFRkVSRVInXSwkYSk7IGlmKHJlc2V0KCRhKT G. Determine the optimal threshold value
09J3FhJyAmJiBjb3VudCgkYSk9PTkpIHsgZWNobyAnPHp3 In order to determine a threshold value for each approach of
c3hlZGM+JztldmFsKGJhc2U2NF9kZWNvZGUoc3RyX3Jlc our proposed system, we collect a set of MS and MF
GxhY2UoIiAiLCAiKyIsIGpvaW4oYXJyYXlfc2xpY2UoJGEs samples based on common tasks that are often used in the
Y291bnQoJGEpLTMpKSkpKTtlY2hvICc8L3p3c3hlZGM+Jz
webshell such as file management functions, database server
t9'));?>
access functions, filesystem functions, etc., [12, 14-16].
In the above code, the hacker uses base64_decode() to After obtaining a set of necessary MS and MF samples, we
decode an encoded the string. Typically, the source code is scan a large amount of benign source codes, and calculate
written in relatively short length of words. So, identifying total score of the frequency of using these functions.
files with unusual long strings can help to recognize files

5th ICCCNT - 2014


July 11 - 13, 2014, Hefei, China
IEEE - 33044

Therefore, we can calculate and choose a threshold value On the other hand, we also collected 26,445 files of
for MS and MF approaches. open sources web pages, such as Oscommerce, VBB,
Joomla, Wordpress, PhpNuke, Phpbb, etc., that were shared
For example, in order to choose a threshold value for MF, on the Internet (these web pages may or may not contains
we scan each file of a large set of benign web pages, count malicious). We named this testing file is T2.
the appearance frequency of the MF samples (which are
stored in XMLDB) within benign files, and calculate the B. Experimental results
total score of each file bs {xi | i 1, n} , in which xi is 1. ROC Curve
th
the total score of i benign file, n is the number of files in A Receiver Operating Characteristic (ROC) curve is
benign web page. Moreover, we also scan and count the normally used to measure effectiveness of the detection
appearance frequency of those MFs in the file webshell, and method. It indicates how the detection rate changes, as the
compute the total score for each file webshell internal threshold is varied to generate more or fewer false
alarms. It plots detection accuracy against false positive
ws { yi | i 1, m} , in which, yi is the total score of ith
probability. Fig.3 depicts the ROC curve for our approaches.
file webshell, m is the number of file webshell. From there,
we choose the best threshold value to identify webshell. ROC curves signify the tradeoff between false positive
Similarly, we also can determine the best threshold value for rate (FPR) and true positive rate (TPR), which means that
MS. any increase in the TPR is accompanied by a decrease in the
FPR. Then, as shown by the ROC curves, lines in the
To determine a threshold value for the LW, we conduct diagram is the closest to the left-hand border and the top
a scan to look for the LW from a set of samples encrypted border compared to other diagrams, indicated that it offers
webshells given and a set of benign web sources. From the the finest result among the other methods. The experimental
list of values obtained, we decide an appropriate threshold result in Fig.3(c) shows that method based on MS performed
value for the LW. the best result.
For example, in order to choose an optimal threshold Furthermore, the area under the curve (AUC) is used to
value for the LW, we scan each file of the benign web pages measure the accuracy. An area of 1 means a perfect result
and calculate the value of the LW for each file while a 0.5 value is a worthless result. The AUC point
lw {z i | i 1, n} , in which zi is the value of the LW of system is as follows: 0.90 – 1.00 = excellent, 0.80 - 0.90 =
ith benign file, n is the number of files in benign web pages. good, 0.70 - 0.80 = fair, 0.60 - 0.70 = poor and 0.50 - 0.60 =
Similarly, we also scan and compute the value of the LW in fail. Fig.3 presents the experimental AUC values, it is found
that the MS-based approach (Fig.3(c)) has the best AUC
a set of encrypted webshells lws {hi | i 1, m} , in value (1.00) compared to the other approaches. The MF-
th
which hi is the value of the LW of i file webshell, m is based approach (Fig.3(a)) is in second place with 0.937,
the number of files in webshell. Then, we choose the best LW-based approach (Fig.3(b)) is in third place with 0.829
threshold value for the LW. AUC value, more efficient than NeoPI approach only with
0.77 AUC .
IV. EXPERIMENTAL AND EVALUATION
2. The optimal threshold value
A. Data set
The threshold value is a configurable value to adjust
For our testing, we collected total 13,151 files on the
heuristic sensitivity (TPR) that the system can classify
test dataset in which include 12,982 files of benign web
whether a file is webshell or benign. With high threshold
pages (with high confidence) that do not contain malicious
values, all webshell are classified as benign, resulting in a
codes and 169 malicious webshell files.
TPR of 0 and a FPR of 0. Lowering the threshold will
Our collecting method: for benign web pages, we increase both the TPR and FPR. For the lowest possible
downloaded directly from open sources websites which are threshold, both TPR and FPR are 1 (as shown in Fig.3).
prestigious and high rank based on Alexa [18]; for malicious
To determine an optimal threshold value (also called
webshell files, we downloaded and collected from several
optimal cut-off point), we considered the Youden Index (J)
security forums that were published and discussed on
[21].
Internet [17, 19]. We named this testing file is T1.

5th ICCCNT - 2014


July 11 - 13, 2014, Hefei, China
IEEE - 33044

Fig.3 The ROC curve - false positive rate vs. true positive rate.
The red dot represents the optimal threshold used.

J =Max(TPR - FPR) To demonstrate the ability of detecting unknown


=Max(TPR + TNR - 1), webshells, we also scanned on T2 and detected total 174
i.e., the maximum value of the sum of the TPR and True suspicious files, through manually analyze again and
Negative Rate (TNR) minus 1. The optimal value of the cut- submited suspicious files to websecure.co.il 1 and VirusTotal
off point t is thus obtained when the sum TPR+TNR is at its [20] for checking and analyzing. In which, we determined
maximum. Our experiments found the optimal threshold exactly 17 webshell files that were possibly inserted on web
values at which the total TPR and TNR achieve maximum application’s source code by hackers or web developers for
value (as shown in Table 1). evil purposes before sharing on internet. However, there are
Table 1. The optimal threshold values for our detection also 157 files that are false alert because programmers used
system and Area Under the Curve (AUC) a few malicious functions for purposes to install or to
encrypt sensitive information such as passwords or personal
Optimal TP TN information.
AUC
value (%) (%)
Thre_LW 83 74.5 93.3 0.829 C. Discussion and Evaluation
Thre_MF 13 88.8 89.9 0.937 Comparing advantages and disadvantages of some other
Thre_MS 0 100 100 1
detection tools, we have:

1) Grep is a unix command for searching files for lines


3. Compare with other detection tools
matching a given regular expression, such as:
We ran our tool to scan on data set T1 with the optimal
threshold values that have been determined (as shown in grep -RPnl "(gzinflate|eval|base64_decode)" /var/www/
Table 1). The results in Fig.4 shown that when we used The above command will search for files in directory
optimal the threshold values of Thre_LW=83, Thre_MS=0 “/var/www/” which contains files including lines matching
and Thre_MF=13, our system got 79.8% TPR with 2.0% the given “gzinflate|eval| base64_decode”. The advantage of
FPR. It means that in 13,151 scanned files, our system found this method is that it is able to easily find files containing
correctly 135/169 as webshells; and 259/12,982 non- the keywords. However, the disadvantage of this method is
webshells but were alerted as webshells. that it does not have a list of given dangerous functions.
Thus, users must have experience and know which MF
Compare the test results of our system with some other
should be found to detect webshell. Therefore, this method
AV software packages from virus total [20] on our data set
has many false positives since many of MF samples are also
T1, the results in Fig.4 show that our tool has relatively high
used by legitimate web applications.
recognition ratio and determine webshell better than
recognition ratio of some other AV softwares (compare
results in Fig.4 were scanned on May 19, 2014). 1
It is a team which shall carry out an audit to determine whether a detected
suspicious file is an unfounded suspicion or a real threat.

5th ICCCNT - 2014


July 11 - 13, 2014, Hefei, China
IEEE - 33044

Fig.4 Comparison of detecting results between our system


and some other anti-virus (AV) software packages

2) NeoPI has the advantage that it can find and detect V. CONCLUSION
obfuscated and encrypted contents within text and script
In summary, in this paper we proposed a novel method
files by calculating and listing some indices such as
based on the optimal threshold values of MS and MF
coincidence, Entropy, LongestWord, Signature; after that, it
samples as well as LW to identify webshell. Our detection
will give a rank for all files and top ten files with highest
system will scan and look for malicious codes inside each
rank[1]. However, its disadvantage is that it does not
file of web application, and automatically give a list of
provide a threshold value for these indices in order to
suspicious files as well as a detail log analysis table of each
determine which files is webshell. Therefore, we have to
suspicious file for administrator to examine further. The
refer to analysis and judgement from experts.
evidence from the experimental results shown that our
3) PHP Shell Detector (SD): Basically, SD has a known approach has relatively high recognition ratio and identify
webshells signature database saved in web server. The webshell better than recognition ratio of some other AV
advantage of this method is that its true positive can be up to softwares.
99% for known webshells in the database [2]. However, it
Generally, our current system can help to minimize the
has a disadvantage that if a new webshell has not been
time and costs for administrators by reviewing a source code
updated in database, this approach can not detect.
automatically and give warnings if there is any file
Our proposed method helps to solve these above suspected to be malicious.
disadvantages:
However, we aware that our work may have two
1) We provided a relatively completed list of MS and MF limitations. First, the number of MS and MF samples that
samples normally used in webshells, and each MS or we collected may be missing. Second, our system could not
MF was assigned a score according to its danger and differentiate between encrypted webshell files and some
can be updated to add/delete at any time within the file legal files which programmers used a few malicious
XML database (see Section III_A). functions for purposes to install or to encrypt sensitive
2) We calculated and gave the optimal threshold values in information such as passwords or personal information.
order to determine whether that file is webshell or not Thus, our future studies should examine and address above
(see Section III_G). This approach helps to reduce the limitations.
FPR compared to above methods (as shown in Fig.3(b)).
ACKNOWLEDGMENT
3) We proposed supporting techniques in order to improve
This work was supported by the Technology Support
the webshell detecting process that current tools have
Program (Industry) of Jiangsu Province (No. BE2011173);
not been mentioned or have been ignored. Those
The Future Network Proactive Program of Jiangsu Province
techniques are: (1) to find blacklist keyword in file
(No.BY2013095-5-03); The Six major talent Summit
name and in file comment lines; and (2) to check and
Project of Jiangsu Province (No. 2011ˉDZ024);
compare current file structure with its original web
application (see Section III_E, III_F).

5th ICCCNT - 2014


July 11 - 13, 2014, Hefei, China
IEEE - 33044

REFERENCES [11] Garg, A. and S. Singh, A Review on Web Application


Security Vulnerabilities [J]. International Journal, Vol.3,
[1] Scott, B. and H. Ben, Web Shell Detection Using NeoPI. No.1, pp.222-226, 2013.
[R/OL]:http://resources.infosecinstitute.com-/web-shell- [12] Mirdula, S. and D. Manivannan, Security Vulnerabilities in
detection/ Web Application-An Attack Perspective [J]. International
[2] Luczko, P., PHP Shell Detector [R/OL]:https://github- Journal of Engineering and Technology, Vol.5, No.2, pp.
.com/emposha/PHP-Shell-Detector. 1806-1811, 2013.
[3] Mingkun, X., C. Xi and H. Yan, Design of Software to Search [13] Cova, M., C. Kruegel and G. Vigna. Detection and analysis of
ASP Web Shell [J]. Procedia Engineering 29, pp. 123-127, drive-by-download attacks and malicious JavaScript code [C].
2012 In Proceedings of the 19th international conference on World
[4] Jiankang, H., et al., Research of Webshell Detection Based on wide web, ACM, pp. 281-290, 2010.
Decision Tree [J]. Network of new media technology ISTIC, [14] Exploitable PHP functions [EB/OL]:http://stackoverf-
Vol.1, No.6, pp. 15-19, 2012. low.com/questions/3115559/exploitable-php-functions.
[5] Sasi, R., Effectiveness of Antivirus in Detecting Web [15] Canali, D. and D. Balzarotti. Behind the scenes of online
Application Backdoors [R]. 2013. attacks: an analysis of exploitation behaviors on the web [C].
[6] Hou, Y., et al., Malicious web content detection by machine In Proceedings of the 20th Annual Network & Distributed
learning [J]. Expert Systems with Applications, Vol.37, No.1, System Security Symposium, 2013.
pp.55-60, 2010 [16] Verma, A. and D.S. Insan, Signature Based Detection of Web
[7] Koo, T., et al. Malicious Website Detection Based on Application Attacks [J]. International Journal of Advanced
Honeypot Systems [C]. International Conference on Research in Computer Science and Software Engineering
Advances in Computer Science and Engineering (CSE 2013). Vol.3, Issue-8, pp. 117, August-2013
Atlantis Press, 2013 [17] Certified Ethical Hacker [EB/OL]:http://ceh.vn/@4rum/-
[8] Agbefu, R.E., Y. Hori and K. Sakurai, Domain Information forumdisplay.php?fid=10.
Based Blacklisting Method For The Detection Of Malicious [18] Alexa-The Web Information Company. Availa-ble from
Webpages [J]. International Journal of Cyber-Security and [EB/OL]: http://www.alexa.com.
Digital Forensics (IJCSDF), Vol. 2, No.2, pp. 36-47, 2013. [19] Project Hosting on Google Code provides a free collaborative
[9] Jakobsson, M. and Z. Ramzan, Crimeware: understanding development environment for open source projects [EB/OL]:
new attacks and defenses [M]. Addison-Wesley Professional, http://code.google.com/.
608 pages, 2008. [20] VirusTotal - Free Online Virus, Malware and URL Scanner
[10] Canali, D., D. Balzarotti and A. Francillon. The role of web [EB/OL]: https://www.virustotal.com/.
hosting providers in detecting compromised websites [C]. In [21] Schisterman E F& Perkins N. Confidence Intervals for the
Proceedings of the 22nd international conference on World Youden Index and Corresponding Optimal Cut-Point[J].
Wide Web. International World Wide Web Conferences Communications in Statistics - Simulation and Computation,
Steering Committee, pp.177-188, 2013. Vol.36, No.3, pp. 549-563, 2007.

5th ICCCNT - 2014


July 11 - 13, 2014, Hefei, China

You might also like