You are on page 1of 15

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/272295945

SQLiDDS: SQL Injection Detection using Query Transformation and Document


Similarity

Conference Paper · February 2015


DOI: 10.1007/978-3-319-14977-6_41

CITATIONS READS
20 2,352

3 authors, including:

Debabrata Kar
Silicon Institute of Technology
6 PUBLICATIONS   122 CITATIONS   

SEE PROFILE

All content following this page was uploaded by Debabrata Kar on 16 February 2015.

The user has requested enhancement of the downloaded file.


SQLiDDS: SQL Injection Detection using
Query Transformation and Document Similarity

Debabrata Kar1 , Suvasini Panigrahi2 , and Srikanth Sundararajan3


1
Department of Computer Science and Engineering
Silicon Institute of Technology, Bhubaneswar, India
debabrata.kar@silicon.ac.in
2
Department of Computer Science and Engineering
VSS University of Technology, Burla, Sambalpur, India
suvasini26@gmail.com
3
Indian Institute of Technology Bhubaneswar, India
(currently at Helion Advisors, Bangalore, India)
sundararajan.srikanth@gmail.com

Abstract. SQL Injection Attack has been a major security threat to


web applications since last 15 years. Nowadays, hackers use automated
tools to discover vulnerable websites and launch mass injection attacks.
Accurate run-time detection of SQL injection has been a challenge in
spite of extensive research in this area. This paper presents a novel
approach for real-time detection of SQL injection attacks using query
transformation and document similarity measure. Acting as a database
firewall, the proposed system named SQLiDDS, can protect multiple
web applications using the database server. With additional inputs from
human expert, SQLiDDS can also become more robust over time. Our
experimental results confirm that this approach can effectively detect
and prevent all types of SQL injection attacks with good accuracy yet
negligible impact on system performance. The approach was tested on
web applications built using PHP and MySQL, however it can be easily
adopted in other platforms with minimal changes.

Keywords: sql injection detection, query transformation, sql injection


prevention, document similarity, database firewall, phrase similarity

1 Introduction
Web applications are exposed to various types of security threats among which
SQL Injection attack is predominantly used against web databases. The Open
Web Application Security Project (OWASP) ranks it on top among the Top-10
security threats [1]. According to TrustWave 2012 Global Security Report [2],
SQL injection was the number one attack method for four consecutive years.
Nowadays, attackers use sophisticated Botnets [3] which automatically discover
vulnerable web pages from search engines like Google and launch mass SQL
injection attacks from distributed sources. About 97% of data breaches across
the world occur due to SQL injection alone [4].
2 D. Kar, S. Panigrahi, S. Sundararajan

Research on SQL injection in the literature can be broadly categorized into:


defensive coding practices, vulnerability testing approaches, and prevention based
approaches. Defensive coding practices [5–7] consist of techniques applied dur-
ing application development and require programming in a specific way, but in
practice, programmers often ignore the security aspects. On the other hand, vul-
nerability testing approaches [8–10] rely on discovering possible SQL injection
hotspots so that they can be fixed before deployment. The effectiveness of this
approach is limited by the number of security issues identified. Majority of the
research on SQL injection has been done under prevention based approaches. In
general, prevention based approaches like [11–15] consist of preparing a model of
SQL queries and/or the application’s behavior during normal-use in a secured
environment and then utilize the model at run-time to prevent SQL injection
attempts. Any modifications or enhancements to the application code requires
rebuilding the normal-use model, which is a major disadvantage of this approach.
Further, these are mostly designed to protect a single web application and usu-
ally suitable for a specific language and database platform.
In this paper, we present a novel technique to detect SQL injection attacks
by applying document similarity on transformed queries. Document similarity
is typically used in information retrieval domain. We implemented the tech-
nique in a tool we named SQLiDDS (SQL i njection Detection using Document
S imilarity) and validated the approach on five sample web applications. The
query transformation scheme and application of document similarity concept for
detecting SQL injection attacks are the main contributions of our work. Based on
interesting observations made during the course of research, we postulate that,
it is sufficient to examine only the WHERE clause part of run-time queries for
detecting injection attacks, which is another significant contribution. SQLiDDS
can protect multiple web applications interacting with a database server (as in
shared hosting environments), which is an advantage over existing methods.
Rest of the paper is organized as follows. Section 2 explains the mechanism
of SQL injection attack by example. Section 3 discusses our approach in detail
covering the query transformation scheme, document similarity measure, and
architecture of SQLiDDS. Experimental results are discussed in Sect. 4 along
with assessment of performance overhead. Quoting related works in Sect. 5, we
conclude the paper in Sect. 6 with future directions of research.

2 SQL Injection Attack


SQL injection attacks occur due to a commonly found vulnerability of using
raw input data received through web forms or URL parameters without proper
validation to construct dynamic SQL queries. Attackers exploit this vulnerability
by inserting carefully crafted SQL keywords and values, effectively altering the
semantics of dynamic queries, and causing them to return results intended by
the attacker. To understand the basic mechanism, consider a product detail page
of an E-commerce website accessed by clicking a link to the URL:
http://www.somewebstore.com/product_details.php?pid=24
SQL injection Detection using Document Similarity 3

The script product_details.php becomes vulnerable to SQL injection if the


programmer uses the parameter pid to construct an SQL query without proper
validation or type-checking. For example, in the following PHP code:
$query = "SELECT * FROM products WHERE prod_id = ".$_GET['pid'];
$result = mysql_query($query, $dbconn);

the parameter pid has been used directly for constructing the dynamic query
assuming that an integer value (e.g., 24) will be always received. However, if
an attacker adds “ OR 1 = 1” to the URL then the string value “24 OR 1 =
1” will be used in the code which will generate the query as “SELECT * FROM
products WHERE prod id = 24 OR 1 = 1”. Upon execution, it will return the
entire products table as the query evaluates to true for all rows. This is an
example of the simplest type of SQL injection known as tautological attack. There
are several other types of SQL injection attacks [16], using which an attacker
can steal sensitive information from other database tables.
Commercial Intrusion Detection Systems (IDS) and Web Application Fire-
walls (WAF) mostly rely on regular expression based filters created from known
attack signatures. However, signature-based filters can be circumvented using
a number of bypassing techniques [17, 18]. For example, consider the following
attacks, showing only the injection code added by an attacker:
OR 'A' = 'A'
OR -5.24 = 17.43 - 22.67
OR 419320 = 0x84B22 - 0x1E52A
OR 'ABC' = CONCAT('A', 'B', 'C')
OR 'ABC' = CoNcAt(cHaR(0x28 + 25), cHaR(0x42), cHAr(80 - 0x0d))
OR 'XYZ' = sUbStRiNg(cOncAt('AB', 'CX', 'YZ', 'EF'), 4, 3)
OR 'XYZ' = /*!SuBsTrInG(*/CoNcAt('aB',/*!'cX','YZ',*/'eF')/*!,4,3)*/

All of these expressions evaluate to true, hence they are tautological attacks.
Similar bypassing techniques can also be used in other types of SQL injection
attacks. SQL injection attack expressions can be formed in a number of ways
to the extent that constructing regular expression based filters becomes almost
impossible. Signature based systems can be eluded by the attacker with creative
changes to the injected expressions with a little effort.

3 Proposed Approach
Two key ideas, the query transformation scheme and application of document
similarity measure, constitute the core of our approach. The system begins with
an initial set of SQL injection attacks collected from honeypot web applications.
These are converted into text form using a transformation scheme and then
grouped into clusters by their document similarity. Attack vectors in each cluster
are merged into a document. Each document thus contains a set of highly similar
attack vectors. At run-time, SQL injection attacks are detected by comparing
the document similarity of an incoming query with these documents. Rest of
this section describes our approach and architecture of SQLiDDS in detail.
4 D. Kar, S. Panigrahi, S. Sundararajan

3.1 Query Transformation Scheme


We developed an extension of the transformation scheme proposed in [19], so that
it normalizes an SQL query into a sentence-like form, and facilitates application
of document similarity measure. The extended scheme uses only capital A-Z for
all tokens and space character as token separator. All symbols and special charac-
ters are also transformed into words except the underscore (_) character, because
it is frequently used in MySQL system databases, system tables and functions,
e.g., information_schema, table_priv, CURRENT_USER(), LAST_INSERT_ID(),
etc. Splitting such tokens at the underscore character would result in over tok-
enization and negatively affect the similarity values. The step-by-step process of
transforming an SQL query has been detailed in Table 1.

Table 1. The Extended Query Transformation Scheme

Step Token/Symbol Transform


1. Newline characters (\r or \n) if any Remove
2. Inline Comments (/*. . . */) if any Remove
3. Anything within single/double quotes
(a) Hexadecimal value HEX
(b) Decimal value DEC
(c) Integer value INT
(d) IP address IPADDR
(e) Single alphabet character CHR
(f) General string (none of the above) STR
4. Anything outside single/double quotes
(a) Hexadecimal value HEX
(b) Decimal value DEC
(c) Integer value INT
(d) IP address IPADDR
5. System objects
(a) System databases SYSDB
(b) System tables SYSTBL
(c) System table column SYSCOL
(d) System variable SYSVAR
(e) System views SYSVW
(f) System stored procedure SYSPROC
6. User-defined objects
(a) User databases USRDB
(b) User tables USRTBL
(c) User table column USRCOL
(d) User-defined views USRVW
(e) User-defined stored procedures USRPROC
(f) User-defined functions USRFUNC
7. SQL keywords, functions and reserved words To Uppercase
8. Any token/word not transformed so far
(a) Single alphabet CHR
(b) Alpha-numeric without space STR
9. Other symbols and special characters As per Table 2
10. The entire query To Uppercase
11. Multiple spaces Single Space

Each step of query transformation is a find-and-replace operation, done


by appropriately using the preg_replace() and str_replace() built-in func-
tions available in PHP. For example, preg_replace("/\b0x[0-9a-f]+\b/i",
"HEX", $query) converts hexadecimal numbers in $query into HEX in a case-
SQL injection Detection using Document Similarity 5

Table 2. Transformation of Special Characters and Symbols

Symbol Name Transform Symbol Name Transform


` Backtick Remove ( Opening Parenthesis LPRN
!= or <> Not Equals NEQ ) Closing Parenthesis RPRN
&& Logical AND AND { Opening Brace LCBR
|| Logical OR OR } Closing Brace RCBR
~ Tilde TLDE [ Opening Bracket LSQBR
! Exclaimation EXCLM ] Closing Bracket RSQBR
@ At-the-rate ATR \ Back Slash BSLSH
# Pound HASH : Colon CLN
$ Dollar DLLR ; Semi-colon SMCLN
% Percent PRCNT " Double Quote DQUT
^ Caret XOR ' Single Quote SQUT
& Ampersand BITAND < Less Than LT
| Pipe or Bar BITOR > Greater Than GT
* Asterisk STAR , Comma CMMA
- Hyphen/Minus MINUS . Stop or Period DOT
+ Addition/Plus PLUS ? Question Mark QSTN
= Equals EQ / Forward Slash SLSH

insensitive manner. All symbols and special characters are transformed in Step–9
as per the scheme given in Table 2. The last two steps convert the transformed
query to uppercase and reduce multi-spaces to single spaces. To visualize how the
transformation scheme works, consider the following SQL queries, intentionally
written in mixed-case to show the effect of transformation.

sEleCt * fRoM products wHeRe price > 10.00 aNd discount < 8
SeLeCt email FrOm customers WhErE fname LIKE 'john%'

By applying the transformation scheme, the above queries are transformed


into the following form respectively:

SELECT STAR FROM USRTBL WHERE USRCOL GT DEC AND USRCOL LT INT
SELECT USRCOL FROM USRTBL WHERE USRCOL LIKE SQUT STR PRCNT SQUT

Consider an injected query generated due to a cleverly crafted attack to


bypass detection (taken from the examples given in Sect. 2):

SELECT * FROM products WHERE prod_id = 24 OR 'ABC' = CoNcAt(cHaR(0x28 +


25), cHaR(0x42), cHAr(80 - 0x0d))

This query contains a number of symbols and operators, which is not suitable
for application of a document similarity measure. The transformation scheme
converts it completely into text form as:

SELECT STAR FROM USRTBL WHERE USRCOL EQ INT OR SQUT STR SQUT EQ CONCAT
LPRN CHAR LPRN HEX PLUS INT RPRN CMMA CHAR LPRN HEX RPRN CMMA CHAR LPRN
INT MINUS HEX RPRN RPRN

Any SQL query, irrespective of its complexity, is thus transformed into a


series of words separated by spaces like a sentence in English. The structural
form of the query is correctly maintained by the transformation scheme.
6 D. Kar, S. Panigrahi, S. Sundararajan

3.2 Document Similarity Measure

Several document similarity measures are available in the literature, out of


which the vector space model (VSM) using term-frequency (TF) and inverse-
document-frequency (IDF) weighting is widely used in information retrieval
[20, Chap. 6]. In this model, two documents di and dj are represented in vec-
tor form by the TF-IDF term weights as d~i = hwt1 ,di , wt2 ,di , . . . , wtn ,di i and
d~j = hwt1 ,dj , wt2 ,dj , . . . , wtn ,dj i. Similarity between the documents is computed
as the cosine of the angle θ between the document vectors. Formally,

Pn
d~i · d~j wt ,d wt ,d
Sim(di , dj ) = cos(θ) = = qP k=1 kqiP k j (1)
kd~i kkd~j k n
w2
n
k=1 w2 tk ,di k=1 tk ,dj

Although TF-IDF weighting is popular in information retrieval domain and


has been used for intrusion detection in [21,22], our experiments did not produce
acceptable results using this measure. A major shortcoming of this measure is
that it ignores the order of occurrence of terms in the document. For example,
“John is older than Mary” and “Mary is older than John”, are determined
as identical documents. The loss of term-order can be overcome by considering
phrases instead of individual terms. By phrase, we mean a contiguous sequence of
terms, not the syntactical or structural clauses of SQL. A phrase in this context
is same as an N-gram where each gram corresponds to a term.
Considering that a document d is a sequence of terms (t1 , t2 , . . . , tn ), phrases
are extracted by sliding a window of size w across the document, i.e., every
phrase contains w terms. This is termed as phrase length. For example, if the
phrase length is 2, then the phrases extracted are (t1 t2 , t2 t3 , t3 t4 , . . . , tn−1 tn ).
The number of phrases in a document therefore equals to (n − w + 1). Let the
phrases extracted from two documents di and dj be Pi = (pi1 , pi2 , . . . , pim )
and Pj = (pj1 , pj2 , . . . , pjn ) respectively. Then, the set of unique phrases in
both the documents is P = Pi ∪ Pj = {p1 , p2 , . . . , pk } where k ≤ m + n; the
upper bound occurs when the documents are dissimilar. The documents can now
be represented in vector form in k-dimensions by the frequency of the phrases
they contain as d~i = hfp1 ,di , fp2 ,di , . . . , fpk ,di i and d~j = hfp1 ,dj , fp2 ,dj , . . . , fpk ,dj i.
Following (1), phrase-based cosine similarity can be computed by:
Pk
q=1 fpq ,di fpq ,dj
Sim(di , dj ) = qP qP (2)
k 2 k
q=1 fpq ,di q=1 fp2q ,dj

The similarity computed by (2) is indirectly dependent on the phrase-length


w. Taking w = 1 is same as computing similarity by unweighted term frequencies.
A natural question arises – how to choose the appropriate phrase length? A
rational answer is obtained by looking at the classic injection attack “OR 1 =
1” which gets transformed to “OR INT EQ INT” – containing four words. In rest
of the paper, we adopt the phrase length w = 4 for computing similarity.
SQL injection Detection using Document Similarity 7

3.3 Detection Strategy of SQLiDDS

Design of SQLiDDS is strategically guided by two interesting observations which


we were unable to find mentioned anywhere in the literature. The first obser-
vation surfaced while looking for an answer to “where do SQL injections most
commonly occur in an injected query? ” Looking at the general coding practices
for developing web applications, we find that, dynamic SQL queries are usu-
ally constructed in two parts: (1) a static part hard coded by the programmer,
and (2) a dynamic part produced by concatenation of SQL keywords, delimiters
and received input values. Since the main objective of a dynamic query is to
fetch different set of records depending on the input, the parameter values must
be used in the WHERE clause to specify the selection criteria. This compulsive
programming need as well as the de facto practice of using input values in the
WHERE clause part of dynamic queries resolves that SQL injections happen after
the WHERE keyword, almost in all cases. In fact, SQL injection before the WHERE
keyword is extremely rare – not a single instance was found in over 16,500 queries
containing SQL injection attacks we examined for this study.
The second interesting observation stems from the basic intention behind an
SQL injection attack – stealing sensitive information from the back-end database,
such as credit card numbers. By carefully formulating the injection code, the
attacker tries to get the intended data displayed on the web page itself, which
is the only method for data breach through SQL injection. Data is fetched by
SELECT queries in SQL, which implies that unless the injectable parameter is
used to construct a dynamic SELECT query that delivers data displayed on the
web page, it is not useful for data breach. This points to another assertion that
SQL injections mostly occur through dynamically generated SELECT queries.
The above observations lead to the strategic inference that, it is sufficient to
examine the part of an SQL query after the WHERE keyword in order to detect the
presence of any injection attack, which significantly narrows down the scope of
processing. Another plausible strategy may be to intercept only SELECT queries
for examination, however, this may enable the attacker to cause damage to
the data, if not breach. As such, by considering the part of a query after the
WHERE keyword, we automatically include UPDATE and DELETE queries in the
investigation, because they also support a WHERE clause by SQL syntax.
A cognizant question may be raised concerning a special kind of attack,
known as Second Order SQL Injection, which is initiated through INSERT queries.
In this method, the attacker submits values containing injected code through
form fields (e.g., a registration form) which is not executed at that time, but
gets stored in the database as normal data. It takes effect when another part of
the web application uses that stored value in a dynamic query. It is interesting
to note that: 1) the victim dynamic query must be a SELECT query fetching data
for display on the web page, and 2) the stored value containing injected code
must be used in its WHERE clause. By detecting injection in the WHERE clause, the
secondary attack would render ineffective. Therefore, the strategy to examine
only the portion of queries after the WHERE keyword is correct and sufficient for
detecting SQL injection attacks, including second order injections.
8 D. Kar, S. Panigrahi, S. Sundararajan

3.4 Architecture of SQLiDDS

Figure 1 shows the detail architecture of our approach. SQLiDDS operates in


two phases: (1) offline phase, and (2) run-time phase. The offline phase begins
with a collection of SQL injected queries (discussed in Sect. 4.1). The portion of
the injected queries after the first WHERE keyword are extracted and converted to
text form using the transformation scheme. A major advantage of the transfor-
mation scheme is that several queries get converted into the same form, which
greatly reduces space and processing. The distinct transformed structures are
stored in the injected structures database. The MD5 hash value of each structure
is computed and stored separately in a reference hash table, which helps avoid-
ing similarity computation for every incoming query at run-time. The distinct
injected structures are also clustered into groups based on their phrase similarity
(discussed in Sect. 4.2). Each cluster is merged into a single document and saved
in a document repository. Each document thus contains a set of highly similar
injection patterns against which queries will be compared at run-time.
The offline phase completes with creation of the reference hash table and the
documents of injected structures, making SQLiDDS ready for deployment as a
database firewall between the web server and database server. Two thresholds,
(1) Rejection Threshold (Rth ), and (2) Suspicious Threshold (Sth ) are defined
in SQLiDDS configuration, such that Rth > Sth .

Fig. 1. Architecture of SQLiDDS (dashed: offline, solid: run-time)

At run-time, SQLiDDS intercepts queries issued by web applications to the


database server. The portion of a run-time query after the first WHERE keyword
is extracted and normalized using the query transformation scheme. If its MD5
hash is found in the reference hash table, it is confirmed as an injection attack
and the query is rejected without any further check. When a match is not found,
its phrase similarity with the documents is computed. If the similarity with any
SQL injection Detection using Document Similarity 9

document is ≥ Rth , the query is determined as an injection attack and rejected.


It is added to the injected structures database and its MD5 hash is added to the
reference hash table. As a new structure is added, clustering is redone and the
set of documents is regenerated by merging structures of each cluster.
If the highest similarity with the documents is below Rth but ≥ Sth , the
query is allowed to execute, but tagged as suspicious and logged for scrutiny
by the DBA. If the DBA marks the query as malicious, then its MD5 hash
is added to the reference hash table, which ensures that next time the same
attack is instantly detected. In addition, the pattern is added to the injected
structures database, clustering process is repeated, and a fresh set of documents
are generated by merging the injection patterns in each cluster. This increases
the probability of detecting similar injection attacks in future. In this manner,
the system learns new injection patterns and becomes more robust over time.

4 Experimental Setup and Evaluation


The experimental setup consisted of a standard desktop computer with Intel®
Core-i3™ 2100 CPU @ 3.10GHz and 2GB RAM, running CentOS 5.3 Server
OS, Apache 2.2.3 web server with PHP 5.3.28 configured as Apache module,
and MySQL 5.5.29 database server. Five web applications; namely, Bookstore
(e-commerce), Forum, Classifieds, NewsPortal, and JobPortal, developed using
PHP and MySQL, were installed on the server computer simulating a shared
hosting environment. Each web application was built with features similar to
real websites, however the code was intentionally written without any input
validation so that they are entirely vulnerable to SQL injection attacks.

4.1 Collection of Injected Queries


SQLiDDS requires an initial collection of injected queries to begin the offline
phase. Due to unavailability of any standard real-world data set, we generated
the same using a honeypot based technique. First, we switched on the General
Query Log4 option of the MySQL database server, which enables logging of every
incoming SQL query. Three out of the five sample web applications, namely
Bookstore, Classifieds, and JobPortal, were subjected to SQL injection attacks
using a number of automated vulnerability scanners and SQL injection tools
downloaded from the Internet, such as: HP Scrawler, WebCruiser Pro, Wapiti,
Skipfish, NetSparker (community edition), SQL Power Injector, BSQL Hacker,
NTO SQL Invader, sqlmap, sqlsus, The Mole, IronWasp, jSQL Injector, etc. An
excellent penetration testing platform based on Debian, known as Kali Linux5 ,
comes bundled with several additional scanners and exploitation tools such as
grabber, bbqsql, nikto, w3af, vega, etc., which were also applied.
Some SQL injection scripts were also downloaded from hacker and com-
munity sites such as Simple SQLi Dumper (Perl), darkMySQL (Python), SQL
4
http://dev.mysql.com/doc/refman/5.5/en/query-log.html
5
Formerly known as BackTrack Linux: http://www.kali.org/
10 D. Kar, S. Panigrahi, S. Sundararajan

Sentinel (Java), etc., and applied on the three websites. Each tool was applied
multiple times with different settings to ensure that all possible types of SQL
injection attack patterns are captured. We also applied injection attacks on the
three websites using several tips and tricks collected from tutorials, forums, and
black hat sites. During the entire process, all SQL queries generated by the web
applications were recorded by MySQL server in the general query log.
Over 4.52 million lines were written into the log files from which 3.35 million
SQL queries were extracted. After removing duplicates, INSERT queries, and
queries without a WHERE clause, total 245,356 nos. of unique SELECT, UPDATE,
and DELETE queries were obtained, containing genuine as well as injected queries.
Each query was manually examined and the injected queries were separated. Out
of the 49,273 injected queries identified, 16,702 unique queries were obtained,
containing a natural mix of all types of SQL injection attacks. The portion after
the first WHERE keyword of these queries were extracted and normalized using the
transformation scheme producing 3,896 distinct injection attack patterns. The
patterns were stored in the injected structures database and their MD5 hashes
were stored in the reference hash table.

4.2 Clustering of the Injection Patterns


Various document clustering algorithms used in information retrieval have been
described in [20, Ch.16-17]. Since the number of clusters cannot be guessed
apriori, algorithms like k-means are not useful for our approach. Hierarchical
clustering algorithm, on the other hand, does not require the number of clusters
to be prespecified. Though it has at least Θ(n2 ) complexity, it is not a bottleneck
because: (1) the data set is not very large, (2) clustering is done once in the offline
phase, and (3) re-clustering is required only when a new injection pattern is
added to the database. We used Hierarchical Agglomerative Clustering (HAC)
using the average-link method. The cutoff similarity was determined as 0.664
which yielded 106 clusters including 19 singletons. The injection patterns in each
cluster were merged and saved as documents. The phrases of each document were
pre-extracted for efficiency and stored in comma separated format.

4.3 Run-time Results


The five sample web applications were used for run-time evaluation without any
changes to their input handling. The thresholds Rth and Sth were empirically
determined and set as 0.83 and 0.68 respectively. With SQLiDDS activated to
intercept incoming queries, the web applications were attacked by using the
automated tools and manual techniques mentioned in Sect. 4.1. The results of
the first run (i.e., without DBA involvement) are shown in Table 3, where the
numbers are unique instances of queries extracted from SQLiDDS logs.
The experimental results confirm that SQLiDDS exhibits good accuracy of
detection for all test applications. The precision for all test applications is above
99% while the recall varies from 83.07% for NewsPortal to 94.66% for Classi-
fieds. The false positive rate (FPR) is below 2% except for the Forum application.
SQL injection Detection using Document Similarity 11

Table 3. Results of SQL injection Detection by SQLiDDS

Web # Queries
Application Intercepted TP FN FP TN TPR FPR Accuracy
Bookstore 4477 3862 52 6 557 98.67% 1.07% 98.70%
Forum 3580 2986 97 11 486 96.85% 2.21% 96.98%
Classifieds 1987 1587 21 7 372 98.69% 1.85% 98.59%
NewsPortal 1544 1223 54 2 265 95.77% 0.75% 96.37%
JobPortal 3114 2644 33 4 433 98.77% 0.92% 98.81%

Out of the false positives and negatives, total 162 queries (56.4%) were logged as
suspicious queries for the DBA to examine. The overall accuracy of the system
comes out as 98.05%. Interestingly, the accuracy for the Forum and NewsPortal
are 96.85% and 95.77% respectively in the first run, even if they were not used
as honeypots for collecting injection patterns (see Sect. 4.1). This is very encour-
aging and proves that injected queries collected from few honeypot applications
can be used to protect multiple web applications, because input parameters are
used in the WHERE clause of queries in similar manner across web applications.

4.4 Performance Overhead

Processing overhead by SQLiDDS consists of four components: (1) extracting the


WHERE clause, (2) query transformation, (3) lookup of the reference hash table,
and (4) similarity matching with clustered documents. Out of these, (1) and (3)
consume negligible time and are ignored. Average time for query transformation
and computing phrase based similarity were measured as 0.612 ms and 0.413
ms respectively. Considering that similarity with half of the documents needs
to be computed on average, total processing overhead comes to 22.501 ms per
incoming query. Assuming that a standard web page issues 10 queries on average,
the total delay is ' 225 ms per page. Since the page load times are generally in
the order of several seconds over the Internet, a delay of 225 ms is insignificant
and would not affect the end-users’ browsing experience.
To assess impact on performance in practical usage scenario, load testing
on the sample web applications was conducted using Pylot6 with 10 to 100
concurrent user-agents, configured with request interval of 5 ms and 5 seconds
ramp-up time, covering all public web pages as test cases. Each test was run for
30 minutes with and without SQLiDDS intercepting queries, and the difference
in average response time of the web applications was measured. Figure 2 shows
the delay introduced by SQLiDDS for each application. Clearly, the delay is
proportional to the average number of queries issued per page and increases with
the number of concurrent users. Overall, the delay introduced by SQLiDDS was
found to account for 4.5–5.8% of the average response time of the web server,
which is almost imperceptible in online environment over the Internet.
6
Open source website performance testing tool: http://www.pylot.org/
12 D. Kar, S. Panigrahi, S. Sundararajan

800
Bookstore
700 Classifieds
Forum
Delay in Milliseconds
600 Job Portal
News Portal
500

400

300

200

100
0 20 40 60 80 100
Number of Concurrent Users

Fig. 2. Performance Overhead of SQLiDDS

5 Related Work

To the best of our knowledge, document similarity measures have not been so far
proposed specifically for detection of injection attacks at the SQL query level in
the literature. However, a few studies have used similarity measures on HTTP
requests or payloads for classification of web attacks including SQL injection,
which may be considered as related to our work to some extent.
Small et al. [23] used similarity measure and string alignment techniques for
detecting malicious HTTP payloads. At the same time, Gallagher et al. [24] also
used term frequency based similarity measure for classification of HTTP requests
to identify and classify attacks. Ulmer et al. [25] used similarity on HTTP data
for a configurable hardware classifier for web attacks. Later in [26], they proposed
parallel acceleration for the document-similarity classifier on multi-core proces-
sor architecture, focusing mainly on hardware implementation. Choi et al. [27]
used N-gram splitting and classification using Support Vector Machine (SVM)
to detect SQL injection and XSS attacks. However, their approach to extract
N-grams ignores the symbols and operators completely, which are important
syntactic elements in a query. Application of document similarity measures in
conjunction with query transformation for detecting SQL injection attacks is
proposed for the first time in this paper to the best of our belief.

6 Conclusions and Future Work

This paper presented a novel approach to detect all types of SQL injection
attacks by applying document similarity measure on normalized queries, imple-
mented in a tool named SQLiDDS. We adopted the strategy to examine only
the WHERE clause part of run-time queries and ignore INSERT queries, which was
based on two interesting observations made during the course of research. The
SQL injection Detection using Document Similarity 13

experimental results obtained on five fully vulnerable web applications developed


with PHP and MySQL, are very encouraging and confirm effectiveness of our
approach. The system acts as a database firewall and is able to protect multiple
web applications hosted on a shared server, which is an advantage over existing
methods. Though SQLiDDS requires an initial set of injected queries, these can
be collected from few honeypot applications and reused for further deployments.
The system can evolve with inputs from DBA and become more robust over
time. Performance overhead of the system is almost imperceptible over Internet.
The approach can also be ported to other web application development platforms
without requiring major modifications.
Further research is required on four main aspects: (1) improving the query
transformation scheme to harden against bypassing attempts, (2) evaluating
other document similarity measures available in the literature and determine the
best measure for accuracy and performance, (3) developing a weighting scheme
for frequently occurring phrases in injected queries for lowering the false negative
and false positive rates, and (4) using an efficient text clustering algorithm that
produces less number of high quality clusters.

References
1. OWASP: Top 10 Security Threats 2013. https://www.owasp.org/index.php/
Top_10_2013-A1-Injection (2013) Accessed: 2013-11-15.
2. TrustWave: Executive Summary: Trustwave 2012 Global Security Report. https:
//www.trustwave.com/global-security-report (2012) Accessed: 2013-06-24.
3. Maciejak, D., Lovet, G.: Botnet-Powered Sql Injection Attacks: A Deeper Look
Within. In: Virus Bulletin Conference. (09 2009) 286–288
4. Curtis, S.: Barclays: 97 percent of data breaches still due to SQL injec-
tion. http://news.techworld.com/security/3331283/barclays-97-percent-
of-data-breaches-still-due-to-sql-injection/ (01 2012)
5. Livshits, B., Erlingsson, Ú.: Using Web Application Construction Frameworks to
Protect Against Code Injection Attacks. In: Proceedings of the 2007 Workshop on
Programming Languages and Analysis for Security, ACM (2007) 95–104
6. Boyd, S., Keromytis, A.: SQLrand: Preventing SQL injection attacks. In: Applied
Cryptography and Network Security, Springer (2004) 292–302
7. Johns, M., Beyerlein, C., Giesecke, R., Posegga, J.: Secure Code Generation for
Web Applications. Engineering Secure Software and Systems (2010) 96–113
8. Benedikt, M., Freire, J., Godefroid, P.: VeriWeb: Automatically Testing Dynamic
Web Sites. In: In Proceedings of 11th International World Wide Web Conference
(WWW’2002), Citeseer (2002)
9. Kals, S., Kirda, E., Kruegel, C., Jovanovic, N.: Secubat: A Web Vulnerability
Scanner. In: Proceedings of the 15th International Conference on World Wide
Web, ACM (2006) 247–256
10. Wassermann, G., Yu, D., Chander, A., Dhurjati, D., Inamura, H., Su, Z.: Dy-
namic Test Input Generation for Web Applications. In: Proceedings of the 2008
International Symposium on Software Testing and Analysis, ACM (2008) 249–260
11. Halfond, W., Orso, A.: AMNESIA: Analysis and Monitoring for NEutralizing SQL-
Injection Attacks. In: Proceedings of the 20th IEEE/ACM International Confer-
ence on Automated Software Engineering, ACM (2005) 174–183
14 D. Kar, S. Panigrahi, S. Sundararajan

12. Buehrer, G., Weide, B., Sivilotti, P.: Using Parse Tree Validation to Prevent SQL
Injection Attacks. In: Proceedings of the 5th International Workshop on Software
Engineering and Middleware, ACM (2005) 106–113
13. Bisht, P., Madhusudan, P., Venkatakrishnan, V.: CANDID: Dynamic Candidate
Evaluations for Automatic Prevention of SQL Injection Attacks. ACM Transac-
tions on Information and System Security (TISSEC) 13(2) (2010) 14
14. Lee, I., Jeong, S., Yeo, S., Moon, J.: A Novel Method for SQL Injection Attack
Detection based on Removing SQL Query Attribute Values. Mathematical and
Computer Modelling 55 (2011) 58–68
15. Wang, Y., Li, Z.: SQL Injection Detection via Program Tracing and Machine
Learning. In: Internet and Distributed Computing Systems. Springer (2012) 264–
274
16. Halfond, W., Viegas, J., Orso, A.: A Classification of SQL-injection Attacks and
Countermeasures. In: International Symposium on Secure Software Engineering
(ISSSE). (2006) 12–23
17. Maor, O., Shulman, A.: SQL Injection Signatures Evasion (White paper). Im-
perva, Inc. (04 2004) Online: http://www.issa-sac.org/info_resources/ISSA_
20050519_iMperva_SQLInjection.pdf.
18. Dahse, J.: Exploiting hard filtered SQL Injections. http://websec.wordpress.
com/2010/03/19/exploiting-hard-filtered-sql-injections/ (03 2010)
19. Kar, D., Panigrahi, S.: Prevention of SQL Injection Attack Using Query Trans-
formation and Hashing. In: Proceedings of the 3rd IEEE International Advance
Computing Conference (IACC), IEEE (2013) 1317–1323
20. Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval.
Volume 1. Cambridge University Press, Cambridge (2008) Online: http://nlp.
stanford.edu/IR-book/pdf/irbookonlinereading.pdf.
21. Garcı́a Adeva, J.J., Pikatza Atxa, J.M.: Intrusion Detection in Web Applications
using Text Mining. Engg. Appl. of Artificial Intelligence 20(4) (2007) 555–566
22. Liao, Y., Vemuri, V.R.: Using Text Categorization Techniques for Intrusion De-
tection. In: USENIX Security Symposium. Volume 12. (2002) 51–59
23. Small, S., Mason, J., Monrose, F., Provos, N., Stubblefield, A.: To Catch a Preda-
tor: A Natural Language Approach for Eliciting Malicious Payloads. In: USENIX
Security Symposium. (2008) 171–184
24. Gallagher, B., Eliassi-Rad, T.: Classification of HTTP Attacks: A Study on the
ECML/PKDD 2007 Discovery Challenge. In: Center for Advanced Signal and
Image Sciences (CASIS) Workshop. (2008)
25. Ulmer, C., Gokhale, M.: A Configurable-Hardware Document-Similarity Classifier
to Detect Web Attacks. In: Parallel & Distributed Processing, Workshops and Phd
Forum (IPDPSW), 2010 IEEE International Symposium on, IEEE (04 2010) 1–8
26. Ulmer, C., Gokhale, M., Gallagher, B., Top, P., Eliassi-Rad, T.: Massively parallel
acceleration of a document-similarity classifier to detect web attacks. Journal of
Parallel and Distributed Computing 71(2) (2011) 225–235
27. Choi, J., Kim, H., Choi, C., Kim, P.: Efficient Malicious Code Detection Using
N-Gram Analysis and SVM. In: Network-Based Information Systems (NBiS), 2011
14th International Conference on, IEEE (2011) 618–621

View publication stats

You might also like