You are on page 1of 13

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/276465741

DATALEAK: Data Leakage Detection System

Article  in  MACRo 2015 · January 2015


DOI: 10.1515/macro-2015-0011

CITATIONS READS

2 12,731

1 author:

Adrienn Skrop
University of Pannonia, Veszprém
17 PUBLICATIONS   34 CITATIONS   

SEE PROFILE

All content following this page was uploaded by Adrienn Skrop on 02 May 2019.

The user has requested enhancement of the downloaded file.


MACRo 2015- 5th International Conference on Recent Achievements in
Mechatronics, Automation, Computer Science and Robotics

DATALEAK: Data Leakage Detection System


Adrienn SKROP
Department of Computer Science and Systems Technology, Faculty of Information Technology,
University of Pannonia, Hungary, e-mail: skrop@dcs.uni-pannon.hu

Manuscript received January 23, 2015, revised February 9, 2015.

Abstract: Data leakage is an uncontrolled or unauthorized transmission of classified


information to the outside. It poses a serious problem to companies as the cost of
incidents continues to increase. Many software solutions were developed to provide
data protection. However, data leakage detection systems cannot provide absolute
protection. Thus, it is essential to discover data leakage as soon as possible. The purpose
of this research is to design and implement a data leakage detection system based on
special information retrieval models and methods. In this paper a semantic information-
retrieval based approach and the implemented DATALEAK application is presented.

Keywords: Data leakage, mathematical model, vector space, semantic similarity.

1. Introduction
Data leakage can be defined as an event in which classified information, e.g.
sensitive, protected or confidential data has been viewed, stolen or used by
somebody who is not authorized to do so. Data leakage causes serious and
expensive problems to companies and organizations, because the number of
events continues to rise.
Data leak prevention helps ensure that confidential data like customer
information, personal employee information, trade secrets, financial data and
research and development data remains safe and secure. Data leak prevention
solutions prevent confidential data by securing the data itself [1]. Once most
critical data and its location are identified on the network, it can be monitored to
determine who is accessing and using it; where it is being sent, copied, or
transmitted. A number of methods and systems have been developed to prevent
data leakage, e.g. in [11][4][5]. However, data leakage detection systems cannot
provide absolute protection. The problems are that follows. On the one hand, —

113 10.1515/macro-2015-0011
114 A. Skrop

according to a Symantec study — more than 40 per cent of data breaches are
due to insider negligence1. On the other hand, as data continuously changes,
discovery and classification across large organizations adds complexities.
Nowadays 80 to 90 percent of an organization’s data is unstructured2.
Unstructured data can be any data stored in different files, e.g. text documents,
emails, blogs, videos, excel worksheets, instant messages, webpages,
presentations etc.. Unstructured data is everywhere and exploding. Unlike
application oriented data, which is usually well structured and has means of
protection, unstructured data is loose, out of control and hard to protect.
In this paper DATALEAK a semantic information-retrieval (IR) based
application is presented to address the problem of Web data leakage detection.
The paper is organized as follows. In Section 2 the architecture of the data
leakage detection system is presented. Section 3 presents the methods that are
used in the system. Section 4 presents the DATALEAK application under
implementation. Section 5 concludes the paper.

2. System Description
The goal of the DATALEAK system is to monitor the Web and collect
information about Web documents according to users’ preferences. The
collected Web data sources are compared with user’s confidential documents. If
a document turns up on the Web that is semantically similar to confidential user
documents the system indicates potential data leakage. Figure 1 shows the
abstract model of the system.
Confidental
documents
Web data
sources

DATA LEAKAGE

DATA LEAKAGE DETECTION


SYSTEM

NO PROBLEM

Figure 1: Data leakage detection system.


Usually, the similarity of documents is determined using a repetition-based
hard similarity metric. This approach ignores all potential semantic correlations

1
Source: https://www4.symantec.com/Vrt/offer?a_id=108920
2
Source: http://breakthroughanalysis.com/?s=unstructured+data
DATALEAK: Data Leakage Detection System 115

between different words. In our system not the pure content, but the meaning of
Web documents and user documents are compared. The system consist of a
number of modules [2]. In this section, these modules are introduced briefly.
The modules are represented in Figure 2.

Search module

Web
Query Web links
Document
collection

Cryptographic
Text mining module
module

Scoring module

Figure 2: Modules of the system.


The document collection contains sensitive, protected or confidential
information. In order to protect these documents, encryption is required. The
Cryptographic module is responsible for preparing an encrypted version of the
documents. Regarding the confidential content of the documents, the
Cryptographic module resides on the clients’ server. All the other services
reside in a cloud.
The Search module is responsible for discovering Web pages that might
indicate data leakage. The Search module incorporates a Crawler module that
investigates the structure of Web sites, determines those pages of Web sites that
contain relevant data, and indexes these pages using keywords. The Text mining
module converts Web documents into their appropriate mathematical
representation. The Scoring module matches the mathematical representations
of Web documents and confidential user documents.

3. Methods
In this section, methods that are used in the Text mining, Cryptographic and
Scoring module are presented. Given a search query, the retrieved Web
documents and confidential user documents, the scoring module computes a
relevance score that measures the similarity between these documents. The
scoring module uses different mathematical representations of documents.
116 A. Skrop

Different mathematical methods are implemented to satisfy different


information needs or user preferences.

A. Vector Space Model


In IR the vector space model (VSM) is widely used as a mathematical model
of the systems. In VSM documents are represented by a vector in an n
dimensional vector space, where n is the number of keyword or index terms.
[6].
In the data leakage detection system the VSM is proposed to be used as the
basis of the Cryptographic and Text mining module. As a result the Scoring
module can match the vector space based mathematical representations of Web
documents and confidential user documents. The VSM representation of
documents is as follows.
On the one hand, a set of keywords C = {c1, ..., ci} are extracted from
confidential documents during indexing. On the other hand, another set of
keywords W= {w1, ..., wj} are extracted from Web documents during indexing.
Web documents and user confidential documents are represented using the
VSM over C∪W as follows:
Given a finite set T = C∪W of index terms T = {t1, ... ,tn} any Web document
Dj is assigned a vector vj of finite real numbers, as follows:
vj = (wij)i=1,...,n  (w1j, ..., wij, ..., wnj) (1)

Confidential user documents Uk also have to be represented as a vector vk of


finite real numbers, as follows:
vk = (wik)i=1,…,n  (w1k, …, wik, …, wnk) (2)
The weight wij is interpreted as an extent to which the index term ti
characterizes a document. Web document vectors and user document vectors
are compared according to some similarity measure. The disadvantage of this
representation is that, thanks to the Web documents’ potential diversity, the
dimensionality of the vector space can be fairly high causing costly computation
of the similarity measure.
A Web document Dj is represented to a user having confidential document
Uk if they are similar enough, i.e., a similarity measure Sjk between the Web
document vector vj and the confidential user document vector vk is over some
threshold K, i.e.,
Sjk = s(vj, vk) > K (3)
In the classical VSM different weighting schemes and similarity measures
can be used.
DATALEAK: Data Leakage Detection System 117

Index term weights express how important is a term or keyword for


describing the content of a document. The most commonly used weighting
schemes are as follows. The simplest weighting schema is the binary. It
indicates weather or not a keyword is present in a document. Consequently,
these weights are generally not good enough, because they do not differentiate
between frequent and infrequent keywords. The frequency of occurrence of
keywords in the document is the number of times that this keyword appears in
the text of document. It is called term-frequency weight. The logarithm based
weighting schema is used to adjust within-document frequency. It is done
because frequently appearing terms in a document are not necessarily more
important as special terms that appear only once in that document. Document
sizes might vary widely. To compensate for this undesirable effect length
normalization are used. i.e., the rank of the document is divided by its length.
Given the weighted vector space representation of documents they can be
compared by computing the distance between the points representing the
documents. In other words, a similarity measure is used so that the documents
with the highest scores are the most similar to each other. The three typical
normalized similarity measures are defined as follows [3].
 Cosine similarity
(𝑣𝑗 .𝑣𝑘 )
𝑆𝑗𝑘 = (||𝒗 (4)
𝑗 ||∙||𝒗𝑘 ||)

 Jaccard similarity
(𝑣𝑗 .𝑣𝑘 )
𝑛
∑ 𝑤 +𝑤𝑖𝑘
𝑖=1 𝑖𝑗
𝑆𝑗𝑘 = 𝑤 𝑤 (5)
2 𝑖𝑗 𝑖𝑘

 Dice similarity
(𝑣𝑗 .𝑣𝑘 )
𝑛
∑ 𝑤 +𝑤𝑖𝑘
𝑖=1 𝑖𝑗
𝑆𝑗𝑘 = 𝑤𝑖𝑗 𝑤𝑖𝑘 (6)
2
Usually, the categoricity of the system can be varied by both changing the
weighting scheme and similarity measure.

B. Interaction Information Retrieval


Besides the VSM, other techniques are also considered to be used to
determine the similarity of documents. The interaction information-retrieval
(I2R) model [7][8] has also been selected to be implemented. In the data leakage
detection system the I2R model can be implemented as follows.
118 A. Skrop

Any Web document is represented by an object. Each object oi, i = 1, 2, ...,


M, is associated a vector of predefined index terms ti = (tik), k = 1, 2, ..., ni.
Between any pair (oi, oj), i  j, of objects links are established. The links are
both weighted and directed. Two pairs of links can be identified per direction.
One directed link shows the relative frequency wijp of an index term [7][8].
The relative frequency is as follows:
𝑟𝑖𝑗𝑝
𝑤𝑖𝑗𝑝 = 𝑛𝑖
, 𝑝 = 1, … , 𝑛𝑗 (7)

where rijp denotes the relevance of index term tjp in Web document object oi, and
ni is the number of index terms in Web document object oi.

oi tik

wikj wijp wjpi wjik

oj tjp

Figure 3: Connections between object pairs.


The other directed link is the inverse document frequency wikj [7][8]. It is
defined as follows:
2𝑀
𝑤𝑖𝑘𝑗 = 𝑟𝑖𝑗𝑝 𝑙𝑜𝑔 𝑑𝑓 . (8)
𝑖𝑘

where rikj denotes the relevance of index term tik in oj, dfik is the number of Web
documents that contains the keyword tik, and M is the number of Web document
objects.
As Figure 3 shows, the other direction also has two links with the same
meaning. The investigated Web documents are represented as complete graph
of objects. The confidential user document is represented by an object, too. It is
linked to the other Web documents that are in the complete graph. Linking this
new object to the graph the weights of the already existing connections would
change. The weight of the link pair between any pair (oi, oj), i  i, of objects,
and thus, between the confidential document and another Web document object
oi is defined as the sum of the corresponding weights as follows [7][8]:
𝑗 𝑛 𝑛
𝐾𝑖𝑗 = ∑𝑝=1 𝑤𝑗𝑝𝑖 + ∑𝑘=1
𝑖
𝑤𝑗𝑖𝑘 (9)
DATALEAK: Data Leakage Detection System 119

This complete graph of objects can be regarded as an artificial neural


network. In this network the activation of neurons obey the winner takes all
strategy. The activation starts at the confidential user document and spreads
over to the most active neuron. After a number of steps, the spreading of
activation reaches an object that was already affected. Those Web documents
are similar to the confidential user documents which belongs to this circle.
These documents may indicate data leakage. The advantages of the interaction
model are as follows [7][9]. On one hand, this method avoids costly
computation. The complexity of weights computation is polynomial. The
process also takes polynomial time. On the other hand, the interaction retrieval
method allows for a relatively high precision within 50%−70%. Standard test
collections based evaluation showed that this method is useful when high
precision is favoured at low to middle recall values [7].

4. Application
The DATALEAK application is made up of different program modules. The
next section represents the modules under development.

A. Search Module
The Search module is responsible for discovering Web pages that might
indicate data leakage. The search module is implemented as a conventional
keyword-based metasearch engine. Metasearch engines submit the search query
to several other search engines. The results of these search engines are
combined into a single list. This strategy improves the effectiveness of the
results and also gives the search a broader scope [10]. The implemented
metasearch engine uses the hit list of Google and Bing. The title page of the
search engine is shown in Figure 4. In the title page the three main functions of
the metasearch engine are available: search configuration, results management
and search terms management.

Figure 4: Metasearch engine title page.


120 A. Skrop

The first function is Search configuration where search terms can be defined
and searching options can be set. Query terms can be imported from an XML
file or freely entered from the keyboard in the Search term textbox. During
search configuration results can be limited according to when they were
published on the Web. On the one hand, predefined intervals can be set, on the
other hand, custom range can be defined. With scheduling it can be specify how
often the search would be repeated. The metasearch engine can return the hit list
of Bing or Google. The returned hit list can be viewed in the Results menu. A
fraction of Results menu is shown in Figure 5.

Figure 5: Results management screen of metasearch engine.


In this example the search term data leakage detection was queried using
Google search engine. Before using Text mining, results can previously be
checked by the user. Only accepted hits are sent to the Text mining module.
Search terms can be managed in the Manage menu. This menu is shown in
Figure 6.

Figure 6: Search term management screen.


DATALEAK: Data Leakage Detection System 121

Search terms can be edited, exported or deleted. The previously specified


search scheduling can be modified. It is not necessary to check the metasearch
engine for new results. Notifications can be received using an e-mail alert.
Figure 7. shows a fraction of a notification e-mail.

Figure 7: New results e-mail notification of metasearch engine.

B. Text Mining Module


The Text mining module is responsible for the automated processing of Web
pages that were identified by the Search module. During automated processing
relevant keywords or index terms are extracted from Web documents. Using the
extracted keywords Web documents can be represented in a vector space. The
input of the Text mining module is the hit list extracted from the Search
module. Each result from the hit list is processed in the following order.

Figure 8: Sample Web page.


122 A. Skrop

The Web page is downloaded to a temporary storage database. A sample


Web page selected from the hit list is presented in Figure 8. The first step of text
processing is parsing. Parsing is used to remove surplus and extract the pure
text from the Web page. The result of parsing is represented in Figure 9.

Figure 9. Text extracted form sample Web page.


Having the pure text content of a Web page, the next steps lexical analysis or
tokenizing. Tokenizing is an important step. Tokens are usually the same as
words. However, special cases also have to be considered, like hyphenated
forms, capitalized words, apostrophes etc… After tokenizing stopwords are
removed from the set of words. Stopwords are those functions words that have
little meaning, e.g.: the, a, that, over etc… Removing these words generally
increase the retrieval effectiveness of the system. The final step is stemming. It
captures the relationship between different variations of a word. Stemming
reduces distinct words to their common grammatical root. A fraction of the text
processing output of the Text mining module is represented in Figure 10.

Figure 10: Output of the Text mining module: keywords and their frequency.
DATALEAK: Data Leakage Detection System 123

Keywords are extracted from the Web document and the term-frequency of
the keywords is determined. The output of the Text mining module can be used
to create a vector space representation of any Web document.

C. Cryptographic Module
The Cryptographic Module is responsible for preparing an encrypted version
of the documents. It works similarly as the Text mining module. The input can
be any user document. The output is a set of keywords and their frequencies.
The output of the Cryptographic module can be used to create a mathematical
representation of any user document.

D. Scoring Module
The Scoring module matches the mathematical representations of Web
documents and confidential user documents. Web documents and confidential
user document vectors can be compared using either the VSM or the I2R model.
Figure 11 shows the Cosine VSM similarity of a test user document and test
web documents. If Web documents are similar enough to confidential user
documents data leakage can be assumed.

Figure 11: Scoring module example.


124 A. Skrop

5. Conclusion
In this paper we presented an information-retrieval based approach to
address the problem of data leakage detection. The DATALEK system
architecture is composed of Web searching, text mining and scoring .The idea
of the system is to monitor the Web, collect information about Web documents
according to users’ preferences. If a document turns up on the Web that is
semantically similar to confidential user documents the system indicates
potential data leakage. The system is under implementation. After implementing
the system, experiments have to be carried out to evaluate its effectiveness and
precision.

Acknowledgements
This research has been supported by the European Union and Hungary and
co-financed by the European Social Fund through the project TÁMOP-4.2.2.C-
11/1/KONV-2012-0004 - National Research Center for Development and
Market Introduction of Advanced Information and Communication
Technologies.

References
[1] A. Shabtai, Y. E. Asaf, and R. Lior, A survey of data leakage detection and prevention
solutions. Springer, 2012, ISBN: 978-1-4614-2052-1.
[2] A. Skrop, “Data Leakage Detection Using Information Retrieval Methods.” In: IMMM 2014,
The Fourth International Conference on Advances in Information Mining and Management,
pp. 74-78, 2014.
[3] C. T. Meadow, Text Information Retrieval Systems. Academic Press, 2000, ISBN:
0124874053.
[4] E. Gessiou, Q. H. Vu, and S. Ioannidis, “IRILD: an Information Retrieval based method for
Information Leak Detection,” In Proceedings of European Conference on Computer Network
Defense, 2011, pp. 33–40, IEEE.
[5] P. Papadimitriou and H. Garcia-Molina, “Data leakage detection," Knowledge and Data
Engineering, IEEE Transactions on, vol. 23(1), 2011, pp. 51-63.
[6] R. Baeza-Yates and B. Ribeiro-Neto, Modern information retrieval: The Concepts and
Technology behind Search (2nd Edition). ACM Press Books, Addison-Wesley Professional,
2011, ISBN: 0321416910.
[7] S. Dominich, "Connectionist interaction information retrieval," Information processing &
management, vol. 39.2, 2003, pp. 167-193, doi: 10.1016/S0306-4573(02)00046-8.
[8] S. Dominich, "Interaction information retrieval," Journal of Documentation, vol. 50.3, 1994,
pp. 197-212, doi: 10.1108/eb026930.
[9] S. Dominich, A. Skrop, and Zs. Tuza, “Formal Theory of Connectionist Web Retrieval,”
Soft Computing in Web Information Retrieval, Studies in Fuzziness and Soft Computing,
vol. 197, 2006, pp. 163-194.
[10] W. B. Croft, D.Metzler, and T. Strohman, Search engines: Information retrieval in practice
(p. 283). Reading: Addison-Wesley, 2010.
[11] Y. Liu, C. Corbett, K. Chiang, R. Archibald, B. Mukherjee, and D. Ghosal, “SIDD: A
framework for detecting sensitive data exfiltration by an insider attack,” In System Sciences,
2009, HICSS'09, pp. 1-10, IEEE.

View publication stats

You might also like