Professional Documents
Culture Documents
fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2851228, IEEE Access
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.DOI
ABSTRACT Data leakage is a growing insider threat in information security among organizations and
individuals. A series of methods have been developed to address the problem of data leakage prevention
(DLP). However, large amounts of unstructured data need to be tested in the Big Data era. As the volume of
data grows dramatically and the forms of data become much complicated, it is a new challenge for DLP to
deal with large amounts of transformed data. We propose an Adaptive weighted Graph Walk model (AGW)
to solve this problem by mapping it to the dimension of weighted graphs. Our approach solves this problem
in three steps. First, the adaptive weighted graphs are built to quantify the sensitivity of tested data based on
its context. Then, the improved label propagation is used to enhance the scalability for fresh data. Finally, a
low-complexity score walk algorithm is proposed to determine the ultimate sensitivity. Experimental results
show that the proposed method can detect leaks of transformed or fresh data fast and efficiently.
INDEX TERMS data leaks, weighted graphs, data transformation, label propagation, score walk.
ATA leakage (or data loss) is a term used in the informa- To address the problem of data leakage detection (DLD),
D tion security field to describe unwanted disclosures of
information [1]. Unlike the traditional security threats from
plenty of research work has been done with the use of hash
fingerprinting, n-gram [7], statistical methods [8], [9] and
the outside, data leakage is mainly caused by the insider so on. With the rapid development of Internet, many new
who may leak data to the outside unauthorized entities. communication technologies (e.g., Device-to-Device tech-
Thus, traditional security measures (i.e. firewalls, intrusion nology [10]) have emerged. As a result, the volume of data
detection systems, anti-virus) are no longer valid due to lack grows dramatically and the forms of data becomes much
of understanding of data semantics. However, data leakage complicated. This brings new challenge to DLD. Therefore,
incidents happen from time to time, bringing serious damage new DLD method is required with better tolerance of data
continuously. In November 2010, WikiLeaks leaked over transformation and higher efficiency to deal with the large
251,287 U.S. diplomatic cables [2], revealing the largest amounts of unstructured data in long patterns. Considering a
number of classified documents at that time. In June 2013, common example scenario: an employee of an organization
Edward Snowden leaked the senior confidential documents tries to leak a sensitive file to unauthorized outsiders, and he
of the US National Security Agency (NSA), which exposed may deliberately make some transformation to the file (e.g,
US spy program [3]. In October 2017, Yahoo reported that add some non-sensitive content to the original file, modify or
3 billion accounts had been breached [4], covering every delete some content) to escape detection. Existing methods
Yahoo account at the time. The security of data has attracted mainly rely on the appearance of sensitive keywords and ig-
wide attention in many fields, for example, the recently nore the context, which may cause: a sensitive document may
developed Internet of Things [5]. According to Kaspersky be detected as nonsensitive because the file is transformed
Lab IT Security Risks Report 2016 [6], data protection is through modifying some contents, resulting in low detection
the main area of concern, with 80% of businesses saying it accuracy towards the transformed data. Some work has been
is their major concern. 43% of companies have experienced done to detect the transformed data [7], [11], [12], however,
VOLUME 1, 2018 1
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2851228, IEEE Access
most of them only consider simple cases, such as adding A. CONTENT ANALYSIS
or deleting some content from the original file. The data
The content analysis methods include rule-based meth-
transformation in real application scenarios is much complex.
ods, fingerprinting related methods and statistical methods.
As data transformation is quite common in real situation,
Among the rule-based methods, predefined regular expres-
a large number of sensitive data may be leaked once the
sion is widely used in DLD. It scans the tested text with
detection can’t tolerate the transformed data. Hence, how to
predefined rules. The sensitivity of the text depends on the
better tolerate the data transformation is of vital importance
extent to which it matches the regular expression. Rule-based
for DLD.
methods can detect the exact data leakage rapidly, but they
In this paper, a novel model named Adaptive weighted will fail to detect the new data. The fingerprinting methods
Graph Walk model (AGW) is proposed to detect transformed calculate the hash value of the tested documents and compare
data. In this model, all the documents are represented by it with the the hash value of trained template. The detection
graphs. The sensitive context weights in the form of node phase will be accurate and fast if the file is the original file.
weights and edge weights are defined in the graph to improve But once the file makes any change, its fingerprinting will
the detection accuracy towards the transformed data. The be completely different from the original one, making it hard
context weight is able to quantify the sensitivity of the to detect. Some work is devoted to improve the method for
keywords adaptively based on the context around the key- robust fingerprinting. Blocking hash is used instead of overall
words. The proposed solution aims to detect large amounts hash to tolerate the text transformation [13].
of newly generated, extensively transformed data accurately
and efficiently. The main contributions are as follows. The statistical methods are mainly based on the statistical
features of the text (e.g, terms and their frequencies). The
• To better tolerate the long transformed data, we define word n-gram based classification method is another type of
an adaptive context weight mechanism to quantify the content analysis based DLD method, which calculates the
sensitivity of the keyword based on its context. The offset of the grams between the tested file and the template
complex documents are further represented by weighted file, sorted by their frequencies [7]. The smaller the offset,
context graphs, containing both key terms and contex- the more sensitive the tested file is. But it is incomplete to
tual information. represent a document only by its high frequency terms. To
• To make up for the limitation of the template and better choose the key terms to represent a document, term
increase the scalability, we also take the data seman- weighting is used to indicate the semantic importance of the
tics into consideration. An improved label propagation word. Term frequency-inverse document frequency (TF-IDF)
algorithm (LPA) is used to tag the same label on the is a best known method used to detect the data semantic to
highly relevant terms of the tested graphs. We also prevent data leakage [14].
merge the context graphs of each file into a general
one - the template graph, to preserve more correlation Machine learning classification methods are well explored
information between key terms and enhance the overall in areas of information retrieval, text categorization and spam
sensitive context. filtering, such as Naive Bayes [8] and support vector ma-
• To deal with the large amounts of data, we propose an chines (SVM) . These methods also have limitations in DLD.
algorithm with low-complexity. With a weight reward An automatic text classification method based on SVM is
and penalty mechanism, the algorithm quantifies the proposed in [9]. It maps the text into vector space, and trains
sensitivity of the tested documents by one walk on their the classifier with the features of terms and their frequencies.
graphs, which allows the detection to be implemented in The documents are then classified as either sensitive or non-
real time. sensitive. However, different from text classification tasks,
the number of sensitive files in training set is usually very
The remaining of this paper is organized as follows. In Sec-
small while the non-sensitive is large, the machine learning
tion II, related work about data leak detection is summarized.
methods mentioned above will learn the most significant
In Section III, the proposed model is presented and the related
statistical features, ignoring the sensitive context weight of
algorithms is discussed in detail. The performance evaluation
terms, which will lead to a high false alarm rate.
is shown in Section IV. Finally, Section V summarizes the
paper. All in all, the content analysis methods such as rule-based
methods and fingerprinting methods are hard to deal with
II. RELATED WORK data transformation, while the statistical methods only learn
the most statistical characteristics, ignoring the sensitive con-
In this section, related research work about data leakage
text weight. However, with the development of Internet and
detection/prevention is summarized. In view of the serious
the emerging of cloud service providers (CSPs) [15], a large
damage of data leakage, the research on data leakage detec-
number of data in various forms are produced and need to be
tion is far from sufficient and is relatively new. Researchers
monitored. The above methods are hard to cope with these
have paid more and more attention on data leak detection
transformed data.
recently. Existing research can mainly be divided into two
categories: content analysis and context analysis methods.
2 VOLUME 1, 2018
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2851228, IEEE Access
B. CONTEXT ANALYSIS lyzes the tested documents and matches them to the trained
The recently studied context analysis methods attempt to template and to determine whether the tested documents are
take the sensitive context of terms into consideration. The sensitive. The details are described in the following subsec-
adaptable n-gram [16] is proposed to classify documents tions.
into different detailed sensitive topics according to a simple
overall statistical context. The improved n-gram method [11] A. PROBLEM FORMULATION
is also proposed by using subsequence-preserving sampling Data leakage happens when data is distributed from the
to preserve some of the context information. Although these inside of an organization to unauthorized outsiders, inten-
methods compensate some limitations of traditional n-gram tionally or accidentally, which may cause serious damage
methods, they still consider little or very simple context to the organization. Considering such a common scenario:
information in data leak detection. Some studies have al- an employee of an organization tries to leak sensitive files
ready tried to use graphs to represent the text [17]–[19] in from local network to unauthorized outsiders. To prevent data
summarization, information retrieval and DLD tasks instead leakage, a detection agent is needed to inspect the outgoing
of using word vectors, which has yielded good performance. traffic content. The detection agent is usually deployed at the
An improved approach called CoBAn [12] reduces the false outlet of the local network. The overall application scenario
positives by considering the context of key terms in graphs is shown in Fig. 1.
for classification. CoBAn has achieved good performance. It
assigns one from a set of clusters to match a document. But
the context it defined is too rigid to satisfy in the scene of
processing transformed data. In addition, the computational DLD Agent
complexity of existing context analysis methods can still be
further improved to accommodate big data scenarios.
Data Holder
Recently, some new challenges are faced by DLD with
transformation of data. Research work has been done to
Internet
handle the detection of transformed data. The word n-gram
based classification method proposed in [7] tested the cases
of document modification by subtracting and adding words
to the original text body and most cases can be classified FIGURE 1: The overview application scenario of AGW
correctly. The AlignDLD method proposed in [11] can detect
modified leaks caused by WordPress, which substitutes every Given a dataset D as the training set with documents
space with a "+". The CoBAn [12] also can detect the doc- d1 , d2 , ..., dm . The documents contain two categories: sen-
uments out with modified sensitive sections. However, only sitive DS and non-sensitive DN , with R = {DS , DN } to
simple transformation scenarios are considered in most of the represent the classified dataset. D0 = {d1 , d2 , ..., dn } is an
papers. The assumed data transformation is either embedding unclassified dataset, denoting the outgoing data to be tested,
some sensitive content in the original text or replacing some which means whether the documents in D0 are sensitive or
of the original text. How to tolerate the data transformation not is unknown. The goal of this paper is to establish a
to a greater degree is still a huge challenge in DLD. mapping M : D → R on D. Then the mapping M can
In this paper, AGW is proposed to handle extensively be used to build D0 → R0 , where R0 = {DS0 , DN 0
}. Thus,
transformed data. We use the adaptive context weight mech- sensitive files in D0 can be detected to determine if there is a
anism to better preserve the information of key terms and data leak.
their context, which can tolerate a large degree of data The problem becomes how to classify the test set D0 into
transformation. We also propose a low-complexity algorithm R : D0 → R0 based on training set D. The key point to solve
0
with a weight reward and penalty mechanism, to deal with the this problem is to build a appropriate mapping model M. For
large amounts of data in big data scenario. Thus, AGW can any di ∈ D0 , there are three possible cases: 1) di ∈ D, means
improve the detection accuracy towards transformed data and there exist one dj ∈ D, that dj = di ; 2) di ∈ / D, but there
has a low computational complexity. exist a dj ∈ D which is partly similar to di ; 3) di ∈ / D, nor
is there any dj ∈ D that is similar to di . The above cases
III. MODEL illustrate two challenges need to be tackled in this scenario:
0
In this section, the problem formulation of data leak detection • How to handle the di ∈ DS in case 2, that is, how to
in this paper is given at first. Then we present the overview detect the transformed sensitive data?
0
and details of our solution and its analysis. The proposed • How to deal with the di ∈ DS in case 3, which means
AGW consists of two phases: adaptive weighted graphs how to detect the completely transformed unknown
learning and score walk matching detection. The adaptive data?
weighted graphs are learned from the training set, retaining What’s more, the scale of data is usually large in Big Data
the key terms and their context characteristics to represent scenarios, how to perform the detection more efficiently also
a document comprehensively. The score walk detection ana- deserves much consideration.
VOLUME 1, 2018 3
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2851228, IEEE Access
B. OVERVIEW OF AGW
Document set D
We address the above challenges in AGW. AGW mainly
contains two phases: adaptive weighted graphs learning and
score walk matching detection. The adaptive weighted graphs
learning phase builds the template graphs from training set. Documents
preprocessing
It retains the key information of the training set as much as
possible. The score walk matching detection calculates the
sensitivity scores of tested documents, and determine if there
Analyze key terms and their
is a data leak based on the sensitivity scores. The learning semantic correlation
phase can be performed locally at the data holder’s side,
while the detection phase can be performed by one or several
third party agents. What’s more, the DLD agent may be semi- Construct the adaptive
honest for some reasons (e.g, attacked or controlled by ma- weighted graphs
licious users), which may try to learn the sensitive data from
data holder. To figure this problem out, we apply privacy-
preserving graph masking before the sensitive data is sent to Merge into one template
graph
the agent. The graph masking is performed respectively after
the template graphs are constructed in the learning phase and
the tested graphs are constructed in the score walk matching
phase. The original content of the graph nodes (sensitive Nodes partition by improved
LPA
terms) can be concealed by using graph masking. The adap-
tive weighted graphs learning, the privacy-preserving graph
FIGURE 3: The learning of adaptive weighted graphs
masking and the score walk matching constitute the key
modules of AGW. The overview of AGW is shown in Fig.
2.
connecting two nodes denotes that they have a certain degree
Learning Detection of correlation, and each edge has a weight we to quantify
this correlation degree. Let WE be the set of edge weights.
Weight matrix A consists of elements w ∈ WN ∪ WE ,
Training set D Test set D’
which contains the entire structure and weight information
of the graph. An adaptive weighted graph is denoted as
G = {V, A}.
Documents
Masked test graphs
preprocessing An adaptive weighted graph is a summary of a document
which keeps the key terms and their contextual information
Adaptive weighted graphs Privacy-preserving Score walk matching
of the document. In the learning phase, how to retain as
Results
learning graph masking detection much key information as possible and simplify the graph to
FIGURE 2: The overview of AGW minimize the amount of subsequent calculation is the main
concern. The adaptive weighted graph provides a balance be-
tween the two points mentioned above. Moreover, what DLD
C. ADAPTIVE WEIGHTED GRAPHS LEARNING concerns with is not the word semantic, but the sensitivity of
The word vector space model represents text by vectors, the content. We call it sensitivity semantic to distinguish from
which is widely used in many text-related tasks, such as the words semantic used in NLP. The sensitivity semantic
information retrieval and sentiment analysis. The graphs are defined in AGW focuses more on the appearance of key terms
another method which also has been used for text represen- and their contextual correlations.
tation in many text analysis tasks. Different from the tasks The complete detailed steps of the weighted graphs learn-
of natural language processing (NLP), the sensitivities of ing phase are described below.
terms are more concerned than their meanings in data leak
detection. Graphs can better keep the structure information 1) Documents preprocessing
of a document besides its content. We use adaptive weighted The adaptive weighted graphs learning starts with documents
graphs to retain the sensitive semantic of documents. preprocessing. There are two categories of documents: sen-
Fig. 3 shows the learning phase of adaptive weighted sitive and non-sensitive. Documents are first “cleaned" by
graphs. We define the adaptive weighted graph based on removing the stop words and any meaningless words(e.g.,
traditional graphs in Definition 1. ‘the’ and ‘ing’ suffix). Then the documents are segmented
Definition 1 (Adaptive Weighted Graphs): Let V be the set into terms of various lengths, where each term may contain
of nodes (key terms), and each node has a value of text term one or more words.
and a weight wn . The set of node weights is WN . An edge e
4 VOLUME 1, 2018
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2851228, IEEE Access
where the
P
nt0 ,d denotes the sum number of all terms in G(V, A) = G(Vd1 , Ad1 ) ∪ · · · ∪ G(Vdi , Adi )
(7)
t0 ∈d = G(Vd1 ∪ · · · ∪ Vdi , Ad1 ∪ · · · ∪ Adi )
d. The inverse document frequency IDF is
If two or more nodes belonging to different graphs have
|DS | the same text value, the merged weight will be an mean
idf (t, DN ) = log , (2)
1 + |{d0 ∈ DN : t ∈ d0 }|
n
1X
where |DS | is the total number of sensitive documents, wn = wn . (8)
n i=1 i
|{d0 ∈ DN : t ∈ d0 }| is the number of non-sensitive
documents where the term t appears. Thus the weight w So is the same if two or more edges connect two same nodes,
(adjusted TF-IDF) of term t is defined as Eq. (3), which the mean merged edge weight is
reflects the sensitivity of term t in the document.
n
1X
we = we . (9)
wn = tf (t, d) · idf (t, DN ) (3) n j=1 j
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2851228, IEEE Access
Algorithm 1 The improved label propagation holder’s side, consisting of nodes set and weight matrix.
Input: template graph G(V, A), stop criteria η, maximum We apply the irreversible encryption (hash function) to the
number of iterations C original value of each node to encrypt the content of sen-
Output: labeled template graph G(V, A, L) sitive terms, while the weights and edges correlations are
1: Initialize: L = ∅, iterations t = 0, randomly assign a preserved. The masked template graphs, containing nothing
label to each node about the sensitive content, will be provided to the detection
2: while do not meet η and iterations t < C do agent from the data holder. The agent performs the detection
3: while node i in G do based on the masked template graphs as profiles. Since our
4: calculate the label of node i: select the label with detection mainly focus on value matching, the encryption
maximum wi,l from its neighbors will not affect the performance of detection. The overall
5: end while masking process is shown in Fig. 4.
6: end while
7: return G(V, A, L)
Weighted Context Graph M Masked Graph M’
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2851228, IEEE Access
Improved label propagation Algorithm. 3 provides the detail of score walk algorithm.
The sensitivity score of a weighted graph is calculated by
one time score walk on the graph, which can perform the
detection quickly and efficiently. The final sensitivity is deter-
Score walk matching mined by comparing the two score toward sensitive template
GS and non-sensitive template GN . What’s more, through
the mechanism of reward and punishment, it can perform a
general match between the tested graphs and template graph,
which can tolerate the transformed or fresh nodes and edges
Detection results
in tested graphs. Thus, score walk algorithm can deal with
the large amounts of transformed data and fresh data.
FIGURE 5: The detection phase
Algorithm 3 The score walk
The score walk algorithm is proposed to solve the problem
Input: weighted graph G0d , template graph G
of graph matching between tested graphs G0 and template
Output: The sensitivity of graph G0d - fw (G0d )
graph G. For each graph G0d ∈ G0 to be tested, its nodes
1: Initialize: fw (G0d ) = 0
and edges are walked through by breadth-first traversal to
2: Breadth-first traversal G0d
calculate a matching score. A reward and punishment mech-
3: while node ni in G0d do
anism is used to calculate the score fw () during the process
4: if ni ∈ G then
of walking. For a node ni ∈ G0d , if ni ∈ G, a node bonus
5: fw (G0d ) = fw (G0d ) + fw (ni , +)
fw (ni , +) as shown in Eq. (11a) will be added. Then all
6: while eij connects ei and ej do
nodes nj connecting to ni through edge eij will be traversed.
7: if eij in G then
If an edge eij ∈ G, an edge bonus fw (eij , +) as shown in
8: fw (G0d ) = fw (G0d ) + fw (eij , +)
Eq. (11b) will be added. If eij ∈ / G but in G there exist one
9: else
node also connecting to ni with the same label as nj .label
10: if nj .labelinG then
in G0 , a label bonus fw (eij , lb, +) as shown in Eq. (11c)
11: fw (G0d ) = fw (G0d ) + fw (eij , lb, +)
will be added. Otherwise a label penalty fw (eij , lb, −) as
12: else
shown in Eq. (11d) will be subtracted. If ni ∈ / G, a node
13: fw (G0d ) = fw (G0d ) − fw (eij , lb, −)
penalty fw (ni , −) as shown in Eq. (11e) and its edges penalty
14: end if
fw (eij , −) as shown in Eq. (11f) will be together subtracted,
15: end if
to punish the mismatched node and its correlations. The
16: end while
detailed rewards and penalties mentioned above are listed in
17: else
Eq. (11).
18: fw (G0d ) = fw (G0d ) − fw (ni , −)
19: while eij connects ei and ej do
wni
fw (ni , +) = wni + α · ( ) (11a) 20: fw (G0d ) = fw (G0d ) − fw (eij , −)
wnTi 21: end while
we end if
fw (eij , +) = weij + β · ( ij ) (11b) 22:
weTij 23: end while
fw (eij , lb, +) = weij + β · weij · γ (11c) 24: return fw (G0d )
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2851228, IEEE Access
detection agent in real-time. So the complexity of the detec- each document may belong to many classes. It must be
tion phase is much more important than the learning phase, figure out that data leak detection is different from the task
which is directly related to the efficiency of the real-time of document classification. What DLD do is to assess the
detection. We mainly analyze the complexity of the score sensitivity of documents according to sensitivity semantic.
walk detection. For a document d to be tested, the score walk Thus, it concerns the exact category of the documents, while
detection mainly has two steps: (a) learning the weighted the multi-label cases will be ignored by DLD. Since many
graph G0d , (b) score walk detection of G0d . categories in the basic dataset are too small (e.g., one or two
To analyze the complexity of detection phase, we assume documents only) to perform further evaluation, we select the
that the template graph G has already been constructed. The categories out whose number of documents is more than 100,
weighted graphs are stored in the HashMap, whose cost of both in the training set and test set. 7 qualified categories (the
looking up a value is O(1). Assuming there are C terms in number of documents ranges from hundreds to thousands)
document d, the top N terms will be selected out as nodes, the are screened out from the 116 categories as the basic dataset.
window size of establishing an edge is K, and the number of The basic dataset consists of 6611 training documents and
edges is M . The complexity of step (a) and (b) are as follows: 2579 testing documents. To better simulate the real-world
• learning of G0d : each word of document d is read once data leak scenarios, we specify the sensitive data by assuming
to record its count and index position . The cost of this one category as sensitive and the others as non-sensitive. For
process is O(C). Then the adjusted TF-IDF value of more detailed further evaluation, we combine one category
each term is calculated at the cost of O(C). The top as sensitive and another as non-sensitive randomly from the
N terms are selected out as nodes. For each node, its basic dataset each time. Finally, 42 sets of data are generated
adjacent nodes are searched in the window of K, the for the following statistical experiment, each with a training
complexity is O(N · K). In label propagation process, set and a test set containing thousands of documents.
the initialization of each node label requires O(N ) time, Among the detection phase, we use differential score Sd =
and each iteration of label propagation algorithm takes Sd,GS −Sd,GN instead of a fixed score threshold to determine
linear time in the number of edges O(M ) [21]. The the final sensitivity of document d, where Sd,GS denotes the
overall complexity is O(2C +N ·K +N +M ). For each calculated score of d towards sensitive template GS , while
node in G0d , the number of its edges is less than K, as its Sd,GN towards GN . A tested document d is determined to be
edges are set in the window of K. Thus, M < K · N , sensitive only when Sd > 0. We use the relative differential
the complexity can be reduced to O(K · N ). score other than a fixed threshold to avoid the impact of a
• detection of G0d : the sensitivity score of graph G0d is threshold on the final detection results.
calculated through one time score walk on G0d . This The effect of detection is measured by accuracy, recall and
process takes liner time in the sum of the number of receiver operating characteristic (ROC) curve. The confusion
nodes and edges, O(N + M ). For M < K · N , the matrix in Table. 1 presents the detailed analysis parameters
complexity can be simplified to O(K · N ). of detection results. The accuracy (Acc) measures the rate
To sum up, the complexity is O(2K · N ) = O(K · N ). of correctly detected cases, calculated by Acc = T PT+T P
N
.
The steps described above will be applied on each document The recall measures the rate of sensitive documents that are
d, where d ∈ D0 . These repeated calculation takes a linear also detected as sensitive, calculated by Recall = T PT+F P
N,
time to the complexity of each document. Thus, the overall which reflects the ability to detect sensitive data leaks. The
complexity is O(K · N ). As we can see, the overall com- ROC curve is plotted with T P against F P at various thresh-
plexity is liner to the number of nodes and the size of edge old, illustrating the diagnostic ability of our proposed model.
window, while the two parameters are both constant values
in our implementation. Therefore, it can be concluded that TABLE 1: The confusion matrix
AGW is capable of being implemented in real time. Actual sensitive Actual non-sensitive
Predicted sensitive TP FP
Predicted non-negative FN TN
IV. EVALUATION
In this section, the performance of AGW is evaluated by a
series of experiments. The experiments involve 42 sets of data, which range in
size from a few hundred to several thousand. Scenarios of
A. IMPLEMENTATION AND EXPERIMENT SETUP detecting unmodified data and detecting transformed data are
For there are few published sensitive datasets available due tested, with different sizes of dataset. The documents of same
to the confidentiality of sensitive data, the Reuters dataset category in test set are quite different from the training set, so
[22] is adopted as the raw dataset to perform our experi- the test set is used as the modified or fresh data to be detected.
ments. The dataset is a series of articles appeared on Reuters The following experiments in this section aim to answer the
newswire, which consists of 11592 training documents and following questions.
4140 testing documents in 116 categories, each category • Can the proposed method deal with large amounts of
in one folder. It is also a multi-label dataset which means data efficiently?
8 VOLUME 1, 2018
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2851228, IEEE Access
•Can the proposed method tolerate the transformed or of dataset increases, which is consistent with our previous
fresh data in detection? complexity analysis in Section III. It can also be seen from
• How does our method compare to the state-of-the-art Fig. 7 that the running time of test phase is much less than
method on the same dataset in terms of accuracy, recall, the training phase. As mentioned before, the training phase
ROC? is performed locally and off-line, while the test phase has a
The performance of AGW in two scenarios is evaluated high demand for time efficiency. It can be concluded from the
in our experiments. Firstly, experiments for detecting un- results that the proposed method can deal with large amounts
modified data have been done as the basic scenario. Then, of data efficiently.
experiments for detecting modified data have been done to
find out how AGW performs when faced with transformed
90
data. This simulates the real-world transformed data leakage
scenario. 80 Training Time of AGW
Testing Time of AGW
Total Time of AGW
70
and recall of 40 groups are both 100%, while the other two 10
leaks.
FIGURE 7: The running time of AGW in different sizes of
TABLE 2: The accuracy and recall of 42 tested groups
test set
Number of groups Accuracy Recall
40 100% 100%
1 99.91% 100% C. DETECTING TRANSFORMED LEAKS
1 91.64% 98.51%
In order to test the capability of our method in tolerating data
transformation, the transformed data should be used as the
Fig. 6 presents the ROC curves of AGW and CoBAn [12].
data to be detected. Instead of simply adding some content
Both of these two methods have achieved high performance
or partly modifying the original documents, the documents
in ROC curve, while AGW is a little higher. The high per-
in the test set are completely different and new compared
formance is as expected for the DLD task here is similar to
with those in our training set. For sure they are in the same
a basic classification problem in the unmodified scenario. It
category. The new documents in our test set are chosen as
indicates that to detect the unmodified data leakage is much
the transformed sensitive data to verify the ability of AGW to
easier because the data to be tested changes little from the
detect completely transformed data. A series of datasets with
template data.
various size have been tested. The overall statistical results of
all 42 test sets are presented in Fig. 8. As we can see, among
1 the 42 groups, AGW performs better in most cases.
0.998 More specifically, Fig. 9 shows the average detection ac-
0.996 curacy of AGW and CoBAn, denoting the rate of correctly
0.994
classified (sensitive or not) cases. The recalls of AGW and
CoBAn are provided in Fig. 10. It is a proportion of doc-
0.992
uments that correctly detected as sensitive in total sensitive
TP
0.99
documents, denoting the capability of the method to detect
0.988 the leaks of sensitive data. As denoted in Fig. 9, Fig. 10,
0.986 CoBAn AGW achieves a higher detection accuracy and recall on
0.984
AGW various test sets with different sizes. The accuracy and recall
of AGW change little with the increase of the size of test
0.982
0 0.1 0.2 0.3 0.4 0.5
FP
0.6 0.7 0.8 0.9 1
set, while the accuracy and recall of CoBAn fluctuate widely
in some cases. This difference may be due to that the doc-
FIGURE 6: The ROC curves of CoBAn and AGW uments of same category in the test set and training set are
quite different. AGW catches more correlation information
The training time, testing time and total time of AGW of key terms in the document by using weighted edges, while
are shown in Fig. 7. The time increases linearly as the size CoBAn uses only context terms to represent the correlation,
VOLUME 1, 2018 9
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2851228, IEEE Access
1
CoBAn
AGW
0.95
0.9
1
0.85
0.9
0.8
0.8
Recall
0.7 0.75
Recall
0.6 0.7
0.5 0.65
0.4
0.6 CoBAn
AGW
0.3
1
0.55
0.9
7000
0.5
6000
0.8 964 1060 1160 1299 2847 2951 3086 4450 4542 4546 4681 6333
5000
0.7 4000 Size of Test Set
3000
Ac
cura 0.6 2000
cy t
1000 Test Se
Size of
0.5 0
FIGURE 10: The recall of CoBAn and AGW in different
sizes of test set
FIGURE 8: The accuracy and recall of CoBAn and AGW in
42 test sets
3000
is more suitable for dealing with transformed data. Average Running Time (seconds)
2000
1500
1
0.95
1000
0.9
0.85 500
0.8
Accuracy
0
0.75 964 1060 1160 1299 2847 2951 3086 4450 4542 4546 4681 6333
Size of Test Set
0.7
0.5
964 1056 1060 1064 1160 1299 2847 2855 2951 3086 4442 4450 4542 4546 4681 6333 1
Size of Test Set
0.9
0.6
0.5
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2018.2851228, IEEE Access
V. CONCLUSION [19] D. Cai, X. He, J. Han, and T. S. Huang, “Graph regularized nonnegative
In this paper, we present a new method for data leak de- matrix factorization for data representation,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 33, no. 8, pp. 1548–1560, 2011.
tection named AGW. Our method consists of two phases: [20] X. Zhu and Z. Ghahramani, “Learning from labeled and unlabeled data
adaptive weighted graphs learning and score walk match- with label propagation,” Technical Report CMU-CALD-02-107, Carnegie
ing. The proposed method mainly has the following three Mellon University, 2002.
[21] F. Wang and C. Zhang, “Label propagation through linear neighborhoods,”
contributions: First, it can tolerate the long transformed data IEEE Transactions on Knowledge and Data Engineering, vol. 20, no. 1, pp.
by using weighted context graphs. Second, It increases the 55–67, 2008.
scalability to deal with fresh data by applying improved label [22] D. D. Lewis, “Reuters dataset,” http://www.daviddlewis.com/resources/
testcollections/.
propagation algorithm. Third, it can detect large amounts of
data with a weight reward and penalty mechanism named
score walk matching. The evaluation shows the effectiveness
of AGW in terms of accuracy, recall and running time. It can
be concluded that our method can perform the fast detection
for transformed or fresh data leaks in real time.
REFERENCES
[1] S. Alneyadi, E. Sithirasenan, and V. Muthukkumarasamy, “A survey on
data leakage prevention systems,” Journal of Network and Computer
Applications, vol. 62, pp. 137–152, 2016.
[2] Wikileaks, “Secret us embassy cables,” https://wikileaks.org/plusd/, 2010.
[3] B. News, “Edward snowden: Leaks that exposed us spy programme,” http:
//www.bbc.com/news/world-us-canada-23123964, 2013.
[4] Wikipedia, “Yahoo! data breaches,” https://en.wikipedia.org/wiki/Yahoo!
_data_breaches, 2016.
[5] J. Huang, X. Li, C.-C. Xing, W. Wang, K. Hua, and S. Guo, “Dtd: A novel
double-track approach to clone detection for rfid-enabled supply chains,”
IEEE Transactions on Emerging Topics in Computing, vol. 5, no. 1, pp.
134–140, 2017.
[6] K. Lab, “It security risks report 2016,” https://www.kaspersky.com/blog/
security_risks_report_perception, 2016.
[7] S. Alneyadi, E. Sithirasenan, and V. Muthukkumarasamy, “Word n-gram
based classification for data leakage prevention,” in Proc. 12th IEEE
International Conference on Trust, Security and Privacy in Computing and
Communications (TrustCom). IEEE, 2013, pp. 578–585.
[8] S. Xu, “Bayesian naïve bayes classifiers to text classification,” Journal of
Information Science, vol. 44, no. 1, pp. 48–59, 2016.
[9] M. Hart, P. Manadhata, and R. Johnson, “Text classification for data
loss prevention,” in Proc. International Symposium on Privacy Enhancing
Technologies Symposium. Springer, 2011, pp. 18–37.
[10] J. Huang, S. Huang, C.-C. Xing, and Y. Qian, “Game-theoretic power
control mechanisms for device-to-device communications underlaying
cellular system,” IEEE Transactions on Vehicular Technology, 2018.
[11] X. Shu, J. Zhang, D. D. Yao, and W.-C. Feng, “Fast detection of trans-
formed data leaks,” IEEE Transactions on Information Forensics and
Security, vol. 11, no. 3, pp. 528–542, 2016.
[12] G. Katz, Y. Elovici, and B. Shapira, “Coban: A context based model for
data leakage prevention,” Information sciences, vol. 262, pp. 137–158,
2014.
[13] X. Shu, D. Yao, and E. Bertino, “Privacy-preserving detection of sensitive
data exposure,” IEEE transactions on information forensics and security,
vol. 10, no. 5, pp. 1092–1103, 2015.
[14] Alneyadi, Sultan and Sithirasenan, Elankayer and Muthukkumarasamy,
Vallipuram, “Detecting data semantic: a data leakage prevention ap-
proach,” in Proc. Trustcom/BigDataSE/ISPA, vol. 1. IEEE, 2015, pp.
910–917.
[15] J. Huang, J. Zou, and C.-C. Xing, “Competitions among service providers
in cloud computing: A new economic model,” IEEE Transactions on
Network and Service Management, 2018.
[16] S. Alneyadi, E. Sithirasenan, and V. Muthukkumarasamy, “Adaptable n-
gram classification model for data leakage prevention,” in Proc. 7th In-
ternational Conference on Signal Processing and Communication Systems
(ICSPCS). IEEE, 2013, pp. 1–8.
[17] G. Giannakopoulos, V. Karkaletsis, G. Vouros, and P. Stamatopou-
los, “Summarization system evaluation revisited: N-gram graphs,” ACM
Transactions on Speech and Language Processing (TSLP), vol. 5, no. 3,
p. 5, 2008.
[18] Y. Lu, X. Huang, Y. Ma, and M. Ma, “A weighted context graph model
for fast data leak detection,” in Proc. IEEE International Conference on
Communications (ICC). IEEE, 2018.
VOLUME 1, 2018 11
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.