Data Leakage Detection Documentation

DATA LEACKAGE DETECTION
INTRODUCTION
Data leakage is the unauthorized transmission of data or information from
within an organization to an external destination or recipient . Data leakage is
defined as the accidental or intentional distribution of private or sensitive data
to an unauthorized entity. Sensitive data of companies and organization
includes intellectual property, financial information, patient information,
personal credit card data and other information depending upon the business
and the industry. Furthermore, in many cases, sensitive data shred among
various stakeholders such as employees working from outside the
organizational premises, business partners and customers. This increases the
risk of confidential information falling into unauthorized hands.
Furthermore, in many cases, sensitive data is shared among various
stakeholders such as employees working from outside the organizational
premises (e.g., on laptops), business partners and customers. This increases
the risk of confidential information falling into unauthorized hands.
In the course of doing business, sometimes data must be handed over to
supposedly trusted third parties for some enhancement or operations. Let’s
take the example;[11] a hospital may give patient records to researcher who will
devise new treatments. Similarly a company may have partnership with other
companies that require sharing of customer data. Another enterprise may
outsource its data processing, so data must be given to various other
companies.
DEPARTMENT OF CSE PAGE-1 TKRCET

Owner of data is termed as the distributor and the supposedly third parties are
called as the agents. In this project, our goal is to identify the guilty agent
when the distributor’s sensitive data have been leaked by some agents.
Perturbation and watermarking are techniques which can help in such
situations. Perturbation is a very useful technique where the data
is modified and made less sensitive before being handed to agents. For
example, one can add random noise to certain attributes or one can replace
exact values by ranges on the original record. However in some cases, it is not
important to alter the original record. Suppose if an outsourcer is doing our
payroll, he must have the exact salary and customer bank account numbers. If
medical researchers treating the patients (as opposed to simply computing
statistic) they may need accurate data for the patients . Traditionally, leakage
detection is handled by the watermarking. For example a unique code is
embedded in each distributed copy. If that copy is later found in the hands of
an unauthorized party, the leaker can be identified. Watermarks can be very
useful in some cases but again, involve some modification of the original data.
Furthermore, watermarks can sometimes be destroyed if the data recipient is
malicious [research paper].
In short, watermarking is suitable for all the application because it’s lost the
original data. There are some disadvantages of it. That is-
• It involves some modification of data that is
making the data less sensitive by altering
attributes of the data.

• The second problem is that these watermarks
can be sometimes destroyed if the recipient
is malicious.
In this paper, we develop an algorithm of data allocation strategies for finding
the guilty agents that improves the chances of identifying a leaker. We also
consider the option of adding fake objects to the distributed set. Such object do
not corresponds to real entities but appear realistic to the agents. Means that
fake objects act as a type of watermarks for the entire set, without modifying
any original data. If it turns out that an agent was given one or more fake
objects that were leaked, then the distributor can be more confident that agent
was guilty.

PROBLEM SETUP AND NOTATION
Entities and Agents
Let the distributor database owns a set S= {t1, t2,…., tm} which consists of
data objects. Let the no of agents be A1, A2, ..., An [6][10]. The distributor
distributes a set of records S to any agents based on their request such as
sample or explicit request.
• Sample request Ri= SAMPLE (T, mi): Any
subset of mi records from T can be given to
Ui [1].
• Explicit request Ri= EXPLICIT (T;condi):
Agent Ui receives all T objects that satisfy
Condition.
The objects in T could be of any type and size, e.g. they could be tuples in a
relation, or relations in a database. After giving objects to agents, the
distributor discovers that a set S of T has leaked. This means that some third
party called the target has been caught in possession of S. For example, [9] this
target may be displaying S on its web site, or perhaps as part of a legal
discovery process, the target turned over S to the distributor. Since the agents
(A1, A2, ..., An) have some of the data, it is reasonable to suspect them leaking
the data. However, the agents can argue that they are innocent, and that the S
data was obtained by the target through other means.
Example: Say T contains customer records for a given company A. Company A
hires a marketing agency U1 to do an on-line survey of customers. Since any

customers will do for the survey, U1 requests a sample of 1000 customer
records. At the same time, company A subcontracts with agent U2 to handle
billing for all California customers. Thus, U2 receives all T records that satisfy
the condition “state is California.”
Guilty Agents
Suppose that after giving objects to agents, the distributor discovers that a set
S _ T has leaked. This means that some third party called the target, has been
caught in possession of S. For example, this target may be displaying S on its
web site, or perhaps as part of a legal discovery process, the target turned over
S to the distributor. Since the agents U1; : : : ;Un have some of the data, it
is reasonable to suspect them leaking the data. However, the agents can argue
that they are innocent, and that the S data was obtained by the target through
other means. For example, say one of the objects in S represents a customer
X. Perhaps X is also a customer of some other company, and that company
provided the data to the target. Or perhaps X can be reconstructed from
various publicly available sources on the web.
Our goal is to estimate the likelihood that the leaked data came from the
agents as opposed to other sources. Intuitively, the more data in S, the harder
it is for the agents to argue they did not leak anything. Similarly, the “rarer”
the objects, the harder it is to argue that the target obtained them through

other means. Not only do we want to estimate the likelihood the agents leaked
data, but we would also like to find out if one of them in particular was more
likely to be the leaker. For instance, if one of the S objects was only given to
agent U1, while the other objects were given to all agents, we may suspect U1
more. The model we present next captures this intuition.
We say an agent Ui is guilty if it contributes one or more objects to the target.
We denote the event that agent Ui is guilty for a given leaked set S by GijS. Our
next step is to estimate PrfGijSg, i.e., the probability that agent Ui is guilty
given evidence S.
AGENT GUILT MODEL
To compute this PrfGijSg, we need an estimate for the probability that values in
S can be “guessed” by the target. For instance, say some of the objects in T are
emails of individuals. We can conduct an experiment and ask a person with
approximately the expertise and resources of the target to find the email of say
100 individuals. If this person can find say 90 emails, then we can reasonably
guess that the probability of finding one email is 0.9. On the other hand, if the
objects in question are bank account numbers, the person may only
discover say 20, leading to an estimate of 0.2. We call this estimate pt, the
probability that object t can be guessed by the target.
To simplify the formulas that we present in the rest of the paper, we assume
that all T objects have the same pt, which we call p. Our equations can be
easily generalized to diverse pt’s though they become cumbersome to display.

Next, we make two assumptions regarding the relationship among the various
leakage events. The first assumption simply states that an agent’s decision to
leak an object is not related to other objects.
Assumption 1:
For all t; t0 2 S such that t 6= t0 the provenance of t is independent of the
provenance of t0. To simplify our formulas, the following assumption states
that joint events have a negligible probability. As we argue in the example
below, this assumption gives us more conservative estimates for the guilt of
agents, which is consistent with our goals.
Assumption 2:
An object t 2 S can only be obtained by the target in one of two ways:
 A single agent Ui leaked t from his own Ri set; or
 The target guessed (or obtained through other means) t without the help
of any of the n agents.
In other words, for all t 2 S, the event that the target guesses t and the events
that agent Ui (i = 1; : : : ; n) leaks object t are disjoint.
Before we present the general formula for computing PrfGijSg, we provide a
simple example. Assume that sets T, R’s and S are as follows:
T = ft1; t2; t3g; R1 = ft1; t2g; R2 = ft1; t3g; S = ft1; t2; t3g:
In this case, all three of the distributor’s objects have been leaked and appear
in S.
Let us first consider how the target may have obtained object t1, which was
given to both agents. From Assumption 2, the target either guessed t1 or one of

U1 or U2 leaked it. We know that the probability of the former event is p, so
assuming that the probability that each of the two agents leaked t1 is the
same we have the following cases:
_ the leaker guessed t1 with probability p;
_ agent U1 leaked t1 to S with probability (1 􀀀 p)=2
_ agent U2 leaked t1 to S with probability (1 􀀀 p)=2
Similarly, we find that agent U1 leaked t2 to S with probability 1 􀀀 p since it is
the only agent that has this data object. Given these values, the probability
that agent U1 is not guilty, namely that U1 did not leak either object is:
PrfG_1jSg = (1 􀀀 (1 􀀀 p)=2) _ (1 􀀀 (1 􀀀 p))………….. (1)
Hence, the probability that U1 is guilty is:
PrfG1jSg = 1 􀀀 PrfG1_jSg……………………………… (2)
In the general case (with our assumptions), to find the probability that an
agent Ui is guilty given a set S, first we compute the probability that he leaks a
single object t to S. To compute this we define the set of agents Vt = fUijt 2 Rig
that have t in their data sets. Then using Assumption 2 and known probability
p, we have:
Prfsome agent leaked t to Sg = 1 􀀀 p: ……………… (3)
Assuming that all agents that belong to Vt can leak t to S with equal
probability and using Assumption 2 we obtain:
{
PrfUi leaked t to Sg = 1-p/ Vtj ; if Ui 2 Vt
}
0; otherwise ......................................... (4)

Data Allocation Problem
The main focus of our work is the data allocation problem: how can the
distributor “intelligently” give data to agents to improve the chances of
detecting a guilty agent? As illustrated in Figure 1, there are four instances of
this problem we address, depending on the type of data requests made by
agents (E for Explicit and S for Sample requests) and whether “fake objects” are
allowed (F for the use of fake objects, and F for the case where fake objects are
not allowed).
Fake objects are objects generated by the distributor that are not in set T. The
objects are designed to look like real objects, and are distributed to agents
together with the T objects, in order to increase the chances of detecting agents
that leak data. We discuss fake objects in more detail in [2].

Figure1. Leakage
problem instances
Optimization Problem
The distributor’s data allocation to agents has one constraint and one
objective. The distributor’s constraint is to satisfy agents’ requests, by
providing them with the number of objects they request or with all available
objects that satisfy their conditions. His objective is to be able to detect an
agent who leaks any of his data objects.
We consider the constraint as strict. The distributor may not deny serving an
agent request as in [3] and may not provide agents with different perturbed
versions of the same objects as in [4]. We consider fake object allocation as the
only possible constraint relaxation.

Our detection objective is ideal and intractable. Detection would be assured
only if the distributor gave no data object to any agent. We use instead the
following objective: maximize
the chances of detecting a guilty agent that leaks all his objects .
We now introduce some notation to state formally the distributor’s objective.
Recall that PrfGj jS = Rig or simply PrfGj jRig is the probability that agent Uj is
guilty if the distributor discovers a leaked table S that contains all Ri objects.
We define the difference functions delta(i; j) as:
Delta(i; j) = Pr{Gi/Ri} – Pr{Gj /Ri} i; j = 1………….. n(6)
Problem Definition:
Let the distributor have data requests from n agents. The distributor wants to
give tables R1; : : : ;Rn to agents U1; : : : ;Un respectively, so that:
_ he satisfies agents’ requests; and
_ he maximizes the guilt probability differences _(i; j) for all i; j = 1; : : : ; n and i
6= j.
Assuming that the Ri sets satisfy the agents’ requests, we can express the
problem as a multi-criterion optimization problem:
Maximize
(over R1; : : : ;Rn) (: : : ; _(i; j); : : : ) i 6|= j ………….(7)
Objective Approximation
We can approximate the objective of Equation 7 with Equation 8 that does not
depend on agents’ guilt probabilities and therefore on p.
minimize
(over R1; : : : ;Rn) ( ………Ri n Rj / Ri) i|=j…………………………(8)

Both scalar optimization problems yield the optimal solution of the problem of
Equation 8, if such solution exists. If there is no global optimal solution, the
sum-objective yields the Pareto optimal solution that allows the distributor to
detect the guilty agent on average (over all different agents) with higher
confidence than any other distribution. The max-objective yields the solution
that guarantees that the distributor will detect the guilty agent with a certain
confidence in the worst case. Such a guarantee may adversely impact the
average performance of the distribution.

RELATED WORK
The guilt detection approach we present is related to the data provenance
problem: tracing the lineage of an S object implies essentially the detection of
the guilty agents. It provides a good overview on the research conducted in this
field. Suggested solutions are domain specific, such as lineage tracing
for data Warehouses, and assume some prior knowledge on the way a data
view is created out of data sources. Our problem formulation with objects and
sets is more general and simplifies lineage tracing, since we do not consider
any data transformation from Ri sets to S.As far as the data allocation
strategies are concerned, our work is mostly relevant to watermarking that is
used as a means of establishing original ownership of distributed objects.
Watermarks were initially used in images, video and audio data whose digital
representation includes considerable redundancy. Our approach and
watermarking are similar in the sense of providing agents with some kind of
receiver identifying information. However, by its very nature, a watermark
modifies the item being watermarked. If the object to be watermarked cannot
be modified, then a watermark cannot be inserted. In such cases, methods that
attach watermarks to the distributed data are not applicable. Finally, there are

also lots of other works on mechanisms that allow only authorized users to
access sensitive data through access control policies. Such approaches
prevent in some sense data leakage by sharing information only with trusted
parties. However, these policies are restrictive and may make it impossible to
satisfy agents’ requests.
Preparation for data protection
While there are plenty of tools on the market for keeping mobile and stationary
data from leaving the company surreptitiously, the best ones use a
combination of prevention and detection methods, such as a detection engine
and a data blocker.
However, before doing anything, it's crucial to understand what data types are
being protected and the level of risk. You should create and codify data
classification levels for all of your company's data according to the
organization's IT security standards. Data types can be ranked on a scale from
low to high, based on the risk of its loss or exposure .
Some examples of high-risk data might include the following:
 Customer or employee information with names, addresses, social
security numbers and other identity-related information.
 Customer lists that could be used by a competitor for poaching clients.
 Trade secrets and intellectual property.
 Confidential engineering and manufacturing plans for products.

 Financial information or soon-to-be-released marketing plans for
upcoming products.
Data leakage prevention tools
Data leakage prevention tools can be roughly compared to application-level
firewalls. Likefirewalls, they examine the content of outbound data, rather than
just ports and packet types, and ultimately decide what can leave the
company. When investigating data leakage prevention tools, you'll find that the
three big players in the market are Vontu Inc., Reconnex Inc. and Vericept
Corp.
 The Vontu 6.0 suite contains a set of tools that can monitor all types of
Web traffic, including SSL, IM and Web mail. It detects malicious outbound
traffic with its three algorithms: Exact Data Matching, Indexed Document
Matching and Described Content Matching. Vontu 6.0 can be finely tuned
to target specific groups of employees, locations or types of content.
 Reconnex's iGuard platform consists of two useful devices. Reconnex's
iGuard is a network appliance that monitors the content of outbound traffic
and can also spot malicious activity. Their other product, Reconnex InSight
Console, is a database that makes detection easier by storing sensitive data
info. As with Vontu, the Reconnex platform can be tuned to suit a
company's needs.

 Vericept's 360-degree Visibility and Control is a customizable tool
predominantly used for content monitoring. It uses what it calls its
proprietary Intelligent Content Control Engine. Vericept not only monitors
the whole range of Web traffic -- like FTP, SSL, IM and P2P -- but also
monitors blog postings, chat rooms and Web sites, all places where sensitive
company data and secrets could end up.
 Two other vendors that may be useful are PortAuthority Technologies
Inc. and GTB Technologies Inc. Like the other products mentioned above,
these companies offer hardware appliances that monitor outbound IP traffic
for specific types of corporate data.
CONCLUSIONS
In a perfect world there would be no need to hand over sensitive data to agents
that may unknowingly or maliciously leak it. And even if we had to hand over
sensitive data, in a perfect world we could watermark each object so that we
could trace its origins with absolute certainty. However, in many cases we
must indeed work with agents that may not be 100% trusted, and we may not
be certain if a leaked object came from an agent or from some other source. In
spite of these difficulties, we have shown it is possible to assess the
likelihood that an agent is responsible for a leak, based on the overlap of its
data with the leaked data and the data of other agents, and based on the

probability that objects can be “guessed” by other means. Our model is
relatively simple, but we believe it captures the essential trade-offs.
REFERENCES
[1] N. Sandhya, G. Haricharan Sharma, K. Bhima, "Exerting Modern
Techniques for Data Leakage Problems Detect,” International Journal of
Electronics Communication and Computer Engineering (IJECCE), Vol. 3, Issue
(1) NCRTCST, ISSN 2249 –071X
[2] Sandip A. Kale C1, Prof. S. V. Kulkarni C2, “Data Leakage Detection: A
Survey,” IOSR Journal of Computer Engineering (IOSRJCE) ISSN: 2278-0661
Vol. 1, Issue 6 (July-Aug 2012), PP 32-35www.iosrjournals.org
[3] P. Saranya, "Online Data Leakage Detection And Analysis,” IJART, Vol. 2
Issue 2, March 2012
[4] P. Papadimitriou, H. Garcia-Molina, “Data Leakage Detection,” technical
report, Stanford University, 2008.
[5] Shivappa M. Metagar, Sanjaykumar J. Hamilpure ,B. P. Savukar, "Water
Marking Technique: An Unique Approach For Detecting The Data Leakage,"
Volume 2, Issue 5, Sept 2012
[6] Archana Vaidya, Prakash Lahange, Kiran More, Shefali Kachroo & Nivedita
Pandey, "DATA LEAKAGE DETECTION," Vol. 3, Issue 1, pp. 315-321.

Data Leakage Detection Documentation

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Leakage Detection Documentation

Uploaded by

Copyright:

Available Formats

DATA LEACKAGE DETECTION

within an organization to an external destination or recipient . Data leakage is

defined as the accidental or intentional distribution of private or sensitive data

to an unauthorized entity. Sensitive data of companies and organization

includes intellectual property, financial information, patient information,

various stakeholders such as employees working from outside the

organizational premises, business partners and customers. This increases the

risk of confidential information falling into unauthorized hands.

Furthermore, in many cases, sensitive data is shared among various

stakeholders such as employees working from outside the organizational

premises (e.g., on laptops), business partners and customers. This increases

the risk of confidential information falling into unauthorized hands.

In the course of doing business, sometimes data must be handed over to

supposedly trusted third parties for some enhancement or operations. Let’s

companies that require sharing of customer data. Another enterprise may

outsource its data processing, so data must be given to various other

DEPARTMENT OF CSE PAGE-1 TKRCET

Perturbation and watermarking are techniques which can help in such

situations. Perturbation is a very useful technique where the data

important to alter the original record. Suppose if an outsourcer is doing our

medical researchers treating the patients (as opposed to simply computing

detection is handled by the watermarking. For example a unique code is

an unauthorized party, the leaker can be identified. Watermarks can be very

Furthermore, watermarks can sometimes be destroyed if the data recipient is

malicious [research paper].

original data. There are some disadvantages of it. That is-

• It involves some modification of data that is

making the data less sensitive by altering

attributes of the data.

DEPARTMENT OF CSE PAGE-1 TKRCET

• The second problem is that these watermarks

can be sometimes destroyed if the recipient

In this paper, we develop an algorithm of data allocation strategies for finding

DEPARTMENT OF CSE PAGE-1 TKRCET

PROBLEM SETUP AND NOTATION

Entities and Agents

distributes a set of records S to any agents based on their request such as

sample or explicit request.

• Sample request Ri= SAMPLE (T, mi): Any

subset of mi records from T can be given to

• Explicit request Ri= EXPLICIT (T;condi):

Agent Ui receives all T objects that satisfy

relation, or relations in a database. After giving objects to agents, the

target may be displaying S on its web site, or perhaps as part of a legal

data was obtained by the target through other means.

Example: Say T contains customer records for a given company A. Company A

hires a marketing agency U1 to do an on-line survey of customers. Since any

DEPARTMENT OF CSE PAGE-1 TKRCET

customers will do for the survey, U1 requests a sample of 1000 customer

records. At the same time, company A subcontracts with agent U2 to handle

the condition “state is California.”

caught in possession of S. For example, this target may be displaying S on its

X. Perhaps X is also a customer of some other company, and that company

provided the data to the target. Or perhaps X can be reconstructed from

various publicly available sources on the web.

DEPARTMENT OF CSE PAGE-1 TKRCET

more. The model we present next captures this intuition.

We say an agent Ui is guilty if it contributes one or more objects to the target.

AGENT GUILT MODEL

emails of individuals. We can conduct an experiment and ask a person with

probability that object t can be guessed by the target.

easily generalized to diverse pt’s though they become cumbersome to display.

DEPARTMENT OF CSE PAGE-1 TKRCET