Professional Documents
Culture Documents
INTRODUCTION
Data leakage is the unauthorized transmission of data or information from
personal credit card data and other information depending upon the business
and the industry. Furthermore, in many cases, sensitive data shred among
take the example;[11] a hospital may give patient records to researcher who will
devise new treatments. Similarly a company may have partnership with other
companies.
Owner of data is termed as the distributor and the supposedly third parties are
called as the agents. In this project, our goal is to identify the guilty agent
when the distributor’s sensitive data have been leaked by some agents.
is modified and made less sensitive before being handed to agents. For
example, one can add random noise to certain attributes or one can replace
exact values by ranges on the original record. However in some cases, it is not
payroll, he must have the exact salary and customer bank account numbers. If
statistic) they may need accurate data for the patients . Traditionally, leakage
embedded in each distributed copy. If that copy is later found in the hands of
useful in some cases but again, involve some modification of the original data.
In short, watermarking is suitable for all the application because it’s lost the
is malicious.
the guilty agents that improves the chances of identifying a leaker. We also
consider the option of adding fake objects to the distributed set. Such object do
not corresponds to real entities but appear realistic to the agents. Means that
fake objects act as a type of watermarks for the entire set, without modifying
any original data. If it turns out that an agent was given one or more fake
objects that were leaked, then the distributor can be more confident that agent
was guilty.
Let the distributor database owns a set S= {t1, t2,…., tm} which consists of
data objects. Let the no of agents be A1, A2, ..., An [6][10]. The distributor
Ui [1].
Condition.
The objects in T could be of any type and size, e.g. they could be tuples in a
distributor discovers that a set S of T has leaked. This means that some third
party called the target has been caught in possession of S. For example, [9] this
discovery process, the target turned over S to the distributor. Since the agents
(A1, A2, ..., An) have some of the data, it is reasonable to suspect them leaking
the data. However, the agents can argue that they are innocent, and that the S
billing for all California customers. Thus, U2 receives all T records that satisfy
Guilty Agents
Suppose that after giving objects to agents, the distributor discovers that a set
S _ T has leaked. This means that some third party called the target, has been
web site, or perhaps as part of a legal discovery process, the target turned over
S to the distributor. Since the agents U1; : : : ;Un have some of the data, it
is reasonable to suspect them leaking the data. However, the agents can argue
that they are innocent, and that the S data was obtained by the target through
other means. For example, say one of the objects in S represents a customer
Our goal is to estimate the likelihood that the leaked data came from the
agents as opposed to other sources. Intuitively, the more data in S, the harder
it is for the agents to argue they did not leak anything. Similarly, the “rarer”
the objects, the harder it is to argue that the target obtained them through
other means. Not only do we want to estimate the likelihood the agents leaked
data, but we would also like to find out if one of them in particular was more
likely to be the leaker. For instance, if one of the S objects was only given to
agent U1, while the other objects were given to all agents, we may suspect U1
We denote the event that agent Ui is guilty for a given leaked set S by GijS. Our
next step is to estimate PrfGijSg, i.e., the probability that agent Ui is guilty
given evidence S.
To compute this PrfGijSg, we need an estimate for the probability that values in
S can be “guessed” by the target. For instance, say some of the objects in T are
approximately the expertise and resources of the target to find the email of say
100 individuals. If this person can find say 90 emails, then we can reasonably
guess that the probability of finding one email is 0.9. On the other hand, if the
objects in question are bank account numbers, the person may only
discover say 20, leading to an estimate of 0.2. We call this estimate pt, the
To simplify the formulas that we present in the rest of the paper, we assume
that all T objects have the same pt, which we call p. Our equations can be
Next, we make two assumptions regarding the relationship among the various
leakage events. The first assumption simply states that an agent’s decision to
Assumption 1:
below, this assumption gives us more conservative estimates for the guilt of
Assumption 2:
The target guessed (or obtained through other means) t without the help
In other words, for all t 2 S, the event that the target guesses t and the events
T = ft1; t2; t3g; R1 = ft1; t2g; R2 = ft1; t3g; S = ft1; t2; t3g:
In this case, all three of the distributor’s objects have been leaked and appear
in S.
Let us first consider how the target may have obtained object t1, which was
given to both agents. From Assumption 2, the target either guessed t1 or one of
assuming that the probability that each of the two agents leaked t1 is the
the only agent that has this data object. Given these values, the probability
that agent U1 is not guilty, namely that U1 did not leak either object is:
In the general case (with our assumptions), to find the probability that an
agent Ui is guilty given a set S, first we compute the probability that he leaks a
single object t to S. To compute this we define the set of agents Vt = fUijt 2 Rig
that have t in their data sets. Then using Assumption 2 and known probability
p, we have:
Assuming that all agents that belong to Vt can leak t to S with equal
{
PrfUi leaked t to Sg = 1-p/ Vtj ; if Ui 2 Vt
}
0; otherwise ......................................... (4)
The main focus of our work is the data allocation problem: how can the
agents (E for Explicit and S for Sample requests) and whether “fake objects” are
allowed (F for the use of fake objects, and F for the case where fake objects are
not allowed).
Fake objects are objects generated by the distributor that are not in set T. The
objects are designed to look like real objects, and are distributed to agents
together with the T objects, in order to increase the chances of detecting agents
Figure1. Leakage
problem instances
Optimization Problem
The distributor’s data allocation to agents has one constraint and one
providing them with the number of objects they request or with all available
We consider the constraint as strict. The distributor may not deny serving an
agent request as in [3] and may not provide agents with different perturbed
versions of the same objects as in [4]. We consider fake object allocation as the
only if the distributor gave no data object to any agent. We use instead the
the chances of detecting a guilty agent that leaks all his objects .
Recall that PrfGj jS = Rig or simply PrfGj jRig is the probability that agent Uj is
guilty if the distributor discovers a leaked table S that contains all Ri objects.
Problem Definition:
Let the distributor have data requests from n agents. The distributor wants to
6= j.
Assuming that the Ri sets satisfy the agents’ requests, we can express the
Maximize
(over R1; : : : ;Rn) (: : : ; _(i; j); : : : ) i 6|= j ………….(7)
Objective Approximation
We can approximate the objective of Equation 7 with Equation 8 that does not
minimize
(over R1; : : : ;Rn) ( ………Ri n Rj / Ri) i|=j…………………………(8)
Both scalar optimization problems yield the optimal solution of the problem of
sum-objective yields the Pareto optimal solution that allows the distributor to
detect the guilty agent on average (over all different agents) with higher
confidence than any other distribution. The max-objective yields the solution
that guarantees that the distributor will detect the guilty agent with a certain
confidence in the worst case. Such a guarantee may adversely impact the
RELATED WORK
the guilty agents. It provides a good overview on the research conducted in this
for data Warehouses, and assume some prior knowledge on the way a data
view is created out of data sources. Our problem formulation with objects and
sets is more general and simplifies lineage tracing, since we do not consider
any data transformation from Ri sets to S.As far as the data allocation
Watermarks were initially used in images, video and audio data whose digital
watermarking are similar in the sense of providing agents with some kind of
attach watermarks to the distributed data are not applicable. Finally, there are
also lots of other works on mechanisms that allow only authorized users to
prevent in some sense data leakage by sharing information only with trusted
parties. However, these policies are restrictive and may make it impossible to
While there are plenty of tools on the market for keeping mobile and stationary
data from leaving the company surreptitiously, the best ones use a
However, before doing anything, it's crucial to understand what data types are
being protected and the level of risk. You should create and codify data
upcoming products.
firewalls. Likefirewalls, they examine the content of outbound data, rather than
just ports and packet types, and ultimately decide what can leave the
company. When investigating data leakage prevention tools, you'll find that the
three big players in the market are Vontu Inc., Reconnex Inc. and Vericept
Corp.
The Vontu 6.0 suite contains a set of tools that can monitor all types of
traffic with its three algorithms: Exact Data Matching, Indexed Document
Matching and Described Content Matching. Vontu 6.0 can be finely tuned
and can also spot malicious activity. Their other product, Reconnex InSight
company's needs.
the whole range of Web traffic -- like FTP, SSL, IM and P2P -- but also
monitors blog postings, chat rooms and Web sites, all places where sensitive
Inc. and GTB Technologies Inc. Like the other products mentioned above,
CONCLUSIONS
In a perfect world there would be no need to hand over sensitive data to agents
that may unknowingly or maliciously leak it. And even if we had to hand over
could trace its origins with absolute certainty. However, in many cases we
must indeed work with agents that may not be 100% trusted, and we may not
be certain if a leaked object came from an agent or from some other source. In
likelihood that an agent is responsible for a leak, based on the overlap of its
data with the leaked data and the data of other agents, and based on the
REFERENCES
[2] Sandip A. Kale C1, Prof. S. V. Kulkarni C2, “Data Leakage Detection: A
[3] P. Saranya, "Online Data Leakage Detection And Analysis,” IJART, Vol. 2
[6] Archana Vaidya, Prakash Lahange, Kiran More, Shefali Kachroo & Nivedita