# Detecting Data Leakage

Hector Garcia-Molina
hector@cs.stanford.edu

Leakage Problem
Name: Sarah
Sex: Female …. Name: Mark

Sex: Male
…. Jeremy Sarah App. U1 App. U2 Mark

Other Sources e.g. Sarah’s Network

Kathryn

Stanford Infolab

2

7 – Pr{U2 leaked data} = 0.2 • Distribution Strategies Stanford Infolab 3 .Outline • Problem Description • Guilt Models – Pr{U1 leaked data} = 0.

• Problem Description • Guilt Models • Distribution Strategies Stanford Infolab 4 .

Rn Ri: Set of people’s profiles who have added the application Ui Leaker S Set of leaked profiles Stanford Infolab 5 . ….Problem Entities Entity Distributor Facebook Dataset T Set of all Facebook profiles Agents Facebook Apps U1. Un R1. ….

Agents’ Data Requests • Sample – 100 profiles of Stanford people • Explicit – All people who added application (example we used so far) – All Stanford profiles Stanford Infolab 6 .

• Problem Description • Guilt Models • Distribution Strategies Stanford Infolab 7 .

given the leaked set of profiles S Stanford Infolab Other Sources e. Sarah’s Network 8 .Guilt Models (1/3) p: posterior probability that a leaked profile comes from other sources p p Guilty Agent: Agent who leaks at least one profile Pr{Gi|S}: probability that agent Ui is guilty.g.

Guilt Models (2/3) Agents leak each of their data items independently p2 Agents leak all their data items OR nothing p(1-p) (1-p)p or or (1-p)2 or Stanford Infolab 9 .

Guilt Models (3/3) Independently NOT Independently Pr{G2} Pr{G2} Pr{G1} Pr{G1} Stanford Infolab 10 .

• Problem Description • Guilt Models • Distribution Strategies Stanford Infolab 11 .

The Distributor’s Objective (1/2) R1 R2 U1 U2 R3 S (leaked) R1 R3 R4 U3 U4 Stanford Infolab Pr{G1|S}>>Pr{G2|S} Pr{G1|S}>> Pr{G4|S} 12 .

j  1..The Distributor’s Objective (2/2) • To achieve his objective the distributor has to distribute sets Ri. Rn that minimize  i 1 Ri  R R j i i j ... n • Intuition: Minimized data sharing among agents makes leaked data reveal the guilty agents Stanford Infolab 13 .. i. ….

Sarah and Mark • There are 4 agents: – U1.Distribution Strategies – Sample (1/4) • Set T has four profiles: – Kathryn. U3 and U4 • Each agent requests a sample of any 2 profiles of T for a market survey Stanford Infolab 14 . Jeremy. U2.

Distribution Strategies – Sample (2/4) Poor Minimize  Ri  R j i j U1 U2 U3 U4         U1 U2 U3 U4         Stanford Infolab 15 .

Distribution Strategies – Sample (3/4) • Optimal Distribution    U1 U2 U3 U4      i • Avoid full overlaps and minimize   R  R j i i 1 Ri j Stanford Infolab 16 .

Distribution Strategies – Sample (4/4) Stanford Infolab 17 .

g..Distribution Strategies Sample Data Requests • The distributor has the freedom to select the data items to provide the agents with • General Idea: – Provide agents with as much disjoint sets of data as possible Explicit Data Requests • The distributor must provide agents with the data they request • General Idea: – Add fake data to the distributed ones to minimize overlap of distributed data • Problem: There are cases where the distributed data must overlap E. |Ri|+…+|Rn|>|T| • Problem: Agents can collude and identify fake data • NOT COVERED in this talk Stanford Infolab 18 .

Conclusions • Data Leakage • Modeled as maximum likelihood problem • Data distribution strategies that help identify the guilty agents Stanford Infolab 19 .

Thank You! .