Attribution Non-Commercial (BY-NC)

12 views

Attribution Non-Commercial (BY-NC)

- Rand Num Gen
- RTUtil User's Guide Release 9_V013
- or+5081
- 5PDPETRO Celio Hyper
- Homework 1
- r 4 Decision
- A_New_Statistical
- 2000 05 Propagation Models for Indoor Radio Network Planning Including Tunnels
- Cowan Report Art 77
- chp 5
- Crush
- imt-26
- 20120140505011
- DecisionTreePrimer-1
- Normal Distribution
- Jurnal Isti Afrizal Bahasa Inggris
- syllabus
- Geostats Manual 2006
- PPT_BSST for students.pdf
- Mathematical Methods 2016 Onwards

You are on page 1of 27

Ramakrishnan Srikant

www.almaden.ibm.com/cs/people/srikant/

Motivation

Privacy & Data Mining: Adversaries forever?

The primary task in data mining: development of

models about aggregated data.

Can we develop accurate models without access to

precise information in individual data records?

1

Papers

R. Agrawal and R. Srikant, \Privacy Preserving Data

Mining", Proc. of the ACM SIGMOD Conference

on Management of Data, Dallas, Texas, May 2000.

Y. Lindell and B. Pinkas, \Privacy Preserving Data

Mining", Crypto 2000, August 2000.

2

Talk Overview

Randomization protects information at the

individual level.

Algorithm to reconstruct the distribution of values.

Use reconstructed distributions in data mining

algorithms, e.g. to build decision-tree classier.

How well does it work?

3

Privacy Surveys

[CRA99b] survey of web users:

{ 17% privacy fundamentalists

{ 56% pragmatic majority

{ 27% marginally concerned

[Wes99] survey of web users:

{ 82% : privacy policy matters

{ 14% don't care

Not equally protective of every eld

{ may not divulge at all certain elds;

{ may not mind giving true values of certain elds;

{ may be willing to give not true values but

modied values of certain elds.

4

Related Work: Statistical Databases

Statistical Databases : provide statistical information without

compromising sensitive information about individuals (surveys:

[AW89] [Sho82])

Query Restriction

{ restrict the size of query result (e.g. [FEL72][DDS79])

{ control overlap among successive queries (e.g. [DJL79])

{ keep audit trail of all answered queries (e.g. [CO82])

{ suppress small data cells (e.g. [Cox80])

{ cluster entities into mutually exclusive atomic populations

(e.g. [YC77])

Data Perturbation

{ replace the original database by a sample from the same

distribution (e.g. [LST83][LCL85][Rei84])

{ sample the result of a query (e.g. [Den80])

{ swap values between records (e.g. [Den82])

{ add noise to the query result (e.g. [Bec80])

{ add noise to the values (e.g. [TYW84][War65])

5

Related Work: Statistical Databases

(cont.)

Negative results: cannot give high quality statistics

and simultaneously prevent partial disclosure of

individual information [AW89]

preserving data mining.

{ Also want to prevent disclosure of condential

information

{ But sucient to reconstruct original distribution

of data values, i.e. not interested in high quality

point estimates

6

Using Randomization to protect Privacy

Return xi + r instead of xi, where r is a random

value drawn from a distribution.

{ Uniform

{ Gaussian

Fixed perturbation { not possible to improve

estimates by repeating queries.

Algorithm knows parameters of r's distribution.

7

Talk Overview

Randomization protects information at the

individual level.

Algorithm to reconstruct the distribution of values.

Use reconstructed distributions to build decision-

tree classier.

How well does it work?

8

Reconstruction Problem

Original values x1; x2; : : : ; xn

{ realizations of iid random variables X1; X2; : : : ; Xn,

{ each with the same distribution as random

variable X .

To hide these values, we use y1; y2; : : : ; yn

{ realizations of iid random variables Y1 ; Y2; : : : ; Yn,

{ each with the same distribution as random

variable Y .

Given

x1 + y1; x2 + y2; : : : ; xn + yn

the density function fY for Y ,

estimate the density function fX for X .

9

Using Bayes' Rule

Assume we know both fX and fY .

Let wi xi + yi .

fX1 (a j X1 + Y1 = w1)

fX1+Y1 (w1 j X1 = a) fX1 (a)

=

fX1+Y1 (w1)

fX1+Y1 (w1 j X1 = a) fX1 (a)

= R1

1 fX1+Y1 (w1 j X1 = z ) fX1 (z ) dz

fY (w1 a) fX (a)

1 1

= R 1 f (w z ) f (z ) dz (Y1 independent of X1)

1 Y1 1 X1

fY (w1 a) fX (a)

= R1 (fX1 fX , fY1 fY )

1 Y 1

f ( w z ) f X ( z ) dz

1X

n

0

fX (a) n fXi (a j Xi + Yi = wi)

i=1

1X

n

R1 Y i

f (w a) fX (a)

=

n

i=1 1 fY (wi z ) fX (z ) dz

10

Reconstruction Method: Algorithm

fX0 := Uniform distribution

j := 0 // Iteration number

repeat

Use equation to compute a new estimate fXj+1.

j := j + 1

until (stopping criterion met)

successive estimates of the original distribution

becomes very small (1% of the threshold of the 2

test).

11

Using Partitioning to Speed Computation

distance(z, wi) distance between the mid-points

of the intervals in which they lie, and

density function fX (a) the average of the density

function over the interval in which a lies.

=

n i=1 11 fY (wi z ) fX (z ) dz

becomes

0 X 2 Ip ) =

Pr (

k X fY (m(Is) m(Ip)) Pr(X 2 Ip)

1

n s=1 N (Is ) Pkt=1 fY (m(Is) m(It)) Pr(X 2 It)

number of intervals.

12

How well does this work?

Uniform random variable [-0.5, 0.5]

1200

Original

Randomized

1000 Reconstructed

Number of Records

800

600

400

200

0

-1 -0.5 0 0.5 1 1.5 2

Attribute Value

1000

Original

Randomized

800 Reconstructed

Number of Records

600

400

200

0

-1 -0.5 0 0.5 1 1.5 2

Attribute Value

13

Talk Overview

Randomization protects information at the

individual level.

Algorithm to reconstruct the distribution of values.

Use reconstructed distributions to build decision-

tree classier.

How well does it work?

14

Decision Tree Classication

Classication: Given a set of classes, and a set of

records in each class, develop a model that predicts

the class of a new record.

Partition(Data S )

begin

if (most points in S are of the same class) then

return;

for each attribute A do

evaluate splits on attribute A;

Use best split to partition S into S1 and S2;

Partition(S1 );

Partition(S2 );

end

Initial call: Partition(TrainingData)

15

Example: Decision Tree Classication

Training Set:

Age Salary Credit Risk

23 50K High

17 30K High

43 40K High

68 50K Low

32 70K Low

20 20K High

Decision Tree:

Age < 25

High

High Low

16

Training using Randomized Data

Need to modify two key operations:

{ Determining a split point.

{ Partitioning the data.

When and how do we reconstruct the original

distribution?

{ Reconstruct using the whole data (globally) or

reconstruct separately for each class?

{ Reconstruct once at the root node or reconstruct

at every node?

17

Training using Randomized Data (cont.)

Determining split points:

{ Candidate splits are interval boundaries.

{ Use statistics from the reconstructed distribution.

Partitioning the data:

{ Reconstruction gives estimate of number of

points in each interval.

{ Associate each data point with an interval by

sorting the values.

18

Algorithms

Global:

Reconstruct for each attribute once at the

beginning.

Induce decision tree using reconstructed data.

ByClass:

For each attribute, rst split by class, then

reconstruct separately for each class.

Induce decision tree using reconstructed data.

Local:

As in ByClass, split by class and reconstruct

separately for each class.

However, reconstruct at each node (not just once).

19

Talk Overview

Randomization protects information at the

individual level.

Algorithm to reconstruct the distribution of values.

Use reconstructed distributions to build decision-

tree classier.

How well does it work?

20

Methodology

Compare accuracy of Global, ByClass and Local

against

{ Original: unperturbed data without randomization.

{ Randomized: perturbed data but without making

any corrections for randomization.

Synthetic data generator from [AGI+92].

Training set of 100,000 records, split equally

between the two classes.

21

Quantifying Privacy

If it can be estimated with c% condence that a

value x lies in the interval [x1; x2], then the interval

width (x2 x1) denes the amount of privacy at c%

condence level.

Example: Privacy Level for Age[10,90]

{ Given a perturbed value 40

{ 95% condence that true value lies in [30,50]

{ Interval Width : 20 ) 25% privacy level

Range : 80

Uniform: between [ ; + ]

Gaussian: mean = 0 and standard deviation

Condence

50% 95% 99.9%

Uniform : 2

05 : 2

0 95 :

0 999 2

Gaussian 1:34 3:92 6:8

22

Synthetic Data Functions

Class A if function is true, Class B otherwise.

F1 (age 40) _ ((60 age)

<

< K K

((40 age < 60) ^ (75 salary 125 )) _

K K

((age 60) ^ (25 K salary 75 )) K

F3 ((age 40)^

<

(((elevel 2 [0 1]) ^ (25

:: salary 75 )) _

K K

((elevel 2 [2 3]) ^ (50

:: salary 100 )))) _

K K

((40 age < 60)^

(((elevel 2 [1 3]) ^ (50

:: salary 100 )) _

K K

(((elevel = 4)) ^ (75 salary 125 )))) _

K K

((age 60)^

(((elevel 2 [2 4]) ^ (50

:: salary 100 )) _

K K

((elevel = 1)) ^ (25 K salary 75 )))) K

: : K >

: :

0 2 equity

: 10 ) K 0

>

where equity = 0 1 hvalue max(hyears 20 0)

: ;

23

Classication Accuracy

100

95

Privacy Level:

Accuracy (%)

90

25% of 85 Original

Attribute Range Local

ByClass

Global

80 Randomized

75

Fn1 Fn2 Fn3 Fn4 Fn5

Dataset

100

90

Privacy Level:

Accuracy (%)

80

100% of 70

Attribute Range

Original

Local

60 ByClass

Global

Randomized

50

Dataset

24

Change in Accuracy with Privacy

100

90

Accuracy (%)

80

Fn 1 70

Original

ByClass(G)

60 ByClass(U)

Random(G)

Random(U)

50

Privacy Level (%)

100

90

Accuracy (%)

80

Fn 3 70

Original

ByClass(G)

60 ByClass(U)

Random(G)

Random(U)

50

Privacy Level (%)

25

Conclusions

Have your cake and mine it too!

{ Preserve privacy at the individual level, but still

build accurate models.

Future work:

{ other data mining algorithms,

{ characterize loss in accuracy,

{ other randomization functions,

{ better reconstruction techniques, ...

26

- Rand Num GenUploaded bymohsindalvi87
- RTUtil User's Guide Release 9_V013Uploaded byMinh Dương
- or+5081Uploaded bySahar Desokey
- 5PDPETRO Celio HyperUploaded byAlexandre Muselli
- Homework 1Uploaded byDustin Rudiger
- r 4 DecisionUploaded by0xym0r0n
- A_New_StatisticalUploaded byTales Alexandre Aversi-Ferreira
- 2000 05 Propagation Models for Indoor Radio Network Planning Including TunnelsUploaded byzhangxiangjun2000
- Cowan Report Art 77Uploaded byChris Herzeca
- chp 5Uploaded byMehmet Kamal
- CrushUploaded byBhawesh Ranjan
- imt-26Uploaded byanuragsingh03
- 20120140505011Uploaded byIAEME Publication
- DecisionTreePrimer-1Uploaded byAndrew Drummond-Murray
- Normal DistributionUploaded byaamirali31
- Jurnal Isti Afrizal Bahasa InggrisUploaded byAfrizal Juansyah
- syllabusUploaded bySaradeMiguel
- Geostats Manual 2006Uploaded byCris Tina
- PPT_BSST for students.pdfUploaded byOishik Banerji
- Mathematical Methods 2016 OnwardsUploaded byJack J
- Lecture6 Distribution (1)Uploaded byaasxdas
- Binomial & Poisson DistributionUploaded bySangeet Sharma
- Mathematical_Methods_2016_Onwards.pptxUploaded byU
- Mathematical_Methods_2016_Onwards.pptxUploaded byDJLI3
- Reliable Data Mining Tasks and Techniques for Industrial ApplicationsUploaded byiaetsdiaetsd
- cm_editor_help.pdfUploaded byjormula
- MATH30-6 Lecture 7.pptxUploaded bymisaka
- 05 ES Random Variables and Their Probability DistributionsUploaded byM. Ahmad
- Probability and statistics with examples on distributionsUploaded byParam Km
- OreganoUploaded byBenjamin Cabalona Jr.

- it2 fact sheetUploaded bypostscript
- wenke-discex00Uploaded bypostscript
- patentsUploaded bypostscript
- news2 98Uploaded bypostscript
- ecsUploaded bypostscript
- cvUploaded bypostscript
- prpr0118Uploaded bypostscript
- SR302Uploaded bypostscript
- 20amortUploaded bypostscript
- mset2000Uploaded bypostscript
- edcUploaded bypostscript
- sec2000Uploaded bypostscript
- lec6Uploaded bypostscript
- basuUploaded bypostscript
- mmidans2Uploaded bypostscript
- testsecUploaded bypostscript
- 2-1-8Uploaded bypostscript
- v7i1r46Uploaded bypostscript
- paperUploaded bypostscript
- Chip to Kapton Bonding Procedur1Uploaded bypostscript
- hagjoycombesUploaded bypostscript
- hagjoy10CMPUploaded bypostscript
- lm02Uploaded bypostscript
- snackenUploaded bypostscript
- nmarea rose slidesUploaded bypostscript
- rpf280Uploaded bypostscript
- net neutralityUploaded bypostscript
- zengerUploaded bypostscript
- brnoUploaded bypostscript

- SGWTE Brochure Jun 2015 (2)Uploaded byMohammed AbuZayed
- Valve Failures- Assessment HSE UKUploaded byselvajanar
- SAP Memory Management DocumentUploaded bySurya Nanda
- tema3ghghghg.docxUploaded byGhilic
- Iphone Marimba ringtone transcription by Lisztlovers.pdfUploaded byBalasko Balazs
- Data for Egypt CompainesUploaded byMahmoudKhedr
- Chapter-3 Departments of PtclUploaded by✬ SHANZA MALIK ✬
- Week8_XS_IAUploaded byabudeebme
- Bridges MonitoringUploaded bywoozie777
- book review of leverage leadershipUploaded byapi-312982784
- Tires & WheelsUploaded bydforeman4509
- Application of Positional Statistics and Simulation Means to Alamouti MIMO Systems AnalysisUploaded byJournal of Telecommunications
- "How to Quantify Likert Scale Type Data" - MethodSpaceUploaded byhareshtank
- Micromax InformaticsUploaded bytarun
- A Double-Edged SwordUploaded byJohn Doe
- ict fpd lesson 4Uploaded byapi-249648222
- Adaptive Skins Phase 2Uploaded byMarcos Cordero
- Ranger CampUploaded byAnonymous QCXQbYr0
- env20 syllabusUploaded byKim Good
- Capacity Management Cases - Useful ReadingUploaded byTamy G
- Briefing Fyp (15july2009) NewUploaded byAsyraf Daud
- Secondary Radar - IfF & SIFFUploaded byselnemais
- CLC501Uploaded byane_faris
- Lets Move NashvilleUploaded byThe Tennessean
- Ba7015 Customer Relationship Management Rejinpaul_2Uploaded byBala Murugan
- Training Design for Civil EngineerUploaded byJayson Shrestha
- Formulation of Sales ProgramUploaded byramanrockstar21
- GL CL for Onboard MaintenanceUploaded byacere18
- cyberbullyingUploaded byapi-284380123
- scadaUploaded bys_banerjee

## Much more than documents.

Discover everything Scribd has to offer, including books and audiobooks from major publishers.

Cancel anytime.