Presented by Anubhav,Saurav,Ravi,Ashutosh (ASRA Group) CSE/2k7 Guided by Prof.

Binod Kumar

•ASRA Group

•13/07/2011

•1

1. Introduction 2. Motivation 3. Achieving Anonymity via Clustering 4. Proposed algorithm 5. Experimental result 6. Conclusion 7. Future Work

•ASRA Group

•13/07/2011

•2

Data holders, Statistics Offices are facing tremendous demand for Person specific data for the application such as : Data mining  Cost analysis  Fraud detection

•ASRA Group •13/07/2011 •3

“How can a data holder release a version of its private data with scientific guarantees that the individuals who are the subjects of the data can’t be re-identified while the data remains practically useful for survey work”.
•ASRA Group •13/07/2011 •4

k-Anonymity Model

•ASRA Group •13/07/2011 •5

Sensitive
Uniquely identify you!

Zipcode Age
75275 22

Gender
Male

Disease Flu Cold

75277
75278 75275 75275 75275

23
24 33 38 36

Male

Quasi-identifiers: Male Diabetes approximate foreign keys Male Flu Female Female

Arthritis Heart problem
•13/07/2011 •6

•ASRA Group

Identifying
Mobile number Name Zipcode Gender age

Sensitive
Disease

9905150112 9905121223

Amit John

75275 75277 75278 75275 75275 75275

Male Male Male Male

22 23

Flu Cold

9431103097
9334292352 9431109087 9421345678

Rajan
Robin Ramesh Dhoni

Quasi-identifiers: Diabetes 24 approximate foreign keys
33 Flu Arthritis Arthritis

Female 38 Female 36

•ASRA Group •13/07/2011 •7

Sensitive Age
22 23 24

Gender
Male Male Male

Zip code Disease
75275

Flu

Cold 75277 Quasi-identifiers: approximate foreign keys Diabetes 75278
75275 75275
75275

33 38
36

Male Female
Female

Flu Arthritis Heart problem
•ASRA Group •13/07/2011 •8

Zip Code

Gender

Age

Disease

Expense

75277
75277 75277 75275 75275 75275

Male
Male Male Male Female Female

22
23 24 33 38 36

Flu
Cancer

100

3000 Quasi-identifiers: approximate foreign keys HIV+ 5000 Diabetes Diabetes Diabetes 2500 2800 2600
•ASRA Group •13/07/2011 •9

Zip Code Gender 7527* 7527* 7527* 7527* 7527* 7527* Person Person Person Person Person Person

Age [21-30] [21-30] [21-30] [31-40] [31-40] [31-40]

Disease Flu Cancer HIV+ Diabetes Diabetes Diabetes

Expense 100 3000 5000 2500 2800 2600

•ASRA Group •13/07/2011 •10

Zip Code

Gender

Age

Disease

Expense

7527*
7527* 7527* 75275

Male
Male Male Person

[21-25]
[21-25] [21-25] [31-40]

Flu
Cancer HIV+ Diabetes

100
3000 5000 2500

75275
75275

Person
Person

[31-40]
[31-40]

Diabetes
Diabetes
•ASRA Group

2800
2600
•13/07/2011 •11

Zipcode
83100*

Gender
Person

Age
[25-30]

Disease
Flu

82530* 83400* 83100* 82530* 83400* 82530* 83100* 83400*

Person Person Person Person Person Person Person Person

[10-15] [30-35] [25-30] [15-20] [30-35] [25-30] [25-30] [30-35]

Obesity Cancer HIV+ Cancer Diabetes Obesity Flu Flu

•ASRA Group

•13/07/2011

•12

How to decide number of cluster?

•ASRA Group

•13/07/2011

•13

Distance between two numerical values

•ASRA Group

•13/07/2011

•14

•ASRA Group

•13/07/2011

•15

Distance between two Categorical values
Country

America

Asia

North

South

East

West

USA

Canada

Brazil

Mexico

Iran

Egypt

India

Pakistan

δ C ( v i, v j)=H(∧( v i , v j ))/H(TD)
Fig : Taxonomy Tree of Country
•ASRA Group •13/07/2011 •16

Function greedy_k_member_clustering (S, k) If ( |S| ≤ k) Return S; End if; Result =Ø; r = a randomly picked from S; While ( |S| ≥ k) r= the furthest record from r; S=S-{r}; C ={r}; While ( |C| < k) r= find_best_record(S,C); S=S-{r}; C=C U {r}; End while; Result =Result U {C}; End while; While ( |S| ≠0) r= a randomly picked record from S; S=S-{r}; C=find_best_cluster(Result, r); C=C U {r}; End while;
•ASRA Group •13/07/2011 •17

Function find_best_record (S, c) Input: a set of records S and a cluster c Output: a record r є S such that IL(c U {r}) is minimal n= |S|; min=∞; best = null; for(i=1..n) r= i-th record in S; diff= IL(c U {r}) – IL(c); If(diff<min) min=diff; best=r; End if; End for; Return best; End;

•ASRA Group

•13/07/2011

•18

Function find_best_cluster (C, r) Input: a set of clusters C and a record r. Output: a cluster c ∈ C such that IL(c ∪ {r} is minimal n=|C|; min=∞; best=null; for( i=1..n) c=i-th cluster in C; diff=IL(CU{r}) – IL(C); if(diff<min) min=diff; best=c; end if; end for; return best;

End.

•ASRA Group

•13/07/2011

•19

•ASRA Group

•13/07/2011

•20

The time complexity of this algorithm is O ((n2 log (n))/c), where c is the average number of records in each cluster. The time complexity of this algorithm is better than greedy k-member algorithm
•ASRA Group •13/07/2011 •21

 It is difficult to decide a proper
value for the user-defined threshold  This algorithm might delete many records, which in turn cause a significant information loss. This algorithm is less sensitive to outliers
•ASRA Group •13/07/2011 •22

The main goal of the experiments was to investigate the implementation of the k-anonymity model using clustering algorithm. We mainly focus on the data quality, k-anonymization and scalability which are main consideration of kanonymity model
•ASRA Group •13/07/2011 •23

•ASRA Group

•13/07/2011

•24

Finally, keeping in mind data quality is the big problem in kanonymization. We also focus on data quality rather than the computation efficiency that should be the main consideration in kanonymity model, so we are encouraged by our result which demonstrates that our algorithm is flexible and is able to produce a range of desired anonymization.
•ASRA Group •13/07/2011 •25

 Encouraged

by experimental result, we are currently working on more efficient heuristics to improve the performance of our approach.  We are also working to utilize this clustering algorithm to detect fraud.
•ASRA Group •13/07/2011 •26

1. Sweeney, L.: k-Anonymity: A Model for Protecting Privacy. International Journal of Uncertainty, Fuzziness and Knowlege-Based Systems 10, 557–570 (2002) 2. Efficient k-Anonymization using clustering techniques, Ji-Wyun, R.Kotagiri et al. (Eds.):DASFAA 2007,LNCS 4443, pp. 188-2007.
•ASRA Group •13/07/2011 •27

•ASRA Group

•13/07/2011

•28

Sign up to vote on this title
UsefulNot useful