Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Download
Standard view
Full view
of .
Look up keyword
Like this
2Activity
0 of .
Results for:
No results containing your search query
P. 1
101027

101027

Ratings: (0)|Views: 107|Likes:
Published by IJCNSVol2NO10

More info:

Published by: IJCNSVol2NO10 on Nov 05, 2010
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

11/18/2010

pdf

text

original

 
(IJCNS) International Journal of Computer and Network Security,Vol. 2, No. 10, 2010
145 
Evaluating Clustering Performance for the ProtectedData using Perturbative Masking Techniques inPrivacy Preserving Data Mining
S.Vijayarani
1
, Dr.A.Tamilarasi
2
 
1,
School of Computer Science and Engg.,Bharathiar University, Coimbatore, Tamilnadu,India
vijimohan_2000@yahoo.com
2
Dept. of MCA, Kongu Engg. College, Erode, Tamilnadu, India
drtamil@kongu.ac.in
 Abstract-
 Privacy Preserving Data Mining has become very popular for protecting the confidential knowledge which wasextracted from the data mining techniques. Privacy preserving data mining is nothing but the study of how to produce valid  mining models and patterns without disclosing privateinformation. Several techniques are used for protecting the sensitive data. Some of them are statistical, cryptographic, randomization, k-anonymity model, l-diversity and etc. In thiswork, we have analyzed the two statistical disclosure control  techniques i.e additive noise and micro aggregation. We haveexamined the clustering performance of additive noise and  micro aggregation techniques. The experimental results show that the clustering performance of additive noise technique is comparatively better than micro aggregation.
Keywords-
Data Perturbation, Micro Aggregation, AdditiveNoise, K-means clustering
 
1. Introduction
The problem of privacy-preserving data mining has becomemore important in recent years because of the increasingability to store personal data about users, and the increasingsophistication of data mining algorithms to leverage thisinformation. Many data mining applications such asfinancial transactions, health-care records, and network communication traffic are deal with private sensitive data.Data is an important asset to business organization andgovernments for decision making by analyzing it. Privacyregulations and other privacy concerns may prevent dataowners from sharing information for data analysis. In orderto share data while preserving privacy data owner mustcome up with a solution which achieves the dual goal of privacy preservation as well as accurate data mining result.The main consideration in privacy preserving data mining istwofold. First, sensitive raw data like identifiers, names,addresses and the like should be modified or trimmed outfrom the original database. Second, sensitive knowledgewhich can be mined from a database by using data miningalgorithms should also be excluded, because such knowledgecan equally well compromise data privacy.The main objective in privacy preserving data mining is todevelop algorithms for modifying the original data in someway, so that the private data and private knowledge remainprivate even after the mining process [8]. Data modificationis one of the privacy preserving techniques used to modifythe sensitive or original information available in thedatabase that needs to be released to the public. It ensureshigh privacy protection.The rest of this paper is organized as follows. In Section 2,we present an overview of micro data and maskingtechniques. Section 3 discusses different types of micro dataprotection techniques. Additive noise and micro aggregationtechniques are discussed in section 4. Section 5 gives theperformance results of additive noise and micro aggregation.Conclusions are given in Section 6.
2. Micro Data
Protecting static individual data is called micro data. It canbe represented as tables. It consists of tuples (records) withvalues from a set of attributes. A micro data set V is a filewith
n
records, where each record contains
m
attributes onan individual respondent [3]. The attributes can be classifiedin four categories which are not necessarily disjoint:
Ø
 
 Identifiers.
These are attributes that unambiguouslyidentify the respondent. Examples are the passportnumber, social security number, name surname, etc.
Ø
 
Quasi-identifiers or key attributes.
These are attributeswhich identify the respondent with some degree of ambiguity. Examples are address, gender, age,telephone number, etc.
Ø
 
Confidential outcome attributes.
These are attributeswhich contain sensitive information on the respondent.Examples are salary, religion, political affiliation,health condition, etc.
Ø
 
 Non-confidential outcome attributes.
Those attributewhich do not fall in any of the categories above.
 
(IJCNS) International Journal of Computer and Network Security,Vol. 2, No. 10, 2010
146
 3. Classification of micro data protectiontechniques (MPTs) [3]
Figure 1
. Micro-data Protection Techniques
3.1 Masking Techniques
Protecting sensitive data is a very significant issue in thegovernment, public and private bodies. Masking techniquesare used to prevent confidential information in the table.Masking techniques can operate on different data types.Data types can be categorized as follows.
Ø
 
Continuous.
An attribute is said to be continuous if it isnumerical and arithmetic operations are defined on it.For instance, attributes age and income are continuousattributes.
Ø
 
Categorical.
An attribute is said to be categorical if it canassume a limited and specified set of values andarithmetic operations do not have sense on it. Forinstance, attributes marital status and sex arecategorical attributes.Masking techniques are classified into two categories
Ø
 
Perturbative
Ø
 
Non- Perturbative
3.1.1 Perturbative Masking
Perturbation is nothing but altering an attribute value by anew value. The data set are distorted before publication.Data is distorted in some way that affects the protected dataset, i.e. it may contain some errors. In this way the originaldataset may disappear and new unique combinations of dataitems may appear in the perturbed dataset; in perturbationmethod statistics computed on the perturbed dataset do notdiffer from the statistics obtained on the original dataset [3].Some of the perturbative masking methods are,
Ø
 
Micro aggregation
Ø
 
Rank swapping
Ø
 
Additive noise
Ø
 
Rounding
Ø
 
Resampling
Ø
 
PRAM
Ø
 
MASSC etc.
3.1.2 Non-Perturbative Masking
Non-perturbative techniques produce protected microdata byeliminating details from the original
 
microdata. Some of theNon-perturbative masking methods are
Ø
 
Sampling
Ø
 
Local Suppression
Ø
 
Global Recoding
Ø
 
Top-Coding
Ø
 
Bottom-Coding
Ø
 
Generalization
3.2 Synthetic Techniques
The original set of tuples in a microdata table is replacedwith a new set of tuples generated in such a way to preservethe key statistical properties of the original data. Thegeneration process is usually based on a statistical modeland the key statistical properties that are not included in themodel will not be necessarily respected by the synthetic data.Since the released micro data table contains synthetic data,the re-identification risk is reduced. The techniques aredivided into two categories:
 fully synthetic
techniques and
 partially synthetic
techniques. The first category containstechniques that generate a completely new set of data, whilethe techniques in the second category merge the originaldata with synthetic data.
3.2.1 Fully Synthetic Techniques
 
Ø
 
Bootstrap
Ø
 
Cholesky Decomposition
Ø
 
Multiple Imputation
Ø
 
Maximum Entropy
Ø
 
Latin Hypercube Sampling
 
3.2.2 Partially Synthetic Techniques
Ø
 
IPSO (Information Preserving StatisticalObfuscation)
Ø
 
Hybrid Masking Random Response
Ø
 
Blank and Impute
Ø
 
SMIKe (Selective Multiple Imputation of Keys)
Ø
 
Multiply Imputed Partially Synthetic Dataset [3]
4. Analysis of the SDC techniques
The main steps involved in this work are,
 
Sensitive numerical data item is selected from thedatabase
 
Modifying the sensitive data item using microaggregation and additive noise
 
Analyzing the statistical performance
 
Analyzing the accuracy of privacy protection
 
Evaluating the clustering accuracy
 
(IJCNS) International Journal of Computer and Network Security,Vol. 2, No. 10, 2010
147
 4.1 Micro aggregation
Micro aggregation is an SDC technique consisting in theaggregation of individual data. It can be considered as anSDC sub-discipline devoted to the protection of the microdata. Micro aggregation can be seen as a clustering problemwith constraints on the size of the clusters. It is somehowrelated to other clustering problems (e.g., dimensionreduction or minimum squares design of clusters). However,the main difference of the micro aggregation problem is thatit does not consider the number of clusters to generate or thenumber of dimensions to reduce, but only the minimumnumber of elements that are grouped in the same cluster [9].Any type of data, micro aggregation can be operationallydefined in terms of the following two steps:
Ø
 
Partition: The set of original records is partitioned intoseveral clusters in such a way that records in the samecluster are similar to each other and so that the numberof records in each cluster is at least k.
Ø
 
Aggregation: An aggregation operator (for example, themean for continuous data or the median for categoricaldata) is computed for each cluster and is used to replacethe original records. In other words, each record in acluster is replaced by the cluster
s prototype.From an operational point of view, micro aggregationapplies
Ø
 
A clustering algorithm to a set of data obtaining a set of clusters. Formally, the algorithm determines a partitionof the original data. Then, micro aggregation proceedsby calculating a cluster representative for each clusterfinally,
Ø
 
Each original datum is replaced by the correspondingcluster representative [7]After modifying the values the k means algorithm is appliedto find, whether the original value and the modified value inthe micro aggregation table are in the same cluster.
4.2. Additive Noise
It perturbs a sensitive attribute by adding or by multiplyingit with a random variable with a given distribution. [2]
 
Ø
 
Masking by uncorrelated noise addition
The vector of observations
 x
 j
 
for the
 j
-th attribute of theoriginal dataset
 X 
 j
 
is replaced by a vector
 z
 j
 
=
 x
 j
+
ε
 j
 
where
ε
 j
 
is a vector of normally distributed errors drawnfrom a random variable
ε 
 j
∼ 
(0
 ,
σ 
2
ε 
 j
), such that
Cov
(
ε 
 ,
ε 
l
)= 0 for all
≠ 
 
l
. This does not preserve variances norcorrelations.
Ø
 
Masking by correlated noise addition
.Correlated noise addition also preserves means andadditionally allows preservation of correlation coefficients.The difference with the previous method is that thecovariance matrix of the errors is now proportional to thecovariance matrix of the original data,
i.e.
ε 
 
∼ 
(0
 ,
Σ
ε 
),where
Σ
ε 
 
=
α 
Σ
.In this work we have used the given additive noisealgorithm.
 
Consider a database D consists of T tuples.D={t
1
,t
2
,…t
n
}. Each tuple in T consists of set of attributes T={A
1
,A
2
,…A
p
} where A
i
 
Є
T and T
i
Є
D
 
Identify the sensitive or confidential numeric attribute A
R
 
n
 
 
Calculate the mean
Σ
A
Ri
 
i=1
 
Initialize countgre=0 and countmin=0
 
If A
Ri
>=mean then{Store these numbers separatelygroup1= A
Ri
(i=1,..n)countgre=coungre+1}
 
else if A
Ri
<mean then{Store these numbers separatelygroup2= A
Ri
(i=1,..n)countmin=countmin+1}
 
Calculate the noise1 value as 2*mean/countgre
 
Calculate the noise2 value as 2*mean/countmin
 
Subtract the noise1 value from each data item in group1
 
Add the noise2 value to each data item in group2
 
Now release the new modified sensitive data
 
In the modified data find out the mean value which issame as the original
 
Adding the noise1 and noise2 produce the result as 0.
5. Experimental Results
In order to conduct the experiments, synthetic employeedataset can be created with 500 records. From this dataset,we select the sensitive numeric attribute, income. Additivenoise and micro aggregation techniques are used formodifying the attribute income.The following performance factors are considered forevaluating the two techniques
5.1 Statistical Calculations
The statistical properties mean, standard deviation andvariance of modified data can be compared with the originaldata. Both the techniques were produced the same results.

You're Reading a Free Preview

Download
scribd
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->