You are on page 1of 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/2877061

A Framework for Privacy Preserving Classification in Data Mining

Article · March 2004


Source: CiteSeer

CITATIONS READS
27 123

2 authors:

Md Zahidul Islam Ljiljana Brankovic


Charles Sturt University The University of Newcastle, Australia
146 PUBLICATIONS   2,322 CITATIONS    68 PUBLICATIONS   1,280 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Clustering View project

kDMI: A Novel Method for Missing Values Imputation Using Two Levels of Horizontal Partitioning in a Data set View project

All content following this page was uploaded by Ljiljana Brankovic on 29 March 2013.

The user has requested enhancement of the downloaded file.


A Framework for Privacy Preserving Classification in Data Mining
Md. Zahidul Islam and Ljiljana Brankovic
School of Electrical Engineering and Computer Science
The University of Newcastle
Callaghan, NSW 2308, Australia
zahid@cs.newcastle.edu.au, lbrankov@cs.newcastle.edu.au

deciding whether a loan application should be accepted or


Abstract
not. In modern days organizations are extremely
Nowadays organizations all over the world are dependent dependent on data mining in their every day activities and
on mining gigantic datasets. These datasets typically paybacks include providing better service, achieving
contain delicate individual information, which inevitably greater profit, and better decision-making. For these
gets exposed to different parties. Consequently privacy purposes organizations collect huge amount of data on
issues are constantly under the limelight and the public which they apply data mining techniques. For example,
dissatisfaction may well threaten the exercise of data business organizations collect data about the consumers
mining and all its benefits. It is thus of great importance for marketing purposes and improving business
to develop adequate security techniques for protecting strategies, medical organizations collect medical records
confidentiality of individual values used for data mining. for better treatment and medical research, and national
In the last 30 years several techniques have been security agencies maintain criminal records for security
proposed in the context of statistical databases. It was purpose.
noticed early on that non-careful noise addition Typically, these data are collected with the consent of the
introduces biases to statistical parameters, including data subjects and the collector provides some assurance
means, variances and covariances, and sophisticated that the privacy of individual data will be protected.
techniques that avoid biases were developed. However, However, the secondary use of collected data is also very
when these techniques are applied in the context of data common. Secondary use is any use for which data were
mining, they do not appear to be bias-free. Wilson and not collected initially. Additionally, some organizations
Rosen (2002) suggest the existence of Type Data Mining sell the collected data to other organizations, which use
(DM) bias that relates to the loss of underlying patters in these data for their own purposes. Thus the data get
the database and cannot be eliminated by preserving exposed to a number of parties including collectors,
simple statistical parameters. In this paper we propose a owners, users and miners.
noise addition framework specifically tailored towards
Individual records are often considered to be private and
the classification task in data mining. It builds upon some
sensitive. For example, detailed credit card records can
previous techniques that introduce noise to the class and
disclose personal lifestyle with reasonable accuracy. Any
the so-called innocent attributes. Our framework extends
misuse of such personal data can be uncomfortable for
these techniques to the influential attributes; additionally,
the individuals and can occasionally cause a serious
it caters for the preservation of the variances and
damage to them. Consequently, public concern about the
covariances, along with patterns, thus making the
personal information is growing which forces
perturbed dataset useful for both statistical and data
governments and law enforcing agencies to introduce and
mining purposes. Our preliminary experimental results
implement new privacy protecting laws and regulations.
indicate that data patterns are highly preserved suggesting
An example of such laws is the US Executive Order
the non-existence of DM bias.
(2000) that protects federal employees from being
Keywords: Data mining, Statistical database, Data discriminated, on the basis of protected genetic
security, Privacy, Noise addition information, for employment purposes (Clinton 2000). It
1 Introduction1 is not unlikely that stricter privacy laws will be
The advances in information processing technology and introduced in the future. On the other hand, without such
the storage capacity have established data mining as a laws individuals may become hesitant to share their
widely accepted technique for various organizations. The personal information. Both scenarios may make the data
benefits of data mining include but are not restricted to collection difficult and hence may deprive the
improvement in diagnosis of diseases, composition of organizations from the benefits of data mining resulting
drugs tailored towards individual’s genetic structure (US in inferior quality of services provided to the public. Such
Department of Energy 2001), and automatic aid for prospects equally concern collectors and owners of data,
as well as researchers. A possible solution to this problem
is to provide security techniques that would insure
Copyright © 2004, Australian Computer Society, Inc. This privacy of individual records while at the same time
paper appeared at the Australasian Computer Science Week enable data mining and statistical analysis. In recent years
(ACSW 2004), Dunedin, New Zealand. Conferences in several security techniques in statistical database and data
Research and Practice in Information Technology, Vol. 32. J. mining have been proposed.
Hogan, P. Montague, M. Purvis and C. Steketee, Ed. It is deemed that the confidentiality of individual records
Reproduction for academic, not-for profit purposes permitted
provided this text is included. can be maintained and at the same time the usefulness of
data for data mining can also be preserved (Agrawal and based on the approaches they take, such as query
Srikant 2000, Estivil-Castro and Brankovic 1999, Islam restriction, data perturbation, and output perturbation.
and Brankovic 2003). We observe that removing the Among these methods data perturbation method is the
names and other identifiers may not guarantee the most straightforward to implement. We only need to
confidentiality of individual records as it is often the case perturb the data. Then we can use any existing software
that a particular record can be uniquely identified from (eg. DBMS) to access the data without any further
the combination of other attributes. Hence we need more restrictions on processing. That is not the case with query
sophisticated protection techniques. The datasets used for restriction and output perturbation. Thus, we feel that this
data mining purposes do not necessarily need to contain method is the most suitable for data mining applications,
100% accurate data. In fact, that is almost never the case, and to our knowledge no other methods have been
due to the existence of natural noise in datasets. In the investigated in this context.
context of data mining it is important to maintain the The simplest version of additive data perturbation was
patterns in the dataset. Additionally, maintenance of proposed by Kim (1986). This method suffers from bias
statistical parameters, namely means, variances and related to the variances (Type A), bias related to the
covariances of attributes is important in the context of relationship between confidential attributes (Type B) and
statistical databases. However, the maintenance of only bias related to the relationship between confidential and
statistical parameters does not necessarily preserve the non-confidential attributes (Type C). Several years later,
patterns in a dataset. To illustrate this, let us consider two Tendick and Matloff (1994) presented a modified version
attributes, A and B. The correlation of these two attributes of random data perturbation, which is free of Type A and
over the whole population space may not be significantly Type B bias. In 1999, Muralidhar, Parsa and Sarathy
high but they may have a very high correlation in a (1999) proposed a new data perturbation method called
specific part of the dataset, say when a value of A is General Additive Data Perturbation (GADP) method. For
within a specific range. For example, A could be “Age” the database with a multivariate normal distribution, the
and B could be “Height”. While these two attributes are GADP method is said to be the most secure and the least
highly correlated when Age is in the range 0 – 10, their bias prone. This method is structured to be free of the so-
correlation is very low in the range 20+. This high called Type “A”, Type “B”, Type “C” and Type “D” bias.
correlation between the attributes in a particular part of
the dataset can be seen as a pattern of the dataset. The In 2002, Wilson and Rosen (2002) compared GADP
maintenance of the correlations among the attributes over method with a naïve noise addition method called Simple
the whole population space might not preserve the Additive Data Perturbation (SADP) method in the
patterns. context of data mining. They suggested the existence of a
new bias called Type Data Mining (DM) bias and thus
In this paper we concentrate on classification and pattern
attempted to show that GADP method is not bias free in
preservation of a dataset while adding noise to it. At the
the context of data mining.
same time we are maintaining the statistical parameters in
order to provide the usefulness of the dataset for Estivil-Castro and Brankovic (1999) proposed a data
statistical purposes. We propose a noise addition perturbation technique by adding noise to the class
framework, which uses a few previous techniques attribute. The technique emphasised the pattern
(Estivil-Castro and Brankovic 1999, Islam and Brankovic preservation instead of obtaining unbiased statistical
2003, Muralidhar, Parsa and Sarathy 1999) and parameters.
incorporates them with new technique introduced in this The majority of the previous studies evaluate the quality
paper. Our initial experimental results, presented in this of the perturbed datasets by measuring the predictive
paper, indicate that the proposed framework preserves the accuracy of the classifiers built on the perturbed datasets
pattern of datasets while at the same time it provides high (Kohavi 1996, Lim, Loh and Shih 2000, Wilson and
privacy by perturbing the data of a dataset. This Rosen 2002). However, in this paper we follow our
framework perturbs both confidential and non- previous directions (Estivil-Castro and Brankovic 1999,
confidential attributes. Perturbation of non-confidential Islam and Brankovic 2003, Islam, Barnaghi and
attributes is very valuable in providing a higher level of Brankovic 2003) and evaluate the quality of the perturbed
security, as it makes it difficult for an intruder to uniquely datasets by comparing the decision trees and the logic
identify a record and thus compromise its confidential rules associated with the perturbed and the original
values. datasets. Our previous work (Islam, Barnaghi and
2 Previous Work Brankovic 2003) indicates that the preservation of logic
rules is highly correlated with the accuracy of neural
Many privacy techniques have been proposed in the network classifiers, which are known to be very sensitive
context of statistical databases (see, for example, Adam to noise.
and Wortmann 1989, Muralidhar, Parsa and Sarathy
1999, Tendick and Norman 1987, Tendick and Matloff 3 The Framework
1994) while few privacy techniques have been proposed In this section we present a noise addition framework for
in the context of data mining (see, for example, Agrawal datasets that contain several numerical attributes and a
and Srikant 2000, Du and Zhan 2002, Lindell and Pinkas single categorical attribute (class). An example of such
2000, Oliviera and Zaane 2002). An excellent survey, on dataset is the Wisconsin Breast Cancer (WBC) dataset
the existing privacy methods for statistical databases, has available from the UCI Machine Learning Repository at
been presented in (Adam and Wortmann 1999). These http://www.ics.uci.edu/~mlearn/MLRepository.html.
methods have been categorised in three main groups,
We first give brief introduction to WBC dataset, as we unperturbed dataset, by Leaf Innocent Attribute
shall refer to it in examples throughout this section. Perturbation Technique (LIAPT).
The WBC dataset has 10 numerical non-class attributes Step 3: Add noise to class attribute, for each of the
and one categorical class attribute. The class attribute has heterogeneous leaves, by Random Perturbation
two categorical values, “2” and “4”. Out of 10 numerical Technique (RPT).
attributes one is the Record ID. The 9 numerical attributes In order to describe the details of the three framework
(excluding the record ID) are Clump Thickness, steps, we first need to introduce some assumptions and
Uniformity of Cell Size, Uniformity of Cell Shape, definitions.
Marginal Adhesion, Single Epithelial Cell Size, Bare
First of all, we note that we add noise to all of the
Nuclei, Bland Chromatin, Normal Nucleoli and Mitoses.
attributes regardless of whether they are considered to be
Each of these attributes draws an integer value from the
confidential or non confidential. Hence we consider all
range 1 to 10 inclusive.
the attributes, including the class attribute, sensitive. The
We used Quinlan’s C5 decision tree builder on 349 cases rational behind this assumption is that considering some
of the WBC dataset, and we obtained a decision tree of the attributes to be non-sensitive and not adding any
shown in Figure 1. Ellipses represent internal nodes and noise to those attributes can facilitate identification of a
circles represent the leaves. particular record by an intruder. If the identification is
We next present a framework for adding noise to all the possible then the intruder can learn the values of all other
attributes in a dataset, including the class, in such a way attributes of that record. Thus in that case it would be
that the patterns discovered by the decision tree built on absolutely necessary to add a significant amount of noise
the original dataset are preserved. Additionally, our to all confidential attributes in order to protect them, and
framework can be extended so as to preserve the adding a great deal of noise would affect the patterns in
correlation among the attributes. This extension makes the dataset and to some extent the predictive accuracy of
the framework applicable to a wider range of datasets, the classifier built on the perturbed dataset. On the other
both those to be used for classification and those used for hand, if the identification of individual records is made
statistical analysis. Our preliminary experimental results difficult by adding noise to all the attributes, or in other
indicate that the patterns are very well preserved (see words if all the attributes are considered to be sensitive,
Section 4). then the same level of privacy can be achieved by adding
less noise to confidential attributes. By spreading the
noise in such a way, we may better preserve the patterns.
This is particularly important in the context of
“Influential Attributes”, where we can only add a limited
noise as described later in this section.
Secondly, we split the non-class attributes into two
divisions, namely Leaf Innocent Attributes (LIAs) and
Leaf Influential Attributes (LINAs) for each leaf. In a
decision tree, if an attribute is not tested in any of the
nodes on the path between the root and a leaf, then the
attribute is called Leaf Innocent Attribute (LIA) for that
particular leaf (Islam and Brankovic 2003). Thus, every
leaf of a decision tree maintains a set of LIAs. For
example, Uniformity of Cell Shape, Marginal Adhesion,
Single Epithelial Cell Size, Bare Nuclei, Bland
Chromatin and Mitoses are the Leaf Innocent Attributes
(LIAs) for LEAF-1 and LEAF-2 of the decision tree
shown in Figure 1.
On the contrary, the attribute that is tested at least once on
the path between the root and a leaf is called Leaf
Influential Attribute (LINA) for that particular leaf. Every
leaf of a decision tree upholds a set of LINAs. For
example, Clump Thickness, Uniformity of Cell Size and
Figure 1. The decision tree obtained from 349 cases of Normal Nucleoli are the Leaf Influential Attributes
the unperturbed WBC dataset. (LINAs) for LEAF-1 and LEAF-2.
Finally, we define a domain of an attribute to be a set of
The framework engages three steps as described below. legal values that the attribute can take (Date 2000). For
Step 1: Add noise to Leaf Influential Attributes (LINAs), example, the domain of the 9 numerical attributes in WBC
of each leaf of the decision tree obtained from the database is [1,10].
unperturbed dataset, by Leaf Influential Attribute Now we are ready to explain in details the three steps of
Perturbation Technique (LINAPT). the framework.
Step 2: Add noise to Leaf Innocent Attributes (LIAs), of In the Step 1 we introduce a new technique for perturbing
each leaf of the decision tree obtained from the the Leaf Influential Attributes (LINAs). We call this
technique Leaf Influential Attribute Perturbation
Technique (LINAPT). For each leaf of the decision tree In Step 3 we use a perturbation technique to preserve the
LINAPT first identifies the LINAs of that leaf. LINAPT confidentiality of the class attribute. Three class attribute
then adds noise to each record of that leaf, and only to perturbation techniques, namely Random Perturbation
Leaf Influential Attributes. Let A be a LINA. After Technique (RPT), Probabilistic Perturbation Technique
LINAPT adds noise to A, the perturbed value A’ of the (PPT) and All Leaves Probabilistic Perturbation
same attribute will be A′ = A + ε , where ε is a Technique (ALPT) were proposed in (Islam and
Brankovic 2003). Out of these three class attribute
discrete noise with mean µ and variance σ . The
2
perturbation techniques we found Random Perturbation
distribution of the noise can be chosen to suit a particular Technique (RPT) to be the best (Islam and Brankovic
application. 2003, Islam, Barnaghi and Brankovic 2003) and hence
An important characteristic of LINAPT is that all our framework uses RPT. In RPT we first focus on a
perturbed values remain within the range defined by the heterogeneous leaf of a decision tree and find the number
conditional values in LINAs for that particular leaf. For of cases with the minority class in that heterogeneous
example, LEAF-1 of Figure 1 has three LINAs namely leaf. We denote this number as n. We then convert all n
Uniformity of Cell Size, Clump Thickness and Normal cases of the minority class to the majority class. Then we
Nucleoli. The range of Uniformity of Cell Size’s defined randomly select n cases belong to the same leaf and
by the conditional value for LEAF-1 is 1 to 2 inclusive convert the class of those cases from majority to minority.
(see Figure 1). Thus LINAPT adds noise to this attribute We continue the perturbation of class attribute for all
for the cases belong to the LEAF-1 in such a way so that heterogeneous leaves of the decision tree. For example,
the perturbed value of the attribute remains within the LEAF-1, LEAF-3, LEAF-6 and LEAF-12 are
range 1 to 2. Similarly, the range of Uniformity of Cell heterogeneous leaves of the decision tree shown in Figure
Size for LEAF-10 is 3 to 4 inclusive, while the range of 1. We observe that the minority class for LEAF-1 is “4”
the Uniformity of Cell Shape is 1 to 2 inclusive (see and there is one case with minority class in the leaf. In
Figure 1). LINAPT perturbs all the LINAs of each leaf in RPT we first convert the class of that case from “4” to
the same way. “2”. Then select any one case randomly out of all 6 cases
In the Step 2 we perturb the LIAs according to the Leaf belong to the leaf and convert the class of the selected
Innocent Attribute Perturbation Technique (LIAPT), case from “2” to “4”. Similar perturbation is performed
proposed in (Islam and Brankovic 2003). LIAPT is used for the remaining heterogeneous leaves of the decision
to perturb the LIAs within each leaf. LIAs are not tree.
influential in the sense that they do not play any role in In (Islam and Brankovic 2003), we presented
prediction of the class attribute for that particular leaf. In experimental results on RPT. We produced a decision tree
other words, they do not appear in any logic rule that is from 300 cases of Boston Housing Price dataset, which is
related to the leaf. Hence they are not considered available from UCI Machine Learning Repository at
important for the pattern preservation of the dataset. http://www.ics.uci.edu/~mlearn/MLReposi tory.html. We
LIAPT first detects the LIAs from the original decision then perturbed the dataset by RPT technique 5 times, and
tree and then adds noise to each LIA. Let B be a LIA. each time we produced a decision tree from the perturbed
After LIAPT adds noise to B, the value of the same dataset. Hence we got 5 decision trees from 5 perturbed
attribute will be B ′ = B + ε , where ε is a discrete datasets, and we compared them with the original
decision tree. We found that the logic rules of the
noise with mean µ and variance σ . Again the
2
perturbed decision trees and the logic rules of the original
distribution of the noise can be chosen to suit particular decision tree were similar. Out of four attributes tested in
application. the original decision tree, three were tested in every
The LIAPT was first introduced in (Islam and Brankovic decision tree. The conditional values of these attributes
2003). In the same paper we presented some experimental were similar in decision trees obtained from original and
results to evaluate the technique. We only perturbed LIAs perturbed datasets.
of the heterogeneous leaves of WBC dataset. We used an We now propose an extension of our framework in order
approximation of a normal distribution with mean µ=0 to maintain the statistical parameters of a dataset, as well
and standard deviation σ =27.6% of the attribute value. as patterns.
We then compared the decision tree produced from the
For each leaf, DO:
perturbed dataset to the decision tree produced from
unperturbed dataset. We carried out the same experiment Step 1: Add noise to Leaf Influential Attributes
10 times. In 8 out of 10 experiments the trees obtained (LINAs), and Leaf Innocent Attributes (LIAs),
from the perturbed datasets were very similar to the tree using the GADP technique (Muralidhar, Parsa
obtained from the unperturbed dataset. Only in two out of and Sarathy 1999), where the domain of each
10 experiments we obtained decision trees which are LINA is bounded by the conditional values for
slightly different from the original tree but nevertheless that leaf.
they hold the same root and some logic rules. Step 2: Add noise to class attribute, for each of
In this paper we extend the experiment with LIAPT by the heterogeneous leaves, by Random
perturbing LIAs in both homogeneous and heterogeneous Perturbation Technique (RPT).
leaves by LIAPT. We present our experimental results in END DO
the next section. In Step 1 we consider the collection of records that
belong to the leaf under consideration to be a dataset in
its own right. The domains of LIAs remain the same as This node only affects 12 to 13 cases out of 349. The
their domains in the original dataset. However, the remaining part of these 3 trees is exactly the same as in
domains of the LINAs are defined by the conditional the original tree (see Figures 1 and 3 for illustration).
values for the leaf. Thus, after the perturbation is Thus, all logic rules associated with these trees are
completed the values of LINAs will remain in the range exactly the same as the logic rules associated with the
defined by the conditional values. original tree, except for the two rules defined by the node
Step 2 is straightforward and remains the same as the that is different.
Step 3 of the original framework. Further two trees are very similar to the original tree.
The extended framework effectively partitions the dataset They both test the same 7 attributes as the original tree,
into partitions defined by the leaves of the decision tree and have 13 nodes each (the original tree has 12 nodes).
built from the unperturbed dataset. The GADP method is Out of these nodes, 10 nodes are exactly the same as in
then applied to each leaf separately. This guarantees the the original tree, in the sense that they test the same
preservation of the correlations among all the attributes, attributes with the same conditional values and produce
as well as absence of any bias of types A, B, C or D. Thus leaves with the same number of cases as the original tree.
the correlations in the dataset as a whole will also be Ten out of 14 logic rules associated with these 2 trees are
preserved and the above mentioned biases will not occur. the same as in the original tree, and they cover 329 out of
The advantage of adding noise leaf by leaf is in the 349 cases.
preservation of all the patterns discovered by the original
tree. The trade off that we must bear is in certain decrease
of the security, as an intruder will have tight bounds on
the original values in LINAs. However, there are no such
bounds on LIAs.
4 Experimental Results
In this section we present results of our experiments on
Leaf Influential Attribute Perturbation Technique
(LINAPT) and Leaf Innocent Attribute Perturbation
Technique (LIAPT).
4.1 Experimental Results on LIANPT
In our experiments on LINAPT we used noise with the Figure 2. Probability distribution of the noise added to
mean µ = 0 and a standard deviation σ which depends on LINAs.
the new domains of LINAs. For larger values of domains
of LINAs σ can be approximated as 27.6% of the
domain. In fact, we used a probability distribution of the
noise as shown in Figure 2. Horizontal axis represents
noise as a percentage of the range of the attribute value
defined by the conditional value for a leaf of a decision
tree. Vertical axis represents the probability of adding the
corresponding noise. The σ is decreasing with a
decrease in the domain due to the rounding factor.
We used the 349 cases of WBC dataset for our
experiments. We first produced a decision tree, shown in
Figure 1, from the original 349 cases of WBC dataset.
Then we applied LINAPT on each leaf of the original
decision tree (Figure 1). Thus we produced a perturbed
dataset and then we obtained a decision tree from it. We
repeated the experiment 15 times. Hence, we produced 15
perturbed datasets and therefore 15 perturbed decision
trees.
We compared the logic rules of the decision trees
obtained from perturbed datasets with the logic rules of
the decision tree obtained from the original dataset.
Although decision trees are generally known as instable Figure 3. A decision tree obtained from a dataset
to noise (Li 2001), in 12 out of 15 experiments we find perturbed by LINAPT. This decision tree is very similar to
that the decision tree obtained from the perturbed dataset the decision tree obtained from unperturbed dataset.
is exactly same or almost the same as the original The remaining 3 trees are slightly different from the
decision tree. More precisely, in 7 out of 15 experiments original tree. They have between 8 and 13 nodes and
we obtained trees which are identical to the original tree. between 9 and 14 leaves, while the original tree has 12
The trees obtained in the further 3 experiments were nodes and 13 leaves. However, around half of the logic
almost the same as the original tree. The only difference rules associated with the original tree also appear in all of
was in one node, which is at the very bottom of the tree. the 3 trees and these logic rules cover 316 out of 349
cases in each tree. All of these trees test at least 6 out of 7 This method is known to be secure while at the same time
attributes tested in the original tree. Figure 4 shows one it does not introduce any of the four statistical biases
of these 3 trees. (Muralidhar, Parsa and Sarathy 1999, Wilson and Rosen
These experimental results indicate that LINAPT is 2002). Our experiments indicate that when use within our
excellent at preserving the patterns in the dataset. To fully framework GADP may also be considered free of bias
appreciate this, a reader needs to remember that decision Type DM.
trees are very sensitive to the noise. Our previous
experiments (Islam and Brankovic 2003) show that when 6 References
the noise is added randomly to the records of the dataset, Adam, N. and Wortmann, J.C. (1989): Security Control Methods for
Statistical Databases: A Comparative Study, ACM Computing
the decision trees typically suffer major changes in Surveys, 21(4): 515-556.
respect to number of nodes and logic rules associated Agrawal, R. and Srikant, R. (2000): Privacy-preserving Data Mining, In
with the tree. Proceedings of the 2000 ACM SIGMOD Conference on Management
of Data, Dallas, Tx.
Clinton, W.J., Executive Order 13145, 2000,
http://www.dol.gov/oasam/regs/statutes/eo13145.htm. Accessed 7
Oct 2003
Date, C., J.(2000): An Introduction to database Systems, 7th edition,
Addison-Wesley.
Du, W. and Zhan, Z. (2002): Building Decision Tree Classifier on
Private Data. In Proc. of IEEE Workshop on Privacy, Security and
Data Mining, at the ICDM 02, Conferences in Research and Practice
in Information Technology, 14, Clifton, C and Estivill-Castro, V, eds.
Estivil-Castro, V. and Brankovic, L. (1999): Data Swapping: Balan-cing
Privacy Against Precision in Mining for Logic Rules, M. Mohania
and A.M. Tjoa, eds., Data Warehousing and Know-ledge Discovery
DaWaK’99, LNCS 1676, 389-398.
Islam, M. Z. and Brankovic, L. (2003): Noise Addition for Protecting
Privacy in Data Mining, Proceedings of The 6th Engineering
Mathematics and Applications Conference (EMAC2003), Sydney,
85-90.
Islam, M. Z., Barnaghi, P. M. and Brankovic, L (2003): Measuring Data
Quality: Predictive Accuracy vs. Similarity of Decision Trees,
Accepted for publication in Proceedings of the 6th International
Conference on Computer and Information Technology (ICCIT-2003),
Jahangirnagar Univ., Bangladesh.
Kim, J. (1986):A Method for Limiting Disclosure in Microdata Based
on Random Noise and Transformation, In Proceedings of the
American Statistical Association on Survey Research Methods,
Figure 4. A decision tree that is obtained from a dataset American Statistical Association, Washington, DC, 370-374.
perturbed by LINAPT. This tree is considered not to be Kohavi, R. (1996): Scaling Up the Accuracy of Naïve-Bayes Classifiers:
very similar to the original tree. a Decision-Tree Hybrid, In Proceedings of the Second International
Conference on Knowledge Discovery and Data Mining.
4.2 Experimental Results on LIAPT Li, R.-H. (2001): Instability of Decision Tree Classification Algorithms,
PhD Thesis, University of Illinois at Urbana-Champaign.
In our experiments on LIAPT we used mean µ = 0 and a
standard deviation σ = 27.6% of the range of the
Lim, T.-S., Loh, W.-Y. and Shih, Y.–S. (2000): A Comparison of
Predictive Accuracy, Complexity and Training Time of Thirty Three
attribute. We used a similar probability distribution of the Old and New Classification Algorithm, Machine Learning Journal,
noise shown in Figure 2. We used the same LIAPT in 40, 203-228.
Lindell, Y. and Pinkas, B. (2000): Privacy Preserving Data Mining,
(Islam and Brankovic 2003), but there we perturbed only M. Bellare (Ed.): Proceedings of the Advances in Cryptology -
the LIAs of all heterogeneous leaves of the decision tree. CRYPTO 2000, LNCS 1880.
However, in the experiments presented in this paper we Muralidhar, K., Parsa, R. and Sarathy, R. (1999): A General Additive
perturbed the LIAs of each and every leaf of the original Data Perturbation Method for Database Security, Management
Science, 45 (10), 1399-1415.
decision tree. Thus we produced a perturbed dataset and a Oliveira, S.R.M. and Zaane, O.R. (2002): Foundations for an Access
perturbed decision tree. We repeated the experiment 10 Control Model for Privacy Preservation in Multi-Relational
times. In 7 out of 10 experiments the decision trees were Association Rule Mining, Workshop on Privacy, Security and Data
exactly same as the original decision tree. In the Mining, at the ICDM 02, Conferences in Research and Practice in
Information Technology, 14, Clifton, C. and Estivill-Castro, V., eds.
remaining 3 experiments we obtained trees that are Tendick, P., and Norman, N.S. (1987): Recent Result On The Noise
different from the original tree. However, they still Addition Method For Database Security, In Proceedings of the 1987
preserve many logic rules of the original decision tree. Joint Meetings, American Statistical Association / Institute of
Mathematical Statistics. ASA/IMA, Washington, DC.
5 Conclusion Tendick, P. and Matloff, N. (1994): A Modified Random Perturbation
In this paper we propose a noise addition framework for Method for Database Security, ACM Transaction on Database
Systems, 19 (1), 47-63.
protecting privacy of sensitive information used for data US Department of Energy, Human Genome Program, Genomics and Its
mining purposes. The framework does not distinguish Impact on Medicine and Society: A 2001 Primer,
between confidential and non-confidential attributes but http://www.ornl.gov/hgmis/publicat/primer2001/.
rather adds noise to all of them. Adding noise to non- Wilson, R. L. and Rosen, P. A. (2002): The Impact of Data Perturbation
Techniques on Data Mining, In Proceedings of the 33rd Annual
confidential attributes contributes to the overall security Meeting of DSI.
by preventing unique identification of records by an
intruder. The framework can be extended so that it
incorporates GADP method proposed by Muralidhar et al.

View publication stats

You might also like