Privacy Preserving Publication of Datasets: S-Tuple Inclusion - A New Method For

s-Tuple Inclusion – A New Method for
Privacy Preserving Publication of Datasets

S.Ram prasad Reddy*, V.Valli Kumari**, KVSVN Raju**
Abstract— There has been a great amount of work in that, the knowledge present in the data is extracted for use, the
recent years on privacy preserving but, most of them individual’s privacy is protected, and the data holder is
considered anonymization as a principle to publish the protected against misuse or disclosure of data.
data. k-anonymization [1, 2] is a popular privacy
Studies show that 87% [2] of the US population can be
protection mechanism in data publishing. Such
uniquely identified based on the combination of three
mechanisms only protect the data by generalization or
attributes: Date of Birth, Gender and 5-digit Zip code. The set
suppression or distortion as an anonymization technique.
of attributes that in combination when linked with external
These methods publish data in an effective manner but
data uniquely identify the individuals are called as quasi
involve information loss. Although the privacy guarantee
identifiers. Several approaches have been evolved to provide
is the most important data publication criterion, the
the data privacy, such as k-Anonymity [1, 2], l-Diversity [13],
published data must also provide a reasonable level of
t-Closeness [16] and personalized privacy preservation [14]
utility so that it can be useful for applications such as data
and so on. s-Tuple inclusion based privacy preservation is
mining or data analysis. Utility is a tradeoff for the privacy
being proposed in this paper that provides the solution to this
guarantee since anonymization of data introduces
privacy problem in a more attractive way compared to above
information loss.
approaches in terms of utility and privacy.
In this paper, a simple method for privacy preserving
In summary, this paper contributes the following: Our
publication of datasets in an effective manner by reducing
study has shown that few techniques account for too much of
information loss to certain extent is introduced. Utility is
information loss while achieving privacy. We address the data
given more importance without compromising privacy. To
privacy problem using s–tuple inclusion approach, a new
handle the problem, we develop a simple approach which
perspective of looking at privacy problem in data publishing.
considers the inclusion of spurious tuples in the published
The method ensures more privacy and less information loss.
dataset to satisfy the diversity property. The method does
This paper demonstrates the support for these claims through
not give any scope for the adversary to make any
experimental results as well as theoretical evaluation.
inferences from the published dataset. Experiments, on
medical dataset, showed that our method is efficient and it The rest of the paper is organized as follows: Section 2
can produce published data of high utility. consists of related work. Section 3 gives the problem
definition. Section 4 contains a description of the privacy
Index Terms—Privacy, Spurious tuples, Anonymity, Diversity preserving model. Section 5 describes the possible attacks.
Section 6 discusses the experimental results. Section 7
I. INTRODUCTION concludes the paper with a discussion on what needs to be
done further.
Information sharing has become a common phenomenon
with the advent of technology and global networking. There is II. RELATED WORK
an enormous demand of person-specific data like shopping
habits, criminal records, medical history, credit records and One of the privacy concerned problems is publishing
many more. Data mining is used by several applications for microdata for public use [4], which has been extensively
analyzing such data. This data is an important asset to business studied. A large category of privacy attacks was to re-identify
organizations and their goals and helps in decision making as individuals by joining the published table with some external
well as in providing social benefits, such as medical research, tables modeling the background knowledge of users. There are
crime reduction, national security etc., [3]. Assume a hospital several works discussing privacy preserving in data mining.
publishes patient information for statistical use, by Data modification methods are the simplest ways to preserve
suppressing the identifying attributes. As voter registration privacy: Perturbation [5] (alteration of an attribute value by a
lists are publicly accessible, a careful correlation of the new value or adding noise), blocking [6] (replacement of an
attributes in the two lists would reveal sensitive information existing attribute value with a “?”), aggregation [5]
about an individual [2]. For example, disease, which the (combination of several values into a coarser category),
individual did not wish to reveal, might get revealed. In data swapping [7, 8] (interchanging values of individual records),
mining, ensuring privacy means, showing that results sampling [9] (releasing data for only a sample of a
published or mined do not inherently disclose individual population).
information. The goal of privacy preserving data mining is Attributes can be classified into two groups according to
the business rules of an organization, Privacy Disclosure set
*-Ph.D Scholar, Dept. of Computer Science Engineering, JNTUK, Kakinada, and Non privacy disclosure set of attributes [3]. Any attribute
India, reddysadi@gmail.com and **-Dept. of Computer Science and Systems
which can pin-pointedly identify an individual/organization
Engineering, College of Engineering (A), Andhra University, Visakhapatnam-
530003, India, vallikumari@gmail.com, kvsvn.raju@gmail.com. comes under privacy disclosure (PD) set of attributes. The rest
of the attributes come under non-privacy disclosure (NPD) set called as anonymization [10]. But several problems are
of attributes. Further, the NPD set is categorized into having identified with k-anonymity [11, 12]. A k-anonymous table
sensitive (NPDS) and non-sensitive (NPDNS) attributes. The may allow an adversary to derive the sensitive information of
sensitive attributes have a great impact on the privacy when an individual with 100% confidence. The larger the value of k,
they are considered together with the attributes of the PD set. the better the privacy is protected. There is considerable
The non-sensitive attributes have a lesser or no impact on the information loss from the data.
privacy. This means that if any sensitive attributes of a non-
k-anonymity allows an attacker to discover the value of
privacy disclosure set along with any attribute of the privacy
sensitive attributes, when there is little diversity in the
disclosure set are requested by the miner then it leads to
sensitive attributes. To counter this, another scheme called
privacy violation. This categorization of attributes into PD and
l-diversity [13] was proposed and is shown in Table 3.
NPD sets is purely done based on business rules. The work in
l-diversity provides privacy even when the data publisher does
[3] preserves privacy by transforming the confidential
not know what kind of knowledge is possessed by the
attributes into fuzzy.
adversary. It ensures that, all tuples that share the same values
k-anonymity [1, 2] is one widely discussed approach for of quasi identifiers should have l-diverse values for their
achieving data privacy. The attributes that help in revealing sensitive attributes. Even l-diversity is prone to attacks by an
information when combined with other attributes are called as adversary, as it guarantees a low breach probability [14].
quasi-identifier [QID] attributes or privacy disclosure (PD) set Anatomy [15] is another l-diversity specific method. Though it
of attributes [3]. The attributes that hold private information does not violate the l-diversity property, it confirms that a
about an individual and should not be disclosed in particular individual is included in the data. t-closeness is
combination with any QID are called as sensitive attributes [1, another scheme, which recommends table-wise distribution of
2] and form the non-privacy disclosure sensitive attribute set sensitive attribute values to be repeated within each
(NPDS) [3]. anonymized group [16].
Table 1: Microdata Table 3: 3-Diverse data
Age Gender PIN code Disease Age Gender PIN code Disease
5 Male 12000 gastric ulcer 1-10 Male 10001-20000 gastric ulcer
9 Male 14000 dyspepsia 1-10 Male 10001-20000 dyspepsia
6 Male 18000 pneumonia 1-10 Male 10001-20000 pneumonia
8 Male 19000 bronchitis 1-10 Male 10001-20000 bronchitis
12 Female 22000 pneumonia 11-30 Person 20001-40000 pneumonia
19 Male 24000 pneumonia 11-30 Person 20001-40000 pneumonia
29 Female 33000 flu 11-30 Person 20001-40000 flu
26 Female 36000 gastritis 11-30 Person 20001-40000 gastritis
28 Male 37000 pneumonia 11-30 Person 20001-40000 pneumonia
21 Male 38000 flu 11-30 Person 20001-40000 flu
Table 2: 2-Anonymous data k-anonymity and l-diversity provide data privacy by

considering the entire data set as single entity with same kind
Age Gender PIN code Disease of priority for each tuple. They provide privacy to each person
1-10 Male 10001-20000 gastric ulcer at the same level. They do not take into account personal
1-10 Male 10001-20000 dyspepsia anonymity requirements [14]. This process may lead to
1-10 Male 10001-20000 pneumonia insufficient protection to a subset of people, while applying
1-10 Male 10001-20000 bronchitis excessive privacy control to another subset. Personalized
11-20 Person 20001-30000 pneumonia privacy preservation is another method which allows each
11-20 Person 20001-30000 pneumonia sensitive attribute in a record in the table to have a privacy
21-30 Person 30001-40000 flu constraint [14]. However, the computational effort is too high
21-30 Person 30001-40000 gastritis as generalization has to be done also on the sensitive attribute
21-30 Person 30001-40000 pneumonia column. Personalized privacy preservation uses a tree based
21-30 Person 30001-40000 flu approach, for personalized privacy. Greedy algorithm is used
and hence is not optimal, so does not achieve minimal loss. p-
In k-anonymized data, privacy is achieved through sensitive k-anonymity is almost similar to l-diversity [17].
generalization or suppression. Suppression of directly Extended p-sensitive k-anonymity [25] is a scheme that
identifiable attributes, like name, SSN is done by not extends p-sensitive k-anonymity property which is similar to
publishing them. Then the data set shown in Table 1 is divided the personalized privacy method, where in the protection is
into equivalence classes. Each equivalence class has a distinct offered at different levels in the taxonomy for the sensitive
tuple occurring k-times, which is called generalization. Thus, attribute. Another scheme in [18] assumes hierarchy in each
generalization means replacing a tuple with a more QI attribute, and that all partitions in a general domain should
generalized tuple, which is indistinguishable from several be at the same level of hierarchy.
other tuples in the equivalence class as in Table 2. This is also
Although the privacy guarantee is the most important data III. PROBLEM DEFINITION
publication criterion, the published data must also provide a Consider a table T of microdata as shown in Table 1 and
reasonable level of utility so that it can be useful for the corresponding taxonomies for the quasi-identifiers as
applications such as data mining or data analysis. Utility is a shown in Figures 2, 3 and 4. The data consists of sensitive
tradeoff for the privacy guarantee since anonymization of data attributes and quasi-identifiers attribute. The problem is to
introduces information loss. There are different definitions of publish an anonymized version of the micro dataset without
utility in the existing literature. In [19], the full-domain much information loss. We consider the principle of k-
generalization is developed, which maps the whole domain of anonymity and l-diversity. The k-anonymity model requires
each quasi-identifier attribute to a more general domain in the that every value set for the quasi-identifier attribute set has a
domain generalization hierarchy. To achieve full-domain frequency of zero or at least k. For example, Table 1 does not
generalization, two types of partitioning can be applied. First, satisfy k-anonymity property since all tuples occur once. Now,
single-dimensional partitioning [20] divides an attribute into a Table 2 satisfies 2-anonymity property. Consider a large
set of non-overlapping intervals, and each interval will be collection of patient records with different medical conditions.
replaced by a summary value (e.g., the mean, the median, or Some diseases are sensitive, such as HIV, but many diseases
the range). On the other hand, (strict) multidimensional are common, such as cold and fever. Only associations with
partitioning [21] divides the domain into a set of non- sensitive diseases need protection. To start with, we assume
overlapping multidimensional regions, and each region will be only one sensitive value, such as HIV. Table 2 does not satisfy
generalized into a summary tuple. The ideal anonymization the 2-diversity property and hence possible for inference
should minimize the information loss or maximize the utility. attacks. Table 2 is further generalized to satisfy the diversity
However, theoretical analysis [22] indicates that the problem property to obtain Table 3.
of optimal anonymization under many non-trivial quality
models is NP-hard. [10001-40000]
Micro dataset
[20001-40000]
k-anonymize and find
equivalence classes
[10001-20000] [20001-30000] [30001-40000]
Spurious dataset Classify equivalence classes Figure 2. Taxonomy for Zipcode
[1-39]
Equivalence classes Equivalence classes
violating diversity satisfying diversity
Find diversity value ‘L’

value for each class [11-30] [32-39]
Introduce spurious tuples Randomly pick an ‘L’

to satisfy ‘L’ value value
[1-10] [11-20] [25-26] [28-30] [32-35] [38-39]
New equivalence classes
satisfying diversity Figure 3. Taxonomy for Age
Publishable Data Person
Figure 1. System Architecture

The microdata is initially anonymized by considering k as
the anonymity parameter. From this anonymized dataset, the Male Female
sets satisfying l-diversity and sets violating l-diversity are
identified. An l value is chosen from the equivalence classes Figure 4. Taxonomy for Gender
satisfying l-diversity arbitrarily. This l value is used as a
parameter to ensure diversity for the sets violating l-diversity. IV. PRIVACY PRESERVING DATA PUBLISHING – S-
Accordingly the spurious tuples are added to the sets violating TUPLE INCLUSION
diversity such that the diversity is fulfilled.
Depending on the micro dataset to be published, an extra
The spurious dataset is created based on the micro dataset. table is created which is termed as spurious dataset. This
The spurious dataset consists of extra tuples that are not part dataset consists of tuples that do not actually belong to the
of the original dataset to be published. The required number of original micro dataset but would be useful in consideration for
tuples from the spurious dataset are selected and added to the published version. The general method is to anonymize the
equivalence classes to see that the respective equivalence given dataset by generalization principle and publish it. When
classes assure the required diversity property. anonymity is not satisfied we normally look for
overgeneralization. Overgeneralization increases privacy but Table 5: Equivalence classes after anonymization
at the same time reduces informativeness. If QID Gender Age PIN code HIV(Y/N)
overgeneralization still does not satisfy the diversity property,
Q1 Male 25-26 530017 Y
distortion may be applied, which further increases information
Q1 Male 25-26 530017 Y
loss. To see that privacy and informativeness are balanced, we Q2 Female 38-39 530017 Y
propose a simple approach that reduces information loss to the Q2 Female 38-39 530017 Y
maximum extent without compromising privacy. Q3 Female 32-35 530042 Y
The given dataset is initially anonymized satisfying the Q3 Female 32-35 530042 N
user parameter ‘k’. The anonymized dataset is now organized Q4 Male 28-30 530018 N
into equivalence classes. The equivalence classes are Q4 Male 28-30 530018 N
partitioned into two sets namely (i) equivalence classes
satisfying l-diversity property (EL) and (ii) equivalence Table 6: Equivalence classes after overgeneralization
classes violating l-diversity property (EV). For each
QID Gender Age PIN code HIV(Y/N)
equivalence class of EL, the proportion of sensitive values Li
is computed and the set ‘X’ is formed. From this set, Li value Q Person 25-39 530017 Y
is chosen randomly. This Li value would be used as the Q Person 25-39 530017 Y
diversity parameter for EV i.e., for the sets not satisfying Q Person 25-39 530017 Y
diversity. Q Person 25-39 530017 Y
Q3 Female 32-35 530042 Y
Suppose, if EV is an empty set then the published data Q3 Female 32-35 530042 N
does not involve the inclusion of any spurious tuples. On the Q4 Male 28-30 530018 N
other hand, if EL is an empty set then a minimal Li value Q4 Male 28-30 530018 N
would be chosen as the diversity parameter. Now, for each
equivalence class of EV, the respective number of spurious
tuples are selected from the spurious dataset and added to each To apply this new idea, the equivalence classes are
equivalence class such that the equivalence class satisfies the classified into classes satisfying diversity and classes violating
diversity parameter Li. The inclusion of spurious tuples would diversity. After classification of the data in Table 5, the set EL
not cause any damage to the equivalence classes and will not satisfying diversity consists of {Q3, Q4} and the set EV
extend the scope of the adversary to gain any extra violating diversity consists of {Q1, Q2}. The tuples in the
information. equivalence classes Q3 and Q4 can be published because they
satisfy the diversity property but the tuples of Q1 and Q2
Consider Table 4 that consists of both quasi-identifiers and should not be published because they violate the diversity
sensitive values. From this dataset, the equivalence classes are property. In order to publish these tuples also, they should be
identified as shown in Table 5 after anonymization. Table 5 is either overgeneralized or distorted. Both of these approaches
the anonymization version of Table 4. The table satisfies involve information loss to certain extent and
anonymity property for k=2 and is anonymized using local overgeneralization might not yield the desired results. So, we
recoding. The table is divided into four equivalence classes use a simple approach of including spurious tuples to Q1 and
{Q1, Q2, Q3, Q4}. The table does not satisfy l-diversity Q2 for publishing them. The number of spurious tuples to be
property even after anonymization. So, we go for further added is determined by considering the diversity of the
generalization using local recoding. As a result, equivalence equivalence classes satisfying the diversity property. After
classes {Q1, Q2} are generalized into a single equivalence inclusion of spurious tuples the published version of the
class {Q}. Table 6 is the result of overgeneralization of dataset would be as shown in Table 7. The published dataset
Table 5. l-diversity is still not satisfied. From this, we consists of extra tuples but does not affect the original micro
understand that overgeneralization does not provide a solution dataset. The shaded tuples are the spurious tuples added to Q1
to maintain the desired level of privacy. So, we consider the and Q2 respectively.
new idea of introducing spurious tuples into the equivalence
Table 7: Published dataset after adding spurious tuples
classes not satisfying the diversity property.
QID Gender Age PIN code HIV(Y/N)
Table 4: Micro dataset
Q1 Male 25-26 530017 Y
Gender Age PIN code HIV(Y/N) Q1 Male 25-26 530017 Y
Male 25 530017 Y Q1 Male 25-26 530017 N
Male 26 530017 Y Q1 Male 25-26 530017 N
Female 38 530017 Y Q2 Female 38-39 530017 Y
Female 39 530017 Y Q2 Female 38-39 530017 Y
Female 32 530042 Y Q2 Female 38-39 530017 N
Female 35 530042 N Q2 Female 38-39 530017 N
Male 28 530018 N Q3 Female 32-35 530042 Y
Male 30 530018 N Q3 Female 32-35 530042 N
Q4 Male 28-30 530018 N
Q4 Male 28-30 530018 N
Algorithm: s-Tuple Inclusion Table 8: Published dataset after distortion
Input: Dataset D, k QID Gender Age PIN code HIV(Y/N)
Output: Equivalence classes satisfying diversity
Q1 Male 25-26 530017 Y
Method:
Q1 Male 25-26 530017 N
1: Create the anonymized dataset D’ satisfying the k value. Q2 Female 38-39 530017 Y
2: Find the equivalence classes from the anonymized dataset Q2 Female 38-39 530017 N
D’. Q3 Female 32-35 530042 Y
3: From D’, determine the set EV containing all QID-ECs Q3 Female 32-35 530042 N
which violate l-diversity in T, and a set EL containing Q4 Male 28-30 530018 N
QID-ECs which satisfy l-diversity in D’. Q4 Male 28-30 530018 N
4: For each QID-EC Qi in EL, find the proportion L_i of
tuples containing values in the sensitive value set s. The
Linking Probability: Tuples are in the form of <QID, SA>,
distribution X of the L_i values is determined.
where QID is a set of attributes which if joined with some
5: Randomly pick a value L_i from X.
publicly database, such as voter registry, may reveal identities
6: For each QID-EC Qi in EV, introduce the required number
of some respondents and SA is a set of attributes such as
of spurious tuples such that L_i value is satisfied.
disease, that needs to be protected. Let D’ be the published
7: Publish the dataset.
dataset. The adversary is interested in conditional probability
P[SA = a | QID = q] which measures the strength of the link
between QID and SA.
Step 1 anonymizes a given table to satisfy k-anonymity.
The anonymized data after Step 1 is checked for diversity. It Frequency of <q, a> in D’
may be inferred that some QID-ECs may not satisfy l- P[SA = a | QID = q] = -----------------------------------
diversity. Steps 2 to 6 ensure that all QID-ECs in the result are Frequency of <a> in D’
l-diverse. In Step 3, we partition the sets into two: sets
Privacy: The privacy metric depends on the diversity
satisfying l-diversity and sets violating l-diversity. In Step 4,
parameter. The strength of privacy is given by the distribution
we find the proportion of Li of tuples containing values in the
of sensitive values for each equivalence class and the
sensitive value set. In step 5, a value is picked from this set. In
anonymization of quasi-identifier attribute values with out
Step 6, this value is applied to the sets violating l-diversity.
much information loss.
The purpose is to disguise the distortion so that the adversary
cannot tell the divergence between a distorted QID-EC and Information Loss: To reduce information loss, optimal
many undistorted QID-ECs. anonymous tuples are generated using local recoding without
employing overgeneralization or global recoding. The loss of
The use of Li for the distortion of EV is to make the
information of a tuple t, defined as L(t) = ∑AεQID L(t[A]),
distribution of proportions in EV look indistinguishable from
where L(t[A]) is the loss of information of a value. The
that of a large QID-EC set (EL). This is an extra safeguard for
obtained value is normalized so as to fall within a small
the algorithm in case the adversary knows the mechanism of
specified range. Any of the normalization methods such as
anonymization. In our setting, the probability that some QID-
min-max normalization, z-score normalization or
EC in EV has the same proportion as a QID-EC in EL is 1/l.
normalization by decimal scaling could be used to specify the
information loss within the range 0 to 1. Here ‘0’ specifies no
V. ANALYSIS
information loss and ‘1’ specifies complete information loss.
Generalization of data using local recoding might not
satisfy the diversity property. This may lead to privacy breach. Height of the node ‘A’ as the root
Assume that the adversary knows details as {Male, 26, L(t[A]) = ---------------------------------------------
530017}. The adversary clearly identifies that the tuple Total number of leaf nodes
belongs to Q1. From Table 5, it is clear that the person suffers This method proposes local privacy using local recoding
from HIV as the values of sensitive attribute HIV is ‘yes’ for and also emphasizes on global utility. The utility mainly
both the tuples. Since it does not satisfy the diversity property, depends on sensitive values of the dataset and in this dataset;
it is further generalized. The further generalization using local the sensitive value is HIV because certain common diseases
recoding is given in Table 6. If the adversary infers that the like cold, cough are ignored. The method clearly specifies that
tuple belongs to equivalence class ‘Q’, it is clear that the utility is not compromised. By observing the dataset, it is
person suffers from HIV. So overgeneralization is not a implicit that privacy clearly depends on the sensitive attribute
solution. If distortion is used as specified in [23], there would where the sensitive value is ‘Y’. To satisfy the diversity
be loss of information as can be observed from Table 8. To property, spurious tuples are being added to the equivalence
overcome the problem of information loss, we suggest the idea classes that do not satisfy diversity. The spurious tuples that
of including spurious tuples, shown in Table 7. are being added are clearly with the sensitive value ‘N’ for the
sensitive attribute. From the results, it is understood that data
mining methods do not get deviated from the original dataset.
The anonymized data thus provides accurate results identical
to the original dataset.
Further, we also notice that the aggregate functions such as
mean, median, mode, etc, can also be applied on anonymized
version. Since anonymized version contains only extra tuples
that have sensitive values ‘N’, the aggregate functions produce
the same result similar to the results produced on the original
dataset. The aggregate value would be computed only by
considering those tuples that have sensitive value ‘Y’ because
these are only the tuples that carry an impact in the dataset.
The number of tuples with sensitive value ‘Y’ in the original
dataset and the anonymized dataset do not vary.
VI. EMPIRICAL STUDY

The experiments were performed on a 3.06 GHz Intel IV Figure 5: Original Dataset Vs Published Dataset
processor machine with 1 GB of RAM. The operating system
on the machine was Microsoft Windows XP Professional
Edition – Service Pack 2. The implementation was built and
run in Java 2 Platform, Standard Edition 5.0. For our
experiments, we used the adult dataset from the UCI Machine
Learning Repository [24], which is considered a de facto
benchmark for evaluating the performance of anonymization
algorithms. We removed records with missing values and
retained only few of the original attributes. We considered
disease as an additional attribute and created synthetic dataset.
In our experiments, we considered {Gender, Age, PIN code}
as the quasi-identifier, and used disease attribute as the
sensitive attribute. We evaluated the algorithm in terms of
three measurements: execution time, privacy gain, information Figure 6: Equivalence classes satisfying diversity Vs
loss of QID attributes and sensitive attributes. Equivalence classes violating diversity
Figure 5 highlights the sizes of the datasets before
publishing and after publishing. It is clear from Figure 5 that
number of spurious tuples being added to the original dataset
varies and depends on the dataset being used and the size of
the dataset. The spurious tuples are added depending on the
necessity in order to satisfy the privacy without much
information loss.
On the other hand, Figure 6 depicts the number of
equivalence classes that satisfy diversity and that violate
diversity. This number may vary from dataset to dataset and
for different sizes of the dataset. It is clear from Figure 6 that
the distribution of equivalence classes is not uniform and is
purely dependent on the dataset and the user parameter ‘k’.
From Figure 7, it is clear that spurious tuples are added Figure 7: Equivalence classes before and after adding spurious tuples
only to the equivalence classes that do not satisfy diversity. In
the graph, the first column specifies the number of tuples VII. CONCLUSIONS
considered from the original dataset and the second column This model is an attempt to study the privacy preserving
specifies the number of tuples considered from the spurious data publishing in a different dimension. The proposed model
dataset. If the second column does not appear, it implies that ensures privacy, less information loss. s-Tuple inclusion
the equivalence class satisfies diversity and there is no need to approach is feasible in terms of implementation and it
add any spurious tuples. The cardinality of the dataset provides maximum privacy with less information loss. It gives
considered is 20K. The number of equivalence classes more information for the research or statistical purposes while
obtained is 25. Of these, 9 equivalences satisfy diversity and maintaining maximum privacy. In the current model, the
16 equivalence classes do not satisfy diversity. The number of dataset is assumed to consist of sensitive attribute of only one
tuples added depends on the diversity parameter L_i chosen type such as HIV. Further work is to enhance the current
from the equivalence classes that satisfy diversity and this model for protecting privacy for different datasets that
diversity parameter is not common for all the datasets. comprise of sensitive attributes in different formats. We would
Depending on the dataset and the equivalence classes that like further explore the possibilities of attacks that might take
satisfy diversity, the diversity parameter keeps on changing. place after introducing spurious tuple.
References [24] C. B. S. Hettich and C. Merz, “UCI repository of machine learning
databases”, 1998.
[1] P. Samarati and L. Sweeney, “Protecting privacy when disclosing [25] A. Campan, T. M. Truta, “Extended P-Sensitive K-Anonymity”, Studia
information: k-anonymity and its enforcement through generalization Universitatis Babes-Bolyai, Informatica, 2006.
and suppression”, In Technical Report SRI-CSL-98-04, 1998.
[2] L. Sweeney, “K-anonymity: A model for protecting privacy”, In
International Journal on Uncertainty, Fuzziness, and Knowledge-based
Systems, 2002.
[3] K.Sridevi, KVSVN Raju, V.Valli Kumari and S.Srinivasa Rao, ”Privacy
Preserving in Clustering by Categorizing Attributes using Privacy and
Non Privacy Disclosure Sets”, WORLDCOMP’07,The 2007 Intl. Conf.
on Data Mining, June 2007.
[4] I. Dinur and K. Nissim, “Revealing information while preserving
privacy”, In PODS, 2003.
[5] Ramakrishnan Srikant Rakesh Agrawal “Privacy-preserving data
mining”, In SIGMOD, 2000.
[6] LiWu Chang and Ira S. Moskowitz, “An integrated framework for
database inference and privacy protection”, Data and Applications
Security, 2000.
[7] D.E. Denning, “Cryptography and Data Security”, Addison-Wesley,
1982.
[8] Steven P. Reiss, “Practical data-swapping: The first steps”, ACM TODS,
1984.
[9] D.E. Denning, “Secure statistical databases with random sample
queries”, ACM TODS, 1980.
[10] Benjamin C.M. Fung, Ke Wang, and Philip S. Yu, “Anonymizing
classification data for privacy preservation”, In IEEE Transactions on
Knowledge and Data Engineering, 2007.
[11] M. Ercan Nergiz and C. Clifton, ”Thoughts on k-Anonymization”, In
Proc of 22nd Intl. Conf. on Data Engineering (ICDEW06), IEEE
Computer Society, 2005.
[12] K. Wang, P. Yu, and S. Chakraborty, “Bottom-Up Generalization: A
Data Mining Solution to Privacy Protection”, Proc. Fourth IEEE Intl.
Conf. on Data Mining (ICDM ’04), Nov. 2004.
[13] A.Machanavajjhala, J. Gehrke, and D. Kifer, “l-diversity: privacy
beyond k-anonymity”, In ICDE, 2006.
[14] X. Xiao and Y. Tao, “Personalized privacy preservation”, In SIGMOD,
2006.
[15] Xiao, X., Tao, Y, “Anatomy: Simple and Effective privacy
preservation”, In Proceedings of the 32nd Very Large Data Bases
conference (VLDB), Seoul, Korea, 2006.
[16] N. Li and T. Li, “t-closeness: Privacy beyond k-anonymity and l-
diversity”, In ICDE, 2007.
[17] Trajan Marius Truta, Bindu Vinay, “Privacy Protection: P-Sensitive k-
Anonymity Property", Proceedings of the Workshop on Privacy Data
Management, In 22nd IEEE Intl. Conf. of Data Engineering (ICDE),
Atlanta, Georgia, 2006.
[18] P. Samarati, “Protecting respondents' identities in microdata release”,
IEEE Transactions on Knowledge and Data Engineering, 2001.
[19] K. LeFevre, D. J. DeWitt, and R. Ramakrishnan, “Incognito: Efficient
full-domain k-anonymity”, In SIGMOD, 2005.
[20] V. S. Iyengar, “Transforming data to satisfy privacy constraints”, In
ACM, 2002.
[21] K. LeFevre, D. DeWitt, and R. Ramakrishnan, “Mondrian
multidimensional k-anonymity”, In ICDE, 2006.
[22] R. Bayardo and R. Agrawal, “Data privacy through optimal k-
anonymization”, In ICDE, 2005.
[23] Raymond Chi-Wing Wong, Ada Wai-Chee Fu, Ke Wang, Jain Pei,
“Minimality Attack in Privacy Prserving Data Publishing”, VLDB,
Vienna, Austria, 2007.

Privacy Preserving Publication of Datasets: S-Tuple Inclusion - A New Method For

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Privacy Preserving Publication of Datasets: S-Tuple Inclusion - A New Method For

Uploaded by

Copyright:

Available Formats

s-Tuple Inclusion – A New Method for

Privacy Preserving Publication of Datasets

Table 2: 2-Anonymous data k-anonymity and l-diversity provide data privacy by

Spurious dataset Classify equivalence classes Figure 2. Taxonomy for Zipcode

Find diversity value ‘L’

Introduce spurious tuples Randomly pick an ‘L’

Publishable Data Person

Figure 1. System Architecture

VI. EMPIRICAL STUDY

You might also like