Professional Documents
Culture Documents
Abstract— There has been a great amount of work in that, the knowledge present in the data is extracted for use, the
recent years on privacy preserving but, most of them individual’s privacy is protected, and the data holder is
considered anonymization as a principle to publish the protected against misuse or disclosure of data.
data. k-anonymization [1, 2] is a popular privacy
Studies show that 87% [2] of the US population can be
protection mechanism in data publishing. Such
uniquely identified based on the combination of three
mechanisms only protect the data by generalization or
attributes: Date of Birth, Gender and 5-digit Zip code. The set
suppression or distortion as an anonymization technique.
of attributes that in combination when linked with external
These methods publish data in an effective manner but
data uniquely identify the individuals are called as quasi
involve information loss. Although the privacy guarantee
identifiers. Several approaches have been evolved to provide
is the most important data publication criterion, the
the data privacy, such as k-Anonymity [1, 2], l-Diversity [13],
published data must also provide a reasonable level of
t-Closeness [16] and personalized privacy preservation [14]
utility so that it can be useful for applications such as data
and so on. s-Tuple inclusion based privacy preservation is
mining or data analysis. Utility is a tradeoff for the privacy
being proposed in this paper that provides the solution to this
guarantee since anonymization of data introduces
privacy problem in a more attractive way compared to above
information loss.
approaches in terms of utility and privacy.
In this paper, a simple method for privacy preserving
In summary, this paper contributes the following: Our
publication of datasets in an effective manner by reducing
study has shown that few techniques account for too much of
information loss to certain extent is introduced. Utility is
information loss while achieving privacy. We address the data
given more importance without compromising privacy. To
privacy problem using s–tuple inclusion approach, a new
handle the problem, we develop a simple approach which
perspective of looking at privacy problem in data publishing.
considers the inclusion of spurious tuples in the published
The method ensures more privacy and less information loss.
dataset to satisfy the diversity property. The method does
This paper demonstrates the support for these claims through
not give any scope for the adversary to make any
experimental results as well as theoretical evaluation.
inferences from the published dataset. Experiments, on
medical dataset, showed that our method is efficient and it The rest of the paper is organized as follows: Section 2
can produce published data of high utility. consists of related work. Section 3 gives the problem
definition. Section 4 contains a description of the privacy
Index Terms—Privacy, Spurious tuples, Anonymity, Diversity preserving model. Section 5 describes the possible attacks.
Section 6 discusses the experimental results. Section 7
I. INTRODUCTION concludes the paper with a discussion on what needs to be
done further.
Information sharing has become a common phenomenon
with the advent of technology and global networking. There is II. RELATED WORK
an enormous demand of person-specific data like shopping
habits, criminal records, medical history, credit records and One of the privacy concerned problems is publishing
many more. Data mining is used by several applications for microdata for public use [4], which has been extensively
analyzing such data. This data is an important asset to business studied. A large category of privacy attacks was to re-identify
organizations and their goals and helps in decision making as individuals by joining the published table with some external
well as in providing social benefits, such as medical research, tables modeling the background knowledge of users. There are
crime reduction, national security etc., [3]. Assume a hospital several works discussing privacy preserving in data mining.
publishes patient information for statistical use, by Data modification methods are the simplest ways to preserve
suppressing the identifying attributes. As voter registration privacy: Perturbation [5] (alteration of an attribute value by a
lists are publicly accessible, a careful correlation of the new value or adding noise), blocking [6] (replacement of an
attributes in the two lists would reveal sensitive information existing attribute value with a “?”), aggregation [5]
about an individual [2]. For example, disease, which the (combination of several values into a coarser category),
individual did not wish to reveal, might get revealed. In data swapping [7, 8] (interchanging values of individual records),
mining, ensuring privacy means, showing that results sampling [9] (releasing data for only a sample of a
published or mined do not inherently disclose individual population).
information. The goal of privacy preserving data mining is Attributes can be classified into two groups according to
the business rules of an organization, Privacy Disclosure set
*-Ph.D Scholar, Dept. of Computer Science Engineering, JNTUK, Kakinada, and Non privacy disclosure set of attributes [3]. Any attribute
India, reddysadi@gmail.com and **-Dept. of Computer Science and Systems
which can pin-pointedly identify an individual/organization
Engineering, College of Engineering (A), Andhra University, Visakhapatnam-
530003, India, vallikumari@gmail.com, kvsvn.raju@gmail.com. comes under privacy disclosure (PD) set of attributes. The rest
of the attributes come under non-privacy disclosure (NPD) set called as anonymization [10]. But several problems are
of attributes. Further, the NPD set is categorized into having identified with k-anonymity [11, 12]. A k-anonymous table
sensitive (NPDS) and non-sensitive (NPDNS) attributes. The may allow an adversary to derive the sensitive information of
sensitive attributes have a great impact on the privacy when an individual with 100% confidence. The larger the value of k,
they are considered together with the attributes of the PD set. the better the privacy is protected. There is considerable
The non-sensitive attributes have a lesser or no impact on the information loss from the data.
privacy. This means that if any sensitive attributes of a non-
k-anonymity allows an attacker to discover the value of
privacy disclosure set along with any attribute of the privacy
sensitive attributes, when there is little diversity in the
disclosure set are requested by the miner then it leads to
sensitive attributes. To counter this, another scheme called
privacy violation. This categorization of attributes into PD and
l-diversity [13] was proposed and is shown in Table 3.
NPD sets is purely done based on business rules. The work in
l-diversity provides privacy even when the data publisher does
[3] preserves privacy by transforming the confidential
not know what kind of knowledge is possessed by the
attributes into fuzzy.
adversary. It ensures that, all tuples that share the same values
k-anonymity [1, 2] is one widely discussed approach for of quasi identifiers should have l-diverse values for their
achieving data privacy. The attributes that help in revealing sensitive attributes. Even l-diversity is prone to attacks by an
information when combined with other attributes are called as adversary, as it guarantees a low breach probability [14].
quasi-identifier [QID] attributes or privacy disclosure (PD) set Anatomy [15] is another l-diversity specific method. Though it
of attributes [3]. The attributes that hold private information does not violate the l-diversity property, it confirms that a
about an individual and should not be disclosed in particular individual is included in the data. t-closeness is
combination with any QID are called as sensitive attributes [1, another scheme, which recommends table-wise distribution of
2] and form the non-privacy disclosure sensitive attribute set sensitive attribute values to be repeated within each
(NPDS) [3]. anonymized group [16].
Table 1: Microdata Table 3: 3-Diverse data
Age Gender PIN code Disease Age Gender PIN code Disease
5 Male 12000 gastric ulcer 1-10 Male 10001-20000 gastric ulcer
9 Male 14000 dyspepsia 1-10 Male 10001-20000 dyspepsia
6 Male 18000 pneumonia 1-10 Male 10001-20000 pneumonia
8 Male 19000 bronchitis 1-10 Male 10001-20000 bronchitis
12 Female 22000 pneumonia 11-30 Person 20001-40000 pneumonia
19 Male 24000 pneumonia 11-30 Person 20001-40000 pneumonia
29 Female 33000 flu 11-30 Person 20001-40000 flu
26 Female 36000 gastritis 11-30 Person 20001-40000 gastritis
28 Male 37000 pneumonia 11-30 Person 20001-40000 pneumonia
21 Male 38000 flu 11-30 Person 20001-40000 flu
Micro dataset
[20001-40000]
k-anonymize and find
equivalence classes
[10001-20000] [20001-30000] [30001-40000]
[1-39]
Equivalence classes Equivalence classes
violating diversity satisfying diversity