You are on page 1of 43

A Special Topic Seminar

On

Privacy Preservation Techniques in Data Mining


Submitted to Mumbai University

In the partial fulfilment of requirement for the degree of

MASTER OF ENGINEERING

In

INFORMATION TECHNOLOGY

By

Meena Talele

Under the guidance of

Prof. Smita Jangale and Dr. M. Vijayalakshmi

Department of Information Technology

Vivekanand Education Society’s Institute of Technology

Chembur, Mumbai-400074

2018-2019
VIVEKANAND EDUCATION SOCIETY’S INSTITUTE
OF TECHNOLOGY

DEPARTMENT OF INFORMATION TECHNOLOGY

Certificate
This is to certify that Meena Talele has satisfactorily carried out the special
topic seminar work entitled Privacy Preservation Techniques in Data Mining
for the degree of Master of Engineering in Information Technology of
University of Mumbai .

Prof. Mrs. Smita Jangale Dr. Mrs. Shalu Chopra.

(Project Guide) (Head of Department)

Dr. M. Vijayalakshmi

(Project Guide)

Dr. Mrs. J. M. Nair

(External Examiner) (Principal)


Acknowledgement

I sincerely thank my internal guide Prof. Mrs. Smita Jangale and Dr. M.
Vijayalakshmi for her support, co-operation, guidance and most importantly for
motivating me during the course of the progress of this report. Her motivation and
technical acumen has been of immense help in the successful completion of this
special topic report.

I would also like to sincerely thank Dr. Mrs. Shalu Chopra, Head of
Department, Department of Information Technology, Vivekanand Education
Society‟s Institute of Technology for all her guidance and technical expertise that
she has shared with me during the progression of this report.

I am also grateful to our principal Dr. Mrs. J. M. Nair and all the staff and
management of Vivekanand Education Society‟s Institute of Technology for
providing their valuable cooperation without which it could not have been possible
to complete this work successfully.

A word of thanks to my family members and all my other academician


friends who have in their, very special way directly & indirectly extended their
valuable support and contribution for the completion of my work.

Meena Talele
M.E (IT) (Sem.III)
VESIT

I
Abstract

In recent years, privacy preserving data mining (PPDM) has emerged as a very active re-
search area. This field of research studies how knowledge or patterns can be extracted from large
data stores while maintaining commercial or legislative privacy constraints. Quite often, these
constraints pertain to individuals represented in the data stores. While data collectors strive to
derive new insights that would allow them to improve customer service and increase their sales
by better understanding customer needs, consumers are concerned about the vast quantities of
information collected about them and how this information is put to use. Privacy preserving data
mining aims to settle these conflicting interests. The question how these two contrasting goals,
mining new knowledge while protecting individuals' privacy, can be reconciled, is the focus of
current research. To improve the trade-off between privacy and utility when mining data is
performed is the need of the time.

Let us consider three objective functions: the accuracy of the data mining model (e.g., the
expected accuracy of a resulting classifier, estimated by its performance on test samples), the
size of the mined database (number of training samples), and the privacy requirement,
represented by a privacy parameter. In a given situation, one or more of these factors may be
fixed: a client may present a lower acceptance bound for the accuracy of a classifier, the database
may contain a limited number of samples, or a regulator may pose privacy restrictions. Within
the given constraints, we wish to improve the objective functions: achieve better accuracy with
fewer learning examples and better privacy guarantees.

II
Table of Contents

Acknowledgement ...................................................................................................................... I

Abstract ..................................................................................................................................... II

List of Figures ............................................................................................................................ V

List of Tables ........................................................................................................................... VI

Chapter 1 ...................................................................................................................................1

Introduction ................................................................................................................................1

1.1 Background...................................................................................................................1

1.2 Motivation ....................................................................................................................1

1.3 Privacy Preserving Techniques .....................................................................................2

1.3.1 Randomisation Technique ......................................................................................4

1.3.2 K-anonymity Technique.........................................................................................5

1.3.3 Cryptographic Technique .......................................................................................6

Chapter 2 ...................................................................................................................................8

Literature Review........................................................................................................................8

Chapter 3 ................................................................................................................................. 11

Privacy Preservation.................................................................................................................. 11

3.1 Data Mining ................................................................................................................ 11

3.2 Privacy Preservation: .................................................................................................. 12

3.2 Privacy Preserving in Data Mining .............................................................................. 13

3.3 Privacy Preserving Models .......................................................................................... 14

3.4 Applications of PPDM ................................................................................................ 15

3.5 PPDM Framework ...................................................................................................... 15

3.6. Privacy Preserving Implementation Dimensions.......................................................... 16

III
3.6.2. Data Modification .................................................................................................... 17

3.6.3. Data Mining Algorithm............................................................................................ 17

3.6.4. Data or Rule Hiding ................................................................................................. 17

3.6.5. Privacy Preservation ................................................................................................ 17

Chapter 4 ................................................................................................................................. 18

Privacy Preservation Techniques ............................................................................................... 18

4.2 Randomization based Techniques ............................................................................... 18

4.3 K- Anonymity ............................................................................................................. 21

4.3.2. k-Anonymization Method .................................................................................... 22

4.3.3 Attack on k-Anonymity........................................................................................ 26

4.3.4 Generalization and Suppression ........................................................................... 27

4.3.5. l-diversity ............................................................................................................ 28

4.4 Encryption (cryptographic Techniques) ....................................................................... 29

4.5 Summary .................................................................................................................... 31

Chapter 5 ................................................................................................................................. 32

Conclusion and Future Scope .................................................................................................... 32

5.1 Conclusion .................................................................................................................. 32

5.2 Future Scope ............................................................................................................... 32

References ................................................................................................................................ 34

IV
List of Figures
Figure 1:Privacy Preservation......................................................................................................2
Figure 2:Classification of Privacy Preservation Data Mining Techniques ....................................4
Figure 3:Randomisation Technique .............................................................................................5
Figure 4:Data Mining Functionalities ........................................................................................ 12
Figure 5:PPDM Framework ...................................................................................................... 15
Figure 6:Linkage Attack............................................................................................................ 22
Figure 7:Domain and Value Generalization Hierarchy including Suppression ........................... 27

V
List of Tables
Table 1:Comparison of The Data Perturbation Methods ............................................................ 21
Table 2:Mortgage Company Data .............................................................................................. 23
Table 3:Original Patient Table................................................................................................... 26
Table 4:3-Anoymous Version of Patient Table. ......................................................................... 26
Table 5:Advantages and Limitations of PPDM Techniques ....................................................... 31
Table 6:Comparison of Various PPDM Techniques .................................................................. 31

VI
Chapter 1

Introduction

1.1 Background
The proliferation of information technologies and the internet in the past two decades has
brought a wealth of individual information into the hands of commercial companies and
government agencies. As hardware costs go down, organizations and it easier than ever to keep
any piece of information acquired from the ongoing activities of their clients. Data owners
constantly seek to make better use of the data they possess, and utilize data mining tools to
extract useful knowledge and patterns from the data. In result, there is a growing concern about
the ability of data owners, such as large corporations and government agencies, to abuse this
knowledge and compromise the privacy of their clients concern which has been reflected in the
actions of legislative bodies.

This concern is exacerbated by actual incidents that demonstrate how difficult it is to use
and share information while protecting individuals' privacy. One example is from August 2006
[10], when AOL published on their website a data set of 20 million web searches for research
purposes. Although the data set was believed to be anonymized, New York Times journalists
have shown how the released information can be used to expose the identities of the searchers
and learn quite a lot about them. Another example relates to the Netflix prize contest 1 that took
place between October 2006 and September 2009. Netflix has published a dataset consisting of
more than 100 million movie ratings from over 480 thousands of its customers, and invited the
research community to contend for improvements to its recommendation algorithm. To protect
customer privacy, Netflix removed all personal information identifying individual customers and
perturbed some of the movie ratings. Despite these precautions, researchers have shown that with
relatively little auxiliary information anonymous customers can be re-identified.

1.2 Motivation
Privacy preserving data mining (PPDM) has emerged as a very active re-search area. This
field of research studies how knowledge or patterns can be extracted from large data stores while

1
maintaining commercial or legislative privacy constraints. Quite often, these constraints pertain
to individuals represented in the data stores. While data collectors strive to derive new insights
that would allow them to improve customer service and increase their sales by better
understanding customer needs, consumers are concerned about the vast quantities of information
collected about them and how this information is put to use. Privacy preserving data mining aims
to settle these conflicting interests. The question how these two contrasting goals, mining new
knowledge while protecting individuals' privacy, can be reconciled, is the focus of current
research. To improve the trade-off between privacy and utility when mining data is performed is
the need of the time. Let us consider three objective functions: the accuracy of the data mining
model (e.g., the expected accuracy of a resulting classifier, estimated by its performance on test
samples), the size of the mined database (number of training samples), and the privacy
requirement, represented by a privacy parameter. In a given situation, one or more of these
factors may be fixed: a client may present a lower acceptance bound for the accuracy of a
classifier, the database may contain a limited number of samples, or a regulator may pose
privacy restrictions. Within the given constraints, we wish to improve the objective functions:
achieve better accuracy with fewer learning examples and better privacy guarantees [3].

1.3 Privacy Preserving Techniques

Figure 1:Privacy Preservation

Data mining and knowledge discovery in databases are two new research areas that
investigate the automatic extraction of previously unknown patterns from large amounts of data.
Recent advances in data collection, data dissemination and related technologies have inaugurated
a new era of research where existing data mining algorithms should be reconsidered from a

2
different point of view, this of privacy preservation. It is well documented that this new without
limits explosion of new information through the Internet and other media, has reached to a point
where threats against the privacy are very common on a daily basis and they deserve serious
thinking. Privacy preserving data mining is a novel research direction in data mining and
statistical databases, where data mining algorithms are analysed for the side-effects they incur in
data privacy.

The main consideration in privacy preserving data mining is twofold. First, sensitive raw
data like identifiers, names, addresses and the like should be modified or trimmed out from the
original database, in order for the recipient of the data not to be able to compromise another
person‟s privacy. Second, sensitive knowledge which can be mined from a database by using
data mining algorithms should also be excluded, because such knowledge can equally well
compromise data privacy, as we will indicate. The main objective in privacy preserving data
mining is to develop algorithms for modifying the original data in some way, so that the private
data and private knowledge remain private even after the mining process. The problem that
arises when confidential information can be derived from released data by unauthorized users is
also commonly called the “database inference” problem. In this report, we provide a
classification and an extended description of the various techniques and methodologies that have
been developed in the area of privacy preserving data mining. PPDM Framework Data mining is
the process of extracting or mining knowledge from large amounts of data .The extracted
knowledge can be used for decision making, process control, information management, query
processing and so on.

Now-a-days, data mining is used widely in many applications and huge volume of data is
collected. As data mining extracts information from large databases, which may make the data
vulnerable and lead to misuse. Some examples of sensitive data are:- credit card/debit card
details, criminal records, medical history, identity information etc. Thus, it‟s necessary to have
some privacy policy to secure the sensitive personal data of individuals Privacy Preserving Data
Mining has emerged as a very important research area. Privacy Preserving Data Mining deals
with giving protection to the individual‟s sensitive data.

Classification of privacy preservation data mining techniques is illustrated in Figure 2

3
Figure 2:Classification of Privacy Preservation Data Mining Techniques

1.3.1 Randomisation Technique


Randomisation techniques entail masking real values by adding additional values to the
original data; hence attackers fail to perform leakages on distorted published data to identify
individual records. Data distortion is an effective method recommended for privacy preserving
data collection, whereby original data is perturbed to conceal real data values. One approach
toward privacy protection in data mining was to perturb the input (the data) before it is mined
[7],[13]. Thus, it was claimed, the original data would remain secret, while the added noise
would average out in the output.

This approach has the benefit of simplicity. At the same time, it takes advantage of the
statistical nature of data mining and directly protects the privacy of the data. The drawback of the
perturbation approach is that it lacks a formal framework for proving how much privacy is
guaranteed. This lack has been exacerbated by some recent evidence that for some data, and
some kinds of noise, perturbation provides no privacy at all. Recent models for studying the
privacy attainable through perturbation offer solutions to this problem in the context of statistical
databases [7], [13]. Randomisation techniques are useful at data collection. They are found to
conceal an individual‟s private data in PPDM. Randomisation techniques are esteemed to be

4
easier to implement efficiently. However, the demerits of most randomisation techniques to be
reduction of data utility, i.e reduction in ability to obtain meaningful information through data
mining. There exists therefore a challenging trade-off between privacy-preservation level and
information loss. Figure 3: shows the block diagram for randomisation technique in privacy
preservation[13].

Figure 3:Randomisation Technique

1.3.2 K-anonymity Technique


K-anonymity a definition for privacy that was conceived in the context of relational
databases has received a lot of attention in the past decade. It assumes that the owner of a data
table can separate the columns into public ones known as quasi-identifiers and private ones.
Public columns may appear in external tables, and thus be available to an attacker. Private
columns contain data which is not available in external tables and needs to be protected. The
guarantee provided by k-anonymity is that an attacker will not be able to link private information
to groups of less than k individuals.[3] This is enforced by making certain that every
combination of public attribute values in the release appears in at least k rows. The k-anonymity
model of privacy was studied intensively in the context of public data releases when the database
owner wishes to ensure that no one will be able to link information gleaned from the database to
individuals from whom the data has been collected. This method was also leveraged to provide

5
anonymity in other contexts, such as anonymous message transmission and location privacy. In
recent years, the assumptions underlying the k-anonymity model have been challenged, and the
AOL and Netflix data breaches [10] demonstrated the difficulties in ensuring anonymity.

1.3.3 Cryptographic Technique


At the same time, a second branch of privacy preserving data mining was developed,
using cryptographic techniques to prevent information leakage during the computation of the
data mining model. This branch became hugely popular for two main reasons: First,
cryptography offers a well-defined model for privacy, which includes methodologies for proving
and quantifying it. Second, there exists a vast toolset of cryptographic algorithms and constructs
for implementing privacy-preserving data mining algorithms.

Most encryption techniques developed for privacy preserving data mining that is
conducted jointly on multiple parties are founded on Secure Multiparty Computation (SMC)
which aims to provide a secure computation where parties know nothing but their input and
expected results. Secure Multi-party Computation (SMC) as introduced above is a special
protocol that allows computation on private data from several parties without compromising the
data security and privacy of any participating parties in joint data mining.

In Homomorphic encryption, encrypted results produced by computation performed on


cipher text correspond to results of operations on plaintext [2]. Computation is done without
having to decrypt the ciphertexts. Partially Homomorphic Encryption (PHE) and Fully
Homomorphic Encryption (FHE) are two forms of homomorphic encryption. PHE utilises either
additive or multiplicative homomorphism but not both, whereas FHE exhibits both. Additive
homomorphism means that, if two different plain texts are encrypted separately using the same
keys and then the cipher texts added together, the resulting cipher text would be the same as if
the plain texts were first added and then encrypted with the same key. The multiplicative
homomorphism is defined the same except that the texts are multiplied instead of being added.
However, recent work has pointed that cryptography does not protect the output of a
computation. Instead, it prevents privacy leaks in the process of computation. Thus, it falls short
of providing a complete answer to the problem of privacy preserving data mining [7].

6
1.4 Evaluation of Privacy Preserving Techniques

 Efficiency - The ability of the algorithm to execute with good performance with the
regard to available resources.
 Scalability - The evaluation parameter on efficiency of the algorithm when the dataset
size is increased.
 Data quality - The quality of data after application of privacy preserving techniques,
considered both for data quality and quality of data mining results.
 Hiding failure - The portion of sensitive information that is not hidden by the application
of privacy preserving technique.
 Privacy Level - An estimated degree of uncertainty, according to which sensitive
information can still be be predicted even after being hidden [3].

1.5 Summary

With increased capability to collect and store huge amount of data, invaluable insight can
be gained from using data mining techniques. The need for accurate data mining results while
preventing the disclosure of sensitive data and information has led to the emergence of Privacy-
Preserving Data Mining. Not withstanding the challenge of balancing privacy and accuracy in
data mining, the past decade has seen serious efforts having been made to protect privacy in data
mining. Thus it is necessary to address the privacy/accuracy trade-off problem by considering the
privacy and algorithmic requirements simultaneously. Hence algorithmic data mining processes
and how privacy considerations may influence the way the data miner accesses the data and
processes them needs to be investigated. Analysis and experimental evaluations confirm that
algorithmic decisions made with privacy considerations in mind may have a profound impact on
the accuracy of the resulting data mining models.

7
Chapter 2

Literature Review

The publication and dissemination of raw data are crucial elements in commercial,
academic, and medical applications. With an increasing number of open platforms, such as social
networks and mobile devices from which data may be collected, the volume of such data has also
increased over time. Consequently, applying traditional algorithms on collected large data is not
efficient, and in some cases not feasible. It is common for large data sets to be processed with
distributed platforms such as the MapReduce framework in order to distribute a costly process
among multiple nodes and achieve considerable performance improvement. While larger data
sets pose greater challenges for anonymization from an efficiency perspective, they also provide
greater potential in terms of using the larger number of individuals to hide the identity of an
individual in this larger cohort. Published data must, after all, have enough quality and utility to
be considered useful. Clearly, anonymization degrades the quality and utility of the underlying
data. Therefore, the principle question in publishing such data is: “How to pre-serve the privacy
of individuals while publishing data of high utility?".

Privacy-preserving models broadly fall into two different settings, which are referred to
as input and output privacy. In input privacy, one is primarily concerned with publishing
anonymized data with models such as k-anonymity and `l-diversity. In output privacy, one is
generally interested in problems such as association rule hiding and query auditing, where the
output of different data mining algorithms is either perturbed or audited in order to preserve
privacy. This report is primarily concerned with input privacy algorithms such as k-anonymity
and l-diversity because they are fundamental privacy models and addressing these problems sets
the stage for other more complex privacy models.

Much of the work in privacy has focussed on the quality of privacy preservation
(vulnerability quantification) and the utility of the published data. However, the issue of
scalability has received little attention in the literature. Because of advances in hardware- and
software-based data collection technologies, it is now common to big data sets at the terabyte or
petabyte size. Examples of particularly sensitive data of large sizes include healthcare data and

8
web search logs. However, using many of the existing privacy-preserving algorithms to
anonymize such big data is impractical as they do not scale and have unacceptable running time.
In particular, when the data is large enough it necessitates the use of a distributed framework
such as MapReduce to make excessive processing possible.

One solution is to simply divide the data into smaller parts called fragments and
anonymize each part independently. This is similar to anonymizing streaming data. However,
such an approach seems to be naive, when one has access to all the data at a given time. This
additional access can be used to reduce the level of generalization and randomization in the
anonymization process. This is ignored by the naive solution of horizontally splitting the data
into smaller parts, and then anonymizing each part in isolation.

A branch of privacy preserving data mining was developed, using cryptographic


techniques to prevent information leakage during the computation of the data mining model. This
branch became hugely popular for two main reasons: First, cryptography offers a well-defined
model for privacy, which includes methodologies for proving and quantifying it. Second, there
exists a vast toolset of cryptographic algorithms and constructs for implementing privacy-
preserving data mining algorithms. However, recent work has pointed that cryptography does not
protect the output of a computation. Instead, it prevents privacy leaks in the process of
computation. Thus, it falls short of providing a complete answer to the problem of privacy
preserving data mining .

K-anonymity assumes that the owner of a data table can separate the columns into public
ones (quasi-identifiers) and private ones. Public columns may appear in external tables, and thus
be available to an attacker. Private columns contain data which is not available in external tables
and needs to be protected. The guarantee provided by k-anonymity is that an attacker will not be
able to link private information to groups of less than k individuals. This is enforced by making
certain that every combination of public attribute values in the release appears in at least k rows.
The k-anonymity model of privacy was studied intensively in the context of public data releases
when the database owner wishes to ensure that no one will be able to link information gleaned
from the database to individuals from whom the data has been collected. This method was also
leveraged to provide anonymity in other contexts, such as anonymous message transmission and
location privacy. Although k-anonymity can prevent identity attacks, it fails to protect from
attribute disclosure attacks. For instance, an attacker who knows the values of quasi-identifier

9
attributes, of the individual at row can find out that she suffers from miocarditis by examining
the 2-anonymized. This is because of the lack of diversity in the sensitive attribute within the
equivalence class. The l-diversity model mandates that each equivalence class must have at least
well-represented sensitive values.

10
Chapter 3

Privacy Preservation
3.1 Data Mining
Data mining uses various data analysis tools to discover patterns/relationships in data to
make valid predictions Data Mining, also called Knowledge Discovery in Databases (KDD),
refers to nontrivial extraction of previously unknown/useful information from databases. Though
data mining and KDD are treated as synonyms, data mining is part of knowledge discovery
process Data mining is an iterative/interactive discovering something innovative. Data mining
differs from On-Line Analytical Processing (OLAP) as instead of verifying hypothetical
patterns; it uses data to uncover patterns. It is an inductive process Data mining uses advances in
artificial intelligence and statistics. Both disciplines were working on pattern recognition and
classification problems. Both contributed to the understanding and application of neural nets and
decision trees. Data mining is automated techniques to extract buried/unknown information from
large databases. Data mining is used for four purposes:

i. To improve customer acquisition/retention;

ii. To identify internal inefficiencies and then revamp operations;

iii. To reduce fraud, and

iv. To map unexplored internet terrain.

The primary tools used in data mining include neural networks (NN), decision trees, rule
induction, and data visualization

11
Figure 4:Data Mining Functionalities

The first processing step is data preparation, often called “scrubbing data.” Data is
selected, cleansed, and preprocessed under a domain experts guidance and knowledge. Second,
data mining algorithm is processes prepared data, compressing and transforming it to ensure easy
identification of valuable information. The third phase is data analysis where data mining output
is evaluated to check discovery of any additional domain knowledge and determine the relative
importance of mining algorithms generated facts.

3.2 Privacy Preservation:


Privacy preserving defines “privacy preserving as the each individual‟s ability to control
the circulation of information relating to him/her”. It is also defined as “the claim of single
person, groups, or institutions to determine for themselves when, how & what extent the
information about them is communicating with others”. Privacy is also involved confidentiality
and security of the data and more than two.

The large aggregated data to be extracted is stored on the various servers for the effortless
and rapid access. Information retrieval from this huge amount of competent data plays crucial

12
role in data mining. But this excerption leads to harm individual privacy of the users,
community, etc. So it is required to provide privacy for the sensitive records from the data
miners.

3.2 Privacy Preserving in Data Mining


The recent exponential increase in data generated and excessive internet activities have
caused individuals‟ behaviour, preferences and other sensitive details to be traceable [1]. This
has become a growing privacy concern since sophisticated statistical techniques may be
employed to harvest valuable insight from this data. The process of efficiently swifting through
huge data sets to identify previously unknown patterns is defined as datamining [2],[4] claims
that 90% of data available today was only generated within the past couple of years and an
exponential increase is expected, organisations may benefit greatly from using data mining
techniques to acquire new knowledge that will improve their decision making. Huge volumes of
data are generated from social media, online transactions, stock exchange, organisations‟
operational data and health-care patient data - to mention but a few . Since this data comes from
diff erent sources, it tends to diff er greatly in structure; hence the success of data mining
techniques depends greatly on robust databases management, machine learning and complex
statistical methods However, since datamining techniques reveal previously unknown insights,
the processes and results may well breach the privacy of both individuals and organizations.
Preserving the privacy of data and information in data mining has therefore become a genuine
concern among research communities, with their main focus being to seek and obtain valuable
insight from data but being cautious to protect private data and information [7], [8]. This area of
research referred to as privacy-preserving data mining (PPDM), proposes to prohibit
unauthorised data access and/or usage that breaches privacy throughout all the stages of data
mining [3]. Even though there are various approaches proposed in literature, no approach out
perform all other approaches in performance. Furthermore, there is not enough attention given
and eff ort made to classify these techniques, and to discuss the scenarios on which each
approach is suited, hence a classification of PPDM techniques is presented.

Privacy preserving data mining (PPDM) is a novel research direction in data mining and
statistical databases, where data mining algorithms are analysed for the side-effects they incur in
data privacy. The main consideration in privacy preserving data mining is twofold. First,
sensitive raw data like identifiers, names, addresses and the like should be modified or trimmed

13
out from the original database, in order for the recipient of the data not to be able to compromise
another person‟s privacy. Second, sensitive knowledge which can be mined from a database by
using data mining algorithms should also be excluded, because such knowledge can equally well
compromise data privacy, as we will indicate. The main objective in privacy preserving data
mining is to develop algorithms for modifying the original data in some way, so that the private
data and private knowledge remain private even after the mining process.

The main goals of PPDM algorithm are

1. A PPDM algorithm should have to thwart the discovery of sensible information.


2. It should be resistant to the various data mining techniques.
3. It should not compromise the access and use of sensitive data.
4. It should have an exponential computational complexity.

3.3 Privacy Preserving Models


In data sharing, various approaches are used to design the different privacy preserving models.
In Privacy-preserving data mining (PPDM), which mainly considered four categories as follows:

1. Trust Third Party Model


The trusted third party is performing the computation and delivers only the results of the
computation. The goal of secure protocols is to reach this same level of privacy
preservation, without considering the problem of finding a third party that assumes
everyone is the trusts.
2. Semi-Honest Model
In the semi-honest model, every party acts as a semi-honest one. It follows the rules of
the protocol using its correct input, but when the protocol is free to use whatever it sees
during the execution of the protocol to compromise the security.
3. Malicious Model
In the malicious model, restrictions are not placed on any of the participants. Thus if any
party is completely free to indulge in whatever actions it pleases. In general, these models
are quite difficult to develop efficient protocols that are still valid under the malicious
model.
4. Other Models - Incentive Compatibility In the cryptographic community, the semi-
honest and the malicious models have been well researched.

14
3.4 Applications of PPDM
Privacy is an important issue in many data mining applications and it deals with some
application fields such as

1. Health care,
2. Security,
3. Financial and
4. Other types of sensitive applications.

3.5 PPDM Framework


The framework was constructed according to the stages in the data mining process, from
data collection, pre-process, to final data mining procedure. The PPDM framework contains
three layers: Data Collection Layer (DCL), Data Pre-Process Layer (DPL) and Data Mining
Layer (DML),[7] as shown in Figure 5. The first layer DCL contains a huge number of data
providers that provide original raw data that could contain some sensitive information. The
privacy-preserving data collection can be carried out during the data collection time. All the data
collected from the data providers will be stored and processed in the data warehouse servers in
DPL.

Figure 5:PPDM Framework

15
The second layer DPL contains data warehouse servers that are responsible for storing
and pre-processing the collected raw data from the data providers. The raw data stored in the
data warehouse servers could be aggregated in sum, average etc., or pre-computed using privacy-
preserving methods in order to make the data aggregation or fusion process more efficient. The
privacy preservation in this layer concerns two aspects. One is privacy-preserving data
preprocessing for later data mining, and the other is the security of data access

The third layer DML consists of data mining servers and/or data miners located mostly in
the Internet for conducting actual data mining and providing mining results. In this layer, privacy
preservation concerns two aspects. One is improving or optimizing data mining methods to
enable privacy-preserving features. The other is collaborative data mining based on the union of
a number of data sets owned by multiple parties without revealing any private information.

3.6. Privacy Preserving Implementation Dimensions


There are many approaches which have been adopted for privacy preserving data mining. It can
be classified based on the following dimensions:

1. Data distribution

2. Data modification

3. Data mining algorithm

4. Data or rule hiding

5. Privacy preservation

3.6.1. Data Distribution

The first dimension refers to the distribution of data. Some of the privacy preserving
approaches have been developed for centralized data and others refer to a distributed data
scenario. Distributed datasets scenarios can also be classified as horizontal data distribution and
vertical data distribution.

Horizontal distribution refers to these cases where different database records reside in
different places, while vertical data record is referring to the cases where all the values for
different attributes reside in different places.

16
3.6.2. Data Modification
The second dimension refers to the modification scheme of the data. Data modification
is mostly used in order to change the original values of a database that needs to be released to the
public and also to ensure high protection of the privacy data. It is important that a data
modification technique should be in concern with the privacy policy adopted by an organization.
Methods of data modification include:

I. Perturbation: which is accomplished by the alteration of an attribute value with a new


value (i.e., changing a 1-value of a 0-value, or adding noise)
II. Blocking: is the replacement of an existing attribute value with an aggregation or
merging which is the combination of several values.
III. Swapping: that refers to interchanging values of individual records, and
IV. Sampling: which is referred to delivering the data for only a sample of a population.

3.6.3. Data Mining Algorithm


The third dimension refers to the data mining algorithm, for which the data modification
is also placed here. It has included the problem of hiding data from a combination of data mining
algorithms. Various data mining algorithms have been considered in isolation of each other.
Among them, the most important ideas have been developed for classification data mining
algorithms, such as decision tree inducers, clustering algorithms, rough sets and Bayesian
networks, association rule mining algorithms.

3.6.4. Data or Rule Hiding


The fourth dimension refers to hiding the rule or the data. The complexity of hiding
aggregated data in the form of rules is of course higher, and for this reason, most heuristics have
been developed. The data miner to produce weaker inference rules that will not allow the
inference of confidential values. This process is also known as “rule confusion”.

3.6.5. Privacy Preservation


The last dimension which is the most important and its refers to the privacy preservation
technique used for the selective modification of the data. Selective modification is required in
order to achieve higher utility for the modified data given that the privacy data is not a loss. The
important techniques of privacy preserving data mining are,

1. The randomization method 2. The anonymization method 3. The encryption method

17
Chapter 4

Privacy Preservation Techniques


4.1 Overview

In recent years, privacy preserving data mining (PPDM) has emerged as a very
active re-search area. This field of research studies how knowledge or patterns can be extracted
from large data stores while maintaining commercial or legislative privacy constraints. Quite
often, these constraints pertain to individuals represented in the data stores. While data collectors
strive to derive new insights that would allow them to improve customer service and increase
their sales by better understanding customer needs, consumers are concerned about the vast
quantities of information collected about them and how this information is put to use. Privacy
preserving data mining aims to settle these conflicting interests. The question how these two
contrasting goals, mining new knowledge while protecting individuals' privacy, can be
reconciled, is the focus of current research. To improve the trade-off between privacy and utility
when mining data is performed is the need of the time. Let us consider three objective functions:
the accuracy of the data mining model (e.g., the expected accuracy of a resulting classifier,
estimated by its performance on test samples), the size of the mined database (number of training
samples), and the privacy requirement, represented by a privacy parameter. In a given situation,
one or more of these factors may be fixed: a client may present a lower acceptance bound for the
accuracy of a classifier, the database may contain a limited number of samples, or a regulator
may pose privacy restrictions. Within the given constraints, we wish to improve the objective
functions: achieve better accuracy with fewer learning examples and better privacy guarantees.

4.2 Randomization based Techniques


Data privacy preservation for data providers aims to protect raw data from disclosure
without the providers‟ permission. Because the raw data is collected directly from the data
providers, the privacy preservation in the DCL can be seen as the privacy preservation during
data collection. Most data modification methods used during data collection in the DCL can be
classified into two groups: value-based methods and dimension-based methods.

18
 Value-based methods

Random Noise Addition is the most common data perturbation method in the value-based
group [6]. It is regarded as a method of Value Distortion. Random Noise Addition is described in
[6] as:

X=Xi+ri

where Xi is the original data value of a one-dimensional distribution, and ri is a random value
drawn from a certain distribution. This method distorts the original data values by adding
random values as random noise and returning the processed value. The distributions used in this
method are usually Uniform or Gaussian. Random Noise Addition as a method of data
perturbation can be used for data pre-processing before performing actual mining while at the
same time preserving the data privacy by reconstructing the distributions, but not the individual
values of the original data set. They also showed that Gaussian perturbation does better than
Uniform perturbation in terms of achieving the same privacy preservation level. But the Uniform
perturbation is easier to deploy than the Gaussian. Therefore, which distribution is selected as
random noise depends on application requirements.

 Dimension-based methods

The dimension-based methods were proposed to overcome the disadvantages of the


value-based methods. In real-life applications, data sets are usually multi-dimensional, which
could increase the difficulty of data mining process and affect the data mining results, especially
in the tasks where multidimensional information is crucial for the data mining results. However,
most value-based perturbation methods only concern preserving the distribution information of a
single data dimension. Therefore, they have an inherent disadvantage to provide accurate mining
results in a data-mining task that requires information over multiple correlated data dimensions.

The dimension-based methods concern keeping multidimensional information when doing data
perturbation to preserve data privacy.

The most common dimension-based methods used during data collection are Random Rotation
Transformation and Random Projection.

19
 Random Rotation Transformation

This was proposed to decrease the loss of privacy while not affecting the quality of data
mining [7], and it is often used for privacy-preserving data classification. The authors in [7]
achieved this by multiplying a rotation matrix to a data set matrix as:

g(X)=R.X

where R represents the rotation matrix and X is the original data set. The rotation is conducted in
a way to preserve the multi-dimensional geometric properties, such as Euclidean distance or
inner product, of the original data set. Because the rotation does not perturb data points equally,
and the points near the rotation centre can have few changes, it could make privacy protection
over these points weak. In order to solve this problem that is vulnerable to rotation oriented
attacks, the rotation center is randomly selected in a normalized data space to make the weakly
perturbed points unpredictable This Random Rotation Transformation method can provide high
level of privacy protection and assure expected accuracy of data mining results.

 Random Projection

This is a promising dimension-based data perturbation method. The original idea was
proposed in order to reduce the dimensionality of original data set by projecting the set of data
points from a high-dimensional space to a randomly chosen lower-dimensional subspace. These
methods showed that the projection can preserve the inner product, which is directly related to
several distance-related metrics, by conducting rowwise and column-wise projection of the
sample data. These properties guarantee that both the dimensionality and the exacted value of
each element of the original data are kept confidential as long as the data and random noise are
from a continuous real domain and all involved participating parties are semi-honest.

Analysis and Comparison The data perturbation methods are often evaluated by measuring
privacy preservation level and information loss [4]. The privacy preservation level herein
indicates the level of difficulty of estimating original data from perturbed data [7]. The
information loss refers to the loss of critical information of the original data set after
perturbation. Table 1 gives a comparison of three kinds of data perturbation methods as
described above.

20
For value-based data perturbation including Random Noise Addition, data values are often
perturbed independently. Thus, the information loss depends on the amounts of data values
perturbed to achieve a certain privacy preservation level of a specific data-mining task.
Compared to the value-based data perturbation methods, the dimension-based methods usually
achieve lower information loss because they preserve more statistical information across
different dimensions through certain transformations, rotations or projections when the data
values are perturbed. But their perturbation algorithms are generally more complicated than the
value-based methods.

Table 1:Comparison of The Data Perturbation Methods

Compatibility with
Privacy Preservation
Information Loss Data Mining
Level
Methods
Random Noise
Low Medium No
Addition

Random Rotation Lower High Yes


Random Projection Lower High Yes

4.3 K- Anonymity
The fundamental form of data in a table has following listed attributes:

1. Explicit Identifiers- attributes explicitly identifying a record owner. e.g ID, number,
name
2. Quasi-identifiers-a set of attributes that could potentially identify a record owner
when united with data available policy e.g. Zipcode, age
3. Sensitive Attributes- a set of attributes containing sensitive person-specific
information e.g. disease or salary.
4. Non-sensitive attributes- a set of attributes not risky, even if disclosed to
untrustworthy parties.

21
4.3.1. Linkage attack

The data records are often made available by simply removing key identifiers such as
name and social-security numbers from individual records. However combination of other
attributes named as quasi-identifier can be used to exactly identify individual records. For
example, attributes such as ZIP, birth date ,Sex are available in public records such as voter list.
When these attributes are also available in a given data set such as medical data, they are used to
infer the identity of the corresponding individual with high probability by linking operation as
shown in Figure 6.

Figure 6:Linkage Attack

4.3.2. k-Anonymization Method


One definition of privacy which has received a lot of attention in the past decade is that of k-
anonymity. The guarantee given by k-anonymity is that no information can be linked to groups
of less than k individuals. One way to guarantee k-anonymity of a data mining model is to build
it from a k-anonymized table. However, this poses two main problems: First, the performance
cost of the anonymization process may be very high, especially for large and sparse databases. In
fact, the cost of anonymization can exceed the cost of mining the data. Second, the process of
anonymization may inadvertently delete features that are critical for the success of data mining
and leave out those that are useless; thus, it would make more sense to perform data mining first
and anonymization later. To demonstrate the second problem, consider the data in Table 2,
which describes loan risk information of a mortgage company. The Gender, Married, Age and
Sports Car attributes contain data that is available to the public, while the Loan Risk attribute
contains data that is known only to the company. To get a 2-anonymous version of this table,
many practical methods call for the suppression or generalization of whole columns. This

22
approach was termed single-dimension recoding. In the case of Table 2, the data owner would
have to choose between suppressing the Gender column and suppressing all the other columns.
While more data is suppressed, the accuracy of the decision tree learned from this Table 2 is
better than that of the decision tree learned from the table without the Gender column.
Specifically, without the Gender column, it is impossible to obtain a classification better than
50% good loan risk, 50% bad loan risk, for any set of tuples.

Table 2:Mortgage Company Data

Name Gender Married Age Sports Loan Risk


Ram Male Yes 45 No High
Lucky Male No 21 Yes High
Mona Female Yes 33 No Low
Rashmi Female Yes 48 Yes Low
Sachin Male No 60 Yes High
Merry Female Yes 37 No High
Ramesh Male No 25 No Low

The k-anonymity model distinguishes three entities: individuals, whose privacy needs to
be protected; the database owner, who controls a table in which each row (also referred to as
record or tuple) describes exactly one individual; and the attacker. The k-anonymity model
makes two major assumptions:

 The database owner is able to separate the columns of the table into a set of quasi-
identifiers, which are attributes that may appear in external tables the database owner
does not control, and a set of private columns, the values of which need to be protected.
We prefer to term these two sets as public attributes and private attributes, respectively.
 The attacker has full knowledge of the public attribute values of individuals, and no
knowledge of their private data. The attacker only performs linking attacks. A linking
attack is executed by taking external tables containing the identities of individuals, and
some or all of the public attributes. When the public attributes of an individual match the
public attributes that appear in a row of a table released by the database owner, then we
say that the individual is linked to that row. Specifically the individual is linked to the
private attribute values that appear in that row. A linking attack will succeed if the

23
attacker is able to match the identity of an individual against the value of a private
attribute.

Under the k-anonymity model, the database owner retains the k-anonymity of individuals
if none of them can be linked with fewer than k rows in a released table. This is achieved by
making certain that in any table released by the owner there are at least k rows with the same
combination of values in the public attributes. In many cases, the tables that the data owners wish
to publish do not adhere to the k-anonymity constraints. Therefore, the data owners are
compelled to alter the tables to conform to k-anonymity. Two main methods are used to this end:
generalization and suppression. In generalization, a public attribute value is replaced with a less
specific but semantically consistent value. In suppression, a value is not released at all. However,
it was shown that anonymizing tables such that their contents are minimally distorted is an NP-
hard problem.

The limitations of the k-anonymity model stem from the two assumptions above. First, it
may be very hard for the owner of a database to determine which of the attributes are or are not
available in external tables. This limitation can be overcome by adopting a strict approach that
assumes much of the data is public. The second limitation is much harsher. The k-anonymity
model assumes a certain method of attack, while in real scenarios there is no reason why the
attacker should not try other methods, such as injecting false rows means no real individuals into
the database. A third limitation of the k-anonymity model published in the literature is its
implicit assumption that tuples with similar public attribute values will have different private
attribute values. Even if the attacker knows the set of private attribute values that match a set of k
individuals, the assumption remains that he does not know which value matches any individual
in particular. However, it may well happen that, since there is no explicit restriction forbidding it,
the value of a private attribute will be the same for an identifiable group of k individuals. In that
case, the k-anonymity model would permit the attacker to discover the value of an individual's
private attribute. The ability to expose sensitive information by analysing the private attribute
values linked to each group of individuals has motivated several works that proposed
modifications of the privacy definition. Despite these limitations, k-anonymity has gained a lot
of traction in the research community, and it still provides the theoretical basis for privacy
related legislation. This is for several important reasons:

24
1. The k-anonymity model defines the privacy of the output of a process and not of the
process itself. This is in sharp contrast to the vast majority of privacy models that were
suggested earlier, and it is in this sense of privacy that clients are usually interested.
2. It is a simple, intuitive, and well-understood model. Thus, it appeals to the non-expert
who is the end client of the model.
3. Although the process of computing a k-anonymous table may be quite hard, it is easy to
validate that an outcome is indeed k-anonymous.

k-Anonymization Summary And Example

 Data anonymization is a type of information sanitization whose intent is privacy


protection.

 It is process of either removing personally identifiable information from the data sets so
that the people whom the data describe remain anonymous.

 The information for each person contained in the released table cannot be distinguished
from at least k-1 individuals whose information also appears in the release

 Any quasi-identifier present in the released table must appear in at least k records.

 Achieved using generalization and suppression.(not value perturbation)

Formal definition

 Let T(A1, …, An) be a table and QI be a quasi-identifier associated with it. T is said to
satisfy k-anonymity wrt QI if each sequence of values in T[QI] appears at least with k
occurrences in T[QI].

 (T[QI] is the projection of T on quasi-identifier attributes)

 Each record must be indistinguishable with at least k-1 other records with respect to the
quasi-identifier

 Linking attack cannot be performed with confidence > 1/k

o While k-anonymity protects against identity disclosure, it does not provide


sufficient protection against attribute disclosure.

25
Table 3:Original Patient Table

ZIP Code Age Disease


1 47677 29 Heart Disease
2 47602 22 Heart Disease
3 47678 27 Heart Disease
4 47905 43 Flu
5 47909 52 Heart Disease
6 47906 47 Cancer
7 47605 30 Heart Disease
8 47673 36 Cancer
9 47607 32 Cancer

Table 4:3-Anoymous Version of Patient Table.

ZIP Code Age Disease


1 476** 2* Heart Disease
2 476** 2* Heart Disease
3 476** 2* Heart Disease
4 4790* ≥40 Flu
5 4790* ≥40 Heart Disease
6 4790* ≥40 Cancer
7 476** 3* Heart Disease
8 476** 3* Cancer
9 476** 3* Cancer

4.3.3 Attack on k-Anonymity


The k-anonymity is an attractive technique because of the simplicity of the definition and
the numerous algorithms available to perform the anonymization. Nevertheless the technique is
susceptible to many kinds of attacks especially when background knowledge is available to the
attacker. Some kinds of such attacks are as follows:
 Homogeneity Attack: In this attack, all the values for a sensitive attribute within a group
of k records are the same. Therefore, even though the data is k-anonymized, the value of
the sensitive attribute for that group of k records can be predicted exactly.

26
 Background Knowledge Attack: In this attack, the adversary can use an association
between one or more quasi-identifier attributes with the sensitive attribute in order to
narrow down possible values of the sensitive field further.
Example shows Table 3 is the original data table, and Table 4 is an anonymized version of it
satisfying 3-anonymity. The Disease attribute is sensitive. Suppose Alice knows that Bob is a 27-
year old man living in ZIP 47678 and Bob‟s record is in the table. From Table 4, Alice can
conclude that Bob corresponds to one of the first three records, and thus must have heart disease.
This is the homogeneity attack. For an example of the background knowledge attack, suppose
that, by knowing Carl‟s age and zip code, Alice can conclude that Carl corresponds to a record in
the last equivalence class in Table 4. Furthermore, suppose that Alice knows that Carl has very
low risk for heart disease. This background knowledge enables Alice to conclude that Carl most
likely has cancer.

4.3.4 Generalization and Suppression


In this method, individual values of attributes are replaced by with a broader category. Our
approach to providing k-anonymity is based on the definition and use of generalization
relationships between domains and between values that attributes can assume. Generalization
relationships in a classical relational database system, domains are used to describe the set of
values that attributes assume. For example, there might be a ZIP code domain, a number domain,
and a string domain. We extend this notion of a domain to make it easier to describe how to
generalize the values of an attribute.

In the original database, where every value is as specific as possible, every attribute is in
the ground domain. For example, 02139 is in the ground ZIP code domain, Z0. To achieve k-
anonymity, we can make the ZIP code less informative. We do this by saying that there is a more
general, less specific, domain that can be used to describe ZIP codes, Z1, in which the last digit
has been replaced by a 0. There is also a mapping from Z0 to Z1, such as 02139→02130. This
mapping between domains is stated by means of a generalization relationship,

Figure 7:Domain and Value Generalization Hierarchy including Suppression

27
Suppression

In this method, certain values of the attributes are replaced by an asterisk '*'. All or some
values of a column may be replaced by '*'. Suppression is similar to generalization but in this
values of quasi-identifier is completely hidden for eg from sex male female to Any or not
released or from specific profession to value is suppressed to not released at all.

Different Suppression types are defined as


1. Record Level :When the complete entry of a record from the table is eliminated or suppressed.
2. Value Level : When all instance or records of a particular value in the table is suppressed.
3. Cell Level : When some of records for a given value are suppressed in a table.

4.3.5. l-diversity
l-diversity is a form of group based anonymization that is used to preserve privacy in data
sets by reducing the granularity of a data representation.. The l-diversity model is an extension of
the k-anonymity model which reduces the granularity of data representation using techniques
including generalization and suppression such that any given record maps onto at least k-1 other
records in the data. The l-diversity model handles some of the weaknesses in the k-anonymity
model where protected identities to the level of k-individuals is not equivalent to protecting the
corresponding sensitive values that were generalized or suppressed, especially when the sensitive
values within a group exhibit homogeneity. The l-diversity model adds the promotion of intra-
group diversity for sensitive values in the anonymization mechanism.

Given the existence of such attacks where sensitive attributes may be inferred for k-anonymity
data, the l-diversity method was created to further k-anonymity by additionally maintaining the
diversity of sensitive fields.

Let a q*-block be a set of tuples such that its non-sensitive values generalize to q*. A q*-block
is l-diverse if it contains l "well represented" values for the sensitive attribute S. A table is l-
diverse, if every q*-block in it is l-diverse.

The l-diversity Principle – An equivalence class is said to have l-diversity if there are at
least l “well-represented” values for the sensitive attribute. A table is said to have l-diversity if
every equivalence class of the table has l-diversity.

28
Distinct l-diversity – The simplest definition ensures that at least l distinct values for the
sensitive field in each equivalence class.

Recursive (c-l)-diversity – A compromise definition that ensures the most common value does
not appear too often while less common values are ensured to not appear too infrequently and
less frequent values do not appear too rarely.

4.4 Encryption (cryptographic Techniques)


The process of multiple parties collaborating to conduct data mining on the union of their
databases is referred to as distributed data mining [8]. Health institutions may take advantage of
data mining using encryption based techniques on their private data whilst protecting
individual‟s privacy. This is so because most encryption-based privacy preserving algorithms
developed for data mining aim to reveal to participating parties nothing but the results and not
individual records.

Since parties participating in joint data mining could be competitors or untrusted entities,
secure privacy preserving algorithms that provide accurate data mining results from the union of
parties‟ datasets (without compromising the privacy of individual datasets) are essential. Hence,
it is evident that cryptography suits multiparty distributed data mining to preserve privacy.

Most encryption techniques developed for privacy preserving data mining that is
conducted jointly on multiple parties are founded on Secure Multiparty Computation (SMC)
which aims to provide a secure computation where parties know nothing but their input and
expected results. The primary concerns of secure computation protocols such as SMCs are
privacy and correctness. Malicious behaviors of adversarial entities which are also participants
must be considered in the case of multiparty participation. Controls must be in place to ensure
that adversaries must not be able to cause results to deviate from the function that parties have
established to compute. „Semi-honest‟ and „Malicious‟ are two adversarial models that have
received extensive research attention [8]. It is assumed in the semi-honest model that adversarial
entities would follow the protocol genuinely but might attempt to gather useful information
during computation. On the other hand, such assumption should not be considered in respect of
malicious model, where adversaries would do anything to infer secret information.

It is generally difficult to provide a higher privacy level on the malicious model.


However, cryptography provides intelligent algorithms that attempt to reveal nothing to any

29
unauthorized parties rather than the parties whom the information is intended for. In distributed
data mining, data may be partitioned horizontally or vertically among diff erent collaborators.
Horizontally partitioned data refers to records dispersed among multiple entities, e.g. hospitals
performing data mining on the union of their data-sets with having similar attributes. Vertically
partitioned data has attributes distributed across various entities, e.g. universities, hospitals and
banks collect diff erent attributes about students. Cryptography-based methods for preserving
privacy in data mining which are secure multiparty computation, homomorphic encryption and
secret sharing or homomorphic secret sharing.

 Secure Multiparty Computation:

Secure Multi-party Computation (SMC) as introduced above is a special protocol that allows
computation on private data from several parties without compromising the data security and
privacy of any participating parties in joint data mining. Only a party‟s private input and
aggregate results should be known to it after the mining process. SMC consists of several secure
sub-protocols such as Secure Sum, Secure Set Intersection, Secure Union, Dot product protocol
and more.

 Homomorphic Encryption

In Homomorphic encryption, encrypted results produced by computation performed on


ciphertext correspond to results of operations on plaintext [8]. Computation is done without
having to decrypt the ciphertexts. Partially Homomorphic Encryption(PHE) and Fully
Homomorphic Encryption(FHE) are two forms of homomorphic encryption. PHE utilizes either
additive or multiplicative homomorphism but not both, whereas FHE exhibits both Additive
homomorphism means that, if two diff erent plain texts are encrypted separately using the same
keys and then the ciphertexts added together, the resulting ciphertext would be the same as if the
plain texts were first added and then encrypted with the same key. The multiplicative
homomorphism is defined the same except that the texts are multiplied instead of being added.

30
4.5 Summary
Table 5:Advantages and Limitations of PPDM Techniques

Technique Advantages Limitations


Anonymization Identity or sensitive data about record Linking attack.
based PPDM owners are to be hidden. Heavy loss of information.
It is relatively simple useful for hiding Loss of individual‟s information.
Randomized based information about individuals. Better This method is not for multiple
PPDM efficiency compare to cryptography attribute database.
based PPDM.
Transformed data are exact and This approach is especially
Cryptography
protected. Better privacy compare to difficult to scale. Multiple parties
Based PPDM
randomized Approach. are involved.

Each class of techniques is evaluated based on these criteria by composing the results
obtained by other researchers and recording their performance on each parameter in Table 6. The
performance values are qualitative and are low, average and high.

Table 6:Comparison of Various PPDM Techniques

PPDM Computational Privacy


Accuracy Scalability
Technique Cost Preservation
Cryptography High High High Low
Randomization Low High High High
Anonymization Low Average Average -

31
Chapter 5

Conclusion and Future Scope

5.1 Conclusion
Defining what privacy means in the context of data mining is a difficult task, and at large
it is still an open research question. PPDM framework is defined and reviewed techniques of
PPDM based on this framework. An extensive survey of the diverse approaches for privacy
preserving data mining is done.

Overview of popular approaches for doing PPDM narrated like randomization, k-


anonymity and cryptography. K-anonymity and derived approaches rely on a syntactic definition
to ensure that one released record is indistinguishable from a set of other records. The size k of
the indistinguishability set is used as the privacy measure in k-anonymity. We extended the
definition of k-anonymity to the realm of data mining and used the extended definition to
generate data mining models that conform to k-anonymity. While k-anonymity is relatively easy
to understand and verify, recent research has pointed out that it may not show suitable protection
when an adversary has access to abundant auxiliary information or even other k-anonymized
data sets. K-anonymity dataset permits strong attack due to lack diversity in sensitive attributes,
so l-diversity framework is build that gives stronger privacy guarantee.

In the context of k-anonymity, some of the anonymization algorithms rely on utility


metrics to guide the anonymization process, such that the anonymized data is more useful for
future analysis, which may or may not be known in advance. Measures such as Loss Metric,
Discernibility Metric and Classification Metric were used as a benchmark for utility, with the
purpose of optimizing the metric within given anonymity constraints.

5.2 Future Scope


Different data mining algorithms take different approaches to data analysis and therefore
present different utility considerations. Therefore, to claim that to improve the trade-off between
privacy and utility, a one size fits all solution would not be optimal. The further study can be
carried out using any one of the existing techniques or using a combination of these or by

32
developing entirely a new technique. Since, no such technique exists which overcomes all
privacy issues, research in this direction can make significant contributions. However, to reach
better accuracy, a better approach would be to adapt a specific data mining algorithm such that it
provides a k-anonymous outcome. K-anonymity should be extended for multiple sensitive
attribute and methods can be developed for continuous sensitive attribute. Methods should be
build that are competent and attain a sense of balance between disclosure cost, computation cost
and communication cost. It is important to find privacy preserving technique that is independent
of data mining task so that after applying privacy preserving technique a database can be
released without being constraint to original task. Suitable evaluation criteria must be identified
and benchmarks for selecting algorithms are to be developed to move ahead in PPDM research

How to deploy privacy-preserving techniques into practical applications is also required to be


further studied.

However, to achieve a good tradeoff between privacy and accuracy, it is better to develop a
differentially-private version of the desired data mining algorithm.

33
References
[1] A. S. Shanthi ; M. Karthikeyan “A review on privacy preserving data mining” 2012
IEEE International Conference on Computational Intelligence and Computing Research

[2] Aobakwe Senosi, George Sibiya “Classification and Evaluation of Privacy Preserving
Data Mining: A Review” IEEE Africon 2017 Proceedings

[3] Hina Vaghashia, Amit Ganatra, PhD , “A Survey: Privacy Preservation Techniques in
Data Mining”, International Journal of Computer Applications (0975 – 8887) Volume
119 – No.4, June 2015

[4] R. Mahesh, T. Meyyappan , “Anonymization Technique through Record Elimination to


Preserve Privacy of Published Data” , Proceedings of the 2013 International Conference
on Pattern Recognition, Informatics and Mobile Engineering, February 21-22.

[5] Bhavna Vishwakarma,Huma Gupta, Manish Manoria ,”A Survey on Privacy Preserving
Mining Implementing Techniques” , 2016 Symposium on Colossal Data Analysis and
Networking (CDAN)

[6] A.S.Shanth, , Dr. M. Karthikeyan , A Review on Privacy Preserving Data Mining, 2012
IEEE International Conference on Computational Intelligence and Computing Research

[7] X. Li, Z. Yan, and P. Zhang, “A review on privacy-preserving data mining,” in Computer
and Information Technology (CIT), 2014 IEEE International Conference on. IEEE, 2014,
pp. 769– 774.

[8] C. Aggarwal, P. Yu. ,”A General Survey of Privacy-Preserving Data Mining Models and
Algorithms, in Privacy-Preserving Data Mining: Models and Algorithms”, Springer,
2008.

[9] Nivetha.P.R,, Thamarai selvi.”,A Survey on Privacy Preserving Data Mining


Techniques”, PG Scholar, Department of CSE, Dr.NGP Institute of Technology.

[10] Michael Barbaro and Tom Zeller Jr. A face is exposed for aol searcher no. 4417749. New
York Times, August 9 2006

34
[11] T.Nandhini,D. Vanathi, Dr.P.Sengottuvelan ,”A Review on Privacy Preservation in Data
Mining”, M.E.Scholar, Department of Computer Science & Engineering, Nandha,
International Journal of UbiComp (IJU), Vol.7, No.3, July 2016.

[12] C. Aggarwal, Philip S. Yu, “A General Survey of Privacy-Preserving Data Mining


Models and Algorithms,” IBM T. J. Watson Research Center Hawthorne, NY 10532,
2013.

[13] M. Dhanalakshmi and E. S. Sankari, “Privacy preserving data mining techniques-


survey,” in Information Communication and Embedded Systems (ICICES), 2014
International Conference on. IEEE, 2014, pp. 1–6.

[14] Shweta Tanja , “A Review on Privacy Preserving Data Mining: Techniques and Research
Challenges”, / (IJCSIT) International Journal of Computer Science and Information
Technologies, Vol. 5 (2) , 2014, 2310-2315

[15] M. B. Malik, M. A. Ghazi, and R. Ali, “Privacy preserving datamining


techniques:Current scenario and future prospects,” in Computer and Communication
Technology (ICCCT), 2012 Third International Conference on, Nov 2012, pp. 26–32.

[16] K. Saranya, K. Premalatha, and S. Rajasekar, “A survey on privacy preserving data


mining,” in Electronics and Communication Systems (ICECS), 2015 2nd International
Conference on. IEEE, 2015, pp. 1740–1744.

[17] S. Patel, G. Shah, and A. Patel, “Techniques of data perturbation for privacy preserving
data miningĂİ,” International Journal of Advent Research in Computer & Electronics
(IJARCE) Vol, vol. 1, pp. 5–10.

35

You might also like