You are on page 1of 15

IEEE TRANSACTIONS ON COMPUTERS, VOL. 64, NO.

8, AUGUST 2015 2293

Proximity-Aware Local-Recoding Anonymization


with MapReduce for Scalable Big Data
Privacy Preservation in Cloud
Xuyun Zhang, Wanchun Dou, Jian Pei, Fellow, IEEE, Surya Nepal, Member, IEEE,
Chi Yang, Chang Liu, and Jinjun Chen, Senior Member, IEEE

Abstract—Cloud computing provides promising scalable IT infrastructure to support various processing of a variety of big data
applications in sectors such as healthcare and business. Data sets like electronic health records in such applications often contain
privacy-sensitive information, which brings about privacy concerns potentially if the information is released or shared to third-parties in
cloud. A practical and widely-adopted technique for data privacy preservation is to anonymize data via generalization to satisfy a given
privacy model. However, most existing privacy preserving approaches tailored to small-scale data sets often fall short when
encountering big data, due to their insufficiency or poor scalability. In this paper, we investigate the local-recoding problem for big data
anonymization against proximity privacy breaches and attempt to identify a scalable solution to this problem. Specifically, we present a
proximity privacy model with allowing semantic proximity of sensitive values and multiple sensitive attributes, and model the problem of
local recoding as a proximity-aware clustering problem. A scalable two-phase clustering approach consisting of a t-ancestors clustering
(similar to k-means) algorithm and a proximity-aware agglomerative clustering algorithm is proposed to address the above problem.
We design the algorithms with MapReduce to gain high scalability by performing data-parallel computation in cloud. Extensive
experiments on real-life data sets demonstrate that our approach significantly improves the capability of defending the proximity privacy
breaches, the scalability and the time-efficiency of local-recoding anonymization over existing approaches.

Index Terms—Big data, cloud computing, mapreduce, data anonymization, proximity privacy

1 INTRODUCTION

C LOUD computing and big data, two disruptive trends at


present, pose a significant impact on current IT indus-
try and research communities [1], [2]. Today, a large num-
party partners or the public. The privacy-sensitive informa-
tion can be divulged with less effort by an adversary as the
coupling of big data with public cloud environments dis-
ber of big data applications and services have been ables some traditional privacy protection measures in cloud
deployed or migrated into cloud for data mining, process- [3], [4]. This can bring considerable economic loss or severe
ing or sharing. The salient characteristics of cloud comput- social reputation impairment to data owners. As such, shar-
ing such as high scalability and pay-as-you-go fashion ing or releasing privacy-sensitive data sets to third-parties
make big data cheaply and easily accessible to various in cloud will bring about potential privacy concerns, and
organizations through public cloud infrastructure. Data sets therefore requires strong privacy preservation.
in many big data applications often contain personal Data anonymization has been extensively studied and
privacy-sensitive data like electronic health records and widely adopted for data privacy preservation in non-
financial transaction records. As the analysis of these data interactive data sharing and releasing scenarios [5]. Data
sets provides profound insights into a number of key areas anonymization refers to hiding identity and/or sensitive
of society (e.g., healthcare, medical, government services, e- data so that the privacy of an individual is effectively pre-
research), the data sets are often shared or released to third served while certain aggregate information can be still
exposed to data users for diverse analysis and mining tasks.
A variety of privacy models and data anonymization
 X. Zhang, C. Yang, C. Liu, and J. Chen are with the Faculty of Engineer- approaches have been proposed and extensively studied
ing and IT, University of Technology, Sydney, NSW 2007, Australia. recently [5], [6], [7], [8], [9], [10], [11], [12]. However, apply-
E-mail: {xyzhanggz, chiyangit, changliu.it, jinjun}@gmail.com.
 W. Dou is with the State Key Laboratory for Novel Software Technology, ing these traditional approaches to big data anonymization
Department of Computer Science and Technology, Nanjing University, poses scalability and efficiency challenges because of the
Nanjing 210023, China. E-mail: douwc@nju.edu.cn. “3Vs”, i.e., Volume, Velocity and Variety. The research on
 J. Pei is with the School of Computing Science, Simon Fraser University, scalability issues of big data anonymization has come to the
Burnaby, BC V5A 1S6, Canada. E-mail: jpei@cs.sfu.ca.
 S. Nepal is with the CSIRO Computational Informatics, CSIRO, NSW picture [10], [13], [14], [15], but they are only applicable to
2122, Australia. E-mail: surya.nepal@csiro.au. the sub-tree or multidimensional scheme. Following this
Manuscript received 16 Dec. 2013; revised 29 July 2014; accepted 18 Aug. line, we investigate the local-recoding scheme herein and
2014. Date of publication 25 Sept. 2014; date of current version 10 July 2015. attempt to identify a scalable solution to big data local-
Recommended for acceptance by H. Jin. recoding anonymization. Recently, differential privacy has
For information on obtaining reprints of this article, please send e-mail to:
reprints@ieee.org, and reference the Digital Object Identifier below. attracted plenty of attention due to its robust privacy guar-
Digital Object Identifier no. 10.1109/TC.2014.2360516 antee regardless of an adversary’s prior knowledge [16].
0018-9340 ß 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
Authorized licensed use limitedSee
to: GITAM University. Downloaded on November 08,2022 at 09:51:50
http://www.ieee.org/publications_standards/publications/rights/index.html UTC
for more from IEEE Xplore. Restrictions apply.
information.
2294 IEEE TRANSACTIONS ON COMPUTERS, VOL. 64, NO. 8, AUGUST 2015

However, besides the drawbacks pointed in [16], differen- multiple data partitions. Fourth, several innovative MapRe-
tial privacy also loses correctness guarantees because it pro- duce jobs are designed and coordinated to concretely con-
duces noisy results to hide the impact of any single duct data-parallel computation for scalability.
individual [17]. Hence, syntactic anonymity privacy models The remainder of this paper is organized as follows. The
still have practical impacts in general data publishing and next section reviews related work. In Section 3, we briefly
can be applied in numerous real-world applications. introduce some preliminary and analyze the problems in
The local-recoding scheme, also known as cell gener- detail. Section 4 models the proximity-aware clustering
alization, groups data sets into a set of cells at the data problem formally, and Section 5 elaborates the two-phase
record level and anonymizes each cell individually. clustering approach and the MapReduce jobs. We empiri-
Existing approaches for local recoding [9], [11], [18], [19] cally evaluate our approach in Section 6. Finally, we con-
can only withstand record linkage attacks by employing clude this paper and discuss future work in Section 7.
k-anonymity privacy model [6], thereby falling short of
defending proximity privacy breaches [12], [20], [21]. In
fact, combining local recoding and proximity privacy 2 RELATED WORK
models together is interesting and necessary when one Recently, data privacy preservation has been extensively
wants an anonymous data set with both low data distor- investigated [5]. We briefly review existing approaches
tion and the ability to combat proximity privacy attacks. for local-recoding anonymization and privacy models to
However, this combination is a challenge because most defense against attribute linkage attacks. In addition,
proximity privacy models have the property of non- research on scalability issues in existing anonymization
monotonicity [21] and local recoding fails to be accom- approaches is shortly surveyed.
plished in a top-down way, which will be detailed in Recently, clustering techniques have been leveraged to
Section 3.3 (Motivation and Problem Analysis). achieve local-recoding anonymization for privacy preserva-
In this paper, we model the problem of big data local tion. Xu et al. [9] studied on the anonymization of local
recoding against proximity privacy breaches as a proxim- recoding scheme from the utility perspective and put forth
ity-aware clustering (PAC) problem, and propose a scalable a bottom-up greedy approach and the top-down counter-
two-phase clustering approach accordingly. Specifically, we part. The former leverages the agglomerative clustering
put forth a proximity privacy model based on [12] by rea- technique while the latter employs the divisive hierarchical
sonably allowing multiple sensitive attributes and semantic clustering technique, both of which pose constraints on the
proximity of categorical sensitive values. As the satisfiabil- size of a cluster. Byun et al. [19] formally modeled local-
ity problem of the proximity privacy model is proved to be recoding anonymization as the k-member clustering prob-
NP-hard, it is interesting and practical to model the problem lem which requires the cluster size should not be less than
as a clustering problem of minimizing both data distortion k in order to achieve k-anonymity, and proposed a simple
and proximity among sensitive values in a cluster, rather greedy algorithm to address the problem. Li et al. [18]
than to find a solution satisfying the privacy model rigor- investigated the inconsistency issue of local-recoding ano-
ously. Technically, a proximity-aware distance is intro- nymization in data with hierarchical attributes and pro-
duced over both quasi-identifier and sensitive attributes to posed KACA (K-Anonymization by Clustering in Attribute
facilitate clustering algorithms. To address the scalability hierarchies) algorithms. Aggarwal et al. [11] proposed a set
problem, we propose a two-phase clustering approach con- of constant factor approximation algorithms for two
sisting of the t-ancestors clustering (similar to k-means [22]) clustering based anonymization problems, i.e., r-GATHER
and proximity-aware agglomerative clustering algorithms. and r-CELLULAR CLUSTERING, where cluster centers
The first phase splits an original data set into t partitions are published without generalization or suppression. How-
that contain similar data records in terms of quasi-identi- ever, existing clustering approaches for local-recoding ano-
fiers. In the second phase, data partitions are locally recoded nymization mainly concentrate on record linkage attacks,
by the proximity-aware agglomerative clustering algorithm specifically under the k-anonymity privacy model, without
in parallel. We design the algorithms with MapReduce in paying any attention to privacy breaches incurred by sensi-
order to gain high scalability by performing data-parallel tive attribute linkage. On the contrary, our research takes
computation over multiple computing nodes in cloud. We both privacy concerns into account. Wong et al. [23] pro-
evaluate our approach by conducting extensive experiments posed a top-down partitioning approach based on the
on real-world data sets. Experimental results demonstrate Mondrian algorithm in [24] to specialize data sets to
that our approach can preserve the proximity privacy sub- achieve ða; kÞ-anonymity which is able to defend certain
stantially, and can significantly improve the scalability and attribute linkage attacks. However, the data utility of the
the time-efficiency of local-recoding anonymization over resultant anonymous data is heavily influenced by the
existing approaches. choice of splitting attributes and values, while local recod-
The major contributions of our research are fourfold. ing does not involve such factors. Our approach leverages
First, an extended proximity privacy model is put forth via clustering to accomplish local recoding because it is a natu-
allowing multiple sensitive attributes and semantic proxim- ral and effective way to anonymize data sets at a cell level.
ity of categorical sensitive values. Second, we model To preserve privacy against attribute linkage attacks, a
the problem of big data local recoding against proximity variety of privacy models have been proposed for both cate-
privacy breaches as a proximity-aware clustering problem. gorical and numerical sensitive attributes. l-diversity and its
Third, a scalable and efficient two-phase clustering variants [7] require each QI-group to include at least l well-
approach is proposed to parallelize local recoding on represented sensitive values. Note that l-diverse data sets

Authorized licensed use limited to: GITAM University. Downloaded on November 08,2022 at 09:51:50 UTC from IEEE Xplore. Restrictions apply.
ZHANG ET AL.: PROXIMITY-AWARE LOCAL-RECODING ANONYMIZATION WITH MAPREDUCE FOR SCALABLE BIG DATA PRIVACY... 2295

are already l-anonymous. Some privacy models, such as TABLE 1


ða; kÞ-anonymity [23], extend k-anonymity with the confi- Basic Symbols and Notions
dence bounding principle that requires the confidence of
Symbol Notations
associating a quasi-identifier to a sensitive value to be less
than a user-specified threshold. In general, these integrated D A data set containing n data records.
models are more flexible than l-diversity. Since the models r A data original record, r 2 D and r ¼ hv1 ; . . . ; vmQI ;
sv1 ; . . . ; svmS i, where vi , 1  i  mQI , is a quasi-
above handle categorical attributes only, they fail to thwart
identifier attribute value, and svj , 1  j  mS , is a
proximity privacy breach in numerical sensitive attributes.
sensitive attribute value, mQI , mS are the number
As a result, several privacy models such as ðk; eÞ-anonymity
of the two types of attribute, respectively.
[20], variance control [10], ð"; mÞ-anonymity [21] and TTi The taxonomy tree of categorical attribute atti .
t-closeness [8] are put forth. ðk; eÞ-anonymity requires that DOMi The set of all domain values in TTi for categorical
the difference of the maximum and minimum sensitive val- attribute atti , or all domain intervals for numerical
ues in a QI-group must be at least e, while the variance con- attribute atti .
trol principle demands that the variance of the sensitive Vi The set of attribute values of attQI i .
values must be not less than a threshold. ð"; mÞ-anonymity, a SVi The set of sensitive values of attSi .
qid A quasi-identifier, qid ¼ hq1 ; . . . ; qmQI i, qi 2 DOMi .
stronger model, requires that for any sensitive value in a QI- QID The set of quasi-identifiers, QID ¼ hDOM1 ; . . . ;
group, at most 1=m of records can have sensitive values simi- DOMmQI i.
lar to the value, where " determines the similarity. A strin- QIG The quasi-identifier group containing all records
gent privacy model t-closeness [8], which mainly combats with the same quasi-identifier.
data distribution skewness attacks by requiring the distribu-
tion of sensitive values in any QI-group should be close to knowledge. Local recoding, also known as cell generaliza-
the distribution of the entire data set, incorporates semantics tion, is one of the schemes outlined in [5]. Other schemes
through the kernel smoothing technique to mitigate the include full-domain, sub-tree and multidimensional ano-
proximity breaches to a certain extent. Moreover, t-closeness nymization. Local recoding generalizes a data set at the cell
is applicable to both categorical and numerical attributes as level, while global recoding generalizes them at the domain
it only demands a pre-defined distance matrix. But t-close- level. The last three schemes mentioned above are global
ness is insufficient to protect against proximity attacks as recoding. Generally, local recoding minimizes the data dis-
pointed in [21]. To cope with both categorical and numerical tortion incurred by anonymization, and therefore produces
attributes, Wang et al. [12] proposed a general proximity pri- better data utility than global recoding.
vacy model, namely, ð; dÞk -dissimilarity, where  determines Table 1 lists some basic symbols and notations. Each
the similarity threshold, d controls the least number of dis- record in D consists of both quasi-identifier attributes and
similar sensitive values for any sensitive value in a QI-group, sensitive attributes. Quasi-identifier attributes are the ones
and k means k-anonymity is integrated. The privacy model that can be potentially linked to individuals uniquely if
we proposed herein can be regarded as an extended form of combined with external data sets, e.g., age and sex. If a sen-
ð; dÞk -dissimilarity. Nevertheless, our model differs from it sitive value is associated to an identified individual, eco-
in that multiple sensitive attributes are taken into account nomic loss or reputation damage to the person probably
and categorical sensitive attributes have semantic proximity occur. Thus, quasi-identifiers are usually anonymized to
in terms of their taxonomy trees. preserve privacy, while sensitive values are often kept in
Scalability issues of anonymization over large-scale data the original form for the sake of data mining or data analyt-
sets have drawn the attention of research communities. ics. The consequence of anonymization is that data are parti-
LeFevre et al. [10] addressed the scalability problem of mul- tioned into a set of groups, and each is represented by an
tidimensional anonymization scheme [24] via introducing anonymous quasi-identifier. Such a group is named as QI-
scalable decision trees and sampling techniques. Iwu- group, denoted by QIG in Table 1. In this way, individual
chukwu et al. [13] proposed an R-tree index-based approach privacy is preserved while aggregate information is still
by building a spatial index over data sets, achieving high available for data mining or analytics.
efficiency. Fung et al. [25], [26] proposed a top-down spe- We consider both numerical and categorical attributes for
cialization approach, improving the efficiency of sub-tree local recoding herein, and assume that a taxonomy tree is
anonymization scheme by exploiting a data structure given for a categorical attribute. To facilitate the discussion,
named Taxonomy Indexed PartitionS (TIPS). Our previous it is assumed that the attributes are arranged in order, i.e.,
work [14], [15] addressed the scalability problem of the sub- for quasi-identifier attributes, the scheme is ATTQI ¼ hattQI 1 ;
tree scheme in big data scenarios via leveraging MapReduce QI QI QI
. . . ; attlQI ; attlQI þ1 ; . . . ; attmQI i, where the first l attributes
QI
paradigm. However, the approaches above aim at either
multidimensional scheme or sub-tree scheme, both of which are numerical while the rest are categorical, and for sensi-
are global recoding, thereby failing to work out for the local- tive attributes, the scheme is ATTS ¼ hattS1 ; . . . ; attSlS ; attSlS þ1 ;
recoding scheme investigated herein. . . . ; attSmS i, where the first lS . attributes are numerical while
the rest are categorical.
3 PRELIMINARIES AND PROBLEM ANALYSIS Based on the notions above, the local recoding scheme is
3.1 Local-Recoding Anonymization Scheme formally described as follows. Data records in D can be
To facilitate subsequent discussion, we briefly introduce the regarded as data points in a high dimensional space.
concept of local-recoding anonymization as background Local-recoding scheme defines a set of functions on mostly

Authorized licensed use limited to: GITAM University. Downloaded on November 08,2022 at 09:51:50 UTC from IEEE Xplore. Restrictions apply.
2296 IEEE TRANSACTIONS ON COMPUTERS, VOL. 64, NO. 8, AUGUST 2015

overlapping multidimensional regions which cover named key-value pair (key, value). Specifically, the Map
V1      Vm , where region overlapping means that multi- function can be formalized as Map : ðk1 ; v1 Þ ! ðk2 ; v2 Þ, i.e.,
ple regions probably contain records with identical quasi- Map takes a pair (k1 , v1 ) as input and then outputs another
identifiers. Specifically, a function ’l : Rl ! QID, is intermediate key-value pair (k2 , v2 ). These intermediate
defined for a region Rl , where l is an arbitrary number pairs are consumed by the Reduce function as input. For-
indexing the region. In fact, a region corresponds to a QI- mally, the Reduce function can be represented as:
group. Therefore, the core sub-problems of local recoding Reduce:ðk2 ; listðv2 ÞÞ ! ðk3 ; v3 Þ, i.e., Reduce takes intermedi-
are how to construct multidimensional regions and how to ate k2 and all its corresponding values listðv2 Þ) as input and
choose functions f’l g. In our approach, we leverage the outputs another pair (k3 ; v3 ). Usually, (k3 , v3 ) list is the
clustering technique to build multidimensional regions results which MapReduce users attempt to obtain. Both
with keeping proximity privacy of sensitive attributes in Map and Reduce functions are specified by users according
mind. For each region, the categorical quasi-identifier attri- to their specific applications. An instance running a Map
bute values are generalized to their lowest common function is called Mapper, and that running a Reduce func-
domain value in the taxonomy tree, and the numerical tion is called Reducer, respectively.
ones are replaced by an interval that covers them mini-
mally. Unlike local recoding, the sub-tree and full-domain 3.3 Motivation and Problem Analysis
schemes have one function over each attribute, i.e., In this section, we analyze the problems of existing
fi : Vi ! DOMi , 1  i  m, and the global multidimen- approaches for local-recoding anonymization from the per-
sional scheme has a single function over all the attributes, spectives of proximity privacy and scalability. Further, chal-
i.e., f : V1      Vm ! QID. lenges of designing scalable MapReduce algorithms for
Preserving privacy is one side of anonymization. The other proximity-aware local recoding are also identified.
one is retaining aggregate information for data mining or ana- Little attention has been paid to the local-recoding ano-
lytics over the anonymous data. Several data metrics have nymization scheme under proximity-aware privacy modes.
been proposed to capture this [5], e.g., minimal distortion As mentioned in Section 2, most existing local-recoding
(MD) [6], ILoss [27] and discernibility metric (DM) [7]. With a approaches concentrate on combating record linkage attacks
data metric, the problem of optimal local recoding is to find by employing k-anonymity privacy model. As demon-
the local-recoding solution that makes the metric optimal. strated in existing work [7], [8], [12], [21], however, k-
However, theoretical analyses demonstrate that the problem anonymity fails to combat attribute linkage attacks like
under most not-trivial data utility metrics is NP-hard [5]. As a homogeneity attacks, skewness attacks and proximity
result, most existing approaches [9], [11], [18], [19] just try to attacks. For instance, if the sensitive values of the records
find the minimal local recoding instead to achieve practical in a QI-group of size k are identical or quite similar, adver-
efficiency and a near optimal solution, where the minimal saries can still link an individual with certain sensitive val-
local recoding means that no more partitioning operations ues with high confidence although the QI-group satisfies k-
are allowed when building multidimensional regions under anonymity, resulting in privacy violation. Accordingly, a
a certain privacy model. Our proximity-aware two-phase plethora of privacy models have been proposed to thwart
clustering approach herein also follows this line. such attacks as shown in Section 2. But these models have
So far, only the k-anonymity privacy model has been been rarely exploited into the local-recoding scheme except
employed to preserve privacy against record linkage attacks the work in [23]. This phenomenon mainly results from two
in existing clustering based anonymization approaches. The reasons analyzed as follows.
k-anonymity privacy model requires that for any qid 2 QID, The first one is that, unlike global-recoding schemes,
the size of QIGðqidÞ must be zero or at least k, so that a quasi- k-anonymity based approaches for record linkage attacks
identifier will not be distinguished from other at least k  1 cannot be simply extended for attribute linkage attacks. Since
ones in the same QI-group [6]. Usually, it is assumed that an global-recoding schemes partition data sets according to
adversary already has the knowledge that an individual is domains, they can be fulfilled effectively in a top-down fash-
definitely in a data set, which occurs in many real-life data ion. This property of global-recoding schemes ensures that
like tax data sets. After local recoding, the upper-bound size k-anonymity based approaches can be extended to combat
of a QI-group is 2k  1 under the k-anonymity privacy model. attribute linkage attacks though checking extra privacy satis-
If there were a QI-group of size at least 2k, it should be split fiability during each round of the top-down anonymization
into two groups of size at least k to maximize data utility. process [5]. However, the local recoding scheme fails to share
the same merits because it partitions data sets in a clustering
3.2 MapReduce Basics fashion where the top-down anonymization property is
MapReduce [28], a parallel and distributed large-scale data inapplicable. Although Wong et al. [23] proposed a top-
processing paradigm, has been extensively researched and down approach for local recoding, the approach can only
widely adopted for big data applications recently [29]. Inte- achieve partially local recoding because global recoding is
grated with infrastructure resources provisioned by cloud exploited to partition data sets as the first step and local
systems, MapReduce becomes much more powerful, elastic recoding is only conducted inside each partition. Conse-
and cost-effective due to the salient characteristics of cloud quently, their approach will incur more data distortion com-
computing. A typical example is the Amazon Elastic Map- pared with the full potential of the local-recoding scheme.
Reduce service. The second reason is that most proximity aware privacy
Basically, a MapReduce job consists of two primitive models have the property of non-monotonicity [21], which
functions, Map and Reduce, defined over a data structure makes such models hard to achieve in a top-down way,

Authorized licensed use limited to: GITAM University. Downloaded on November 08,2022 at 09:51:50 UTC from IEEE Xplore. Restrictions apply.
ZHANG ET AL.: PROXIMITY-AWARE LOCAL-RECODING ANONYMIZATION WITH MAPREDUCE FOR SCALABLE BIG DATA PRIVACY... 2297

even for global-recoding schemes. Formally, monotonicity


refers to that if two disjoint data subsets G1 and G2 of a data
set satisfy a privacy model, their union G1 [ G2 satisfies the
model as well. Monotonicity is a prerequisite for top-down
anonymization approaches because it ensures to find mini-
mally anonymized data sets. Specifically, if a data set does
not satisfy a privacy model, we can infer that any of its sub- Fig. 1. A taxonomy tree of attribute Disease.
sets will fail to satisfy the model. Thus, when anonymizing
data sets in a top-down fashion, we can terminate the pro-
4 PROXIMITY-AWARE CLUSTERING PROBLEM OF
cess if further partitioning a subset violates the privacy
model. However, most proximity-aware privacy models LOCAL-RECODING ANONYMIZATION
such as ð"; mÞ-anonymity and ð; dÞk -dissimilarity fail to pos- Due to the non-monotonicity property of proximity-aware
sess the property of monotonicity. As a consequence, most privacy models and characteristics of local recoding, clus-
existing anonymization approaches become inapplicable tering is a natural and promising way to group both quasi-
with such privacy models [21]. A two-step approach based identifier attributes and sensitive attributes. Hence, we pro-
on the Mondrian algorithm [24] is presented in [21] to pose to model the problem of local-recoding anonymization
obtain ð"; mÞ-anonymous data sets, via k-anonymizing a under proximity-aware privacy models as a clustering
data set first, and adjusting partitions to achieve the privacy problem in this section. Specifically, a proximity-aware pri-
requirements then. However, this approach targets the mul- vacy model is formulated in Section 4.1 and the clustering
tidimensional scheme, rather than local recoding investi- problem is formalized in Section 4.2.
gated herein. Furthermore, proximity is not integrated into
the search metric that guides data partitioning in the two- 4.1 Proximity-Aware Privacy Model
step approach, potentially incurring high data distortion. In big data scenarios, multiple sensitive attributes are often
Wang and Liu [30] proposed an anonymization model XCo- contained in data sets, while existing proximity-aware pri-
lor under the ð; dÞk -dissimilarity model, yet there is still a vacy models assume only one single sensitive attribute,
gap between its theoretical methodology and a practical either categorical or numerical. Hence, we assume multiple
algorithm as they acknowledged. sensitive attributes in our privacy model, including both
In terms of the analyses above, achieving the local-recod- categorical and numerical attributes. As the discussion of
ing scheme under proximity-aware privacy models is still a proximity privacy attacks stems from numerical attributes,
challenging problem. To our best knowledge, no previous existing proximity-aware privacy models assume that cate-
work focuses on this problem. Motivated by this challenge, gorical attribute values have no sense of semantic proxim-
we propose a proximity-aware clustering approach for ity [12], [21]. That is, categorical values are only examined
local-recoding anonymization. whether they are exactly identical or different. Also, pri-
Existing clustering approaches for anonymization are vacy models for categorical attributes only aims at avoid-
inherently sequential and assume that the data sets proc- ing exact reconstruction of sensitive values via limiting the
essed can fit into memory [9], [18], [19]. Unfortunately, the number or distribution of sensitive values without consid-
assumption often fails to hold in most big data applications ering semantic proximity [7], [8]. However, sensitive cate-
in cloud nowadays. As a result, the approaches often suffer gorical values often have the sense of semantic proximity
from the scalability problem when encountering big data in real-life applications because the values are usually
applications. Even if a single machine with huge memory organized in a taxonomy tree in terms of domain specific
could be offered, the I/O cost of reading/writing very large knowledge. For instance, a taxonomy tree of diseases is
data sets in a serial manner will be quite high. Thus, paral- presented in [27]. A similar one is depicted in Fig. 1 to
lelism is not an option but by far the best choice for big data facilitate our discussion.
applications. Utilizing a bunch of small and cheap computa- Privacy breaches can still take place even if an anony-
tion nodes rather a large expensive one is more cost- mous data set already satisfies existing privacy models like
effective, which also coheres to the spirits of cloud comput- l-diversity or t-closeness. For instance, an individual identi-
ing where computation is provisioned in the form of various fied in a three-diverse QI-group with sensitive values {syph-
virtual machines. ilis, gnorrhea, HIV}, will be associated with “venereal
We attempt to leverage MapReduce in cloud to address disease” with 100 percent confidence based on the taxon-
the scalability problem of clustering approaches for ano- omy tree in Fig. 1. This inference can lead to severe privacy
nymization. However, designing proper MapReduce jobs breach. We call such an attack as “categorical proximity
for complex applications is still a challenge as MapReduce breach”. The core of the breach is the semantic proximity
is a constrained programming paradigm. Usually, it is nec- among categorical values defined by the domain specific
essary to consider the problems like which part of an appli- knowledge, which is ignored in previous privacy models.
cation can be parallelized by MapReduce, how to design With the notion of proximity of categorical sensitive values,
Map and Reduce functions to make them scalable, and how we extend the proximity privacy model ð; dÞk -dissimilarity in
to reduce network traffics among worker nodes. The [12] to multiple sensitive attributes including both categorical
answers to these questions often vary for different applica- and numerical ones. Our privacy model is named as
tions. Hence, extensive research is still required to design ðþ ; dÞk -dissimilarity, where “þ” implies proximity of categor-
MapReduce jobs for a specific application. ical values is taken into account.

Authorized licensed use limited to: GITAM University. Downloaded on November 08,2022 at 09:51:50 UTC from IEEE Xplore. Restrictions apply.
2298 IEEE TRANSACTIONS ON COMPUTERS, VOL. 64, NO. 8, AUGUST 2015

To capture the dissimilarity between sensitive values of sensitive vector can own to combat proximity attacks. As
two records, the distance metric should be defined first. Let ðþ ; dÞk -dissimilarity is extended from ð; dÞk -dissimilarity, it
sv; sv0 2 SVi denote two sensitive values from two records, shares the effectiveness and most characteristics of the latter
where the meaning of SVi is already described in Table 1. as described in [12]. But monotonicity of ð; dÞk -dissimilarity
To establish the distance metric over all sensitive attributes, has not been discussed in [12]. We formally establish the
distance is normalized in subsequent definitions. For a non-monotonicity property of the two models by the follow-
numerical attribute, the normalized distance between sv ing theorem.
and sv0 , dN ðsv; sv0 Þ, is defined as:
Theorem 1. Neither ð; dÞk -dissimilarity nor ðþ ; dÞk -dissimilar-
0 0
dN ðsv; sv Þ ¼ jsv  sv j=R; (1) ity is monotonic.
where R ¼ ðsvmax  svmin Þ denotes the domain of the attri- Proof. It suffices to prove this theorem by finding a
bute, svmax and svmin are the maximum and minimum val- counter-example for ð; dÞk -dissimilarity, where two QI-
ues of the attribute, respectively. For a categorical attribute, group QIGx and QIGy satisfy ð; dÞk -dissimilarity
the normalized distance between sv and sv0 dC ðsv; sv0 Þ, is respectively, but their union QIGx [ QIGy does not. A
defined as: counter-example is shown as follows. Assume a QI-
group QIGx of size 3 has one-dimensional numerical
dC ðsv; sv0 Þ ¼ Lðsv; sv0 Þ=ð2  HðTTi ÞÞ; (2)
sensitive values {1.0, 3.0, 5.0}, another QI-group QIGy of
where Lðsv; sv0 Þ is the shortest path length between sv and size 3 has sensitive values {2.0, 4.0, 6.0}. Let , d and k
sv0 , and HðTTi Þ is the height of the taxonomy tree. Note that be 1.5, 1 and 3, respectively. Both QI-groups satisfy
2  HðTTi Þ is the maximum path length between any two ð1:5; 1Þ3 -dissimilarity. After merging the two QI-groups,
nodes in the tree. Our definition is more flexible than that in the set of sensitive values is {1.0, 2.0, 3.0, 4.0, 5.0, 6.0}.
[19], as it is unnecessary to require that all the leaf nodes of For the value 3.0, only values {1.0, 5.0, 6.0} are dissimilar
the taxonomy tree should have the same depth, which is in the sense of  ¼ 1:5. But the model requires
quite common in real-life applications. Lðsv; sv0 Þ can be 1  ð6  1Þ ¼ 5 values are dissimilar to 3.0. So, the theo-
computed with ease by finding the Lowest Common Ances- rem is proved in terms of this contradiction. u
t
tor (LCA) of sv and sv0 .
Let rS ¼ sv1 ; . . . ; svmS denote the vector of sensitive val-
ues of a record r For convenience, rS is named as sensitive 4.2 Proximity-Aware Clustering Problem of Local-
vector. With the definitions of single attributes, the distance Recoding Anonymization
between sensitive vector rSx ¼ hsv1 ; . . . ; svmS i and Due to the non-monotonicity property, the satisfiability
problem of ðþ ; dÞk -dissimilarity is hard. The satisfiability
rSy ¼ hsv01 ; . . . ; sv0mS i, dðrSx ; rSy Þ, is defined as:
problem is whether there exists a partition that makes the
anonymous data set satisfy ðþ ; dÞk -dissimilarity. Based on
  XlS X
mS
d rSx ; rSy ¼ wSi dN þ wSi dC ; (3) the results in [12], we have the following theorem.
i¼1 i¼lS þ1
Theorem 2. In general, the ðþ ; dÞk -dissimilarity satisfiability
    problem is NP-hard.
where dN and dC are short for dN svi ; sv0i and dC svi ; sv0i ,
respectively, and wSi , 1  i  mS , are weights for sensitive Proof. Since the ð; dÞk -dissimilarity satisfiability problem is
P S S proved to be NP-hard in [12], the conclusion holds as
attributes. The weights, satisfying 0  wSi  1, m i¼1 wi ¼ 1,
indicate the importance of each sensitive attributes and are well for the ðþ ; dÞk -dissimilarity satisfiability problem.
usually specified The reason is that ð; dÞk -dissimilarity can be regarded as
 by domain  experts for flexibility.
Let SVqid ¼ rS1 ; . . . ; rSn be the set of sensitive vectors in a ðþ ; dÞk -dissimilarity in a specific setting where categori-
QI-group QIGðqidÞ of size n. Given a parameter þ  0, the cal proximity is ignored, and a single sensitive attribute
þ “-neighborhood” of a sensitive vector rS , denoted as rather than multiple ones is considered. u
t
  
Fqid ðrS ; þ Þ, is defined as: Fqid rS ; þ ¼ rSx j rSx 2 SVqid ;
 S S In terms of the complexity result in Theorem 2, it is inter-
and d r ; rx  þ g. A sensitive vector is regarded to be sim- esting and practical to find an efficient and near-optimal
 
ilar to rS if it lies in Fqid rS ; þ . Thus, the parameter 2þ con- solution that can make the proximity among records in a
trols similarity between two sensitive vectors. QI-group as low as possible. Due to the non-monotonicity
To preserve privacy against proximity attacks, it is of ð"þ ; dÞk -dissimilarity, top-down partitioning at domain
expected that the sensitive vectors in a QI-group are dissim- levels fails to work for this privacy model. But clustering
ilar to each other as much as possible. Accordingly, we records with low proximity of sensitive values is still a
define the ðþ ; dÞk -dissimilarity privacy model based on promising way to make the proximity as low as possible. As
[12]. The model requires that for any QI-group QIGðqidÞ in clustering is also a natural and effective way for the local-
an anonymous data set, its size is at least k, and any sensi- recoding scheme, we propose a novel clustering approach
tive vector in SVqid must be dissimilar to at least for local recoding by integrating proximity among sensitive
d  ðjQIGðqidÞj  1Þ other sensitive vectors in QIGðqidÞ, values. Specifically, our approach attempts to minimize
where 0  d  1. Parameter k controls the size of each QI- both data distortion and proximity among sensitive values
group to prevent record linkage attacks. Parameter d speci- in a QI-group when conducting clustering, unlike all the
fies constraints on the number of þ neighbors that each existing k-member clustering approaches (kMCs) that

Authorized licensed use limited to: GITAM University. Downloaded on November 08,2022 at 09:51:50 UTC from IEEE Xplore. Restrictions apply.
ZHANG ET AL.: PROXIMITY-AWARE LOCAL-RECODING ANONYMIZATION WITH MAPREDUCE FOR SCALABLE BIG DATA PRIVACY... 2299

consider the former only. The clustering problem we Definition 1 (Proximity-Aware Clustering Problem). Given
attempt to address is referred to as Proximity-Aware Clus- a data set D, the Proximity-Aware Clustering problem is to
tering problem, which is essentially a two-objective cluster- find a set of clusters, denoted as C ¼ fC1 ; . . . ; Cm g, such that
ing problem. each cluster contains at least k (k > 0) data records, the prox-
As minimizing proximity is one objective of the PAC imity of sensitive values in each cluster and the overall data
problem, we define proximity index formally to capture the distortion are minimized. Formally, the PAC problem is for-
proximity between two records. According to (1), the prox- mulated as:
imity index between two numerical sensitive values sv and 
max ProxðCÞ;
sv0 can be defined as dN  ðsv; sv0 Þ ¼ 1  dN ðsv; sv0 Þ: Likely, Minimize P C2C s:t: :
the index between two categorical values can be defined as C2C ILðC Þ;

dC ðsv; sv0 Þ ¼ 1  dC ðsv; sv0 Þ: Then, the proximity index


1) [i¼1;...;m Ci ¼ D; and Ci \ Cj ¼ ;, 1  i 6¼ j  m;
between two sensitive vectors rSx and rSy , denoted by
2) 8C 2 C; jCj  k.
proxðrSx ; rSy Þ, is defined as:
The constraint 2) makes clustering approaches for ano-
  XlS X
mS
nymization different from traditional ones that usually have
prox rSx ; rSy ¼ wSi dN  þ wSi dC  ; (4)
constraints on the number of clusters rather than the size of
i¼1 i¼lS þ1
a single cluster. Moreover, the PAC problem differs from
where dN  and dC  are short for dN  ðsv; sv0 Þ and the k-member clustering problem in essence, as the former
dC  ðsv; sv0 Þ, respectively. With the notion of proximity has one more objective that minimizes maxC2C ProxðCÞ in
index between two sensitive vectors, we define the proxim- terms of Definition 1. To explore the computational com-
ity index of a QI-group QIGðqidÞ by: plexity of the PAC problem, the proximity-aware clustering
decision problem is defined as follows.
 
proxðQIGðqidÞÞ ¼ max prox rSx ; rSy : (5)
rx 6¼ry 2QIDðqidÞ Definition 2 (Proximity-Aware Clustering Decision Prob-
lem). Given a data set D the decision problem is to find a set
Here we use the maximizing function rather than the sum of clusters C ¼ fC1 ; . . . ; Cm g, where [i¼1;...;m Ci ¼ D and
function as the former can capture the risk of proximity Ci \ Cj ¼ ;, 1  i 6¼ j  m, subject to:
breach for a QI-group according to risk analyses in [12].
1) 8C 2 C; jCj  k;
Similar to existing approaches, the other objective of the
PAC problem is to minimizing data distortion caused by
2) P C2C ProxðCÞ  c1 , c1 > 0;
max
generalization operations. We define the notion of informa-
3) C2C ILðC Þ  c2 , c2 > 0.

tion loss to measure the amount of data distortion. Recall Theorem 3. The PAC decision problem is NP-hard.
that the quasi-identifiers of all records in a final cluster will
Proof. It suffices to prove that this problem for a specific set-
be generalized to a more general one. Specifically, all quasi-
ting is NP-hard. We observe that the k-member clustering
identifier values of a categorical attribute are generalized to
decision problem articulated in [19] is a special case of
their lowest common ancestor in the taxonomy tree, and
the PAC decision problem by setting c1 ¼ þ1, i.e., no
numerical values are generalized to a minimum interval
constraints are posed on sensitive attributes. Byun et al.
covering all the values. Let RQI i be the domain of attribute
[19] proved that the k-member clustering decision prob-
attQI
i . The information loss of a record r 2 QIGðqidÞ, lem which is NP-hard. It therefore follows that the PAC
denoted as ILðrÞ, is defined by: decision problem is NP-hard as well. u
t

X
lQI
vmax  vmin X
mQI
Lðv; vLAC Þ In terms of Theorem 3, the PAC clustering problem is
ILðrÞ ¼ wQI i i
þ wQI ; (6) also NP-hard. To address this problem, we transform it to a
i¼1
i
RQI
i
HðTTi Þ
i i¼lQI single-objective clustering problem. To this aim, a distance
where vmax and vmin are the maximum and minimum values measure between two records should be defined with con-
i i
sidering the proximity between sensitive values. Unlike dis-
of the attribute within QIGðqidÞ respectively, and Lðv; vLAC Þ
tance metrics employed in existing k-member clustering
is the length of the path from v to their LCA vLAC . Weights
approaches, ours is required to consider both the similarity
wQI
i , 1  i  m , are exploited to indicate the importance
QI
of quasi-identifiers and the proximity of sensitive values.
of each attribute, and satisfy 0  wQI i 1 and Intuitively, we desire that records tend to go into the same
PmQI QI clusters if their quasi-identifiers are similar, and the proxim-
i¼1 wi ¼ 1. These weights above can bring flexibility of
the definition of information loss for different applications. ity of sensitive values among them is low, i.e., the sensitive
Usually, these weights are specified by domain-specific values are dissimilar. Thus, it is reasonable to integrate
experts. Further, we define the information loss of QIGðqidÞ proximity into the distance measure, and hence the distance
is defined by: measure herein is defined over both quasi-identifier and
sensitive attributes.
X
ILðQIGðqidÞÞ ¼ ILðrÞ: (7) As to quasi-identifiers, the distance metric can be utilized
r2QIGðqidÞ directly to capture similarity among them. Similar to (1) and
(2), the normalized distance between two numerical quasi-
Similar to the k-member clustering problem defined in identifier values v and v0 is defined as dN ðv; v0 Þ ¼ jv  v0 j=R,
[19], the PAC problem is formally defined as follows. and the distance between two categorical quasi-identifier

Authorized licensed use limited to: GITAM University. Downloaded on November 08,2022 at 09:51:50 UTC from IEEE Xplore. Restrictions apply.
2300 IEEE TRANSACTIONS ON COMPUTERS, VOL. 64, NO. 8, AUGUST 2015

values is defined as dC ðv; v0 Þ ¼ Lðv; v0 Þ=ð2  HðTTi ÞÞ. Then, Proof. It suffices to prove that this problem for a specific set-
the distance between two quasi-identifiers rQI x and ry
QI
is ting is NP-hard. The conventional k-member clustering
defined by: problem [19] is a special case of the SPAC problem by
setting vQI ¼ 1 and vs ¼ 0, i.e., the distance is deter-
 QI QI  XlQI
QI
X
mQI
mined by quasi-identifier attributes only, the same as k-
d rx ; ry ¼ wi dN þ wQI
i dC ; (8)
i¼1
member clustering problem. The k-member clustering
i¼lQI
problem is NP-hard as its decision problem is NP-hard
where dN and dC are short for dN ðv; v0 Þ and dC ðv; v0 Þ respec- [19]. It thus follows that the SPAC problem is NP-hard as
tively. Weights wQI
i , 1  i  m , are the same as (6).
QI well. u
t
Combining the distance function (8) and the proximity Given the hardness of the SPAC problem, it is practical
index function (4), a proximity-aware distance measure and interesting to time-efficiently find a near-optimal solu-
between two data records rx ¼ hrQI x ; rx i and ry ¼ hry ; ry i,
S QI S
tion in big data scenarios, rather than to seek the optimal
denoted as distðrx ; ry Þ, is defined by: one. The next section will show how to achieve this.
   
distðrx ; ry Þ ¼ vQI d rQI QI
x ; ry þ vs prox rSx ; rSy ; (9)

where the two weights, 0  vQI ; vs  1 and vQI þ vs ¼ 1, 5 TWO-PHASE PROXIMITY-AWARE CLUSTERING
control how much quasi-identifier attributes or sensitive USING MAPREDUCE
attributes contribute to the distance measure. Note that if Except where otherwise noted, the proximity-aware cluster-
vQI ¼ 1 and vs ¼ 0, the distance measure is the same as that ing problem refers to the single-objective proximity-aware
in k-member clustering approaches. However, the distance clustering problem hereafter. To address the SPAC problem
measure distð; Þ is not a distance metric if vs 6¼ 0 because it in big data scenarios, we propose a two-phase clustering
fails to satisfy coincidence axiom. The coincidence axiom, approach where agglomerative clustering method and
one of conditions
  that a distance metric must satisfy, point-assignment clustering method are employed in the
requires dist rx ; ry ¼ 0 if and only if rx ¼ ry . However, two phases, respectively. We outline the sketch of the two-
distð; Þ can still capture the degree of similarity records phase clustering approach in Section 5.1. Then, the algorith-
and guide them into mic details of the two phases are elaborated in Sections 5.2
 proper clusters when we conduct clus-
tering. In fact, dist rx ; ry ¼ 0 holds if and only if the quasi- and 5.3, respectively. We illustrate the execution process of
identifiers are the same and the proximity our approach and analyze the performance in Section 5.4.
  between sensitive
values is the lowest, while dist rx ; ry gets the maximum
value if and only if the distance of quasi-identifiers reaches
5.1 Sketch of Two-Phase Clustering
the maximum and the sensitive  values are the same. So, the
In order to choose proper clustering methods for the SPAC
distance measure dist rx ; ry possesses the desired property
problem, some observations of clustering problems for data
that guides records with similar quasi-identifier but dissimi-
anonymization should be taken into account. First, the
lar sensitive values into the same clusters. Armed with the
parameter k in the k-anonymity privacy model is relatively
distance measure, we model the single-objective proximity-
small compared with the scale of a data set in big data sce-
aware clustering (SPAC) problem as follows.
narios. Since the upper-bound of the size of a cluster for
Definition 3 (Single-Objective Proximity-Aware Cluster- local-recoding anonymization is 2k  1, the size of clusters
ing Problem). Given a data set D the single-objective prox- is also relatively small. Accordingly, the number of clusters
imity-aware clustering problem is to find a set of clusters of will be quite large. Second, under the condition that the size
size at least kdenoted as C ¼ fC1 ; . . . ; Cm g, such that the sum of any cluster is not less than k, the smaller a cluster is, the
of all intra-cluster distances is minimized. Formally, the more it is preferred. The reason is that this tends to incur
SPAC problem is formulated as: less data distortion. Ideally, the size of all clusters is exactly
X k. Third, the intrinsic clustering architecture in a data set is
 
Minimize jCl j  max dist rx ; ry ; s:t: : helpful for local-recoding anonymization, but building such
rx ;ry 2Cl an architecture is not the final purpose.
l¼1;...;m
Given the observations above, the agglomerative cluster-
1) [i¼1;...;m Ci ¼ D; and Ci \ Cj ¼ ;, 1  i 6¼ j  m. ing method is suitable for local-recoding anonymization, as
2) 8C 2 C; jCj  k. the stopping criterion can be set as whether the size of a clus-
  ter reaches to k. Moreover, the agglomerative clustering
In fact, maxrx ;ry 2Cl dist rx ; ry in the objective function is
method can achieve minimum data distortion in the sense of
the diameter of the cluster Cl . Intuitively, smaller diameter
the defined distance measure. Most existing approaches for
tends to incur lower data distortion, because it implies that
k-anonymity mentioned in Section 2 employ greedy agglom-
the least common ancestors for categorical attributes lie in
erative clustering approaches. But they construct clusters in
lower levels and the minimum covering intervals for
a greedy manner rather than combine the two clusters
numerical attributes are shorter. It also tends to produce
that have the minimum distance in each round, which results
lower proximity of sensitive values. The factor jCl j indicates
in more data distortion. But the optimal agglomerative clus-
that smaller clusters are preferred, because less data distor-
tering method suffers the scalability problem when handling
tion will be incurred with smaller clusters.
large-scale data sets. Its time complexity is Oðn2 log nÞ with
Theorem 4. The SPAC problem is NP-hard. utilizing a priority queue. Worse still, the agglomerative

Authorized licensed use limited to: GITAM University. Downloaded on November 08,2022 at 09:51:50 UTC from IEEE Xplore. Restrictions apply.
ZHANG ET AL.: PROXIMITY-AWARE LOCAL-RECODING ANONYMIZATION WITH MAPREDUCE FOR SCALABLE BIG DATA PRIVACY... 2301

method is serial, which makes it difficult to be adapted to MapReduce task worker, i.e., either a Mapper or a Reducer.
parallel environments like MapReduce. Concretely, the capacity of N here means that the worker
From the perspective of scalability, the point-assignment can accomplish the agglomerative clustering on a data set of
method seems to be ideal for local-recoding anonymization size N within an acceptable time. The value of t is estimated
in MapReduce. The point-assignment process is to initialize as t  jDj=N. Then, the expected maximum size of
a set of data records to represent the clusters, one for each, b-clusters jDj=t can be less than the worker capacity N. In
and assign the rest records into these clusters. The process fact, the skewness in a data set will affect the maximum size
is repeated until certain conditions are satisfied. Point of b-clusters. Thus, the higher the degree of skewness is, the
assignment can be conducted in a scalable and parallel larger t should be. As k N according to the aforemen-
fashion in MapReduce. However, the set of representative tioned observations, t will be much smaller than jDj=k
records will become quite large according the observations which is the case if the point-assignment clustering method
above. Approximately, its size will be 1=k of an original is exploited directly on anonymization. Hence, the set of t
data set. This fact makes it is a challenge to distribute such representative records is relative small and can be distrib-
representative records to MapReduce workers who conduct uted to each MapReduce worker efficiently. As such, our
point assignment according to the representative records approach can handle large-scale data sets in a linear manner
independently. Another problem is that the size of each with respect to the number of MapReduce workers, which
cluster is uncontrollable in the point-assignment process. can be accomplished with ease in cloud environments due
Thus, the size of a cluster can exceed the upper-bound to their scalability.
2k  1 or be less than k, especially when a data set has high
skewness. Extra effort is often required to adjust clusters to 5.2 t-Ancestor Clustering for Data Partitioning
proper size. One core problem in the point-assignment method is how to
Given the pros and cons of the two clustering methods for represent a cluster. Similar to t-medians [32], We propose to
local-recoding anonymization, we propose a two-phase leverage the ‘ancestor’ of the records in a cluster to represent
approach that combines both methods based on MapReduce. the cluster. More precisely, an ancestor of a cluster refers to a
In the first phase, we leverage point-assignment clustering data record whose attribute value of each categorical quasi-
method to partition an original data set into t clusters, where identifier is the lowest common ancestor of the original val-
t is not necessarily relevant to k. For convenience, a cluster ues in the cluster. Each numerical quasi-identifier of an
produced in the first phase is named as b-cluster. In the ancestor record is the median of original values in the cluster.
second phase, the agglomerative clustering method is run on The notion of ancestor record also attempt to capture the log-
each b-cluster simultaneously as ‘plug-ins’, which is similar ical centre of a cluster like t-means/medians, but t-ancestors
to [31]. In this way, our approach shares the merits of both clustering is more suitable for anonymization due to categor-
methods but avoids the drawbacks. Specifically, the two- ical attributes in the clustering problem herein.
phase approach can produce quality anonymous data sets To facilitate t-ancestors clustering, we take quasi-identi-
with the agglomerative clustering method and gain high fier attributes but sensitive ones into consideration. This
scalability with the point-assignment method. In addition, will rarely affect the proximity of sensitive values in a final
no extra adjustment is required. cluster, because the clustering granularity in the first phase
Algorithm 1 describes the main steps in the two-phase is rather coarse. Accordingly, we leverage the distance mea-
clustering approach. Similar to the spirit of t-means family sure (8) to calculate the distance between a data record and
[22], [32], we propose the t-ancestor clustering algorithm an ancestor. Usually, an ancestor is not a real data record in
for point-assignment method. To avoid confusion, we the data set, but the (8) can still be employed to calculate the
employ the term ‘t-means’ rather than ‘k-means’ which is distance between two vectors of attribute values. Except
commonly used in literature. The details of the t-ancestor where otherwise noted, a record r in this section refers to
algorithm and the agglomerative algorithm will be pre- the quasi-identifier part.
sented in Sections 5.2 and 5.3, respectively. Initially, the t ancestors in the first round of point assign-
ment are t records which are dedicatedly selected as seeds.
Algorithm 1. Sketch of Two-Phase Clustering In general, the selection of such t records influences the
quality of clustering to a certain extent. To obtain a good set
Input: Data set D, anonymity parameter k. of seeds, we pick data records that are as far away from
Output: Anonymous data set D . each other as possible. Concretely, we accomplish seed
1: Run the t-ancestor clustering algorithm on D, get a set of
selection via a MapReduce job SeedSelection which outputs a
b-clusters: Cb ¼ fC1b ; . . . ; Ctb g; set of seeds: S ¼ fr1 ; . . . ; rt g. The Map and Reduce functions
2: For each b-cluster Cib 2 Cb , 1  i  t: run the agglomerative of SeedSelection are described in Algorithm 2. In this job,
clustering algorithm on Cib , get a set of clusters Ci ¼ fCi1 ; only one Reducer is utilized for seed selection due to the
. . . ; Cimi g; serial nature of the algorithm. To make it scalable to big
3: For each cluster Cj 2 C, where C ¼ [ti¼1 Ci , generalize Cj to data, we sample an original data set by emitting a record to
Cj by replacing each attribute value with a general one; the Reducer with probability N=jDj in the Map function, so
mj P
4: Generate D ¼ [j¼1 Cj , where mj ¼ ti¼1 mi . that only about N records in total go to the Reducer. In the
Reduce function, the first seed is picked at random, and
As t is usually required in advance, we roughly estimate then we repeatedly pick the record whose minimum dis-
it and demonstrate that the two-phase clustering algorithm tance to the existing seeds is the largest until the number of
is scalable in big data scenarios. Let N be the capacity of a seeds reaches t.

Authorized licensed use limited to: GITAM University. Downloaded on November 08,2022 at 09:51:50 UTC from IEEE Xplore. Restrictions apply.
2302 IEEE TRANSACTIONS ON COMPUTERS, VOL. 64, NO. 8, AUGUST 2015

Reducer can process more than one b-clusters in sequence if


Algorithm 2. SeedSelection Map and Reduce
t is large enough.
Input: Data record r, r 2 D.
Output: A set of seeds S ¼ fr1 ; . . . ; rt g. Algorithm 4. AncestorUpdate Map and Reduce
Map: Generate a random value rand, where 0  rand  1; Input: Data record r, r 2 D; seeds of round i, S i ¼ fr1 ; . . . ; rt g.
if rand  N=jDj, emit (1, r). ðiþ1Þ
Output: Seeds of round i, S ðiþ1Þ ¼ fr1 ; . . . ; rt g.
ðiþ1Þ
Reduce: 1: Select a random record r from listðrÞ, S r; Map: 1: d min
þ1;
2: While jSj < t: 2: For j: 1 to t
Find r 2 listðrÞ that maximizes minr0 2S dðr; r0 Þ; If dðr; rj Þ < dmin , then dmin dðr; rj Þ and jmin j;
S r; 3: Emit (jmin , r).
3: Emit (null, S). Reduce: 1: For l: 1 to mQI
If attQI
l is numerical, then vl MedianðlistðrÞ; lÞ;
The t-ancestors clustering algorithm exploits Lloyd-style Else vl AncestorðlistðrÞ; lÞ;
ðiþ1Þ
iteration refinement technique [22]. Each round of iteration 2: Emit (j, rj ¼ hv1 ; . . . ; vmQI i).
consists of two steps, namely, expectation (E) and maximiza-
tion (M). In the E step, data records are assigned to their near- 5.3 Proximity-Aware Agglomerative Clustering
est ancestor and constitute a b-cluster. In the M step, the
Unlike Section 5.2, we leverage the proximity-aware dis-
ancestor of a b-cluster is recomputed according to the records
tance measure (9) for the agglomerative clustering in this
in the cluster. The new set of ancestors is used in the E step of
section. In the agglomerative clustering method, each data
the next round. Ideally, it is expected that the iteration con-
record is regarded as a cluster initially, and then two clus-
verges, i.e., the assignments no longer change after a finite
ters are picked to be merged in each round of iteration
number of rounds. However, a Lloyd-style clustering algo-
until some stopping criteria are satisfied. Usually, two clus-
rithm using a different distance measure other than euclidean
ters with the shortest distance are merged. Thus, one core
distance fails probably to converge, or is very slow to con-
problem of the agglomerative clustering method is how to
verge. In practice, two widely-adopted stopping criteria are
define the distance between two clusters. To coincide with
employed together: 1) the difference of ancestors between
the objective in the SPAC problem, we leverage the com-
two continuous rounds is smaller than a predefined thresh-
plete-linkage distance in our agglomerative clustering algo-
old; 2) the rounds of iteration arrive at predefined number.
rithm, i.e., the distance between two clusters equals to the
Formally, let S i and S ðiþ1Þ be the two sets of seeds in round i weighted distance between those two records (one in each
and ði þ 1Þ, respectively. The difference between them, cluster) that are farthest away from each other. In fact, after
denoted by dðS i ; S ðiþ1Þ Þ, is defined as the average distance merging such two clusters, the distance between them is
between their records: the diameter of the new cluster. Formally, the distance
X ! between clusters Cx and Cy denoted as dðCx ; Cy Þ, is calcu-
ðiþ1Þ
t  i ðiþ1Þ 
i
dðS ; S Þ¼ d rj ; rj t : (10) lated by:
j¼1       
d Cx ; Cy ¼ jCx j þ Cy  max dist rx ; ry : (11)
The first stopping criterion is quantified by dðS i ; rx 2Cx ;ry 2Cy
ðiþ1Þ
S Þ < t, where t is a predefined threshold. Let u denote
Dissimilar to traditional agglomerative clustering algo-
the predefined maximum number of iteration rounds. The
rithms, a cluster in our algorithm will not be considered
t-ancestors clustering algorithm stops if either of the criteria
for further merging if its size is equal to or larger than k. If
above is satisfied. Ultimately, the algorithm is described in
no two clusters are of size less than k, the merging process
Algorithm 3.
stops. Accordingly, the maximum size of a cluster after
Algorithm 3. t-Ancestors Clustering merging is 2k  2. It is possible that a single cluster of size
less than k remains after merging. We assign each data
Input: Data set D; parameter t; thresholds t, u. record of the left cluster to a cluster of size less than 2k  1
Output: t b-clusters Cb ¼ fC1b ; . . . ; Ctb g.
who is the nearest to the record. In an extreme case that all
1: Run job SeedSelection, get initial seeds S 0 ; i 0;
clusters are already 2k  1, we randomly pick a cluster and
2: Run job AncestorUpdate, get ancestors S ðiþ1Þ ; i i þ 1;
assign certain records from it to the left cluster to make the
While dðS i ; S ðiþ1Þ Þ  t and i  u, repeat Step 2;
3: Return b-clusters with ancestors S ðiþ1Þ .
size of the latter be k. Note that there are only at most
k  1 clusters if the extreme case takes place. Finally, every
In each round of the while-loop in Algorithm 3, a cluster resulting from the proximity-aware agglomerative
MapReduce job named as AncestorUpdate is designed to ful- clustering algorithm has at least k records, but no more
fill the E and M steps. Specifically, the Map function of the than 2k  1 records.
job is responsible for point assignments in the E step, while Based on the analyses above, Algorithm 5 presents
the Reduce function accomplishes the re-computation of the proximity-aware agglomerative clustering algorithm
ancestors in the M step. The Map and Reduce functions are formally. We leverage a priority queue PQueue to retain dis-
described in Algorithm 4. Two subroutines, MedianðÞ and tance between any two clusters, which aims at improving
AncestorðÞ, are utilized in the Reduce function to calculate the performance of the agglomerative method. In the while-
the medians of numerical attributes and ancestors of loop, the two clusters with the shortest distance are merged
categorical attributes, respectively. Note that the Reduce in step 3.1, and then the PQueue as well as the set of clusters
function is scalable with setting t appropriately, and one Cðiþ1Þ are adjusted in step 3.2 and 3.3. In step 4, we cope

Authorized licensed use limited to: GITAM University. Downloaded on November 08,2022 at 09:51:50 UTC from IEEE Xplore. Restrictions apply.
ZHANG ET AL.: PROXIMITY-AWARE LOCAL-RECODING ANONYMIZATION WITH MAPREDUCE FOR SCALABLE BIG DATA PRIVACY... 2303

TABLE 2
Performance Analysis of MapReduce Jobs

Jobs Time Space Workers Traffics Rounds


1 M Oð1Þ Oð1Þ Oðn=P Þ OðNÞ Oð1Þ
R Oðt2 ðN  tÞÞ OðNÞ Oð1Þ
2 M OðtÞ OðtÞ Oðn=P Þ OðnÞ OðlÞ
R OðmQI ðn=tÞ2 Þ Oðn=tÞ Oðn=NÞ
3 M Oð1Þ Oð1Þ Oðn=P Þ OðnÞ Oð1Þ
R Oððn=tÞ2 log n=tÞ Oððn=tÞ2 Þ Oðn=NÞ

Seen from Fig. 2, the three MapReduce jobs are coordi-


nated together to accomplish the local-recoding anonymiza-
tion. From the perspective of control flow, our approach is
partially parallel because the first phase is sequential with
iteration of the SeedUpdate job, while the second phase is par-
allel. However, our approach is fully parallel from the per-
spective of data flow. The light solid arrow lines in Fig. 2
represent data flows in the canonical MapReduce framework,
while the dashed arrow lines stand for data flows of dispatch-
ing seeds to distributed caches and the data flow of updating
seeds. An Original data set is read by Map functions and its
splits are processed in a parallel manner. As such, the two-
phase clustering approach can handle large-scale data sets.
Note that the amount of seeds (or ancestors) in the Seed-
Update job is relatively small with proper parameter t, so that
Fig. 2. Execution process overview of two-phase clustering. they can be delivered to distributed caches efficiently.
To evaluate our approach theoretically, we analyze the
with the remaining cluster mentioned above without con- performance of each MapReduce job in the two phases
sidering the extreme case. from five aspects in terms of [33], namely, time cost, space
cost, task workers, network traffics and execution rounds.
Algorithm 5. Proximity-Aware Agglomerative Clustering Table 2 illustrates the complexity results of the five
Input: Data set C b ; k-anonymity parameter k. aspects for each job. For conciseness, we use the number
Output: Clusters C ¼ fC1 ; . . . ; Cn g. 1, 2 and 3 to stand for the jobs SeedSelection, SeedUpdate
1: Initialize each record in C b as a cluster, C0 ¼ fC10 ; . . . ; Cn00 g; and AgglomerativeClustering, respectively. M and R are
C ;; i 0; short for Map and Reduce functions, respectively. Most
2: 8Cx0 ; Cy0 2 C0 , x 6¼ y, PQueue hCx0 ; Cy0 ; dðCx0 ; Cy0 Þi; symbols mean the same as aforementioned. Parameter l
3: While PQueue is not empty denotes the number of iteration rounds of the job
3.1: hCx0 ; Cy0 ; dðCx0 ; Cy0 Þi PQueue; Cz0 Cx0 [ Cy0 ; SeedUpdate, and P denotes the size of a data split fed to a
ðiþ1Þ
3.2: C 0
C nfCx ; Cy g;
i 0 Mapper. The number of all data records n ¼ jDj.
0
Delete
 0  entries involving Cx or Cy0 in PQueue; Note that t  n=N according to the analyses in Section 5.1.
3.3: If Cz   k;, then C C [fCz0 g; Thus, the time and space costs of Map and Reduce functions
Else Cðiþ1Þ Cðiþ1Þ [ fCz0 g; have a constant upper bound determined by N. The number
0 ðiþ1Þ
 8C 2 ; PQueue hC 0 ; Cz0 ; dðC 0 ; Cz0 Þi; of task workers varies in a linear manner with respect to the
4: If Cðiþ1Þ  ¼ 1 and C 00 2 Cðiþ1Þ , then 8r 2 C 00 , find a cluster scale of the data set. As such, our approach is scalable with
C 2 C and jCj  2k  2, minimizing dðfrg; CÞ, and C C [frg. appropriate valuing of t. The first and third jobs are one-pass,
while the second one is iterative with OðlÞ rounds, where l is
A MapReduce job named as AgglomertiveClustering is determined by the stopping criteria discussed in Section 5.2.
designed to wrap Algorithm 5. Specifically, Algorithm 5 is
plugged in the Reduce function of the job. After a Reducer
collects all data records of a b-cluster, Algorithm 5 is 6 EXPERIMENT EVALUATION
executed to generate final clusters (QI-groups). The Map 6.1 Overall Comparison
function is relatively simple, which just emits a record and To evaluate the effectiveness and efficiency of the Proxim-
its corresponding cluster. ity-Aware Clustering approach, we compare it with the k-
member Clustering approach proposed in [18], which also
5.4 Execution Process and Performance Analysis represents the approaches for local recoding in [9], [19]. The
In order to demonstrate the proposed two-phase proximity- kMC approach is the state-of-the-art approach for local-
aware clustering approach visually, its execution process recoding anonymization with clustering techniques. As to
overview is illustrated in Fig. 2. The bold solid arrow line in effectiveness, we consider three factors, namely, the resist-
the right indicates the timeline of the process. ibility to proximity breaches, data distortion and scalability.

Authorized licensed use limited to: GITAM University. Downloaded on November 08,2022 at 09:51:50 UTC from IEEE Xplore. Restrictions apply.
2304 IEEE TRANSACTIONS ON COMPUTERS, VOL. 64, NO. 8, AUGUST 2015

TABLE 3
Summary of Experimental Parameters

Symbol Notations
MIN The minimum intra-cluster distance of a data set.

AVG The average intra-cluster distance of a data set.
vs The weight of proximity in the distance measure (9).
k The k-anonymity parameter.
t The number of partitions after t-ancestors clustering.
u; t Two parameters for the stopping criteria of t-ances-
tors clustering. u is the maximum number of itera-
tion rounds, and t is the threshold of the difference Fig. 3. System overview of U-Cloud.
of the seeds between two consecutive iteration
rounds. more computation nodes with the data volume increas-
ing, due to its parallelization capability. kMC will suffer
from poor scalability over large-scale data sets because it
Several notions are defined for convenience. To capture requires too much memory, while PAC can linearly scale
the resistibility to proximity breaches, we define two statis- over data sets of any size. Correspondingly, the execution
tics for a QI-group G, namely, the minimum distance MIN G time of PAC is often less than kMC in big data scenarios.
G . Let nG ¼ jGj, then,
and the average distance AVG But note that the contrary case may occur because of the
  extra overheads engendered by PAC when the data set or
MIN
G ¼ min d rSx ; rSy ; (12) the scale of MapReduce cluster is small. PAC is equiva-
rx 6¼ry 2G
0 1 lent to kMC if the weight vs ¼ 0 and the parameter t ¼ 1.
X  S S . Thus, kMC can be regarded as a special form of PAC.
AVG
G ¼ @2 d rx ; ry A ðnG ðnG  1ÞÞ: (13)
rx 6¼ry 2G 6.2 Experiment Evaluation
 k 6.2.1 Experiment Settings
Actually, the QI-group here satisfies MIN G ; nG  1 -dis- Our experiments are conducted in a cloud environment
similarity. Hence, the resistibility can be captured by MIN G named U-Cloud [4]. The system overview of U-Cloud has
directly. As to the entire data set D, we measure the distri- been depicted in Fig. 3. The Hadoop cluster is built on U-
bution of MIN to capture the resistibility intuitively, where cloud. For more details about U-Cloud, please refer to [4].
MIN is the minimum intra-cluster

of D, a random
distance The data set Census-Income (KDD) [35] is utilized in
variable ranging within 0; maxG
D MIN G . Technically, we our experiments. Its subset Adult data set has been com-
leverage the Relative Cumulative Frequency (RCF, similar monly used as a de facto benchmark for testing anonym-
to cumulative distribution function) [34] of MIN to describe ization algorithms [9], [18], [19], [25], [26]. The data set is
the distribution for comparison. The average distance of QI- sanitized via removing records containing missing values
groups in D is defined as and attributes with extremely skewed distributions. We
obtain a sanitized data set with 153,926 records, from
1 X
AVG ¼ AVG : (14) which data sets in the following experiments are sam-
jfGjG
Dgj G
D G pled. Twelve attributes are chosen out of the original 40
ones, including nine (four numerical and five categorical)
From an overall perspective, the average distance 2AVG quasi-identifier ones and three (two numerical and one
can also reflect the vulnerability to proximity breaches to a categorical) sensitive ones.
certain extent. The metric ILoss [27] is employed to evaluate Both PAC and kMC are implemented in Java. Further,
how much data distortion incurred by PAC and kMC. The PAC is implemented with the standard Hadoop MapRe-
value of ILoss is normalized to facilitate comparisons. For duce API and executed on a Hadoop cluster built on Open-
scalability, we check whether both approaches can still Stack in U-cloud. The Hadoop cluster consists of 20 virtual
work and scale over large-scale data sets. Execution time is machines with type m1.medium which has two virtual CPUs
measured to evaluate efficiency. Experimental parameters and 4 GB Memory. kMC is executed on one virtual machine
are summarized in Table 3. with type m1.medium. The maximum heap size of Java VM
We conduct an overall comparison between PAC and is set as 4 GB when running kMC. Each round of experi-
kMC in terms of the four factors above. Intuitively, PAC ment is repeated 10 times. The mean of the measured results
will produce more QI-groups with higher MIN G than kMC, is regarded as the representative.
and AVG of PAC will be larger than kMC. This is because
PAC tends to assign dissimilar sensitive values into a 6.2.2 Experiment Process and Results
cluster while kMC randomly put sensitive values into We conduct two groups of experiments in this section to
clusters. Normally, the ILoss of PAC will be higher than evaluate the effectiveness and efficiency of our approach. In
that of kMC, as there is a tradeoff between data distortion the first one, we explore whether proximity-aware cluster-
and the capability of defending privacy attacks. Note that ing leads to larger dissimilarity by comparing PAC with
this is a common phenomenon for privacy models (e.g., l- kMC from the perspectives of resistibility to proximity
diversity and t-closeness) that fight against attribute link- breaches and data distortion. The other investigates the scal-
age attacks [7], [8]. As to scalability, PAC can scale over ability and efficiency.

Authorized licensed use limited to: GITAM University. Downloaded on November 08,2022 at 09:51:50 UTC from IEEE Xplore. Restrictions apply.
ZHANG ET AL.: PROXIMITY-AWARE LOCAL-RECODING ANONYMIZATION WITH MAPREDUCE FOR SCALABLE BIG DATA PRIVACY... 2305

Fig. 5. Change of execution time w.r.t. data size and the number of
computation nodes.

In the second group of experiments, the execution time of


PAC and kMC are gauged to investigate scalability and effi-
ciency. Fig. 5a describes the change of execution time of
PAC and kMC with respect to the number of data records
which ranges from 10,000 (10k) to 100,000 (100k). As the
Fig. 4. Change of RCF of MIN , AVG and normalized ILoss w.r.t. vs . scale of data in these experiments is much greater than that
in [9], [18], [19], the data sets in our experiments are big
In the first group, we measure the change of the RCF of enough to evaluate the effectiveness of our approach in
MIN , AVG and ILoss with respect to the weight vs . The terms of data volume. The value of t varies with data sets,
weight vs varies from 0:0 to 1:0 with step 0:2. PAC is run roughly making the size of b-cluster 1,000. The k-anonymity
over two data sets D1 and D2 , with size jD1 j ¼ 1;000 and parameter k is set as 10. Other parameters are set as follows:
jD2 j ¼ 10;000, respectively. Note that the results of kMC are vs ¼ 0:5, u ¼ 5 and t ¼ 0:001. Ten computation nodes are
the same as PAC when vs ¼ 0:0. The parameter k ¼ 10 for employed for PAC.
D1 , and k ¼ 50 for D2 , respectively. Other parameters are From Fig. 5a, we can see the execution time of both PAC
set as follows: t ¼ 10, u ¼ 5, t ¼ 0:001, and the number of and kMC go up when the number of records increases
Reducers is set as 10. The selection of some of these specific although some slight fluctuations exist. The fluctuations,
values is rather random and does not affect our analysis mainly incurred by the data distribution of each data set,
because what we want to see is whether PAC results in will not affect the trends of execution time. Notably, the exe-
larger dissimilarity than kMC. Interested readers can try cution time of kMC surges from hundreds of seconds to
other values and the conclusion will be similar. The results more than 10,000 s within only four steps, while that of
of the first group are depicted in Fig. 4. PAC goes linearly and stably. The dramatic increase of PAC
Figs. 4a and 4b demonstrate the RCF of MIN for D1 and execution time illustrates that the intrinsic time complexity
D2 , respectively, with 20 bins. Both of them show the same of kMC makes it hard to scale over big data. The difference
trend of RCF when the weight vs changes. The trend is that of execution time between kMC and PAC becomes larger
the curve of RCF of MIN shifts right when vs is getting and larger when the data size is growing. This trend demon-
larger, indicating that the percentage of resultant QI-groups strates that our approach becomes much more scalable and
with higher MIN increases with the growth of vs . In particu- efficient compared with kMC in big data scenarios.
lar, the leftmost curve is that of kMC (vs ¼ 0:0), which Fig. 5b exhibits the change of execution time of PAC with
means that kMC produces the highest percentage of QI- respect to the number of computation nodes ranging from 5
groups with low MIN . Moreover, the percentile of a RCF to 15. The number of data records set as 10,000, and other
curve goes up when vs becomes larger, which quantita- settings are the same as Fig. 5a. It can be seen from Fig. 5b
tively reflects the above tendency. For instance, the median that the execution time decreases in a nearly linear manner
of a RCF curve (the x-axis coordinate of the point produced when the number of computation nodes is getting larger. In
by the intersections of the 50 percent horizontal line and the terms of the tendency, we maintain that PAC is linearly scal-
curve) is getting larger with the increase of vs , as depicted able with respect to the number of computation nodes.
in Figs. 4a and 4b. From Fig. 4c, we can see that the average Hence, PAC is able to manage to handle big data local
distance between two records within a QI-group also grows recoding in a timely fashion in cloud where computation
over both data sets when vs becomes larger. As such, PAC resources are offered on demand.
outperforms kMC in terms of preventing proximity attacks Above all, the experimental results demonstrate that
as it can produce an anonymous data set with higher dis- our approach integrating proximity of sensitive attributes
similarity between two records in most QI-groups. into clustering, significantly improves the capability of
Meanwhile, it can be seen from Fig. 4d that the normal- defending proximity attacks, the scalability and efficiency
ized value of ILoss rises as well when vs grows, indicating of local-recoding anonymization over existing approaches.
that more data distortion is incurred. In fact, the gain of dis-
similarity is at the cost of data utility, which is a common
phenomenon as mentioned in Section 6.1. Fortunately, one 7 CONCLUSIONS AND FUTURE WORK
can choose a proper weight vs to make a good trade-off In this paper, local-recoding anonymization for big data in
between the capability of defending proximity attacks and cloud has been investigated from the perspectives of capabil-
data utility. For example, vs ¼ 0:6 seems to be a good choice ity of defending proximity privacy breaches, scalability and
via observing Fig. 4. time-efficiency. We have proposed a proximity privacy

Authorized licensed use limited to: GITAM University. Downloaded on November 08,2022 at 09:51:50 UTC from IEEE Xplore. Restrictions apply.
2306 IEEE TRANSACTIONS ON COMPUTERS, VOL. 64, NO. 8, AUGUST 2015

model, ðþ ; dÞk -dissimilarity, by allowing multiple sensitive [11] G. Aggarwal, R. Panigrahy, T. Feder, D. Thomas, K. Kenthapadi,
S. Khuller, and A. Zhu, “Achieving anonymity via clustering,”
attributes and semantic proximity of categorical sensitive val- ACM Trans. Algorithms, vol. 6, no. 3, 2010.
ues. Since the satisfiability problem of ðþ ; dÞk -dissimilarity is [12] T. Wang, S. Meng, B. Bamba, L. Liu, and C. Pu, “A general prox-
NP-hard, the problem of big data local recoding against prox- imity privacy principle,” in Proc. IEEE 25th Int. Conf. Data Eng.,
2009, pp. 1279–1282.
imity privacy breaches has been modeled as a proximity- [13] T. Iwuchukwu and J. F. Naughton, “K-Anonymization as spatial
aware clustering problem. We have proposed a scalable two- indexing: Toward scalable and incremental anonymization,” in
phase clustering approach based on MapReduce to address Proc. 33rd Int. Conf. Very Large Data Bases, 2007, pp. 746–757.
the above problem time-efficiently. A series of innovative [14] X. Zhang, L. T. Yang, C. Liu, and J. Chen, “A scalable two-phase
top-down specialization approach for data anonymization using
MapReduce jobs have been developed and coordinated to Mapreduce on cloud,” IEEE Trans. Parallel Distrib. Syst., vol. 25,
conduct data-parallel computation. Extensive experiments on no. 2, pp. 363–373, Feb. 2014.
real-world data sets have demonstrated that our approach [15] X. Zhang, C. Liu, S. Nepal, C. Yang, W. Dou, and J. Chen, “A
hybrid approach for scalable sub-tree anonymization over big
significantly improves the capability of defending proximity data using Mapreduce on cloud,” J. Comput. Syst. Sci., vol. 80,
attacks, the scalability and the time-efficiency of local-recod- no. 5, pp. 1008–1020, 2014.
ing anonymization over existing approaches. [16] Y. Yang, Z. Zhang, G. Miklau, M. Winslett, and X. Xiao,
In cloud environment, the privacy preservation for data “Differential privacy in data publication and analysis,” in Proc.
ACM SIGMOD Int. Conf. Manage. Data, 2012, pp. 601–606.
analysis, share and mining is a challenging research issue [17] J. Lee and C. Clifton, “Differential identifiability,” in Proc. 18th
due to increasingly larger volumes of datasets, thereby ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2012,
requiring intensive investigation. Based on the contribu- pp. 1041–1049.
[18] J. Li, R. C.-W. Wong, A. W.-C. Fu, and J. Pei, “Anonymization by
tions herein, we plan to integrate our approach with Apache local recoding in data with attribute hierarchical taxonomies,” IEEE
Mahout, a MapReduce based scalable machine learning and Trans. Knowl. Data Eng., vol. 20, no. 9, pp. 1181–1194, Sep. 2008.
data mining library, to achieve highly scalable privacy pre- [19] J.-W. Byun, A. Kamra, E. Bertino, and N. Li, “Efficient K-anonym-
serving big data mining or analytics. ization using clustering techniques,” in Proc. 12th Int. Conf. Data-
base Syst. Adv. Appl., 2007, pp. 188–200.
[20] Q. Zhang, N. Koudas, D. Srivastava, and T. Yu, “Aggregate query
ACKNOWLEDGMENTS answering on anonymized tables,” in Proc. IEEE 23rd Int. Conf.
Data Eng., 2007, pp. 116–125.
This paper was partly supported by Australian Research [21] J. Li, Y. Tao, and X. Xiao, “Preservation of proximity privacy in
Council Linkage Project LP140100816 and project National publishing numerical sensitive data,” in Proc. ACM SIGMOD Int.
Science Foundation of China under Grant 91318301. Pei’s Conf. Manage. Data, 2008, pp. 473–486.
[22] S. Lloyd, “Least squares quantization in PCM,” IEEE Trans. Inf.
research in this paper was supported in part by an NSERC Theory, vol. IT-28, no. 2, pp. 129–137, Mar. 1982.
Discovery grant and a BCIC NRAS Team Project. All opin- [23] R. C.-W. Wong, J. Li, A. W.-C. Fu, and K. Wang, “(a, k)-anonym-
ions, findings, conclusions and recommendations in this ity: An enhanced k-anonymity model for privacy preserving data
publishing,” in Proc. 12th ACM SIGKDD Int. Conf. Knowl. Discovery
paper are those of the authors and do not necessarily reflect Data Mining, 2006, pp. 754–759.
the views of the funding agencies. [24] K. LeFevre, D. J. DeWitt, and R. Ramakrishnan, “Mondrian multi-
dimensional k-anonymity,” in Proc. 22nd Int. Conf. Data Eng., 2006,
p. 25.
REFERENCES [25] B. C. M. Fung, K. Wang, and P. S. Yu, “Anonymizing classification
[1] S. Chaudhuri, “What next?: A half-dozen data management data for privacy preservation,” IEEE Trans. Knowl. Data Eng.,
research goals for big data and the cloud,” in Proc. 31st Symp. Prin- vol. 19, no. 5, pp. 711–725, May 2007.
ciples Database Syst., 2012, pp. 1–4. [26] N. Mohammed, B. Fung, P. C. K. Hung, and C. K. Lee, “Centralized
[2] L. Wang, J. Zhan, W. Shi, and Y. Liang, “In cloud, can scientific and distributed anonymization for high-dimensional healthcare
communities benefit from the economies of scale?” IEEE Trans. data,” ACM Trans. Knowl. Discovery Data, vol. 4, no. 42010.
Parallel Distrib. Syst., vol. 23, no. 2, pp. 296–303, Feb. 2012. [27] X. Xiao and Y. Tao, “Personalized privacy preservation,” in Proc.
[3] X. Wu, X. Zhu, G.-Q. Wu, and W. Ding, “Data mining with big ACM SIGMOD Int. Conf. Manage. Data, 2006, pp. 229–240.
data,” IEEE Trans. Knowl. Data Eng., vol. 26, no. 1, pp. 97–107, Jan. [28] J. Dean and S. Ghemawat, “Mapreduce: A flexible data processing
2014. tool,” Commun. ACM, vol. 53, no. 1, pp. 72–77, 2010.
[4] X. Zhang, C. Liu, S. Nepal, S. Pandey, and J. Chen, “A privacy [29] K.-H. Lee, Y.-J. Lee, H. Choi, Y. D. Chung, and B. Moon, “Parallel
leakage upper bound constraint-based approach for cost-effective data processing with Mapreduce: A survey,” ACM SIGMOD
privacy preserving of intermediate data sets in cloud,” IEEE Record, vol. 40, no. 4, pp. 11–20, 2012.
Trans. Parallel Distrib. Syst., vol. 24, no. 6, pp. 1192–1202, Jun. 2013. [30] T. Wang and L. Liu, “XColor: Protecting general proximity
[5] B. C. M. Fung, K. Wang, R. Chen, and P. S. Yu, “Privacy-preserv- privacy,” in Proc. IEEE 26th Int. Conf. Data Eng., 2010, pp. 960–
ing data publishing: A survey of recent developments,” ACM 963.
Comput. Survey, vol. 42, no. 4, pp. 1–53, 2010. [31] R. L. Ferreira Cordeiro, C. Traina, A. J. Machado Traina,
[6] L. Sweeney, “K-anonymity: A model for protecting privacy,” Int. J. J. Lopez, U. Kang, and C. Faloutsos, “Clustering very large
Uncertainty Fuzziness, vol. 10, no. 5, pp. 557–570, 2002. multi-dimensional datasets with Mapreduce,” in Proc. 17th
[7] A. Machanavajjhala, D. Kifer, J. Gehrke, and M. Venkitasubramaniam, ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2011,
“L-diversity: Privacy beyond k-anonymity,” ACM Trans. Knowl. Discov- pp. 690–698.
ery Data, vol. 1, no. 1, 2007. [32] S. Arora, P. Raghavan, and S. Rao, “Approximation schemes for
[8] N. Li, T. Li, and S. Venkatasubramanian, “Closeness: A new pri- Euclidean K-medians and related problems,” in Proc. 30th Annu.
vacy measure for data publishing,” IEEE Trans. Knowl. Data Eng., ACM Symp. Theory Comput., 1998, pp. 106–113.
vol. 22, no. 7, pp. 943–956, Jul. 2010. [33] Y. Tao, W. Lin, and X. Xiao, “Minimal Mapreduce algorithms,”
[9] J. Xu, W. Wang, J. Pei, X. Wang, B. Shi, and A. W. C. Fu, “Utility- Proc. ACM SIGMOD Int. Conf. Manage. Data, 2013, pp. 529–540.
based anonymization using local recoding,” in Proc. 12th ACM [34] I. W. Burr, “Cumulative frequency functions,” Ann. Math. Statist.,
SIGKDD Int. Conf. Knowl. Discovery Data, 2006, pp. 785–790. vol. 13, no. 2, pp. 215–232, 1942.
[10] K. LeFevre, D. J. DeWitt, and R. Ramakrishnan, “Workload-aware [35] UCI Machine Learning Repository, Census-income (KDD) data
anonymization techniques for large-scale datasets,” ACM Trans. set [Online]. Available: http://archive.ics.uci.edu/ml/datasets/
Database Syst., vol. 33, no. 3, pp. 1–47, 2008. Census-Income (KDD)

Authorized licensed use limited to: GITAM University. Downloaded on November 08,2022 at 09:51:50 UTC from IEEE Xplore. Restrictions apply.
ZHANG ET AL.: PROXIMITY-AWARE LOCAL-RECODING ANONYMIZATION WITH MAPREDUCE FOR SCALABLE BIG DATA PRIVACY... 2307

Xuyun Zhang received the bachelor’s and mas- Chi Yang received the BS degree from Shan-
ter’s degrees in computer science from Nanjing dong University at Weihai, China and the MS
University, China. He is currently working toward (by research) degree in computer science from
the PhD degree at the Faculty of Engineering & the Swinburne University of Technology, Mel-
IT, University of Technology Sydney, Australia. bourne, Australia, in 2007. He is currently work-
His research interests include cloud computing, ing toward the PhD degree at the University of
privacy and security, big data, scalable machine Technology, Sydney, Australia. His major
learning and data mining, MapReduce and research interests include distributed computing,
OpenStack. He has published several papers in XML data stream, scientific workflow, distributed
refereed international journals including IEEE system, green computing, big data processing,
Transactions on Parallel and Distributed Systems and could computing.
(TPDS). He received the Chinese Government Award for Outstanding
Self-Financed Student Abroad in 2012. Chang Liu is currently working toward the PhD
degree at University of Technology Sydney,
Australia. His research interests include cloud
Wanchun Dou received the PhD degree in computing, resource management, cryptography,
mechanical and electronic engineering from the and data security.
Nanjing University of Science and Technology,
China, in 2001. From Apr. 2001 to Dec. 2002, he
did the postdoctoral research in the Department
of Computer Science and Technology, Nanjing
University, China. He is currently a full professor
ay the State Key Laboratory for Novel Software
Technology, Nanjing University, China. From
Apr. 2005 to Jun. 2005 and from Nov. 2008 to Jinjun Chen received the bachelor’s degree in
Feb. 2009, he visited the Department of Com- applied mathematics from Xidian University,
puter Science and Engineering, Hong Kong University of Science and China, in 1996, the master’s degree in engineer-
Technology, as a visiting scholar, respectively. Up to now, he has chaired ing in 1999, and the PhD degree in computer sci-
three NSFC projects and published more than 60 research papers in ence and software engineering from Swinburne
international journals and international conferences. His research inter- in 2007.He is currently an associate professor at
ests include workflow, cloud computing, and service computing. the Faculty of Engineering and IT, University of
Technology Sydney, Australia. His research
interests include cloud computing, big data,
Jian Pei is a professor of computing science at software services, privacy, and security. He has
Simon Fraser University and Canada Research published more than 130 papers in high quality
Chair in Big Data Science. He is widely regarded journals and conferences, including IEEE Transactions on Parallel and
as one of the world’s top researchers in the area Distributed Systems, IEEE Transactions on Computers, IEEE Transac-
of data mining and his work has been embraced tions on Services Computing, IEEE Transactions on Software Engineer-
by industry and government. His research inter- ing, ACM Transactions on Software Engineering and Methodology, and
ests include developing effective and efficient ICSE. He received the IEEE Computer Society Outstanding Leadership
ways to analyze - and capitalize on - the vast Awards (2008-2009, 2010-2011), Vice Chancellor Research Award, and
stores of data housed in applications such as many other awards.
social networks, network security informatics,
healthcare informatics, business intelligence, and " For more information on this or any other computing topic,
web searches. A prolific and widely-cited author, he has received many please visit our Digital Library at www.computer.org/publications/dlib.
prestigious awards.

Surya Nepal received the BE degree from the


National Institute of Technology, Surat, India, the
ME degree from the Asian Institute of Technol-
ogy, Bangkok, Thailand, and the PhD degree
from RMIT University, Australia. He is a principal
research scientist at CSIRO Digital Productivity &
Services Flagship. His main research interest
include the development and implementation of
technologies in the area of distributed systems
and social networks, with a specific focus on
security, privacy, and trust. At CSIRO, he under-
took research in the area of multimedia databases, web services and
service oriented architectures, social networks, security, privacy and
trust in collaborative environment and cloud systems and big data. He
has more than 150 publications to his credit. Many of his works are pub-
lished in top international journals and conferences such as VLDB,
ICDE, ICWS, SCC, CoopIS, ICSOC, International Journals of Web Serv-
ices Research, IEEE Transactions on Service Computing, ACM Com-
puting Survey, and ACM Transaction on Internet Technology. He leads
a distributed system team at CSIRO comprising research staff and PhD
students. He is a coinventor of two patents including Trust Extension
Device (TED) that provides portable and mobile trusted computing envi-
ronment in an un-trusted host. He has edited two books: “Security, Pri-
vacy and Trust in Cloud Systems” (Springer) and “Managing Multimedia
Semantics”. He has edited many international journal special issues
such as “Trusting Social Web” for Springer WWW Journal and “Security,
Privacy and Trust in Cloud Systems” for International Journal of Cloud
Computing. He serves as program chairs and program committee mem-
bers in many international conferences and workshops. He has also
delivered talks/tutorials/keynotes on trusted systems in national and
international avenues.

Authorized licensed use limited to: GITAM University. Downloaded on November 08,2022 at 09:51:50 UTC from IEEE Xplore. Restrictions apply.

You might also like