Professional Documents
Culture Documents
Abstract—Cloud computing provides promising scalable IT infrastructure to support various processing of a variety of big data
applications in sectors such as healthcare and business. Data sets like electronic health records in such applications often contain
privacy-sensitive information, which brings about privacy concerns potentially if the information is released or shared to third-parties in
cloud. A practical and widely-adopted technique for data privacy preservation is to anonymize data via generalization to satisfy a given
privacy model. However, most existing privacy preserving approaches tailored to small-scale data sets often fall short when
encountering big data, due to their insufficiency or poor scalability. In this paper, we investigate the local-recoding problem for big data
anonymization against proximity privacy breaches and attempt to identify a scalable solution to this problem. Specifically, we present a
proximity privacy model with allowing semantic proximity of sensitive values and multiple sensitive attributes, and model the problem of
local recoding as a proximity-aware clustering problem. A scalable two-phase clustering approach consisting of a t-ancestors clustering
(similar to k-means) algorithm and a proximity-aware agglomerative clustering algorithm is proposed to address the above problem.
We design the algorithms with MapReduce to gain high scalability by performing data-parallel computation in cloud. Extensive
experiments on real-life data sets demonstrate that our approach significantly improves the capability of defending the proximity privacy
breaches, the scalability and the time-efficiency of local-recoding anonymization over existing approaches.
Index Terms—Big data, cloud computing, mapreduce, data anonymization, proximity privacy
1 INTRODUCTION
However, besides the drawbacks pointed in [16], differen- multiple data partitions. Fourth, several innovative MapRe-
tial privacy also loses correctness guarantees because it pro- duce jobs are designed and coordinated to concretely con-
duces noisy results to hide the impact of any single duct data-parallel computation for scalability.
individual [17]. Hence, syntactic anonymity privacy models The remainder of this paper is organized as follows. The
still have practical impacts in general data publishing and next section reviews related work. In Section 3, we briefly
can be applied in numerous real-world applications. introduce some preliminary and analyze the problems in
The local-recoding scheme, also known as cell gener- detail. Section 4 models the proximity-aware clustering
alization, groups data sets into a set of cells at the data problem formally, and Section 5 elaborates the two-phase
record level and anonymizes each cell individually. clustering approach and the MapReduce jobs. We empiri-
Existing approaches for local recoding [9], [11], [18], [19] cally evaluate our approach in Section 6. Finally, we con-
can only withstand record linkage attacks by employing clude this paper and discuss future work in Section 7.
k-anonymity privacy model [6], thereby falling short of
defending proximity privacy breaches [12], [20], [21]. In
fact, combining local recoding and proximity privacy 2 RELATED WORK
models together is interesting and necessary when one Recently, data privacy preservation has been extensively
wants an anonymous data set with both low data distor- investigated [5]. We briefly review existing approaches
tion and the ability to combat proximity privacy attacks. for local-recoding anonymization and privacy models to
However, this combination is a challenge because most defense against attribute linkage attacks. In addition,
proximity privacy models have the property of non- research on scalability issues in existing anonymization
monotonicity [21] and local recoding fails to be accom- approaches is shortly surveyed.
plished in a top-down way, which will be detailed in Recently, clustering techniques have been leveraged to
Section 3.3 (Motivation and Problem Analysis). achieve local-recoding anonymization for privacy preserva-
In this paper, we model the problem of big data local tion. Xu et al. [9] studied on the anonymization of local
recoding against proximity privacy breaches as a proxim- recoding scheme from the utility perspective and put forth
ity-aware clustering (PAC) problem, and propose a scalable a bottom-up greedy approach and the top-down counter-
two-phase clustering approach accordingly. Specifically, we part. The former leverages the agglomerative clustering
put forth a proximity privacy model based on [12] by rea- technique while the latter employs the divisive hierarchical
sonably allowing multiple sensitive attributes and semantic clustering technique, both of which pose constraints on the
proximity of categorical sensitive values. As the satisfiabil- size of a cluster. Byun et al. [19] formally modeled local-
ity problem of the proximity privacy model is proved to be recoding anonymization as the k-member clustering prob-
NP-hard, it is interesting and practical to model the problem lem which requires the cluster size should not be less than
as a clustering problem of minimizing both data distortion k in order to achieve k-anonymity, and proposed a simple
and proximity among sensitive values in a cluster, rather greedy algorithm to address the problem. Li et al. [18]
than to find a solution satisfying the privacy model rigor- investigated the inconsistency issue of local-recoding ano-
ously. Technically, a proximity-aware distance is intro- nymization in data with hierarchical attributes and pro-
duced over both quasi-identifier and sensitive attributes to posed KACA (K-Anonymization by Clustering in Attribute
facilitate clustering algorithms. To address the scalability hierarchies) algorithms. Aggarwal et al. [11] proposed a set
problem, we propose a two-phase clustering approach con- of constant factor approximation algorithms for two
sisting of the t-ancestors clustering (similar to k-means [22]) clustering based anonymization problems, i.e., r-GATHER
and proximity-aware agglomerative clustering algorithms. and r-CELLULAR CLUSTERING, where cluster centers
The first phase splits an original data set into t partitions are published without generalization or suppression. How-
that contain similar data records in terms of quasi-identi- ever, existing clustering approaches for local-recoding ano-
fiers. In the second phase, data partitions are locally recoded nymization mainly concentrate on record linkage attacks,
by the proximity-aware agglomerative clustering algorithm specifically under the k-anonymity privacy model, without
in parallel. We design the algorithms with MapReduce in paying any attention to privacy breaches incurred by sensi-
order to gain high scalability by performing data-parallel tive attribute linkage. On the contrary, our research takes
computation over multiple computing nodes in cloud. We both privacy concerns into account. Wong et al. [23] pro-
evaluate our approach by conducting extensive experiments posed a top-down partitioning approach based on the
on real-world data sets. Experimental results demonstrate Mondrian algorithm in [24] to specialize data sets to
that our approach can preserve the proximity privacy sub- achieve ða; kÞ-anonymity which is able to defend certain
stantially, and can significantly improve the scalability and attribute linkage attacks. However, the data utility of the
the time-efficiency of local-recoding anonymization over resultant anonymous data is heavily influenced by the
existing approaches. choice of splitting attributes and values, while local recod-
The major contributions of our research are fourfold. ing does not involve such factors. Our approach leverages
First, an extended proximity privacy model is put forth via clustering to accomplish local recoding because it is a natu-
allowing multiple sensitive attributes and semantic proxim- ral and effective way to anonymize data sets at a cell level.
ity of categorical sensitive values. Second, we model To preserve privacy against attribute linkage attacks, a
the problem of big data local recoding against proximity variety of privacy models have been proposed for both cate-
privacy breaches as a proximity-aware clustering problem. gorical and numerical sensitive attributes. l-diversity and its
Third, a scalable and efficient two-phase clustering variants [7] require each QI-group to include at least l well-
approach is proposed to parallelize local recoding on represented sensitive values. Note that l-diverse data sets
Authorized licensed use limited to: GITAM University. Downloaded on November 08,2022 at 09:51:50 UTC from IEEE Xplore. Restrictions apply.
ZHANG ET AL.: PROXIMITY-AWARE LOCAL-RECODING ANONYMIZATION WITH MAPREDUCE FOR SCALABLE BIG DATA PRIVACY... 2295
Authorized licensed use limited to: GITAM University. Downloaded on November 08,2022 at 09:51:50 UTC from IEEE Xplore. Restrictions apply.
2296 IEEE TRANSACTIONS ON COMPUTERS, VOL. 64, NO. 8, AUGUST 2015
overlapping multidimensional regions which cover named key-value pair (key, value). Specifically, the Map
V1 Vm , where region overlapping means that multi- function can be formalized as Map : ðk1 ; v1 Þ ! ðk2 ; v2 Þ, i.e.,
ple regions probably contain records with identical quasi- Map takes a pair (k1 , v1 ) as input and then outputs another
identifiers. Specifically, a function ’l : Rl ! QID, is intermediate key-value pair (k2 , v2 ). These intermediate
defined for a region Rl , where l is an arbitrary number pairs are consumed by the Reduce function as input. For-
indexing the region. In fact, a region corresponds to a QI- mally, the Reduce function can be represented as:
group. Therefore, the core sub-problems of local recoding Reduce:ðk2 ; listðv2 ÞÞ ! ðk3 ; v3 Þ, i.e., Reduce takes intermedi-
are how to construct multidimensional regions and how to ate k2 and all its corresponding values listðv2 Þ) as input and
choose functions f’l g. In our approach, we leverage the outputs another pair (k3 ; v3 ). Usually, (k3 , v3 ) list is the
clustering technique to build multidimensional regions results which MapReduce users attempt to obtain. Both
with keeping proximity privacy of sensitive attributes in Map and Reduce functions are specified by users according
mind. For each region, the categorical quasi-identifier attri- to their specific applications. An instance running a Map
bute values are generalized to their lowest common function is called Mapper, and that running a Reduce func-
domain value in the taxonomy tree, and the numerical tion is called Reducer, respectively.
ones are replaced by an interval that covers them mini-
mally. Unlike local recoding, the sub-tree and full-domain 3.3 Motivation and Problem Analysis
schemes have one function over each attribute, i.e., In this section, we analyze the problems of existing
fi : Vi ! DOMi , 1 i m, and the global multidimen- approaches for local-recoding anonymization from the per-
sional scheme has a single function over all the attributes, spectives of proximity privacy and scalability. Further, chal-
i.e., f : V1 Vm ! QID. lenges of designing scalable MapReduce algorithms for
Preserving privacy is one side of anonymization. The other proximity-aware local recoding are also identified.
one is retaining aggregate information for data mining or ana- Little attention has been paid to the local-recoding ano-
lytics over the anonymous data. Several data metrics have nymization scheme under proximity-aware privacy modes.
been proposed to capture this [5], e.g., minimal distortion As mentioned in Section 2, most existing local-recoding
(MD) [6], ILoss [27] and discernibility metric (DM) [7]. With a approaches concentrate on combating record linkage attacks
data metric, the problem of optimal local recoding is to find by employing k-anonymity privacy model. As demon-
the local-recoding solution that makes the metric optimal. strated in existing work [7], [8], [12], [21], however, k-
However, theoretical analyses demonstrate that the problem anonymity fails to combat attribute linkage attacks like
under most not-trivial data utility metrics is NP-hard [5]. As a homogeneity attacks, skewness attacks and proximity
result, most existing approaches [9], [11], [18], [19] just try to attacks. For instance, if the sensitive values of the records
find the minimal local recoding instead to achieve practical in a QI-group of size k are identical or quite similar, adver-
efficiency and a near optimal solution, where the minimal saries can still link an individual with certain sensitive val-
local recoding means that no more partitioning operations ues with high confidence although the QI-group satisfies k-
are allowed when building multidimensional regions under anonymity, resulting in privacy violation. Accordingly, a
a certain privacy model. Our proximity-aware two-phase plethora of privacy models have been proposed to thwart
clustering approach herein also follows this line. such attacks as shown in Section 2. But these models have
So far, only the k-anonymity privacy model has been been rarely exploited into the local-recoding scheme except
employed to preserve privacy against record linkage attacks the work in [23]. This phenomenon mainly results from two
in existing clustering based anonymization approaches. The reasons analyzed as follows.
k-anonymity privacy model requires that for any qid 2 QID, The first one is that, unlike global-recoding schemes,
the size of QIGðqidÞ must be zero or at least k, so that a quasi- k-anonymity based approaches for record linkage attacks
identifier will not be distinguished from other at least k 1 cannot be simply extended for attribute linkage attacks. Since
ones in the same QI-group [6]. Usually, it is assumed that an global-recoding schemes partition data sets according to
adversary already has the knowledge that an individual is domains, they can be fulfilled effectively in a top-down fash-
definitely in a data set, which occurs in many real-life data ion. This property of global-recoding schemes ensures that
like tax data sets. After local recoding, the upper-bound size k-anonymity based approaches can be extended to combat
of a QI-group is 2k 1 under the k-anonymity privacy model. attribute linkage attacks though checking extra privacy satis-
If there were a QI-group of size at least 2k, it should be split fiability during each round of the top-down anonymization
into two groups of size at least k to maximize data utility. process [5]. However, the local recoding scheme fails to share
the same merits because it partitions data sets in a clustering
3.2 MapReduce Basics fashion where the top-down anonymization property is
MapReduce [28], a parallel and distributed large-scale data inapplicable. Although Wong et al. [23] proposed a top-
processing paradigm, has been extensively researched and down approach for local recoding, the approach can only
widely adopted for big data applications recently [29]. Inte- achieve partially local recoding because global recoding is
grated with infrastructure resources provisioned by cloud exploited to partition data sets as the first step and local
systems, MapReduce becomes much more powerful, elastic recoding is only conducted inside each partition. Conse-
and cost-effective due to the salient characteristics of cloud quently, their approach will incur more data distortion com-
computing. A typical example is the Amazon Elastic Map- pared with the full potential of the local-recoding scheme.
Reduce service. The second reason is that most proximity aware privacy
Basically, a MapReduce job consists of two primitive models have the property of non-monotonicity [21], which
functions, Map and Reduce, defined over a data structure makes such models hard to achieve in a top-down way,
Authorized licensed use limited to: GITAM University. Downloaded on November 08,2022 at 09:51:50 UTC from IEEE Xplore. Restrictions apply.
ZHANG ET AL.: PROXIMITY-AWARE LOCAL-RECODING ANONYMIZATION WITH MAPREDUCE FOR SCALABLE BIG DATA PRIVACY... 2297
Authorized licensed use limited to: GITAM University. Downloaded on November 08,2022 at 09:51:50 UTC from IEEE Xplore. Restrictions apply.
2298 IEEE TRANSACTIONS ON COMPUTERS, VOL. 64, NO. 8, AUGUST 2015
To capture the dissimilarity between sensitive values of sensitive vector can own to combat proximity attacks. As
two records, the distance metric should be defined first. Let ðþ ; dÞk -dissimilarity is extended from ð; dÞk -dissimilarity, it
sv; sv0 2 SVi denote two sensitive values from two records, shares the effectiveness and most characteristics of the latter
where the meaning of SVi is already described in Table 1. as described in [12]. But monotonicity of ð; dÞk -dissimilarity
To establish the distance metric over all sensitive attributes, has not been discussed in [12]. We formally establish the
distance is normalized in subsequent definitions. For a non-monotonicity property of the two models by the follow-
numerical attribute, the normalized distance between sv ing theorem.
and sv0 , dN ðsv; sv0 Þ, is defined as:
Theorem 1. Neither ð; dÞk -dissimilarity nor ðþ ; dÞk -dissimilar-
0 0
dN ðsv; sv Þ ¼ jsv sv j=R; (1) ity is monotonic.
where R ¼ ðsvmax svmin Þ denotes the domain of the attri- Proof. It suffices to prove this theorem by finding a
bute, svmax and svmin are the maximum and minimum val- counter-example for ð; dÞk -dissimilarity, where two QI-
ues of the attribute, respectively. For a categorical attribute, group QIGx and QIGy satisfy ð; dÞk -dissimilarity
the normalized distance between sv and sv0 dC ðsv; sv0 Þ, is respectively, but their union QIGx [ QIGy does not. A
defined as: counter-example is shown as follows. Assume a QI-
group QIGx of size 3 has one-dimensional numerical
dC ðsv; sv0 Þ ¼ Lðsv; sv0 Þ=ð2 HðTTi ÞÞ; (2)
sensitive values {1.0, 3.0, 5.0}, another QI-group QIGy of
where Lðsv; sv0 Þ is the shortest path length between sv and size 3 has sensitive values {2.0, 4.0, 6.0}. Let , d and k
sv0 , and HðTTi Þ is the height of the taxonomy tree. Note that be 1.5, 1 and 3, respectively. Both QI-groups satisfy
2 HðTTi Þ is the maximum path length between any two ð1:5; 1Þ3 -dissimilarity. After merging the two QI-groups,
nodes in the tree. Our definition is more flexible than that in the set of sensitive values is {1.0, 2.0, 3.0, 4.0, 5.0, 6.0}.
[19], as it is unnecessary to require that all the leaf nodes of For the value 3.0, only values {1.0, 5.0, 6.0} are dissimilar
the taxonomy tree should have the same depth, which is in the sense of ¼ 1:5. But the model requires
quite common in real-life applications. Lðsv; sv0 Þ can be 1 ð6 1Þ ¼ 5 values are dissimilar to 3.0. So, the theo-
computed with ease by finding the Lowest Common Ances- rem is proved in terms of this contradiction. u
t
tor (LCA) of sv and sv0 .
Let rS ¼ sv1 ; . . . ; svmS denote the vector of sensitive val-
ues of a record r For convenience, rS is named as sensitive 4.2 Proximity-Aware Clustering Problem of Local-
vector. With the definitions of single attributes, the distance Recoding Anonymization
between sensitive vector rSx ¼ hsv1 ; . . . ; svmS i and Due to the non-monotonicity property, the satisfiability
problem of ðþ ; dÞk -dissimilarity is hard. The satisfiability
rSy ¼ hsv01 ; . . . ; sv0mS i, dðrSx ; rSy Þ, is defined as:
problem is whether there exists a partition that makes the
anonymous data set satisfy ðþ ; dÞk -dissimilarity. Based on
XlS X
mS
d rSx ; rSy ¼ wSi dN þ wSi dC ; (3) the results in [12], we have the following theorem.
i¼1 i¼lS þ1
Theorem 2. In general, the ðþ ; dÞk -dissimilarity satisfiability
problem is NP-hard.
where dN and dC are short for dN svi ; sv0i and dC svi ; sv0i ,
respectively, and wSi , 1 i mS , are weights for sensitive Proof. Since the ð; dÞk -dissimilarity satisfiability problem is
P S S proved to be NP-hard in [12], the conclusion holds as
attributes. The weights, satisfying 0 wSi 1, m i¼1 wi ¼ 1,
indicate the importance of each sensitive attributes and are well for the ðþ ; dÞk -dissimilarity satisfiability problem.
usually specified The reason is that ð; dÞk -dissimilarity can be regarded as
by domain experts for flexibility.
Let SVqid ¼ rS1 ; . . . ; rSn be the set of sensitive vectors in a ðþ ; dÞk -dissimilarity in a specific setting where categori-
QI-group QIGðqidÞ of size n. Given a parameter þ 0, the cal proximity is ignored, and a single sensitive attribute
þ “-neighborhood” of a sensitive vector rS , denoted as rather than multiple ones is considered. u
t
Fqid ðrS ; þ Þ, is defined as: Fqid rS ; þ ¼ rSx j rSx 2 SVqid ;
S S In terms of the complexity result in Theorem 2, it is inter-
and d r ; rx þ g. A sensitive vector is regarded to be sim- esting and practical to find an efficient and near-optimal
ilar to rS if it lies in Fqid rS ; þ . Thus, the parameter 2þ con- solution that can make the proximity among records in a
trols similarity between two sensitive vectors. QI-group as low as possible. Due to the non-monotonicity
To preserve privacy against proximity attacks, it is of ð"þ ; dÞk -dissimilarity, top-down partitioning at domain
expected that the sensitive vectors in a QI-group are dissim- levels fails to work for this privacy model. But clustering
ilar to each other as much as possible. Accordingly, we records with low proximity of sensitive values is still a
define the ðþ ; dÞk -dissimilarity privacy model based on promising way to make the proximity as low as possible. As
[12]. The model requires that for any QI-group QIGðqidÞ in clustering is also a natural and effective way for the local-
an anonymous data set, its size is at least k, and any sensi- recoding scheme, we propose a novel clustering approach
tive vector in SVqid must be dissimilar to at least for local recoding by integrating proximity among sensitive
d ðjQIGðqidÞj 1Þ other sensitive vectors in QIGðqidÞ, values. Specifically, our approach attempts to minimize
where 0 d 1. Parameter k controls the size of each QI- both data distortion and proximity among sensitive values
group to prevent record linkage attacks. Parameter d speci- in a QI-group when conducting clustering, unlike all the
fies constraints on the number of þ neighbors that each existing k-member clustering approaches (kMCs) that
Authorized licensed use limited to: GITAM University. Downloaded on November 08,2022 at 09:51:50 UTC from IEEE Xplore. Restrictions apply.
ZHANG ET AL.: PROXIMITY-AWARE LOCAL-RECODING ANONYMIZATION WITH MAPREDUCE FOR SCALABLE BIG DATA PRIVACY... 2299
consider the former only. The clustering problem we Definition 1 (Proximity-Aware Clustering Problem). Given
attempt to address is referred to as Proximity-Aware Clus- a data set D, the Proximity-Aware Clustering problem is to
tering problem, which is essentially a two-objective cluster- find a set of clusters, denoted as C ¼ fC1 ; . . . ; Cm g, such that
ing problem. each cluster contains at least k (k > 0) data records, the prox-
As minimizing proximity is one objective of the PAC imity of sensitive values in each cluster and the overall data
problem, we define proximity index formally to capture the distortion are minimized. Formally, the PAC problem is for-
proximity between two records. According to (1), the prox- mulated as:
imity index between two numerical sensitive values sv and
max ProxðCÞ;
sv0 can be defined as dN ðsv; sv0 Þ ¼ 1 dN ðsv; sv0 Þ: Likely, Minimize P C2C s:t: :
the index between two categorical values can be defined as C2C ILðC Þ;
tion loss to measure the amount of data distortion. Recall Theorem 3. The PAC decision problem is NP-hard.
that the quasi-identifiers of all records in a final cluster will
Proof. It suffices to prove that this problem for a specific set-
be generalized to a more general one. Specifically, all quasi-
ting is NP-hard. We observe that the k-member clustering
identifier values of a categorical attribute are generalized to
decision problem articulated in [19] is a special case of
their lowest common ancestor in the taxonomy tree, and
the PAC decision problem by setting c1 ¼ þ1, i.e., no
numerical values are generalized to a minimum interval
constraints are posed on sensitive attributes. Byun et al.
covering all the values. Let RQI i be the domain of attribute
[19] proved that the k-member clustering decision prob-
attQI
i . The information loss of a record r 2 QIGðqidÞ, lem which is NP-hard. It therefore follows that the PAC
denoted as ILðrÞ, is defined by: decision problem is NP-hard as well. u
t
X
lQI
vmax vmin X
mQI
Lðv; vLAC Þ In terms of Theorem 3, the PAC clustering problem is
ILðrÞ ¼ wQI i i
þ wQI ; (6) also NP-hard. To address this problem, we transform it to a
i¼1
i
RQI
i
HðTTi Þ
i i¼lQI single-objective clustering problem. To this aim, a distance
where vmax and vmin are the maximum and minimum values measure between two records should be defined with con-
i i
sidering the proximity between sensitive values. Unlike dis-
of the attribute within QIGðqidÞ respectively, and Lðv; vLAC Þ
tance metrics employed in existing k-member clustering
is the length of the path from v to their LCA vLAC . Weights
approaches, ours is required to consider both the similarity
wQI
i , 1 i m , are exploited to indicate the importance
QI
of quasi-identifiers and the proximity of sensitive values.
of each attribute, and satisfy 0 wQI i 1 and Intuitively, we desire that records tend to go into the same
PmQI QI clusters if their quasi-identifiers are similar, and the proxim-
i¼1 wi ¼ 1. These weights above can bring flexibility of
the definition of information loss for different applications. ity of sensitive values among them is low, i.e., the sensitive
Usually, these weights are specified by domain-specific values are dissimilar. Thus, it is reasonable to integrate
experts. Further, we define the information loss of QIGðqidÞ proximity into the distance measure, and hence the distance
is defined by: measure herein is defined over both quasi-identifier and
sensitive attributes.
X
ILðQIGðqidÞÞ ¼ ILðrÞ: (7) As to quasi-identifiers, the distance metric can be utilized
r2QIGðqidÞ directly to capture similarity among them. Similar to (1) and
(2), the normalized distance between two numerical quasi-
Similar to the k-member clustering problem defined in identifier values v and v0 is defined as dN ðv; v0 Þ ¼ jv v0 j=R,
[19], the PAC problem is formally defined as follows. and the distance between two categorical quasi-identifier
Authorized licensed use limited to: GITAM University. Downloaded on November 08,2022 at 09:51:50 UTC from IEEE Xplore. Restrictions apply.
2300 IEEE TRANSACTIONS ON COMPUTERS, VOL. 64, NO. 8, AUGUST 2015
values is defined as dC ðv; v0 Þ ¼ Lðv; v0 Þ=ð2 HðTTi ÞÞ. Then, Proof. It suffices to prove that this problem for a specific set-
the distance between two quasi-identifiers rQI x and ry
QI
is ting is NP-hard. The conventional k-member clustering
defined by: problem [19] is a special case of the SPAC problem by
setting vQI ¼ 1 and vs ¼ 0, i.e., the distance is deter-
QI QI XlQI
QI
X
mQI
mined by quasi-identifier attributes only, the same as k-
d rx ; ry ¼ wi dN þ wQI
i dC ; (8)
i¼1
member clustering problem. The k-member clustering
i¼lQI
problem is NP-hard as its decision problem is NP-hard
where dN and dC are short for dN ðv; v0 Þ and dC ðv; v0 Þ respec- [19]. It thus follows that the SPAC problem is NP-hard as
tively. Weights wQI
i , 1 i m , are the same as (6).
QI well. u
t
Combining the distance function (8) and the proximity Given the hardness of the SPAC problem, it is practical
index function (4), a proximity-aware distance measure and interesting to time-efficiently find a near-optimal solu-
between two data records rx ¼ hrQI x ; rx i and ry ¼ hry ; ry i,
S QI S
tion in big data scenarios, rather than to seek the optimal
denoted as distðrx ; ry Þ, is defined by: one. The next section will show how to achieve this.
distðrx ; ry Þ ¼ vQI d rQI QI
x ; ry þ vs prox rSx ; rSy ; (9)
where the two weights, 0 vQI ; vs 1 and vQI þ vs ¼ 1, 5 TWO-PHASE PROXIMITY-AWARE CLUSTERING
control how much quasi-identifier attributes or sensitive USING MAPREDUCE
attributes contribute to the distance measure. Note that if Except where otherwise noted, the proximity-aware cluster-
vQI ¼ 1 and vs ¼ 0, the distance measure is the same as that ing problem refers to the single-objective proximity-aware
in k-member clustering approaches. However, the distance clustering problem hereafter. To address the SPAC problem
measure distð; Þ is not a distance metric if vs 6¼ 0 because it in big data scenarios, we propose a two-phase clustering
fails to satisfy coincidence axiom. The coincidence axiom, approach where agglomerative clustering method and
one of conditions
that a distance metric must satisfy, point-assignment clustering method are employed in the
requires dist rx ; ry ¼ 0 if and only if rx ¼ ry . However, two phases, respectively. We outline the sketch of the two-
distð; Þ can still capture the degree of similarity records phase clustering approach in Section 5.1. Then, the algorith-
and guide them into mic details of the two phases are elaborated in Sections 5.2
proper clusters when we conduct clus-
tering. In fact, dist rx ; ry ¼ 0 holds if and only if the quasi- and 5.3, respectively. We illustrate the execution process of
identifiers are the same and the proximity our approach and analyze the performance in Section 5.4.
between sensitive
values is the lowest, while dist rx ; ry gets the maximum
value if and only if the distance of quasi-identifiers reaches
5.1 Sketch of Two-Phase Clustering
the maximum and the sensitive values are the same. So, the
In order to choose proper clustering methods for the SPAC
distance measure dist rx ; ry possesses the desired property
problem, some observations of clustering problems for data
that guides records with similar quasi-identifier but dissimi-
anonymization should be taken into account. First, the
lar sensitive values into the same clusters. Armed with the
parameter k in the k-anonymity privacy model is relatively
distance measure, we model the single-objective proximity-
small compared with the scale of a data set in big data sce-
aware clustering (SPAC) problem as follows.
narios. Since the upper-bound of the size of a cluster for
Definition 3 (Single-Objective Proximity-Aware Cluster- local-recoding anonymization is 2k 1, the size of clusters
ing Problem). Given a data set D the single-objective prox- is also relatively small. Accordingly, the number of clusters
imity-aware clustering problem is to find a set of clusters of will be quite large. Second, under the condition that the size
size at least kdenoted as C ¼ fC1 ; . . . ; Cm g, such that the sum of any cluster is not less than k, the smaller a cluster is, the
of all intra-cluster distances is minimized. Formally, the more it is preferred. The reason is that this tends to incur
SPAC problem is formulated as: less data distortion. Ideally, the size of all clusters is exactly
X k. Third, the intrinsic clustering architecture in a data set is
Minimize jCl j max dist rx ; ry ; s:t: : helpful for local-recoding anonymization, but building such
rx ;ry 2Cl an architecture is not the final purpose.
l¼1;...;m
Given the observations above, the agglomerative cluster-
1) [i¼1;...;m Ci ¼ D; and Ci \ Cj ¼ ;, 1 i 6¼ j m. ing method is suitable for local-recoding anonymization, as
2) 8C 2 C; jCj k. the stopping criterion can be set as whether the size of a clus-
ter reaches to k. Moreover, the agglomerative clustering
In fact, maxrx ;ry 2Cl dist rx ; ry in the objective function is
method can achieve minimum data distortion in the sense of
the diameter of the cluster Cl . Intuitively, smaller diameter
the defined distance measure. Most existing approaches for
tends to incur lower data distortion, because it implies that
k-anonymity mentioned in Section 2 employ greedy agglom-
the least common ancestors for categorical attributes lie in
erative clustering approaches. But they construct clusters in
lower levels and the minimum covering intervals for
a greedy manner rather than combine the two clusters
numerical attributes are shorter. It also tends to produce
that have the minimum distance in each round, which results
lower proximity of sensitive values. The factor jCl j indicates
in more data distortion. But the optimal agglomerative clus-
that smaller clusters are preferred, because less data distor-
tering method suffers the scalability problem when handling
tion will be incurred with smaller clusters.
large-scale data sets. Its time complexity is Oðn2 log nÞ with
Theorem 4. The SPAC problem is NP-hard. utilizing a priority queue. Worse still, the agglomerative
Authorized licensed use limited to: GITAM University. Downloaded on November 08,2022 at 09:51:50 UTC from IEEE Xplore. Restrictions apply.
ZHANG ET AL.: PROXIMITY-AWARE LOCAL-RECODING ANONYMIZATION WITH MAPREDUCE FOR SCALABLE BIG DATA PRIVACY... 2301
method is serial, which makes it difficult to be adapted to MapReduce task worker, i.e., either a Mapper or a Reducer.
parallel environments like MapReduce. Concretely, the capacity of N here means that the worker
From the perspective of scalability, the point-assignment can accomplish the agglomerative clustering on a data set of
method seems to be ideal for local-recoding anonymization size N within an acceptable time. The value of t is estimated
in MapReduce. The point-assignment process is to initialize as t jDj=N. Then, the expected maximum size of
a set of data records to represent the clusters, one for each, b-clusters jDj=t can be less than the worker capacity N. In
and assign the rest records into these clusters. The process fact, the skewness in a data set will affect the maximum size
is repeated until certain conditions are satisfied. Point of b-clusters. Thus, the higher the degree of skewness is, the
assignment can be conducted in a scalable and parallel larger t should be. As k N according to the aforemen-
fashion in MapReduce. However, the set of representative tioned observations, t will be much smaller than jDj=k
records will become quite large according the observations which is the case if the point-assignment clustering method
above. Approximately, its size will be 1=k of an original is exploited directly on anonymization. Hence, the set of t
data set. This fact makes it is a challenge to distribute such representative records is relative small and can be distrib-
representative records to MapReduce workers who conduct uted to each MapReduce worker efficiently. As such, our
point assignment according to the representative records approach can handle large-scale data sets in a linear manner
independently. Another problem is that the size of each with respect to the number of MapReduce workers, which
cluster is uncontrollable in the point-assignment process. can be accomplished with ease in cloud environments due
Thus, the size of a cluster can exceed the upper-bound to their scalability.
2k 1 or be less than k, especially when a data set has high
skewness. Extra effort is often required to adjust clusters to 5.2 t-Ancestor Clustering for Data Partitioning
proper size. One core problem in the point-assignment method is how to
Given the pros and cons of the two clustering methods for represent a cluster. Similar to t-medians [32], We propose to
local-recoding anonymization, we propose a two-phase leverage the ‘ancestor’ of the records in a cluster to represent
approach that combines both methods based on MapReduce. the cluster. More precisely, an ancestor of a cluster refers to a
In the first phase, we leverage point-assignment clustering data record whose attribute value of each categorical quasi-
method to partition an original data set into t clusters, where identifier is the lowest common ancestor of the original val-
t is not necessarily relevant to k. For convenience, a cluster ues in the cluster. Each numerical quasi-identifier of an
produced in the first phase is named as b-cluster. In the ancestor record is the median of original values in the cluster.
second phase, the agglomerative clustering method is run on The notion of ancestor record also attempt to capture the log-
each b-cluster simultaneously as ‘plug-ins’, which is similar ical centre of a cluster like t-means/medians, but t-ancestors
to [31]. In this way, our approach shares the merits of both clustering is more suitable for anonymization due to categor-
methods but avoids the drawbacks. Specifically, the two- ical attributes in the clustering problem herein.
phase approach can produce quality anonymous data sets To facilitate t-ancestors clustering, we take quasi-identi-
with the agglomerative clustering method and gain high fier attributes but sensitive ones into consideration. This
scalability with the point-assignment method. In addition, will rarely affect the proximity of sensitive values in a final
no extra adjustment is required. cluster, because the clustering granularity in the first phase
Algorithm 1 describes the main steps in the two-phase is rather coarse. Accordingly, we leverage the distance mea-
clustering approach. Similar to the spirit of t-means family sure (8) to calculate the distance between a data record and
[22], [32], we propose the t-ancestor clustering algorithm an ancestor. Usually, an ancestor is not a real data record in
for point-assignment method. To avoid confusion, we the data set, but the (8) can still be employed to calculate the
employ the term ‘t-means’ rather than ‘k-means’ which is distance between two vectors of attribute values. Except
commonly used in literature. The details of the t-ancestor where otherwise noted, a record r in this section refers to
algorithm and the agglomerative algorithm will be pre- the quasi-identifier part.
sented in Sections 5.2 and 5.3, respectively. Initially, the t ancestors in the first round of point assign-
ment are t records which are dedicatedly selected as seeds.
Algorithm 1. Sketch of Two-Phase Clustering In general, the selection of such t records influences the
quality of clustering to a certain extent. To obtain a good set
Input: Data set D, anonymity parameter k. of seeds, we pick data records that are as far away from
Output: Anonymous data set D . each other as possible. Concretely, we accomplish seed
1: Run the t-ancestor clustering algorithm on D, get a set of
selection via a MapReduce job SeedSelection which outputs a
b-clusters: Cb ¼ fC1b ; . . . ; Ctb g; set of seeds: S ¼ fr1 ; . . . ; rt g. The Map and Reduce functions
2: For each b-cluster Cib 2 Cb , 1 i t: run the agglomerative of SeedSelection are described in Algorithm 2. In this job,
clustering algorithm on Cib , get a set of clusters Ci ¼ fCi1 ; only one Reducer is utilized for seed selection due to the
. . . ; Cimi g; serial nature of the algorithm. To make it scalable to big
3: For each cluster Cj 2 C, where C ¼ [ti¼1 Ci , generalize Cj to data, we sample an original data set by emitting a record to
Cj by replacing each attribute value with a general one; the Reducer with probability N=jDj in the Map function, so
mj P
4: Generate D ¼ [j¼1 Cj , where mj ¼ ti¼1 mi . that only about N records in total go to the Reducer. In the
Reduce function, the first seed is picked at random, and
As t is usually required in advance, we roughly estimate then we repeatedly pick the record whose minimum dis-
it and demonstrate that the two-phase clustering algorithm tance to the existing seeds is the largest until the number of
is scalable in big data scenarios. Let N be the capacity of a seeds reaches t.
Authorized licensed use limited to: GITAM University. Downloaded on November 08,2022 at 09:51:50 UTC from IEEE Xplore. Restrictions apply.
2302 IEEE TRANSACTIONS ON COMPUTERS, VOL. 64, NO. 8, AUGUST 2015
Authorized licensed use limited to: GITAM University. Downloaded on November 08,2022 at 09:51:50 UTC from IEEE Xplore. Restrictions apply.
ZHANG ET AL.: PROXIMITY-AWARE LOCAL-RECODING ANONYMIZATION WITH MAPREDUCE FOR SCALABLE BIG DATA PRIVACY... 2303
TABLE 2
Performance Analysis of MapReduce Jobs
Authorized licensed use limited to: GITAM University. Downloaded on November 08,2022 at 09:51:50 UTC from IEEE Xplore. Restrictions apply.
2304 IEEE TRANSACTIONS ON COMPUTERS, VOL. 64, NO. 8, AUGUST 2015
TABLE 3
Summary of Experimental Parameters
Symbol Notations
MIN The minimum intra-cluster distance of a data set.
AVG The average intra-cluster distance of a data set.
vs The weight of proximity in the distance measure (9).
k The k-anonymity parameter.
t The number of partitions after t-ancestors clustering.
u; t Two parameters for the stopping criteria of t-ances-
tors clustering. u is the maximum number of itera-
tion rounds, and t is the threshold of the difference Fig. 3. System overview of U-Cloud.
of the seeds between two consecutive iteration
rounds. more computation nodes with the data volume increas-
ing, due to its parallelization capability. kMC will suffer
from poor scalability over large-scale data sets because it
Several notions are defined for convenience. To capture requires too much memory, while PAC can linearly scale
the resistibility to proximity breaches, we define two statis- over data sets of any size. Correspondingly, the execution
tics for a QI-group G, namely, the minimum distance MIN G time of PAC is often less than kMC in big data scenarios.
G . Let nG ¼ jGj, then,
and the average distance AVG But note that the contrary case may occur because of the
extra overheads engendered by PAC when the data set or
MIN
G ¼ min d rSx ; rSy ; (12) the scale of MapReduce cluster is small. PAC is equiva-
rx 6¼ry 2G
0 1 lent to kMC if the weight vs ¼ 0 and the parameter t ¼ 1.
X S S . Thus, kMC can be regarded as a special form of PAC.
AVG
G ¼ @2 d rx ; ry A ðnG ðnG 1ÞÞ: (13)
rx 6¼ry 2G 6.2 Experiment Evaluation
k 6.2.1 Experiment Settings
Actually, the QI-group here satisfies MIN G ; nG 1 -dis- Our experiments are conducted in a cloud environment
similarity. Hence, the resistibility can be captured by MIN G named U-Cloud [4]. The system overview of U-Cloud has
directly. As to the entire data set D, we measure the distri- been depicted in Fig. 3. The Hadoop cluster is built on U-
bution of MIN to capture the resistibility intuitively, where cloud. For more details about U-Cloud, please refer to [4].
MIN is the minimum intra-cluster
of D, a random
distance The data set Census-Income (KDD) [35] is utilized in
variable ranging within 0; maxG
D MIN G . Technically, we our experiments. Its subset Adult data set has been com-
leverage the Relative Cumulative Frequency (RCF, similar monly used as a de facto benchmark for testing anonym-
to cumulative distribution function) [34] of MIN to describe ization algorithms [9], [18], [19], [25], [26]. The data set is
the distribution for comparison. The average distance of QI- sanitized via removing records containing missing values
groups in D is defined as and attributes with extremely skewed distributions. We
obtain a sanitized data set with 153,926 records, from
1 X
AVG ¼ AVG : (14) which data sets in the following experiments are sam-
jfGjG
Dgj G
D G pled. Twelve attributes are chosen out of the original 40
ones, including nine (four numerical and five categorical)
From an overall perspective, the average distance 2AVG quasi-identifier ones and three (two numerical and one
can also reflect the vulnerability to proximity breaches to a categorical) sensitive ones.
certain extent. The metric ILoss [27] is employed to evaluate Both PAC and kMC are implemented in Java. Further,
how much data distortion incurred by PAC and kMC. The PAC is implemented with the standard Hadoop MapRe-
value of ILoss is normalized to facilitate comparisons. For duce API and executed on a Hadoop cluster built on Open-
scalability, we check whether both approaches can still Stack in U-cloud. The Hadoop cluster consists of 20 virtual
work and scale over large-scale data sets. Execution time is machines with type m1.medium which has two virtual CPUs
measured to evaluate efficiency. Experimental parameters and 4 GB Memory. kMC is executed on one virtual machine
are summarized in Table 3. with type m1.medium. The maximum heap size of Java VM
We conduct an overall comparison between PAC and is set as 4 GB when running kMC. Each round of experi-
kMC in terms of the four factors above. Intuitively, PAC ment is repeated 10 times. The mean of the measured results
will produce more QI-groups with higher MIN G than kMC, is regarded as the representative.
and AVG of PAC will be larger than kMC. This is because
PAC tends to assign dissimilar sensitive values into a 6.2.2 Experiment Process and Results
cluster while kMC randomly put sensitive values into We conduct two groups of experiments in this section to
clusters. Normally, the ILoss of PAC will be higher than evaluate the effectiveness and efficiency of our approach. In
that of kMC, as there is a tradeoff between data distortion the first one, we explore whether proximity-aware cluster-
and the capability of defending privacy attacks. Note that ing leads to larger dissimilarity by comparing PAC with
this is a common phenomenon for privacy models (e.g., l- kMC from the perspectives of resistibility to proximity
diversity and t-closeness) that fight against attribute link- breaches and data distortion. The other investigates the scal-
age attacks [7], [8]. As to scalability, PAC can scale over ability and efficiency.
Authorized licensed use limited to: GITAM University. Downloaded on November 08,2022 at 09:51:50 UTC from IEEE Xplore. Restrictions apply.
ZHANG ET AL.: PROXIMITY-AWARE LOCAL-RECODING ANONYMIZATION WITH MAPREDUCE FOR SCALABLE BIG DATA PRIVACY... 2305
Fig. 5. Change of execution time w.r.t. data size and the number of
computation nodes.
Authorized licensed use limited to: GITAM University. Downloaded on November 08,2022 at 09:51:50 UTC from IEEE Xplore. Restrictions apply.
2306 IEEE TRANSACTIONS ON COMPUTERS, VOL. 64, NO. 8, AUGUST 2015
model, ðþ ; dÞk -dissimilarity, by allowing multiple sensitive [11] G. Aggarwal, R. Panigrahy, T. Feder, D. Thomas, K. Kenthapadi,
S. Khuller, and A. Zhu, “Achieving anonymity via clustering,”
attributes and semantic proximity of categorical sensitive val- ACM Trans. Algorithms, vol. 6, no. 3, 2010.
ues. Since the satisfiability problem of ðþ ; dÞk -dissimilarity is [12] T. Wang, S. Meng, B. Bamba, L. Liu, and C. Pu, “A general prox-
NP-hard, the problem of big data local recoding against prox- imity privacy principle,” in Proc. IEEE 25th Int. Conf. Data Eng.,
2009, pp. 1279–1282.
imity privacy breaches has been modeled as a proximity- [13] T. Iwuchukwu and J. F. Naughton, “K-Anonymization as spatial
aware clustering problem. We have proposed a scalable two- indexing: Toward scalable and incremental anonymization,” in
phase clustering approach based on MapReduce to address Proc. 33rd Int. Conf. Very Large Data Bases, 2007, pp. 746–757.
the above problem time-efficiently. A series of innovative [14] X. Zhang, L. T. Yang, C. Liu, and J. Chen, “A scalable two-phase
top-down specialization approach for data anonymization using
MapReduce jobs have been developed and coordinated to Mapreduce on cloud,” IEEE Trans. Parallel Distrib. Syst., vol. 25,
conduct data-parallel computation. Extensive experiments on no. 2, pp. 363–373, Feb. 2014.
real-world data sets have demonstrated that our approach [15] X. Zhang, C. Liu, S. Nepal, C. Yang, W. Dou, and J. Chen, “A
hybrid approach for scalable sub-tree anonymization over big
significantly improves the capability of defending proximity data using Mapreduce on cloud,” J. Comput. Syst. Sci., vol. 80,
attacks, the scalability and the time-efficiency of local-recod- no. 5, pp. 1008–1020, 2014.
ing anonymization over existing approaches. [16] Y. Yang, Z. Zhang, G. Miklau, M. Winslett, and X. Xiao,
In cloud environment, the privacy preservation for data “Differential privacy in data publication and analysis,” in Proc.
ACM SIGMOD Int. Conf. Manage. Data, 2012, pp. 601–606.
analysis, share and mining is a challenging research issue [17] J. Lee and C. Clifton, “Differential identifiability,” in Proc. 18th
due to increasingly larger volumes of datasets, thereby ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2012,
requiring intensive investigation. Based on the contribu- pp. 1041–1049.
[18] J. Li, R. C.-W. Wong, A. W.-C. Fu, and J. Pei, “Anonymization by
tions herein, we plan to integrate our approach with Apache local recoding in data with attribute hierarchical taxonomies,” IEEE
Mahout, a MapReduce based scalable machine learning and Trans. Knowl. Data Eng., vol. 20, no. 9, pp. 1181–1194, Sep. 2008.
data mining library, to achieve highly scalable privacy pre- [19] J.-W. Byun, A. Kamra, E. Bertino, and N. Li, “Efficient K-anonym-
serving big data mining or analytics. ization using clustering techniques,” in Proc. 12th Int. Conf. Data-
base Syst. Adv. Appl., 2007, pp. 188–200.
[20] Q. Zhang, N. Koudas, D. Srivastava, and T. Yu, “Aggregate query
ACKNOWLEDGMENTS answering on anonymized tables,” in Proc. IEEE 23rd Int. Conf.
Data Eng., 2007, pp. 116–125.
This paper was partly supported by Australian Research [21] J. Li, Y. Tao, and X. Xiao, “Preservation of proximity privacy in
Council Linkage Project LP140100816 and project National publishing numerical sensitive data,” in Proc. ACM SIGMOD Int.
Science Foundation of China under Grant 91318301. Pei’s Conf. Manage. Data, 2008, pp. 473–486.
[22] S. Lloyd, “Least squares quantization in PCM,” IEEE Trans. Inf.
research in this paper was supported in part by an NSERC Theory, vol. IT-28, no. 2, pp. 129–137, Mar. 1982.
Discovery grant and a BCIC NRAS Team Project. All opin- [23] R. C.-W. Wong, J. Li, A. W.-C. Fu, and K. Wang, “(a, k)-anonym-
ions, findings, conclusions and recommendations in this ity: An enhanced k-anonymity model for privacy preserving data
publishing,” in Proc. 12th ACM SIGKDD Int. Conf. Knowl. Discovery
paper are those of the authors and do not necessarily reflect Data Mining, 2006, pp. 754–759.
the views of the funding agencies. [24] K. LeFevre, D. J. DeWitt, and R. Ramakrishnan, “Mondrian multi-
dimensional k-anonymity,” in Proc. 22nd Int. Conf. Data Eng., 2006,
p. 25.
REFERENCES [25] B. C. M. Fung, K. Wang, and P. S. Yu, “Anonymizing classification
[1] S. Chaudhuri, “What next?: A half-dozen data management data for privacy preservation,” IEEE Trans. Knowl. Data Eng.,
research goals for big data and the cloud,” in Proc. 31st Symp. Prin- vol. 19, no. 5, pp. 711–725, May 2007.
ciples Database Syst., 2012, pp. 1–4. [26] N. Mohammed, B. Fung, P. C. K. Hung, and C. K. Lee, “Centralized
[2] L. Wang, J. Zhan, W. Shi, and Y. Liang, “In cloud, can scientific and distributed anonymization for high-dimensional healthcare
communities benefit from the economies of scale?” IEEE Trans. data,” ACM Trans. Knowl. Discovery Data, vol. 4, no. 42010.
Parallel Distrib. Syst., vol. 23, no. 2, pp. 296–303, Feb. 2012. [27] X. Xiao and Y. Tao, “Personalized privacy preservation,” in Proc.
[3] X. Wu, X. Zhu, G.-Q. Wu, and W. Ding, “Data mining with big ACM SIGMOD Int. Conf. Manage. Data, 2006, pp. 229–240.
data,” IEEE Trans. Knowl. Data Eng., vol. 26, no. 1, pp. 97–107, Jan. [28] J. Dean and S. Ghemawat, “Mapreduce: A flexible data processing
2014. tool,” Commun. ACM, vol. 53, no. 1, pp. 72–77, 2010.
[4] X. Zhang, C. Liu, S. Nepal, S. Pandey, and J. Chen, “A privacy [29] K.-H. Lee, Y.-J. Lee, H. Choi, Y. D. Chung, and B. Moon, “Parallel
leakage upper bound constraint-based approach for cost-effective data processing with Mapreduce: A survey,” ACM SIGMOD
privacy preserving of intermediate data sets in cloud,” IEEE Record, vol. 40, no. 4, pp. 11–20, 2012.
Trans. Parallel Distrib. Syst., vol. 24, no. 6, pp. 1192–1202, Jun. 2013. [30] T. Wang and L. Liu, “XColor: Protecting general proximity
[5] B. C. M. Fung, K. Wang, R. Chen, and P. S. Yu, “Privacy-preserv- privacy,” in Proc. IEEE 26th Int. Conf. Data Eng., 2010, pp. 960–
ing data publishing: A survey of recent developments,” ACM 963.
Comput. Survey, vol. 42, no. 4, pp. 1–53, 2010. [31] R. L. Ferreira Cordeiro, C. Traina, A. J. Machado Traina,
[6] L. Sweeney, “K-anonymity: A model for protecting privacy,” Int. J. J. Lopez, U. Kang, and C. Faloutsos, “Clustering very large
Uncertainty Fuzziness, vol. 10, no. 5, pp. 557–570, 2002. multi-dimensional datasets with Mapreduce,” in Proc. 17th
[7] A. Machanavajjhala, D. Kifer, J. Gehrke, and M. Venkitasubramaniam, ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2011,
“L-diversity: Privacy beyond k-anonymity,” ACM Trans. Knowl. Discov- pp. 690–698.
ery Data, vol. 1, no. 1, 2007. [32] S. Arora, P. Raghavan, and S. Rao, “Approximation schemes for
[8] N. Li, T. Li, and S. Venkatasubramanian, “Closeness: A new pri- Euclidean K-medians and related problems,” in Proc. 30th Annu.
vacy measure for data publishing,” IEEE Trans. Knowl. Data Eng., ACM Symp. Theory Comput., 1998, pp. 106–113.
vol. 22, no. 7, pp. 943–956, Jul. 2010. [33] Y. Tao, W. Lin, and X. Xiao, “Minimal Mapreduce algorithms,”
[9] J. Xu, W. Wang, J. Pei, X. Wang, B. Shi, and A. W. C. Fu, “Utility- Proc. ACM SIGMOD Int. Conf. Manage. Data, 2013, pp. 529–540.
based anonymization using local recoding,” in Proc. 12th ACM [34] I. W. Burr, “Cumulative frequency functions,” Ann. Math. Statist.,
SIGKDD Int. Conf. Knowl. Discovery Data, 2006, pp. 785–790. vol. 13, no. 2, pp. 215–232, 1942.
[10] K. LeFevre, D. J. DeWitt, and R. Ramakrishnan, “Workload-aware [35] UCI Machine Learning Repository, Census-income (KDD) data
anonymization techniques for large-scale datasets,” ACM Trans. set [Online]. Available: http://archive.ics.uci.edu/ml/datasets/
Database Syst., vol. 33, no. 3, pp. 1–47, 2008. Census-Income(KDD)
Authorized licensed use limited to: GITAM University. Downloaded on November 08,2022 at 09:51:50 UTC from IEEE Xplore. Restrictions apply.
ZHANG ET AL.: PROXIMITY-AWARE LOCAL-RECODING ANONYMIZATION WITH MAPREDUCE FOR SCALABLE BIG DATA PRIVACY... 2307
Xuyun Zhang received the bachelor’s and mas- Chi Yang received the BS degree from Shan-
ter’s degrees in computer science from Nanjing dong University at Weihai, China and the MS
University, China. He is currently working toward (by research) degree in computer science from
the PhD degree at the Faculty of Engineering & the Swinburne University of Technology, Mel-
IT, University of Technology Sydney, Australia. bourne, Australia, in 2007. He is currently work-
His research interests include cloud computing, ing toward the PhD degree at the University of
privacy and security, big data, scalable machine Technology, Sydney, Australia. His major
learning and data mining, MapReduce and research interests include distributed computing,
OpenStack. He has published several papers in XML data stream, scientific workflow, distributed
refereed international journals including IEEE system, green computing, big data processing,
Transactions on Parallel and Distributed Systems and could computing.
(TPDS). He received the Chinese Government Award for Outstanding
Self-Financed Student Abroad in 2012. Chang Liu is currently working toward the PhD
degree at University of Technology Sydney,
Australia. His research interests include cloud
Wanchun Dou received the PhD degree in computing, resource management, cryptography,
mechanical and electronic engineering from the and data security.
Nanjing University of Science and Technology,
China, in 2001. From Apr. 2001 to Dec. 2002, he
did the postdoctoral research in the Department
of Computer Science and Technology, Nanjing
University, China. He is currently a full professor
ay the State Key Laboratory for Novel Software
Technology, Nanjing University, China. From
Apr. 2005 to Jun. 2005 and from Nov. 2008 to Jinjun Chen received the bachelor’s degree in
Feb. 2009, he visited the Department of Com- applied mathematics from Xidian University,
puter Science and Engineering, Hong Kong University of Science and China, in 1996, the master’s degree in engineer-
Technology, as a visiting scholar, respectively. Up to now, he has chaired ing in 1999, and the PhD degree in computer sci-
three NSFC projects and published more than 60 research papers in ence and software engineering from Swinburne
international journals and international conferences. His research inter- in 2007.He is currently an associate professor at
ests include workflow, cloud computing, and service computing. the Faculty of Engineering and IT, University of
Technology Sydney, Australia. His research
interests include cloud computing, big data,
Jian Pei is a professor of computing science at software services, privacy, and security. He has
Simon Fraser University and Canada Research published more than 130 papers in high quality
Chair in Big Data Science. He is widely regarded journals and conferences, including IEEE Transactions on Parallel and
as one of the world’s top researchers in the area Distributed Systems, IEEE Transactions on Computers, IEEE Transac-
of data mining and his work has been embraced tions on Services Computing, IEEE Transactions on Software Engineer-
by industry and government. His research inter- ing, ACM Transactions on Software Engineering and Methodology, and
ests include developing effective and efficient ICSE. He received the IEEE Computer Society Outstanding Leadership
ways to analyze - and capitalize on - the vast Awards (2008-2009, 2010-2011), Vice Chancellor Research Award, and
stores of data housed in applications such as many other awards.
social networks, network security informatics,
healthcare informatics, business intelligence, and " For more information on this or any other computing topic,
web searches. A prolific and widely-cited author, he has received many please visit our Digital Library at www.computer.org/publications/dlib.
prestigious awards.
Authorized licensed use limited to: GITAM University. Downloaded on November 08,2022 at 09:51:50 UTC from IEEE Xplore. Restrictions apply.