Professional Documents
Culture Documents
Preservation
Abstract—K-anonymity and its successors, like l-diversity conjunction with fuzzy draft rate. The experimental result
and t-closeness, are the most popular approaches for privacy shows that, comparing with the classical k-anonymity method,
preserving data publishing. However, each method has relatively the new algorithm performs better in the information loss and
high information loss and computational complexity. In order execution performance.
to solve this problem, this paper presents a fuzzy set based
anonymity algorithm, where numerical data are transformed to
linguistic data and sensitive data are published in conjunction II. R ELATED W ORK
with fuzzy draft rate. The experimental results show that the
fuzzy based algorithm performs better than that of the k- Currently, numerous methods have been proposed for pri-
anonymity method from the points of information loss and vacy preservation. Samarati and Sweeney [3][4] introduced
execution performance. The information loss of the fuzzy based k-anonymity with a attractive property that each tuple in
algorithm has been reduced by 40%∼50% and the execution the private table being released cannot be identified from
time reduced by 48%∼59%. at least k records. In other words, k-anonymity claims that
Keywords—privacy preservation, fuzzy sets, k-anonymity. each equivalence class contains at least k records with respect
to the quasi-identifier. The main contribution of paper [5]
is p-sensitive k-anonymity, which requires, in addition to
I. I NTRODUCTION
k-anonymity, that for each group of tuples with identical
Along with the development of computer network, dis- combination of quasi-identifier values, the number of distinct
tributed computing, data mining and big data, huge amounts sensitive attributes values must be at least p. Although k-
of data can be collected and analysed efficiently. But when anonymity protects identity disclosure, it is insufficient to
we excavate the potential value of the large amounts of data, prevent attribute disclosure. Nergiz et al. [6] proposed a
it will face privacy leakage inevitably. Therefore, in big data privacy model called multi-relational k-anonymity to ensure
era, how to protect privacy has become one of the hottest IT- k-anonymity on multiple relational tables. K-anonymity is
related research areas. popular and classical because of its simplicity in definition and
also easy implementation to process the anonymization. But
In daily life, most organizations often need to publish their it faces many attacks when background knowledge is known
micro data, e.g, medical data, census data, etc., for research and to outsiders. Such attacks include: (1)“Homogeneity Attack”
other purposes. Typically, such data are stored in databases, leverages the case where all the values for a sensitive value
and each record or a row corresponds to one individual. Each within a set of k records are identical. In such cases, even
record has a number of attributes [1], which can be divided into though the data has been k-anonymized, the sensitive value for
three categories: Identifier attribute, Quasi-identifier attribute the set of k records may be exactly predicted; (2)“Background
and Sensitive attribute. (1)Identifier attribute can uniquely Knowledge Attack” leverages an association between one or
identify an individual, for instance, identity card number, social more quasi-identifier attributes with the sensitive attribute to
security number, and etc. (2)Quasi-identifier attributes that reduce the set of possible values of the sensitive attribute.
may be known by outsiders, such as zip-code, birthday date,
and gender, that can be joined with external information to To address the limitations of k-anonymity, Machanavajjhala
re-identify individual records. (3)Sensitive attributes that are et al. [7] introduced a new notion of privacy, called L-diversity,
assumed to be unknown to outsiders and need to be protected, which requires that the distribution of a sensitive attribute
such as health condition or salaries. in each equivalence class has at least l “well represented”
values. A t-closeness model [8] was proposed which depends
To preserve privacy, a number of techniques have been
on the property that the earth mover distance(EMD) between
proposed for modifying or transforming the original data. The
the distribution of the sensitive attribute within an anonymized
techniques are done through cryptography, data mining and
group should not be different from the global distribution by
information hiding [2]. But in general, these techniques have
more than a threshold t. A table is said to have t-closeness
too much information losses with high execution complexity.
if all equivalence classes have t-closeness. A (l, t)-closeness
In this paper, we address the data privacy problem by using model [9] was proposed based on the partition of the sensitive
fuzzy sets, a new perspective of looking at privacy problem levels. This model relaxes the equivalence class constrains of
in data publishing. In fuzzy set based anonymity algorithm, t-closeness model, that the distance between the distribution
numerical data are fuzzily processed in order to transform of sensitive levels in the equivalence class and the whole data
into linguistic data and sensitive data are also published in table are no more than a threshold t. It depends on the Hellinger
Assigning a value µA (x) ∈ [0, 1] to each element x ∈ U , For example, let U be the universe of discourse and
we call µA (x) ∈ [0, 1] is the membership degree that x belongs U = {2, 4, 6, 8, 10}. If we want to describe “the number which
A. closer to 7”, it is obviously known that 6 and 8 are all closer
to 6, the membership degree of 6 and 8 are both 0.8. When
B. Maximum Membership Principle use the fuzzy offset, we might calculate the fuzzy draft rate
of 6 is −0.2 and the fuzzy draft rate of 8 is +0.2. The above
Let U be the discourse domain, with n fuzzy subsets mentioned statement could be interpreted that 6 is close to 7
{A1 , A2 ...An }, we can identify any given x0 ∈ U and from the left side and 8 is close to 6 from the right side.
E. Information Loss Algorithm 1 Fuzzy-anonymity Algorithm
1: Input: Original Table T(with n records),
The concept of data utility is a standard to measure the
2: Output: Private Table T 0 which satisfies fuzzy-anonymity
degree of the original data characteristics after anonymization.
3: Begin
The definition of information loss is consistent with the data
4: for i = 1 to n do
utility. The smaller the information loss, the better the data
utility. So the information loss can be used to measure the data 5: Select the required attributes from the table T.
utility after anonymization. We propose a method that calculate 6: Categorize the type of attributes.
the homogeneity of data sets to measure the information 7: Identifier attributes, for instance, identification card
loss [19], which can be defined as: number, are generally replaced with auto generated
numbers.
8: Choose the number of fuzzy sets.
IL = SSE/SST (4) 9: while quasi-identifier attributes is numerical do
10: Calculate fuzzy membership degree by choosing
In Eq.(4) IL, SSE and SST are abbreviated for informa- membership function based on experts experience
tion loss, squares of error, and sum of squares respectively. mentioned in section II (In this step, the calculated
SSE represents the homogeneity measurement in a class, membership degree is not released in order to resist
which can be defined as: the linking attack since this membership function is
vulnerable to disclose).
ni
a X
X 11: end while
SSE = (Xij − X i )T (Xij − X i ) (5) 12: while sensitive attribute is numerical do
i=1 j=1 13: Transform the actual values to the fuzzy draft rate by
Algorithm2.
In this formula, Xij refers to the j th tuple on the ith 14: end while
equivalence class. X i refers to the average value of each tuple 15: end for
in the equivalence class. a is the number of the equivalence 16: For all the records, the algorithm will terminate until all
0
classes. ni is the number of tuples in the equivalence classes i. the numerical attributes are transformed and table T is
The smaller the value of SSE, the smaller is the information generated.
loss for anonymized data. 17: End
IV. F UZZY-A NONYMITY A LGORITHM {min − a2 }, {a1 − a2 }, {ai−1 − ai+1 }...{an−1 − max} (7)
In this section, we propose a fuzzy-anonymity algorithm
to protect the privacy. The main idea is that the numerical For example, supposing L = {Low, M edium, High},
data are fuzzily processed and transformed into linguistic sets. then the number of fuzzy set is three. The minimum and max-
Especially, sensitive data are published in conjunction with imum values of income according to the business organization
fuzzy draft rate to maintain the utility of data mining. During are min and max respectively, and {a1 , a2 , a3 } are the mid-
the whole process, the membership function need to remain points of each fuzzy set.
confidential.
Since triangle membership function has the benefit of
The central idea of fuzzy mathematics is that membership simple calculation and easily achievable, we use triangle mem-
degree indicated by a value on the range [0, 1], where ‘0’ bership function to transfer the numerical data to linguistic
represents absolute false and ‘1’ the absolute true with respect data, which belongs to ‘assignment method’ mentioned in
to a given linguistic term. The linguistic terms are the words section II. For the fuzzy set with mid-point a1 , the membership
that describe the magnitude of the linguistic variable, such as function is given as Eq.8.
low, high and medium.
A. Algorithm Overview 0.99
x = min
a2 −x
The pseudo-code of the fuzzy-based algorithm is shown as A1 (x) = min < x < a2 (8)
a2 −min
Algorithm 1. 0 x ≥ a2
TABLE I. T HE INITIAL MICRO DATA TO BE PUBLISHED
For the fuzzy set with mid-point ai , 2 ≤ i ≤ n − 1, the
membership function is given as Eq.9. Name Age Gender Income
R EFERENCES
[1] P. Samarati and L. Sweeney, “Generalizing data to provide anonymity
when disclosing information,” in Proceedings of the seventeenth ACM
SIGACT-SIGMOD-SIGART symposium on Principles of database sys-
tems, 1998, p. 188.
[2] B. C. M. Fung, K. Wang, W. C. Fu, and P. S. Yu, “Introduction to
privacy-preserving data publishing : concepts and techniques,” A Survey
on Recent Developments, Computing, 2011.
[3] L. SWEENEY, “k-anonymity: A model for protecting privacy,” Interna-
tional Journal of Uncertainty, Fuzziness and Knowledge-Based Systems,
vol. 10, no. 5, pp. 557–570, 2012.
[4] L.SWEENEY, “Achieving k -anonymity privacy protection using gen-
eralization and suppression,” International Journal of Uncertainty,
Fuzziness and Knowledge-Based Systems, vol. 10, no. 5, pp. 571–588,
2002.
[5] J. Domingoferrer and V. Torra, “A critique of k-anonymity and some
of its enhancements.” in 2012 Seventh International Conference on
Availability, Reliability and Security, 2008, pp. 990–993.
[6] M. E. Nergiz, C. Clifton, and A. E. Nergiz, “Multirelational k-
anonymity,” IEEE Transactions on Knowledge & Data Engineering,
vol. 21, no. 8, pp. 1104–1117, 2009.
[7] A. Machanavajjhala, J. Gehrke, D. Kifer, and M. Venkitasubramaniam,
“l-diversity: Privacy beyond k-anonymity,” Acm Transactions on Knowl-
edge Discovery from Data, vol. 1, no. 1, p. 24, 2007.
[8] N. Li, T. Li, and S. Venkatasubramanian, “t-closeness: Privacy beyond
k-anonymity and l-diversity,” Icde, pp. 106 – 115, 2007.
[9] J. Yang, B. Zhang, J. Zhang, and J. Xie, “A(l,t)-closeness anonymization
method based on sensitive levels partition,” Journal of Huazhong
University of Science & Technology, 2014.
[10] X. Xiao and Y. Tao, “m-invariance: Towards privacy preserving re-
publication of dynamic datasets,” in Proceedings of the 2007 ACM
SIGMOD international conference on Management of data, 2007, pp.
689–700.