You are on page 1of 5

Research on privacy protection based on

K-anonymity

REN Xiangmin1,2 YANG Jing1


1 1
College of Computer Science and Technology, Harbin College of Computer Science and Technology, Harbin
Engineering University Engineering University
2
School of software, Harbin University Harbin, China
Harbin, China yangjing@hrbeu.edu.cn
min0070@sina.com

Abstract—With the rising of data mining technology and the threat or nonauthorization interview by identity authentication
appearances of data stream and uncertain data technology etc, mechanism and data coding. Many current international
individual data, the enterprise data are possibly leaked at any conferences analyse this problem, put forward the requirement
moments, so the data security has become nowadays the main of the higher encryption technology with the further research
topic of information security. The common way to protect on the information security, establish corresponding protection
privacy is to use K-anonymity in data publishing. This paper will model, ensure the strategy of interview authorization by setting
analyse comprehensively the current research situation of K- up server-based protective device, client-based protective
anonymity model used to prevent privacy leaked in data model and later logic framed protective model. Such method
publishing, introduce the technology of K-anonymity,
can prevent the disclosure of sensitive information to some
generalization and suppression, illustrate K-anonymity
extent, but unavoidable to the user who apply non- sensitive
evaluation criterion, and assess many different algorithms used
currently. Finally, the future directions in this field are discussed. data to infer indirectly the sensitive information with the help
of other external knowledge. Sensitive data, which can’t be
Keywords- K-anonymity; privacy protection; evaluation publicized, belong to private and national confidential data.
criterion; algorithm classfication The leak of such data may lead to the logic publicity of
subjective identity of data. In order to solve the problem of the
unconscious disclosure of private information and protection of
I. INTRODUCTION
individual identity, many research institute commit a great deal
Database technology has been developed quickly and of manpower, and the material resources to the research, the
applied extensively in the 1960's. With the development of research of the problem was put forward on many current
network technology, the database security problem is international conferences, such as SIGMOD, VLDB, ICDE,
prominent increasingly, especially information stealing, and PODS etc.
tampering and destroying, endanger the safety of information
system. With the rising of data mining technology and the K-anonymity [3], a model put forward by Samarati P and
appearances of data stream and uncertain data technology etc, Sweeney L in 1998 to avoid privacy leaks, requests existence
individual data, the enterprise data are possibly leaked at any of a certain amount of unrecognizable individuals in the
moments, so the data security has become nowadays the main publicized data which make the aggressor disable to distinguish
topic of information security. the concrete individual of privacy, and prevent the leak of
individual privacy. K-anonymity got the universal concern of
Enterprise, organization and government stored a great deal the academic circles, and a lot of scholars research and develop
of information, such as employee's salary, medical records, the technology on different levels.
criminal records, credit records etc., which all preserved in
various database. On the database server of some departments Samarati P [4] realizes k-anonymity by adopting
saved sensitive financial data, including the data of the generalization and suppression techniques to protect individual
transaction record, business, and account which should be private information, and introduce the concept of minimal
protected in order to prevent the competitor and other outlaws generalization. Iyengar V [5] explores preserving the
from obtaining these data. In this global information-based anonymity by the use of generalizations and suppressions on
environment, people can obtain useful information from mass the potentially identifying portions of the data. In particular, he
data by knowledge discovery, which in turn, also bring threat investigates the privacy transformation in the context of data
to privacy, as a result, people are required to share data and mining applications like building classification and regression
protect simultaneously. models. This is combined with a more thorough exploration of
the solution space using the genetic algorithm framework. Yao
Database security is mostly fulfilled by authorization or C et al. [6] illustrates the identification of k-anonymity in
encryption technology [1], [2]. Which are very important for views. Machanavajjhala A et al. [7] extended k-anonymity
the information and the data protection, they prevent external concept to the l-diversity concept, which requires l different

978-1-4244-5316-0/10/$26.00 ©2010 IEEE


values on sensitive attribute for each equivalence block that insurance number etc, and be deleted or encrypted
satisfied k-anonymity[18]. Based on the idea of Classfly, a before the publishing of data in order to protect
family of multiple constraints supported K-anonymization individual privacy.
approaches named Classfly+ are proposed according to the
features of mutiple constraints by Yang XC et al. [8]. And the • Quasi-identifier(QI), existed in private table and
Classfly+ can decrease the information loss and improve external table, refers to a group of attributes that can
efficiency of k-anonymization. XiaoKui Xiao et al. [9] present recognize personal information by using link, such as
a new generalization framework based on the concept of {Race ,Birth ,Sex ,ZIP}. Unlike explicit identifier, QI’s
personalized anonymity. Their technique performs the definition depends on external information owned by
minimum generalization for satisfying everybody’s attackers, so any attributes can possibly become QI,
requirements, and thus, retains the largest amount of generally speaking, QI can be chosen by experts, not
information from the microdata. Aggarwal [10] shows that the users.
curse of k-anonymity in the case of high dimensionality. • Sensitive Attribute. The attribute has included
DENG Jing-jing and YE Xiao-jun [11] propose an algorithm individual sensitive information, for example salary,
for multidimensional K-anonymity by R tree. Xiaoxun Sun et religion, political party, physical condition etc.
al. [12] propose a new privacy protection model called (p+, α)-
sensitive k-anonymity, where sensitive attributes are first • Non-sensitive attribute. The attribute has included non-
partitioned into categories by their sensitivity, and then the sensitive information, and can not be ignored when
categories that sensitive attributes belong to are published. In conserving the original data, because any mixture of
[13], a partitioning based privacy preserving k-anonymous these attributes can possibly be a QI. The following is
algorithm k-APPRP for re-publication is proposed by WU the definition of several terms mentioned in the paper.
Yingjie et al., and the k-APPRP can securely anonymize a Definition 2: Quasi-identifier
continuously growing dataset in an efficient manner while
assuring high data quality. Microaggregation algorithm is Given a population of entities U, an entity-specific table
recently proposed to achieve k-anonymity. In [14], T(A1,…,An), fc: U→T and fg:T→U', where U ⊆ U'. A quasi-
microaggregation algorithms’ core ideas, the state-of-the-art identifier of T, written QT, is a set of attributes
and related techniques are surveyed, the existing algorithms are {Ai,…,Aj} ⊆ {A1,…,An} where: ∃pi∈U such that fg(fc(pi)[QT]) =
classified and analyzed, evaluation methods of
pi.
microaggregation algorithms are investigated. In [15], Gionis
Aristides et al. introduce new notions of k-type Definition 3: k-anonymity
anonymizations ,which they call (1, k)-, (k, k)- and global (1,
k)-anonymizations, according to several utility measures. they Let RT(A1,...,An) be a table and QIRT be the quasi-identifier
propose a collection of agglomerative algorithms for the associated with it. RT is said to satisfy k-anonymity if and only
problem of finding such anonymizations with high utility. A if each sequence of values in RT[QIRT] appears with at least k
new privacy protection method beyond K-anonymity called (L, occurrences in RT[QIRT].
K)-anonymity [16]. It was used to protect data after K-
anonymity, and the algorithm of eliminate the information TABLE I. EXAMPLE OF K-ANONYMITY,WHERE K=2 AND
disclosure was provided. Charu C. Aggarwal [17] proposes an QI={RACE,BIRTH,SEX,ZIP}
uncertain version of the k-anonymity model. Jianzhong Li et al. Num Race Birth Sex ZIP Disease
[18], [19] attempt to integrate two fields: data stream t1 asian 1974 F 3115** hypertension
management and privacy protection, they propose novel t2 asian 1974 F 3115** short breath
approach, SWAF and SKY to solve this problem. t3 asian 1974 M 3114** chest pain
t4 asian 1974 M 3114** obesity
t5 black 1973 F 3114** flu
II. RELEATED CONCEPTS OF K-ANONYMITY t6 black 1973 F 3114** short breath
t7 black 1973 F 3114** obesity
A. K-anonymity t8 white 1973 F 3114** chest pain
t9 white 1973 F 3114** cancer
Definition 1: Attributes
Table Ⅰ adheres to k-anonymity, the quasi-identifier is
Let B(A1,…,An) be a table with a finite number of tuples. QIT={Race, Birth, Sex, ZIP} and k=2. In particular, t1[QIT]=
The finite set of attributes of B are {A1,…,An}. t2[QIT], t3[QIT]= t4[QIT], t5[QIT]= t6[QIT] = t7[QIT], t8[QIT]=
Given a table B(A1,…,An), {Ai,…,Aj} ⊆ {A1,…,An}, and a t9[QIT].
tuple t∈B, use t[Ai,…,Aj] to denote the sequence of the values, Lemma:
vi,…,vj, of Ai,…,Aj in t. Use B[Ai,…,Aj] to denote the projection, Let RT(A1,...,An) be a table, QIRT =(Ai,…,Aj) be the quasi-
maintaining duplicate tuples, of attributes Ai,…Aj in B [20].
identifier associated with RT, Ai,…,Aj ⊆ A1,…,An, and RT satisfy
Attribute in table can be classified as four categories k-anonymity. Then, each sequence of values in RT[Ax] appears
according to the role it plays [14]. with at least k occurrences in RT[QIRT] for x=i,…,j.
• Explicit-identifier, refers to the attribute which can For example, each value that appears in a value associated
identify individual explicitly, such as name and social with an attribute of QI in Table Ⅰ appears at least k times.
|T[Race=" asian "]| = 4. |T[Race =" black "]| = 3. |T[Race =" Fig.1 provides an example of domain and value
white "]| = 2.|T[Birth ="1974"]| = 4. |T[Birth="1973"]| = 5. generalization hierarchies expended to include the suppressed
|T[Sex ="M"]| = 2. |T[Sex ="F"]|= 7. |T[ZIP ="3115**"]| = 2. maximal element(******).
And, |T[ZIP ="3114**"]| = 7.
C. The measurement of k-anonymity
B. Generalization and Suppression K-anonymity must be built upon a proper criterion of
Anonymity is the process of generalization and suppression measurement, we could not use one criterion to measure all the
for the data needing to keep secret. We can obtain a table that k-anonymity algorithm in practical application, instead we can
can release but without any information exposed by only say some algorithms are better than others in some
generalizing and suppressing. Simply speaking, suppression is specific application. As a result, we ought to provide users with
that the value that delete a unit value or all the value in a tuple. a set of metric criterion, and let users choose the suitable K-
Generalization is that a value is substituted by a more anonymity algorithm according to their own need. The
ambiguous one and this value is more general than the original criterion of measurement is revealed as the following [23]:
one. It’s occasional to apply the technique of suppression
unlike it’s quite common to apply the technique of 1) Based on Generalization Hierarchy
generalization or the combination of generalization and The data generalization hierarchy is an anonymous
suppression. hierarchy built upon the attribute value that can be divided into
different layers. Data of different hierarchy contains different
1) Suppression amount of information. Generalizations based on attributes
Suppression is that the data value is dislodged directly in with taller generalization hierarchies typically maintain
the released table, thus decreased the released data, but precision better than generalizations based on attributes with
suppression can reduce the amounts of generalization shorter hierarchies. Further, hierarchies with different heights
sometimes, and reduce data loss at the same time in practical can provide different Prec measures for the same table. So, the
application. 2-anonymity can be achieved in table ⅡIf tuple t1 construction of generalization hierarchies is part of the
is suppressed (k=2). preference criteria, as indicated in the following formula:
n m H ( Aij , A'j )
TABLE II. EXAMPLE OF SUPPRESSION ∑∑ H Aj
Pr ec( RT ) = 1 −
i =1 j =1
(1)
Num Race Birth Sex ZIP Disease n×m
t1 black 1974 F 3114** flu
t2 black 1973 F 3114** short breath H is height of returned generalization hierarchies or
t3 black 1973 F 3114** obesity generalization relationship, Prec(RT) presents the generalized
t4 white 1973 F 3114** chest pain cost of n rows and m columns QI attributes.
t5 white 1973 F 3114** cancer
2) Generalization 2) Based on the Amount of Suppression Cells
Given an attribute A, a generalization for an attribute is a In [23], the cost measure of anonymity is based on
function on A. That is, each f: A → B is a generalization I also hamming distance between the rows, for instance, the distance
say that: between <1974, M, 311578> and <1973, M, 311579> is 2,
because they have 2 corresponding attributes value diverse.
A0
f 0
A1
f 1 … f An
n-1
The essence of the measure is to calculate in correspondence
is a generalization sequence or a functional generalization with the amount of suppression cells, as indicated in the
sequence [21]. For example, in Table Ⅱ the original ZIP codes following formula:
{311578, 311579} can be generalized to 3115**. n m

∑∑ HM ( A , A ) ij
'
j
Generalization mainly includes DGHA (domain Cost ( RT ) = 1 − i =1 j =1 (2)
generalization hierarchy on A) and VGHA (value generalization n×m
hierarchy on A) [22]. DGHA is that a given set of attributes HM(Aij ,A’j) presents the hamming distance that Aij is
value is generalized into a set of average attribute value, for generalized to A'j, Aij is the attribute of No. i tuple and j
example, ZIP codes {311578, 311579, 311588, 311589} can be column.
generalized to {31157*, 31158*}, which make the set indicate
a bigger range semantically. VGHA can be shown by a tree. 3) Based on Partition
Z3={******} ******
The cost measure of anonymity method, a classification
measure, is total sum of every row’s penalty coefficient:
Z2={3115**} 3115** n

Z1={31157*, 31158*} 31157* 31158* ∑ Penalty (tup ) i


CM = i =1 (3)
Z0={311578,311579,311588,311589} 311578 311579 311588 311589 n
DGH Z0 VGH Z0 The Penalty(tupi) function value is 1, if and only if current
row is suppressed or its class is not a majority class, otherwise
Figure 1. ZIP domain and value generalization hierarchies including
suppression
Penalty(tupi) function goes back to 0.
4) Based on Entropy 2) Data Stream
Entropy is a classical cost measure method, which was used In many applications, transaction data arrive in the form of
in the Datafly system at first [23]. The anonymity level of each high speed data streams [18], [19]. The data stream has salient
attribute is described with the value from 0 to 1. The value of features:
every attribute in each tuple must be generalized and equal to
other tuples’s corresponding attribute which has a number of • elements arrive on-line.
“bin” at least. Also, in [23], they use the information-theoretic • The system has no control over the order in which the
notion of entropy: data elements arrive.
n( q, s ) n( q, s ) (4) • Once an element has been seen or processed, it cannot
−∑ log( ) ≥ log(l )
s∈S ∑ n ( q , s ' ) ∑ n( q, s ' ) be easily retrieved or seen again unless it is explicitly
s '∈S s '∈S stored in the memory.
The n(q,s) present the appearance times of element s in q These data contain a lot of information about customers,
block. The value of entropy is more big, then infer certain not just transactions, and thus have to be carefully managed to
element more difficult. protect customers' privacy. Li Jianzhong et al. present a novel
method called SKY (Stream K-anonYmity) to continuously
III. ALGORITHM CLASSIFICATION facilitate k-anonymity on data streams. This is the first reported
K-anonymity can be achieved by a lot of algorithms which work that considers k-anonymity on data stream for privacy
appeared newly. My paper will analyse these algorithms from protection. In [19], they consider the problem of preserving
different angles. customer’s privacy on the sliding window of transaction data
streams. This problem is challenging because sliding window
is updated frequently and rapidly. They propose a novel
A. From the angle of Generalization
approach, SWAF (Sliding Window Anonymization
K-anonymity algorithms can be divided into 2 parts: full- Framework), to solve this problem by continuously facilitating
domain generalization algorithm and local-domain k-anonymity on the sliding window. Three advantages make
generalization algorithm. Since the research on full-domain SWAF practical:
generalization was earlier, there are more full-domain
generalization algorithms than local-domain generalization • Small processing time for each tuple of data steam.
algorithms. • Small memory requirement.
The classical full-domain generalization algorithms include
• Both privacy protection and utility of anonymized
μ -Argus, Datafly, MinGen(Minimal Generalization), sliding window are carefully considered.
Incognito and so on.
3) Uncertain Data
The classical local-domain generalization algorithms People have deeper insight into the data uncertainty with
include GA(Genetic algorithm), Top-down, multidimensional the rapid development in the technique of data gathering and
space partition and so on. processing. Data uncertainty exists universally in various fields,
including economy, military, logistics, finance and
B. From the angle of Microaggregation telecommunication, etc [24]. Uncertain data are diversified and
Microaggregation algorithm[14] is recently proposed as an they can occur in the form of relational data, semistructured
alternative to generalization/suppression method for k- data, streaming data, and moving objects. While the results of
anonymization whose goal is to cluster a set of records into privacy-transformation methods are a natural form of uncertain
groups of size at least k such that groups are as homogeneous data, Charu C. Aggarwal [17] proposes an uncertain version of
as possible, then the records’ attribute values in the same group the k-anonymity model which is related to the well known
are replaced by the group’s centroid. deterministic model of k- anonymity. The uncertain version of
the k-anonymity model has the additional feature of
C. From the angle of Data Characteristic introducing greater uncertainty for the adversary over an
equivalent deterministic model.
1) Incremental Datasets
Most of the previous works on k-anonymization focused on In addition, k-Anonymity algorithms can be classified as
one time release of data. How ever, data is often released Homogeneity Attack, Background Knowledge Attack, single
continuously to serve various information purposes in reality. dimension, multi dimension and so on.
The purpose of this study is to develop an effective solution for
the re-publication of incremental datasets. By analyzing several IV. CONCLUSION
possible generalizations in the anonymization for incremental
updates, an important monotonic generalization principle is The k-Anonymity technology has been studied for a long
proposed to prevent privacy disclosure in re-publication. Based time, though its various algorithms have their own merits and
on the monotonic generalization principle, a partitioning based shortcomings. They will be widely used in various fields and
privacy preserving k-anonymous algorithm k-APPRP for re- will improve and mature eventually. In the future we’ll mainly
publication is proposed [13]. focus our research on the following aspects: microaggregation;
data stream; uncertain data-based algorithms; homogeneity
attack, background knowledge attack; multidimensional k- [10] Aggarwal C, On k- anonymity and the curse of dimensionality. Proc of
anonymity; k-anonymity combined with other algorithms such the 31st International Conference on Very Large DataBases, Trondheim,
Norway, 2005, pp. 901- 909.
as clustering, and so on. Obviously all these algorithms will
[11] DENG Jing-jing, YE Xiao-jun, Algorithm for Multidimensional K-
become the focus of the research. anonymity by R Tree, Computer Engineering, 2008, vol.34, No.1, pp.
80-82.(in Chinese)
ACKNOWLEDGMENT [12] Sun, Xiaoxun; Wang, Hua, et al. (p+, α)-sensitive k-anonymity: A new
enhanced privacy protection model. Proceedings-2008 IEEE 8th
The authors would like to thank Zhang Yanwei for many International Conference on Computer and Information Technology,
helpful comments. This work was supported in part by the CIT 2008, 2008, pp. 59-64.
National Natural Science Foundation of China (Grant No. [13] WU Ying-jie, et al. k-APPRP:a Partitioning Based Privacy Preserving k-
60873037), the Natural Science Foundation of Heilongjiang anonymous Algorithm for Republication of Incremental Datasets.
Province of China (Grant No. F200901), and the science and Journal of chinese computer systems, 2009, vol. 30, No.8, pp. 1581-
1587.
technology research program of education department of
[14] HAN Jian-min, et al. Research in Microaggregation Algorithms for k-
Heilongjiang Province of China “Research on privacy Anonymization, ACTA ELECTRONICA SINICA, 2008, vol. 36, No. 10,
protection based on K-anonymity”. pp. 2021-2029.(in Chinese)
[15] Aristides Gionis, et al. K-anonymization revisited. Proceedings of the
REFERENCES 2008 IEEE 24th International Conference on Data Engineering, 2008, pp.
744-753.
[1] R.Agrawal, J.Kiernan,R.Srikant, et al., Order-Preserving Encryption for
[16] LUO Hong-wei, LIU Guo-hua, (L,K)-anonymity for privacy preserving,
Numeric Data. Proceeding of SIGMOD 2004, Paris, France, 2004, pp.
Journal of Yanshan University, 2007, vol. 31, No. 1, pp. 82-86.
13-18.
[17] Aggarwal Charu.C, On Unifying Privacy and Uncertain Data Models,
[2] G.Miklau, D.Suciu, Controlling Access to Published Data Using
Cryptography, VLDB 2003, Berlin, Germany, 2003, pp. 898-909. Proceedings of the 2008 IEEE 24th International Conference on Data
Engineering, 2008, pp:386-395.
[3] Samarati P, Sweeney L, Generalizing data to provide anonymity when
[18] Wang Weiping; Li Jianzhong, Privacy protection on sliding window of
disclosing information, Proceedings of the Seventeenth ACM SIGACT-
data streams, Collaborative Computing: Networking, Applications and
SIGMOD-SIGART Symposium on Principles of Database Systems,
Worksharing, 2007, 12-15 Nov, pp. 213-221.
PODS, Seattle,WA, USA, 1998, p.188.
[19] Li Jianzhong, Anonymizing Streaming Data for Privacy Protection,
[4] Samarati P, Protecting respondents’identities in microdata release, Proc
of the TKDE’01, 2001, pp. 1010-1027. Proceedings of the 2008 IEEE 24th International Conference on Data
Engineering, 2008, pp. 1367-1369.
[5] Iyengar V, Transforming data to satisfy privacy constraints, SIGKDD,
2002, pp. 279- 288. [20] L. Sweeney, k-anonymity: a model for protecting privacy, International
Journal on Uncertainty,Fuzziness and Knowledge-based Systems, 10 (5),
[6] Yao C,Wang X S, Jajodia S, Checking for k- anonymity violation by 2002, pp. 557-570.
views. Proc of the 31st Int’l Conf on Very Large Data Bases, Trondheim,
ACM, 2005, pp. 910- 921. [21] L. Sweeney, Achieving k-anonymity privacy protection using
generalization and suppression, International Journal on Uncertainty,
[7] Machanavajjhala A, Gehrke J, Kifer D, l-diversity: Privacy beyond k- Fuzziness and Knowledge-based Systems, 10 (5), 2002; pp. 571-588.
anonymity. Proc of the International Conference on Data Engineering,
Atlanta, GA, USA, 2006, pp. 24. [22] CEN Tingting et al. Survey of K-anonymity research on privacy
preservation, Computer Engineering and Application, 2008, vol. 44, No.
[8] Yang XC,Liu XY,Wang B,Yu G, K-Anonymization approaches for 4, pp. 130-134. (in Chinese)
supporting multiple constraints. Journal of Software, 2006, vol. 17, No.
5, pp. 1222-1231.(in Chinese) [23] LI Zude. Checking and Preventing Privacy Inference Attacks Based on
K-anonymized Microdata, QingHua University, paper, 2006.
[9] Xiao Xiaokui, Tao Yufen, Personalized privacy preservation. SIGMOD,
Chicago, Illinois, USA, 2006, pp. 229- 240. [24] ZHOU Aoying et al. A Survey on the M anagement of Uncertain Data,
Chinese journal of Computers, 2009, vol. 32, No. 1, pp. 1-15.(in Chinese)

You might also like