You are on page 1of 7

International Journal of Advance Foundation and Research in Computer (IJAFRC)

Volume 2, Issue 5, May - 2015. ISSN 2348 4853

Data Security and Anonymization for Collaborative Data


Publishing
Sumeet Kalaskar, Madhuri Rathod, Snehal Godse, Shruti Dikonda
Department of Computer Engineering, PVGs College of Engineering & Technology
University of Pune, India.
ABSTRACT
In this paper we consider the collaborative data publishing problem and new type of insider attack by
corresponding data providers who refer to overall data to infer the data records given by other data.
Formally we introduce the notion of m-privacy, which guarantees that anonymized data satisfies given
privacy constraints and groups them into corresponding number of colluding data providers. Later we
present algorithms proving security against many providers of data and guarantees high utility and high
privacy of anonymized data. For experimental purposes we use hospital patients data sets. Baseline
algorithms are not as efficient as our slicing approach for achieving utility and efficiency. The difference
between computational time of encryption algorithm is efficiently demonstrated by our experiment.
Index Terms: SMC, TTP, Anonymization, Bucketization, Distributed database, security
I. INTRODUCTION
While data is shared and release to public, privacy techniques are preserved and these are mainly used to
decrease the leakage of formation about partial individual. There is an increasing need for sharing data
that contain personal information from distributed databases. For example, in the healthcare domain, a
national agenda to develop the Nationwide Health Information Network (NHIN) to share information
among hospitals and other providers, and support appropriate use of health information beyond direct
patient care with privacy protection.
Privacy preserving data analysis, and data publishing have received considerable attention in recent
years as promising approaches for sharing data while preserving individual privacy. Main goal is to
publish an anonymized view of integrated data, T, which will be immune to attacks (fig1.1). Attacker runs
the attack, i.e. a single or a group of external or internal entities that wants to breach privacy of data
using background knowledge. Collaborative data publishing is carried out successfully with the help of
trusted third party (TTP), which guarantees that information or data about particular individual is not
disclosed anywhere, that means it maintains privacy. Here it is assumed that the data providers are semi
honest. A more desirable approach for collaborative data publishing is, first aggregate then anonymize
T1,T2,T3 and T4 are databases for which data is provided by provider like provider P1 provides data for
database T1. These distributed data coming from different providers get aggregate by TTP(trusted third
party) or using SMC protocol. Then these aggregated data anonymized further by any anonimization
technique.0 is the authenticate user and P1 trying to breach privacy of data which is provided by other
users with the help of BK(Background knowledge). This type of attack we can call as a insider attack.
We have to protect our system from such a type of attacks.
II. LITERATURE SURVEY
In this section we are presenting the different methods which are previously used for anonymization .We
discuss some advantages and limitation of these systems. C. Dwork in his survey result [2] of differential
privacy, evaluates and summarize different approaches to privacy preserving data publishing (PPDP),
study of different challenges in practically publishing of data, clarify the other related problems which
are different from PPDP and requirements that make PPDP different from others and proposed future
search directions. They identify the research direction in PPDP like privacy preserving tools for
individuals privacy protection in emerging technology and incorporation of privacy protection and
engineering process
24 | 2015, IJAFRC All Rights Reserved

www.ijafrc.org

International Journal of Advance Foundation and Research in Computer (IJAFRC)


Volume 2, Issue 5, May - 2015. ISSN 2348 4853
N. Mohammed, B. C. M. Fung, P. C. K. Hung, and C. Lee, proposed LKC privacy model for high dimensional
relational data for healthcare system[3]. This LKC model gives better result than traditional k
anonymization model. But LKC model consider only relational data and healthcare data is complex, may
be a combination of relational data, transaction data and textual data. Two party protocol DPP2GA is
presented by authors W. Jiang and C. Clifton which is two party proto-privacy. Hence this protocol helps
in it but major disadvantages of DPP2GA is it may not produce a precise data when data are not
partitioned. It is only privacy preserving protocol not SMC because it introduces certain inference
problem.
W. Ziang and C. Clifton presented a two party framework DkA[5]. This helps to maintaining benefit of
partitioning of data while generating integrated k anonymous data. This is proven to generate k
anonymous dataset and satisfying security definition of SMC. But DkA is not a multiparty framework.
A. Machanavajjhala, J. Gehrke, D. Kifer, and M.Venkitasubramaniam proposed a system with diversity.
This system provides a security over k anonymity. Attacker can attack on anonymized system with the
help of BK(background knowledge). L diversity helps to overcome this problem. New system which helps
for anonymization is slicing[10]. This is very useful technique for high dimensional data but there could
be a loss of data utility.
III. RELATED WORK
Privacy preserving data analysis and publishing has received considerable attention in recent years. Most
work has focused on a single data provider setting and considered the data recipient as an attacker. A
large body of literature assumes limited background knowledge of the attacker, and defines privacy. A
few recent works have modeled the instance level background knowledge as corruption, and studied
perturbation techniques under these syntactic privacy notions. In the distributed setting that we study,
since each data holder knows its own records, the corruption of records is an inherent element.
There are some works focused on anonymization of distributed data. They studied distributed
anonymization for vertically partitioned data using k anonymity. Zhong et al. studied classification on
data collected from individual data owners (each record is contributed by one data owner), while
maintaining k anonymityJurczyk et al. proposed a notion called l-site diversity to ensure anonymity for
data providers in addition to privacy of the data subjects. Mironov et al. studied SMC techniques to
achieve differential privacy. Mohammedet al. proposed SMC techniques for anonymizing distributed data
using the notion of LKC-privacy to address high dimensional data. Gal et al. proposed a new way of
anonymization of multiple sensitive attributes, which could be used to implement m-privacy w.r.t. ldiversity with providers as one of sensitive attributes.
IV. PROPOSED WORK
The first category considers that a privacy threat occurs when an attacker is able to link a record owner
to a record in a published data table, to a sensitive attribute in a published data table, or to the published
data table itself. We call these record linkage, attribute linkage, and table linkage, respectively. In all three
types of linkages, we assume that the attacker knows the QID of the victim. In record and attribute
linkages, we further assume that the attacker knows that the victims record is in the released table, and
seeks to identify the victims record and/or sensitive information from the table. In table linkage, the
attack seeks to determine the presence or absence of the victims record in the released table. A data
table is considered to be privacy preserving if it can effectively prevent the attacker from successfully
performing these linkages. In the attack of record linkage, some value qid on QID identifies a small
number of records in the released table T, called a group. If the victims QID matches the value QID, the
victim is vulnerable to being linked to the small number of records in the group. In this case, the attacker
faces only a small number of possibilities for the second category aims at achieving the uninformative
principle. The published table should provide the attacker with little additional information beyond the
background knowledge. If the attacker has a large variation between the prior and posterior beliefs, we
call it the probabilistic attack.

25 | 2015, IJAFRC All Rights Reserved

www.ijafrc.org

International Journal of Advance Foundation and Research in Computer (IJAFRC)


Volume 2, Issue 5, May - 2015. ISSN 2348 4853
Many privacy models in this family do not explicitly classify attributes in a data table into QID and
Sensitive Attributes,, but some of them could also thwart the sensitive linkages in the first category, so the
two categories overlap studies this family of privacy models.
Table I summarizes the attack models addressed by the privacy models. There is a consideration of a new
type of insider attack by colluding data providers who may use their own data records (a subset of the
overall data) to infer the data records contributed by other data providers. The project addresses this
new threat, and makes several contributions. The
The insider attack was not addressed till now, which is the
reason why there was a need of a innovative solution. One of the primary drawbacks that restrict some
companies from getting fully involved in database marketing is its high costs.
ent heuristic algorithms exploiting the monotonicity of privacy constraints for efficiently
Second, we present
checking m-privacy
privacy given a group of records. Third, we present a data provider-aware
provider
anonymization
algorithm with adaptive m-privacy
privacy checking strategies to ensure high
high utility and m-privacy
m
of
anonymized data with efficiency. There was the use of collaborative techniques. Collaborative data
publishing introduces a new attack that has not been studied so far. Compared to the attack by the
external recipient in the second
ond scenario, each provider has additional data knowledge of its own
records, which can help with the attack. This issue can be further worsened when multiple data providers
collude with each other. In the social network or recommendation setting, a user may
m attempt to infer
private information about other users using the anonymized data or recommendations assisted by some
background knowledge and her own account information. Malicious users may collude or even create
artificial accounts as in a shilling attack.
at
Finally, we propose secure multi-party
party computation protocols
for collaborative data publishing with mprivacy. All protocols are extensively analyzed and their security
and efficiency are formally proved. Experiments on real-life
real life datasets suggest that our approach achieves
better or comparable utility and efficiency than existing and baseline algorithms while satisfying mm
privacy.
V. BLOCK DIAGRAM

26 | 2015, IJAFRC All Rights Reserved

www.ijafrc.org

International Journal of Advance Foundation and Research in Computer (IJAFRC)


Volume 2, Issue 5, May - 2015. ISSN 2348 4853
The main aim to avoid congestion and increase reliability in an event driven WSN This involves creation
of random topology, generation of event; the nodes which are near the event area will detect the event
and then forms anatomic decision and broadcast their packet to Mobile data collector node. The node will
follow random path to collect the data packet and disseminate it towards the sink node.

VI. ALGORITHMS
A. Anonymization by slicing: Slicing basically depends on Attribute and tuple partitioning. In attribute
partitioning (vertical partition) we partitioned data as {name},{age-zip} and{Disease} and tuple
partitioning (horizontal partition) as{t1,t2,t3,t4,t5,t6}. In attribute partitioning age and zip are
partitioned together because they both are highly correlated because they are quasi identifiers (QI).
These QI can be known to attacker. While tuple partitioning system should check L diversity for the
sensitive attribute (SA) column. Algorithm runs are as follows.
1. Initialize bucket k=n, int i= rowcount, columncount=C, Q={D}, // D= data into database, Arraylist= a[i];
2. While Q is not empty
If i<=n
Check L diversity;
Else
i++;
Return D*;
3. Q=Q-{D*+a[i]};
4. Repeat step 2 and 3 with next tuple in Q
5.D*=D*U A[D] // next anonymized view of data D
First initialize k = limit of data anonymization bucket size, number of rows, number of columns, array
list and database in the queue(step 1). Further process will done if and only if queue is not empty i.e
there should be data in database. Check data for L diversity if rowcount =k = m (step 2). Initially Q=
Queue of data. If our bucket data fulfill k anonymity and L diversity, it return D* i.e anonymized view of
data. The data from the database which cannot fulfill requirement of privacy will stored in array list a[i].
Now data remains in database i.e in Q = Q-D*+ a[i]} (step 3). Repeat step 2 and step 3. A[D] is
anonymization of data in database. Apply above steps for remaining data and create new anonymization
view which
is the union of original view and new one i.e D* = D*UA[D].
B. L diversity: L diversity is the concept of maintaining uniqueness within data. In this system we used
this conception SA i.e on disease. Our anonymized bucket size is6 and I maintain L=4 i.e. from 6
disease record 4 must be unique.
1. Initialize L=m, int i;
2. If i= n-m+1;
Then a[0]..a[1], insert these values as they are in Q;
i++;
Else
Check privacy constraint for every incremented value in Q
3.If L=n
Then
27 | 2015, IJAFRC All Rights Reserved

www.ijafrc.org

International Journal of Advance Foundation and Research in Computer (IJAFRC)


Volume 2, Issue 5, May - 2015. ISSN 2348 4853
Fscore=1
Insert value in the row
i++;
Else
Add element to arraylist a[i];
4. Exit.
First initialize L=m and rowcount i. If i=n-m+1i.e if k=n=6 and L=m=4 then i=3, up to third row data
doesnt need to check for Fscore. Add this data as they are coming from Q (step 1 and 2). For further data
from Q check data for privacy constraint. If data fulfills L , then Fscore=1. If data doesnt fulfill Fscore=1,
then add element in array list a[i] (step 3).3.Permutation: Permutation means rearrangement of records
of data. In our project we have used permutation process for rearrangement of quasi identifier i.e {Zip,
Age}
4. Fscore: Fscore is privacy fitness score i.e the level of fulfillment of privacy constraint C. If fscore=1 then
C(D*)= true.
5. Constraint C: C is a privacy constraint in which D*should fulfill slicing condition with L diversity as
explain above. Consider value of L diversity is 4. Fscore should be 1 when system fulfills L diversity
condition. Some verification processes are carried out.
1) Verification for L diversity: For verification of L diversity we used Fitness score function. For checking
L diversity generates continuous similar values of SA i.e insert similar disease. Check for Fscore=1. If
L=m, return Fscore. If privacy breach i.e if anonymized view take data as inserted then it breached
privacy. D* should take data which fulfill L= m.
1. Generate continuous similar values of SA
2. Check for privacy constraint and fscore=1;
3. If
Privacy breach;
Then early stop;
Else
Return (Fscore);
4. Exit
2) Verification for strength of system against number of provider: For verification against number of
provider, add one more attribute in anonymized data as a provider to output. This verification will prove
that our technique of anonymization doesnt depend on number of provider. Existing system i.e. provider
aware anonymization algorithm depends on database as well as provider.
1. Generate values of SA by providers= 1..n
2. Check for privacy constraint and Fscore=1 with respect to number of provider
3. If
Privacy breach;
Then early stop;
Else
Return (Fscore);
4. Exit.
28 | 2015, IJAFRC All Rights Reserved

www.ijafrc.org

International Journal of Advance Foundation and Research in Computer (IJAFRC)


Volume 2, Issue 5, May - 2015. ISSN 2348 4853
VII. DATA SET
We conduct extensive workload experiments. Our results confirm that slicing preserves much better data
utility than generalization.
In workloads involving the sensitive attribute, slicing is also more effective than bucketization. In some
classification experiments, slicing shows better performance than using the original data.
VIII. EXPECTED OUTCOME
Output data table shows secure data publishing which provide privacy to sensitive attribute. Comparison
shows the computation time difference between encryption algorithm, provider aware algorithm and
slicing with data privacy algorithm.
IX. CONCLUSION
We consider a potential attack on collaborative data publishing. We used slicing algorithm for
anonymization and diversity and verify it for security and privacy by using binary algorithm of data
privacy. Slicing algorithm is very useful when we are using high dimensional data. It divides data in both
vertical and horizontal fashion. Due to encryption we can increase security. But the limitation is there
could be loss of data utility. Above system can used in many applications like hospital management
system, many industrial areas where we like to protect a sensitive data like salary of employee.
Pharmaceutical company where sensitive data may be combination of ingredients of medicines, in
banking sector where sensitive data is account number of customer, our system can use. It can be used in
military area where data is gathered from different sources and need to secured that data from each
other to maintain privacy. This proposed system help to improve the data privacy and security when data
is gathered from different sources and output should be in collaborative fashion.
X.REFERENCES
[1]

S. Goryczka, L. Xiong, and B. C. M. Fung, m-Privacy for collaborative data publishing, in Proc. of
the 7th Intl. Conf. on Collaborative Computing: Networking, Applications and Work sharing, 2011.

[2]

C. Dwork, Differential privacy: a survey of results, in Proc. of the 5th Intl. Conf. on Theory and
Applications of Models of Computation, 2008, pp. 119.

[3]

B. C. M. Fung, K.Wang, R. Chen, and P. S. Yu, Privacy-preserving data publishing: A survey of


recent developments, ACM Comput. Surv., vol. 42, pp. 14:114:53, June 2010.

[4]

C. Dwork, A firm foundation for private data analysis, Commun. ACM, vol. 54, pp. 8695, January
2011.

[5]

N. Mohammed, B. C. M. Fung, P. C. K. Hung, and C. Lee, Centralized and distributed anonymization


for high-dimensional healthcare data, ACM Trans. on Knowl. Discovery from Data, vol. 4, no. 4,
pp. 18:118:33, October 2010.

[6]

W. Jiang and C. Clifton, Privacy-preserving distributed k-anonymity, in DBSec, vol. 3654, 2005,
pp. 924924.

[7]

W. Jiang and C. Clifton, A secure distributed framework for achieving k-anonymity, VLDB J., vol.
15, no. 4, pp. 316333, 2006.

[8]

O. Goldreich, Foundations of Cryptography: Volume 2, 2004.

29 | 2015, IJAFRC All Rights Reserved

www.ijafrc.org

International Journal of Advance Foundation and Research in Computer (IJAFRC)


Volume 2, Issue 5, May - 2015. ISSN 2348 4853
[9]

Y. Lindell and B. Pinkas, Secure multiparty computation for privacy-preserving data mining, The
Journal of Privacy and Confidentiality, vol. 1, no. 1, pp. 5998, 2009.

[10]

P. Samarati, Protecting respondents identities in microdata release, IEEE TKDE, vol. 13, no. 6,
pp. 10101027, 2001.

30 | 2015, IJAFRC All Rights Reserved

www.ijafrc.org