An Efficient Technique To Secure Data Access For Multiple Domains Using Overlapping Slicing

AN EFFICIENT TECHNIQUE TO SECURE DATA ACCESS FOR
MULTIPLE DOMAINS USING OVERLAPPING SLICING
BY
Prof. Rani V. Ingawale
Dr. Somnath B. Thigale
ii
Abstract
Data Mining is the process of analysing data from different perspective, summariz-
ing it and extracting the needed information from the database. Most enterprises are
collecting and storing data in large database. Database privacy is a important responsi-
bility of organizations to protects client sensitive information, because client trusts them
to do so. Various anonymization techniques have been proposed for the privacy of sen-
sitive microdata. Generalization loses considerable amount of information, especially
for high-dimensional data. Bucketization, does not prevent membership disclosure and
does not apply for data that do not have a clear separation between quasi-identifying
attributes and sensitive attributes. Slicing is a technique proposed for anonymized pub-
lished dataset by partitioning the dataset vertically and horizontally. Proposed technique
increases the utility and privacy of a sliced dataset by allowing overlapped slicing while
maintaining the prevention of membership disclosure. It also provides secure data ac-
cess for multiple domains. This novel approach works on overlapped slicing to improve,
preserve data utility and privacy better than traditional slicing.
Keywords: Data anonymization, Privacy preservation, Data publishing, Data secu-

rity.
Category:
• Data Mining.
⇒ Security and Access Control.

⇒ Secure Access Control.
⇒ Using Overlapping slicing.
iii
Contents
Certificate i
Acknowledgement ii
List of Publications iii
Abstract iv
Contents v
List of Figures viii
List of Tables ix
Abbreviations x
1 Introduction 1
1.1 Privacy-Preserving in Data Mining . . . . . . . . . . . . . . . . . . . . 1
1.2 Privacy-Preserving in Data Publishing . . . . . . . . . . . . . . . . . . 3
2 Literature Review 5
2.1 Various Anonymization Techniques . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 Bucketization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.3 Slicing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.4 Improved Slicing . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Privacy Treads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Membership Disclosure Protection . . . . . . . . . . . . . . . . . 8
2.2.2 Identity Disclosure Protection . . . . . . . . . . . . . . . . . . . 9
2.2.3 Attribute Disclosure Protection . . . . . . . . . . . . . . . . . . 9
3 Problem Description and Specification 11

3.1 Problem Statement .............................................................................................. 11
3.2 Problem Solution.................................................................................................. 11
3.3 Goals and Objectives ........................................................................................... 11
3.3.1 Goals ......................................................................................................... 11
iv
3.3.2 Objective ................................................................................................... 11
3.4 Statement of Scope .............................................................................................. 12
4 Dissertation Plan 13
4.1 Timeline of Project............................................................................................... 13
5 Software Requirement Specifications 15

5.1 Purpose ................................................................................................................. 15
5.2 Design and Implementation constraints............................................................. 15
5.2.1 Overlapped Slicing ................................................................................... 15
5.3 Assumptions and Dependencies ......................................................................... 15
5.4 Usability................................................................................................................. 16
5.5 Interfaces............................................................................................................... 16
5.6 Feasibility Study ................................................................................................... 17
5.6.1 Technical Feasibility ................................................................................. 17
5.6.2 Operational Feasibility............................................................................. 17
5.6.3 Economic Feasibility ................................................................................ 17
5.7 Risk Analysis and Projection Table ................................................................... 18
5.7.1 Project Risk .............................................................................................. 18
5.7.2 Risk Table................................................................................................ 19
5.8 Effort and Cost Estimation.................................................................................. 19
5.8.1 Efforts ....................................................................................................... 19
5.8.2 Development Time ................................................................................... 20
5.8.3 Number of People.................................................................................... 20
5.8.4 Cost ........................................................................................................... 20
5.9 Technical Specification ......................................................................................... 20
6 System Design and Implementation 22

6.1 System Architecture ............................................................................................. 22
6.2 System Overview .................................................................................................. 23
6.3 UML Diagrams ..................................................................................................... 25
6.3.1 Data Flow Diagram ................................................................................. 25
6.3.2 Class Diagram .......................................................................................... 26
v
6.3.3 Use Case Diagram .................................................................................... 27
6.3.4 Sequence Diagram .................................................................................... 28
6.3.5 State Machine Diagram ........................................................................... 29
6.4 Mathematical Model............................................................................................ 30
6.5 Algorithmic Strategy ............................................................................................ 31
6.6 Time Complexity .................................................................................................. 32
6.7 Certainty Analysis ................................................................................................ 32
7 Testing 33
7.1 Introduction .......................................................................................................... 33
7.2 Objective ............................................................................................................... 33
7.3 Testing Strategy ................................................................................................... 34
7.3.1 Unit Testing ............................................................................................. 34
7.3.2 Integration Testing .................................................................................. 34
8 Results and Discussion 36

8.1 Dataset .................................................................................................................. 36
8.2 Result..................................................................................................................... 36
9 Conclusion and Future Work 39

9.1 Conclusion ............................................................................................................. 39
9.2 Future Work ......................................................................................................... 39
Reference 40
Appendix 43
cPGCON Attendee Certificate ................................................................................... 43
cPGCON Presentee Certificate .................................................................................... 44
cPGCON Review ......................................................................................................... 45
cPGCON Paper ............................................................................................................ 46
IJRITCC Certificate ................................................................................................... 50
IJRITCC Review ......................................................................................................... 51
IJRITCC Journal Paper ............................................................................................. 52
Plagiarism Summary ..................................................................................................... 58
vi
List of Figures
1 Data Collection and Data Publication . . . . . . . . . . . . . . . . . . . 2
2 Simple Model of PPDP . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3 The Timeline for Seminar III .............................................................................. 13
4 The Timeline for Project Stage I ....................................................................... 14
5 The Timeline for Project Stage II ...................................................................... 14
6 System Model....................................................................................................... 22
7 DFD Level 0 ................................................................................................................................. 25
8 DFD Level 1 ................................................................................................................................. 25
9 Class Diagram ...................................................................................................... 26

10 Use Case Diagram................................................................................................ 27
11 Sequence Diagram ................................................................................................ 28
12 State Machine Diagram ....................................................................................... 29
13 Login Page............................................................................................................ 36
14 Load Database ..................................................................................................... 37
15 Result for Medical Domain................................................................................. 37
16 Result for Political Domain ................................................................................ 38
17 Result for Education Domain............................................................................. 38
vii
List of Tables
1 The orginial table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 The generalized table . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 The bucketized table . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4 The Sliced table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
5 Comparison of anonymization techniques . . . . . . . . . . . . . . . . . 8
6 Summary of Literature Review........................................................................... 10
7 The Sliced table ................................................................................................... 16
8 The Overlapped Slicing table with Eduction Domain ..................................... 23
9 The Overlapped Slicing table with Medical Domain........................................ 23
10 The Overlapped Slicing table with Crime Domain........................................... 24
11 The Overlapped Slicing table with Political-Opinion Domain ........................ 24
12 Test Cases 1 ......................................................................................................... 34
13 Test Cases 2 ......................................................................................................... 35
viii
Abbreviations
QI : Quasi-Identifiers
SA : Sensitive Attributes
QID : Quasi-Identifier Distance
F : Female
M : Male
PPDP : Privacy Preserving Data Publishing
SRS : Software Requirement Specifications
KDD : Knowledge Discovery in Database
GUI : Graphical User Interface
UML : Unified Model Language
STP : Software Test Plan
RMMM : Risk Mitigation, Monitoring and Management
LOC : Lines of Code
ix
An Efficient Technique to Secure Data Access for Multiple Domains using Overlapping Slicing
1 Introduction
Data Mining is process of evaluating data from different view, summarize it into valu-
able information and extract the needed information from the database. Now a days,
collection of information by various organizations, government is used for knowledge-
based decision making as well as analysis purpose. So, there is need to share the
collected information, but the data in its original form contains some sensitive in-
formation. People or organization does not want their sensitive information to be
disclosed as if the shared data in its original form will violate the individual privacy.
So, to prevent this violation of privacy there should be some technique to publish the
data in such a way that privacy is preserved and at the same time data analysis can be
done effectively. In data publication privacy preserving has been studied extensively
in recent years. Data published contains records each of which contains information
about an individual entity, such as a person, a household, or an organization. Several
microdata anonymization techniques have been proposed.
1.1 Privacy-Preserving in Data Mining
Generally when people talk of privacy, they don’t want personal information to be
disclose. As long as data is not misused, most people do not feel their privacy has been
violated. The problem is that once information is released, it may be impossible to
prevent misuse. Utilizing this distinction, ensuring that a data mining would not en-
able misuse of personal information. A collection of data, it is possible to learn things
that are not revealed by any individual data item. An individual may not care about
someone knowing their birth date, mother’s maiden name, or social security number,
but knowing all of them enables identity theft. This type of privacy problem arises with
large, multi-individual collections as well. A technique that guarantees no individual
data is revealed may still release information describing the collection as a whole. Such
corporate information is generally the goal of data mining, but some results may still
lead to concerns. The difference between such corporate privacy issues and individual
privacy is not that significant[5].
A person is interested to prevent its personal information of their medical records

which is sensitive at certain security level. This may be because they have concern
that it might affect their insurance coverages or employment, or they would not wish
for others to know about medical or psychological conditions or treatments which
would be embarrassing. Revealing medical data could also reveal other details about
one’s personal life. Privacy Breach Physicians and psychiatrists in many cultures and
countries have standards for doctor-patient relationships which include maintaining
confidentiality. In some cases the physician-patient privilege is legally protected. These
practices are in place to protect the dignity of patients, and to ensure that patients
will feel free to reveal complete and accurate information required for them to receive
the correct treatment.
A typical scenario of data collection and publishing is described in Figure 1 [5]. In
the data collection phase, the data holder collects data from record owners (e.g., Alice
and Bob). In the data publishing phase, the data holder releases the collected data to
a data miner or data recipient, who then will conduct data mining on the published
data.
Figure 1: Data Collection and Data Publication

1.2 Privacy-Preserving in Data Publishing
Privacy-Preserving techniques tend to study different transformation methods as-

sociated with privacy. These techniques include methods such as randomization, k-
anonymity, and l-diversity.
Figure 2: Simple Model of PPDP
The most basic form of Privacy-Preserving in Data Publishing (PPDP) is shown

in Figure 2[5], the data holder has a table of the form D(Explicit Identifier, Quasi
Identifier, Sensitive Attributes, Non-Sensitive Attributes), where Explicit Identifier is
a set of attributes, such as name and social security number, containing information
that explicitly identifies record owners. Quasi Identifier is a set of attributes that
could potentially identify record owner. Sensitive Attributes consist of person-specific
information such as disease, salary, and disability status. Non-Sensitive Attributes
contains all attributes that do not fall into the previous three categories. Most works
assume that the four sets of attributes are disjoint and other assume that each record
in the table represents a distinct record owner. Anonymization refers to the Privacy
Preserving in Data Publishing approach that seeks to hide the identity and the sensitive
data of record owners, assuming that sensitive data must be retained for data analysis.
Many organizations collecting information is used for knowledge-based decision mak-

ing and analysis purpose. So, there is need to share the information. But, the data
holds some sensitive information and people do not want their sensitive information
to be revealed. Sharing data in its original form thus reveal the individual privacy.
So, to prevent this violation of privacy there should be some technique to publish the
data in such a way that privacy is preserved and at the same time data analysis can be
done effectively. Microdata holds information about an individual entity, like a person,
organization. The most popular one is generalization and bucketization. Here,data
attributes are partitioned into three categories:
1. Identifier attributes: Attributes are identifiers that can uniquely identify an indi-
vidual, like Age, name or Social Security Number
2. Quasi-Identifiers (QI): The set of attributes that can be linked with public available
datasets to reveal personal identity e.g., Birth date, Gender, and Zipcode.
3. Sensitive Attributes (SA): Which contains personal privacy information, like Dis-
ease, political opinion, crime.
Multiple domains contain multiple sensitive attributes, slicing anonymization tech-
niques proposed to prevent the sensitive information. The basic idea of slicing is to
break the link cross columns, but to preserve the link within each column. Slicing
is in multiple sensitive attributes preserves good usefulness than generalization and
bucketization and reduces the dimensionality of the data. Overlapped slicing increase
the utility and privacy of a sliced dataset with multiple sensitive attributes in different
domains.
2 Literature Review
2.1 Various Anonymization Techniques
Two widely studied data anonymization technique are generalization and bucketization.
In table 1 show the collected data from the record owners.
Table 1: The orginial table

Age Gender Zipcode Disease
22 M 47906 cancer
22 F 47906 flu
33 F 47905 flu
52 F 47905 hypertention
54 M 47302 flu
60 M 47302 cancer
60 M 47304 cancer
64 F 47304 diabetes
2.1.1 Generalization
Generalization is one of the common anonymized approach, which replaces quasi-

identifier values with values that are less-specific but semantically consistent. Then,
all quasi-identifier values in a group would be generalized to the entire group extent in
the QID space. If at least two transactions in a group have distinct values in a certain
column, then all information about that item in the current group is lost. The QID used
in this process includes all possible items in the log. Due to the high-dimensionality
of the quasi-identifier, with the number of possible items in the order of thousands, it
is likely that any generalization method would incur extremely high information loss,
rendering the data useless.
In order for generalization to be effective, records in the same bucket must be close
to each other so that generalizing the records would not lose too much information.
However, in high-dimensional data, most data points have similar distances with each
other. To perform data analysis or data mining tasks on the generalized table, the
data analyst has to make the uniform distribution assumption that every value in a
generalized interval/set is equally possible, as no other distribution assumption can
be justified. This significantly reduces the data utility of the generalized data. And
also because each attribute is generalized separately, correlations between different
attributes are lost. In order to study attribute correlations on the generalized table,
the data analyst has to assume that every possible combination of attribute values is
equally possible. This is an inherent problem of generalization that prevents effective
analysis of attribute correlations[3].
Table 2: The generalized table

[20-25] * 4790* cancer
[20-25] * 4790* flu
[20-25] * 4790* flu
[20-25] * 4790* hypertention
[54-64] * 4730* flu
[54-64] * 4730* cancer
[54-64] * 4730* cancer
[54-64] * 4730* diabetes
2.1.2 Bucketization
In bucketization, the tuples in T are partitioned into buckets. Then to separate

the sensitive attribute from the non-sensitive ones. Randomly permuting the sensitive
attribute values within each bucket. The sanitized data then consists of the buckets
with permuted sensitive values in Table 3. Here use bucketization as the method of
constructing the published data from the original table T, although all results hold
for full-domain generalization as well. In bucketization the tuples are partitioned into
buckets, and within each bucket independent random permutation to the column con-
taining S-values. The resulting set of buckets, denoted by B, is then published[3].
Table 3: The bucketized table

22 M 47906 flu
22 F 47906 cancer
33 F 47905 hypertension
52 F 47905 flu
54 M 47302 diabetes
60 M 47302 flu
60 M 47304 cancer
64 F 47304 cancer
2.1.3 Slicing
Slicing is a novel data anonymization technique. Slicing partitions the data set both
vertically and horizontally. Vertical partitioning is done by grouping attributes into
columns based on the correlations among the attributes. Each column contains a subset
of attributes that are highly correlated. Horizontal partitioning is done by grouping
tuples into buckets. Finally, within each bucket, values in each column are randomly
permuted (or sorted) to break the linking between different columns. [2]
Table 4: The Sliced table

(Age,Gender) (Zipcode,Disease)
(22,M) (47905,flu)
(22,F) (47906,cancer)
(33,F) (47905,hypertention)
(52,F) (47906,flu)
(54,M) (47304,diabetes)
(60,M) (47302,flu)
(60,M) (47302,cancer)
(64,F) (47304,cancer)
The basic idea of slicing is to break the association cross columns, but to preserve
the association within each column. This reduces the dimensionality of the data and
preserves better utility than generalization and bucketization. Slicing preserves utility
because it groups highly correlated attributes together, and preserves the correlations
between such attributes. Slicing achieves privacy because it breaks the associations
between uncorrelated attributes. When the data set contains quasi-identifiers and one
sensitive attributes, bucketization has to break their correlation. Slicing can group
some quasi-identifier attributes with the sensitive attribute for preserving attribute
correlations with the sensitive attribute. Slicing first partitions attributes into columns
which contains a subset of attributes after that partition into buckets where each bucket
contains a subset of tuples. Within each bucket, values in each column are randomly
permuted to break the linking between different columns.
2.1.4 Improved Slicing
A novel data anonymization model is lead that improves limitations of slicing. The
major influences of this model are the use of an overlapped clustering technique in the
attribute partitioning phase and the use of an alternative tuple partitioning algorithm.
Improved slicing works by first finding the correlations between each pair of attributes
and then clustering these attributes into columns by overlapped clustering on the ba-
sis of their association coefficients. The dataset is then horizontally partitioned into
buckets satisfying l-diversity using a novel tuple partitioning algorithm. The columns
within each bucket are then arbitrarily permuted with respect to one another to give
an enhanced sliced dataset [8].
Table 5: Comparison of anonymization techniques

Sr. Generalization Bucketization Slicing
No.
1 Replaces quasi-identifier val- Partition the tuples in T into Slicing partitions the data set
ues with values that are less- buckets, and then to separate both vertically and horizon-
specific but semantically con- the sensitive attribute from the tally
sistent non-sensitive ones
2 Loses considerable amount of Does not prevent membership Preserves better data utility
information disclosure than generalization and can be
used for membership disclosure
protection
3 Not handle high-dimensional Handle high-dimensional data Handling high-dimensional
data requires a clear separation be- data group some QI attributes
tween quasi-identifiers and sen- with the SA, preserving at-
sitive attributes to break their tribute correlations with the
correlation sensitive attribute
2.2 Privacy Treads
When publishing microdata, there are three types of information disclosure threats.
2.2.1 Membership Disclosure Protection
The first type is membership disclosure, when the data to be published is selected
from a larger dataset and the selection conditions are sensitive, it is important to pre-
vent an attacker to knowing whether an individuals record is in the data or not.
2.2.2 Identity Disclosure Protection
The second type is identity disclosure, which occurs when an individual is linked
to a particular record in the released table. In some situations, one wants to protect
against identity disclosure when the attackers is undefined of membership.
2.2.3 Attribute Disclosure Protection
The third type is attribute disclosure, occurs when new data about some individuals
is published. That means the released data makes it possible to assume the attributes
of an individual more correctly than it would be possible before the release. Alike to
the case of identity disclosure, required to consider attacker who previously know the
membership information. Most of the time Identity disclosure leads to attribute disclo-
sure. Once there is identity disclosure, an individual is re- identified and the equivalent
sensitive value is discovered. Attribute disclosure can happen with or without identity
disclosure, for example when the sensitive values of all matching tuples are the same.
The summary of some of the main reviewed papers is included in following table.
Table 6: Summary of Literature Review

Sr. Title Authors Approach Description
No.
1 Enhanced Slicing Mod- S.Kiruthika, Privacy preser- Thus utility is maintained with mini-
els For Preserving Pri- Dr.M.Mohamed vation mum loss by suppressing only very few
vacy In Data Publica- Raseen values and privacy is maintained by
tion random permutation.
2 A Survey of Transac- Lan Sun, Yilei Wang, Privacy preser- Summarize and evaluate different
tion dada Anonymous Yingjie Wu. vation anonymous approaches for transac-
publication tional data publication.
3 Rating: Privacy Jinfei Liu, Jun Luo and Privacy Preser- Provides a detailed analysis of various
Preservation for Mul- Joshua Zhexue Huang vation approaches in web personalization.
tiple Attributes with
Different Sensitivity
Requirements
4 Slicing: A New Ap- Tiancheng Li, Ninghui Privacy Preser- Show that slicing preserves better data
proach to Privacy Pre- Li, Jian Zhang, Ian vation utility than generalization and can be
serving Data Publish- Molloy used for membership disclosure protec-
ing tion.
5 K-anonymity on Sensi- Shyue-Liang Wang , Privacy Preser- Anonymity concept on transactional
tive Transaction Items Yu-Chuan Tsai , Hung- vation data with quasi-identifier items and
Yu Kao2 and Tzung- sensitive items (SI).
Pei Hong.
6 Relationship Privacy Na Li, Nan Zhang, Sa- Privacy Preser- Intend to preserve relationship privacy
Preservation in Pub- jal K. Das vation between two users one of whom can
lishing Online Social even be identified in the released OSN
Networks data.
7 A fast p-sensitive B.K.Tripathy, Privacy Preser- To develop an l-diversity algorithm
l-diversity Anonymisa- A.Maity, B.Ranajit, vation to handle multi-sensitive attributes in
tion algorithm D.howdhuri databases.
8 Privacy Preservation in Yingjie Wu, Privacy Preser- Heuristics are derived which could be
Transaction Databases vation useful in simplifying users navigational
based on Anatomy needs by providing shortcuts on de-
technique mand to them and also provide user
redirects to popular pages in the web-
site.
9 Decomposition: Pri- Yang Ye, Yu Liu, Privacy Preser- Decomposition, to tackle privacy
vacy Preservation for Dapeng Lv, and vation preservation in the MSA case.
Multiple Sensitive At- Jianhua Feng
tributes
10 t-Closeness: Privacy Ninghui Li, Tiancheng Privacy Preser- Discuss the rationale for t-closeness
Beyond k-Anonymity Li vation and illustrate its advantages through
and l-Diversity examples and experiments.
11 A Review of Privacy Amar Paul Singh and Privacy Preser- Focus on effective method that can be
Preserving Data Pub- Ms. Dhanshri Parihar vation used for providing better data utility
lishing Technique and can handle high-dimensional data.
12 Injector: Mining Back- Tiancheng Li, Ninghui Privacy Preser- Show that Injector reduces privacy
ground Knowledge for Li vation risks against background knowledge
Data Anonymization attacks while improving data utility.
13 Anonymous Publi- Gabriel Ghinita Privacy Preser- The data Transformation based on
cation of Sensitive vation Gray code sorting performs best in
Transactional Data terms of both data utility and execu-
tion time.
14 Anonymization of Set- Yeye He, Jeffrey F. Privacy Preser- A top-down, partition-based approach
Valued Data via Top- Naughton vation to anonymizing set-valued data that
Down, Local General- scales linearly with the input size and
ization scores well on an information-loss data
quality metric.
3 Problem Description and Specification

3.1 Problem Statement
The purpose of anonymization technique is for security. Problems can be listed as:
1. Generalization loses some amount of information in high-dimensional data.
2. Bucketization, dose not avoid membership disclosure and not apply on data that
do not have a clear separation between sensitive attributes and QI- attributes.
3.2 Problem Solution
The proposed approach overcomes the above limitations by using slicing. Slicing
with multiple sensitive attributes of multiple columns to protect data from member-
ship discloser. It improves the efficiency and protect data than other anonymization
techniques. Overlap slicing demonstrate the greater data utility, and provide the secure
data access to multiple domains.
3.3 Goals and Objectives

3.3.1 Goals
1. The proposed technique achieves data privacy and utility.
2. In proposed technique the value of sensitive attributes are duplicated in many

columns.
3. Provide the secure data access to multiple domain.
3.3.2 Objective
1. The overlapping slicing shows better performance than the traditional techniques
such as slicing, bucketization, generalization.
2. Sensitive attributes are partitioned into both horizontally and vertically, therefore
more attribute correlation is achieved and utility of data is increased.
3. Overlapping slicing is a promising technique for handling high dimensional data.

By increasing the correlation among data and privacy is preserved.
3.4 Statement of Scope
Anonymization technique is powerful method for preserving privacy of published

data. A new anonymization method that is overlapping slicing is used for privacy
preserving and data publishing. Slicing overcomes the limitations of generalization
and bucketization to preserve better utility while protecting against privacy threats.
Overlapping slicing with multiple snsitive attributes shows that how slicing is used to
prevent attribute disclosures.
4 Dissertation Plan
This plan is the basis for the execution and tracking of all the project activities. It
shall be used throughout the life of the project and shall be kept up to date to reflect
the actual accomplishments and plans of the project. Schedule of the project work is
represented here in Gantt Chart format. As per the given deadlines the planning of
the projects are given in the figure 3, 4, 5:
4.1 Timeline of Project
In Seminar III the most of the timeline is utilized for studying various anonimaization
techniques for security in data mining and studying their comparative analysis. Simi-
lalrly in Stage I and Stage II majority of time is inculcated for module designing and
discussing the progress with the Guide.
Figure 3: The Timeline for Seminar III

Figure 4: The Timeline for Project Stage I
Figure 5: The Timeline for Project Stage II

5 Software Requirement Specifications

5.1 Purpose
The role of Software Requirements Specification (SRS) document is to explain the

external concert of the System. Requirements specification describes the operations,
performance, interfaces and quality assurance requirements of system. SRS also de-
scribes the non-functional necessities such as the user interfaces along with the design
constraints that are to be considered while designing system. It includes necessary fac-
tors to provide a complete and comprehensive description of the system requirements.
5.2 Design and Implementation constraints

5.2.1 Overlapped Slicing
The idea of slicing is to release correlated attributes together which then lends to
the utility of the anonymized dataset. Thus, authorizing an attribute to more than
one column would release more attribute correlations and thus improve the utility of
the released dataset. Table 7 show the overlapped slicing technique. Age is grouped
with Disease and Occupation is grouped with Gender. Even if Occupation also had a
nearly high correlation with Disease but Gender did not, they could not be joined into
a higher group and thus the data utility because the association between Disease and
Occupation is absent. In Table 7 , the attributes Occupation and Disease are existing
in more than one column means they are overlapping. This allows highly associated
attributes to group together. This also solves the problem of singular columns by
merging associated attributes into a different column instead of just leaving out an
attribute with a low correlation. The idea of Overlapping Correlation Clustering[8]
was suggested by F. Bonchi et al. It can be occupied to the attribute partitioning
phase of the slicing algorithm.
5.3 Assumptions and Dependencies
Data in the form of data set or in horizontal and vertical layout is required in many
data mining algorithms. Data set in the form of horizontal and vertical layout that is
Table 7: The overlapped Sliced table
(Gender,Occupation) (Zip,Education) (Age, Disease) (Disease,Occupation)

(M,Sales) (12460,10th) (32,Hypertention) (Dyspepsis,Sales)
(M,Army) (12578,12th) (26, Cancer) (Flu,Student)
(F,Student) (12093,Graduate) (20,Flu) (Hypertention,Army)
(M,Agriculture) (12216,Graduate) (29,diabetes) (Diabetes,Agriculter)
(F,Student) (12589,PG) (23, Flu) (Cancer,Government)
(M,Government) (12903,12th) (41,Cancer) (Flu,Student)
in the form of point dimension, observation variable, instance-feature is the standard

form required by most of the data mining algorithms. In system all users entry done
from admin. One user is related only one domain.
5.4 Usability
Usability is a non-functional requirement of the system that specifies how easy the
system is to use or how user-friendly the system is. It specifies how the system func-
tionality is to be perceived by the user and how efficient it is in carrying out. Slicing
technique improve the utility of anonymized datasets. Overlapped slicing demonstrate
the greater data utility provided by improved slicing while satisfying l-diversity. There
are several factors that decide usability of the system such as ease of learning, task
efficiency, understandability, subjective satisfaction, etc. Hence, the project is demon-
strate using above methods so it achieves high ratings in usability factors such as task
efficiency and user satisfaction.
5.5 Interfaces
User Interface
Project is designed in such a way that user can easily start data detection, recognition,
classification, getting output on just a button click which is designed in Java. So it is
very easy for user to get interface with system.
Software Interface
The software involved is the basic processing environment. In addition user created
library files of loading database, detection, recognition and classification are used.
5.6 Feasibility Study
Feasibility analysis is important in order to determine whether system is feasible

to develop or not. The development of a software based system or a software product
which is more likely plagued by both software and hardware resources and also delivery
dates. The three prime areas of feasibility analysis are:
• Technical Feasibility
• Operational Feasibility
• Economic Feasibility
5.6.1 Technical Feasibility
Technical feasibility is need to check systems technical requirements of the system.

To develop new system with small or null changes are required for the implement the
system. It is also used to investigation of things like necessary technology existing or
not and can system be upgrade in future.
5.6.2 Operational Feasibility
Operational feasibility is important to measure how well proposed system solves

the given problem against the existing system. During the scope definition, it takes
advantages of opportunities identified and fulfilled the requirements in steps of analysis
of software project development. Design of factors well planned will allow to utilize
software resources to improve performance. Reliable system can be implemented to
gain security and access control. If either of the authentication fails accessibility is
denied.
5.6.3 Economic Feasibility
Economic feasibility is needed to check economical factor which affect the system
and also calculate the economical factor that is positive benefits to the systems. It is
used to the amount for monitorial investment included in the research as well as in
development work. We need to justify the investment. The use of software eclipse is
freeware, JDK is also freeware.
5.7 Risk Analysis and Projection Table
Risk management is required to anticipate and identify risk, which is used to mini-
mize the impact, damage and loss, to reduce the probability, to monitor risk areas for
early detection and to ensure management awareness of risks. Risk management is the
identification, assessment, and prioritization of risk. Risk in our project arises in case
of system failure or crashes.
5.7.1 Project Risk
The risks can be monitored in the following ways to prevent their occurrence and
thus minimize project failure. Risk projection, also called risk estimation, attempts to
rate each risk in two ways the likelihood or probability that the risk is real and the
consequences of the problems associated with the risk, should it occur. Performed risk
projection activities such as measure of overall accuracy of project, backup regularly of
database. If system crash occur any other problem occurs it gets resolved that easily.
Along with this following activities are done to prevent from risk :
1. Modules are not completed in given time

This risk can be eliminated by the formulation of a project plan, regularly con-
ducted meeting with Guide, continued progress by checking of project work.
2. Insufficient Response
Now a days in all field internet is must for daily work for employee. In any
technical field there is availability of internet.
3. Technology does not meet requirements

In the event that the technology of database management does not allow all the
features expected of it, or the facilities expected of the programming language
are not properly supported, this risk involves more time to find an alternative
solution to the problem. This increases the time duration of the project. This
has been prevented by careful analysis of the system requirements and feasibility
before coding starts.
4. Required resources not available

This risk is due to mismanagement. It can be eliminated by properly examining
the requirements and their configurations in advance so that required resources
are available on time.
5.7.2 Risk Table
RMMM stands for Risk, Mitigation, Monitoring and Management. The RMMM plan
documents all the work performed as a part of risk analysis and is used by the project
manager as part of the overall project plan.
1. Mitigation : As a result the organization is taking steps to make multiple backup

copies of the software in development and all documentation associated with it,
in multiple locations. To maintain mitigation regularly backup of database is
taken.
2. Monitoring : Any changes in the stability of the environment should be recognized

and taken seriously. Network connections maintained at first time of search for
newly inserted query.
3. Management : If it becomes apparent that the project will not be completed on

time, it is resolved by creating project schedule maintaining deadline for each
moduled.
5.8 Effort and Cost Estimation

5.8.1 Efforts
The number of lines required for implementation of various modules is called Lines of
Code (LOC). It is written in terms of thousands and calculated in KLOC. Efforts in
persons/month are calculated using the following formula:
E = 3.2 ∗ KLOC1.05 (1)
E = 3.2 ∗ 3.31.05 = 11.2
The effort is 11.2 persons/month.

5.8.2 Development Time
The development time in months is calculated by the formula:
D = 2.5 ∗ E0.38 (2)
D = 2.5 ∗ 11.20.38 = 6.26
The development time is approximately 6 and half months.
5.8.3 Number of People
The recommended number of people is calculated by the formula:
E
N= (3)
D
11.2
N= 6.5
= 1.72
The number of persons required is 2. According to recommendation number of person

required for development is approximately 2, so development time can get increased
up to 5 months.
5.8.4 Cost
Considering the average salary of a software engineer/developer to be in the range of

Rs. 25,000 - 30,000 per month, the approximate cost of the project can be estimated
to be in the range of Rs. 80,000.
5.9 Technical Specification
Software Requirement Basic software specifications are:

Operating System : Windows XP/7/8
Open Source Software : Eclipse
Technology : Java (JDK 1.8)
Front End : JSP and Servlets
Database Connectivity : Postgres SQL
Hardware Requirement Basic hardware specifications:
Processor : At Least Pentium i5 Processor

RAM : 4 GB
Hard Disk : 2 GB
6 System Design and Implementation

System’s design is the process of defining the architecture, components, modules, in-
terfaces and data for a system to satisfy specified requirements.
6.1 System Architecture
The figure 6 represents the System Architecture which consist of the following modules
as stated:
• Extract the data set from the database.
• Performing anonymization technique on different domains
• Computes the Overlap sliced table with multiple sensitive attributes of different
domains.
• Attributes are combined and overlapped sliced data displayed.
Figure 6: System Model

6.2 System Overview
Overlap slicing with multiple sensitive attributes of multiple columns to protect data
from membership discloser. It improves the working efficiency and protection schema
rather than other anonymization techniques. Attributes that are highly correlated are
in the same column, this preserves the relationships between such attributes. The
relations between uncorrelated attributes are damaged, this provides better privacy as
the associations between such attributes are less frequent and potentially identifying.
The system may test with high dimensional data that show system work efficiently and
provide good result than the traditional systems. The system have multiple domains
with multiple sensitive attributes. There is provide secure access to multiple domains.
Table 8 shows the Overlapped slicing for Eduction Domain. Here full access to eduction
domain and overlapping between other domains.
Table 8: The Overlapped Slicing table with Eduction Domain

Age Gender Zipcode Occupa- Eduction (Political- Opin- (Disease,Crime)
tion ion, Disease)
20 F 12578 Student 12th (Congress,Cncer) (Cancer,theaf)
41 M 12589 Govern. PG (BGP,Flu) (Flu,Robbery)
26 M 12460 Sales 10th (Aap,Cancer) (Cancer,Null)
23 F 12216 Student Graduate (Shivsena,diabetes) (hyperT,Null)
29 M 12903 Agri. 12th (Congress,hyperT) (Flu,Null)
32 M 12093 Army Graduate (BGP,Flu) (diabetes,Theaf)
The Table no. 9 shows the Overlapped slicing for Medical domain. Here full access to
medical domain and overlapping within other domains.
———
Table 9: The Overlapped Slicing table with Medical Domain
Age Gen- Zip- Disease (PoliOpin, (Occupation, (Occupation, (PoliOpin,
der code Eduction) Crime) Eduction) Crime)
20 F 12578 Flu (Congress,10th) (Sale,theaf) (Gornm,PG) (Aap,Null)
41 M 12589 Cancer (BGP,12th) (Student,Robbery) (Sale,10th) (congress,theaf)
26 M 12460 Cancer (Aap,PG) (Gov,Null) (Student,12th) (BGP,Robbery)
23 F 12216 Flu (Shivsena,12th) (Army,Null) (Agri,12th) (Shivsena,Theaf)
29 M 12903 Dibetes (Congress,Graduate) (Student,Null) (Army,Graduate) (Congress,Null)
32 M 12093 HyerT (BGP,Graduate) (Agri,theaf) (Student,Graduate) (BGP,Null)
Table 10 show the Overlapped slicing for Crime domain. Here full access to Crime
domain and overlapping within other domains.
F¿c
———
Table 10: The Overlapped Slicing table with Crime Domain
Age Gen- Zip- Crime (PoliOpin, (Occupation, (Occupation, (PoliOpin, Dis-
der code Eduction) Disease) Eduction) ease)
20 F 12578 Robbery (Congress,10th) (Sale,Cancer) (Gornm,PG) (Aap,Cancer)
41 M 12589 Null (BGP,12th) (Student,Flu) (Sale,10th) (Congress,Cancer)
26 M 12460 Theaf (Aap,PG) (Gove,Cancer) (Student,12th) (BGP,Flu)
23 F 12216 Null (congress,Graduate) (Army,HyperT) (Agri,12th) (Shivsena,Diabetes)
29 M 12903 Theaf (Shivsena,12th) (Student,Flu) (Army,Graduate) (Congress,HypeT)
32 M 12093 Null (BGP,Graduate) (Agri,Diabetes) (Student,Graduate) (BGP,Flu)
Table 11 show the Overlapped slicing for Political domain. Here full access to
Political domain and overlapping within other domains.
———
Table 11: The Overlapped Slicing table with Political-Opinion Domain
Age Gen- Zip- Politica l(-Crime, Educ- (Occupation, (Occupation, (Crime, Disease)
der code tion) Disease) Eduction)
Opinon
20 F 12578 BGP (Theaf,10th) (Sale,Cancer) (Gornm,PG) (Cancer,theaf)
41 M 12589 Aap (Robbery,12th) (Student,Flu) (Sale,10th) (Flu,Robbery)
26 M 12460 Congress (Null,PG) (Gove,Cancer) (Student,12th) (Cancer,Null)
23 F 12216 BGP (Null,Graduate) (Army,HyperT) (Agri,12th) (Hyper,Null)
29 M 12903 Shivsena (Theaf,12th) (Student,Flu) (Army,Graduate) (Flu,Null)
32 M 12093 congress (Null,Graduate) (Agri,Diabetes) (Student,Graduate) (Diabetes,theaf)
6.3 UML Diagrams
The Unified Modeling Language (UML) includes a set of graphic notation techniques
to create visual models of object-oriented software-intensive systems. Unified Modeling
Language is used to specify, visualize, modify, construct and document the artifacts of
an object-oriented software-intensive system under development.
6.3.1 Data Flow Diagram
Data Flow Diagram is used in software engineering and it is a graphical representation

of the system information flow, means how the data flow from input to output. This
graphical representation shows the different functions carried out by the system.The
data flow diagram explain the flow of the system execution and also state the different
modules of the system.
Figure 7: DFD Level 0
Figure 8: DFD Level 1
Figure 7 is the level 0 Data Flow Diagram gives the flow of the system in initial
phase which just defines the system flow in short graphical representation.
6.3.2 Class Diagram
Class Diagram describes the structure of a system by showing the system’s classes,
their attributes, and the relationships among the classes.
Purpose:The purpose of the class diagram is to model the static view of an application.
The class diagrams are the only diagrams which can be directly mapped with object
oriented languages and thus widely used at the time of construction.
Figure 9: Class Diagram

6.3.3 Use Case Diagram
To model a system the most important aspect is to capture the dynamic behaviour.
To clarify a bit in details, dynamic behaviour means the behaviour of the system
when it is running / operating. So only static behaviour is not sufficient to model
a system rather dynamic behaviour is more important than static behaviour. The
system consists of actor admin itself. User load database and system generate secure
data using anonymization technique slicing. Different use cases are load database,
Extract data, applying slicing and suggest result. Use case diagram for current system
is shown in figure 10.
Figure 10: Use Case Diagram

6.3.4 Sequence Diagram
A sequence diagram is a kind of interaction diagram that shows how processes operate
with one another and in what order. It is a construct of a Message Sequence Chart. A
sequence diagram shows object interactions arranged in time sequence. It depicts the
objects and classes involved in the scenario and the sequence of messages exchanged
between the objects needed to carry out the functionality of the scenario. Sequence
diagrams are typically associated with use case realizations in the Logical View of the
system under development. Sequence diagrams are sometimes called event diagrams,
event scenarios, and timing diagrams. A sequence diagram shows, as parallel vertical
lines (lifelines), different processes or objects that live simultaneously, and, as horizontal
arrows, the messages exchanged between them, in the order in which they occur. This
allows the specification of simple runtime scenarios in a graphical manner. Figure 11
is sequence diagram for the system.
Figure 11: Sequence Diagram

6.3.5 State Machine Diagram
UML state machine diagram describes the states and state transitions of the system.
There are many different states through which system transits. State machine diagram
is a behaviour diagram which shows discrete behaviour of a part of designed system
through unite state transitions. Figure 12 is state machine digram for system.
Figure 12: State Machine Diagram
The above subsection defines various UML diagrams. UML diagrams helps to
construct and document about the system under development.
6.4 Mathematical Model
In this section we specify data users present in the system for uploading the data and
downloading the data. The input and their respective outcome is described below in
form of set theory.
1. T = Microdata Table
2. Identify the Attributs A = {A1, A2, A3, ...Ad}
3. D be the set Attribute Domain D = {D[A1], D[A2], ...D[Ad]}
4. Identify the Tuple t = {t[A1], t[A2], ...t[Ad]}
5. s = Sensitive value
6. B = Sliced Bucket
Σ
p(t, s) = p(t, B)p(s|t, B) (4)
B
Where p(t,s)= probability that t takes sensitive value s.

p(t,B)= probability that t is in bucket B.
p(s| t,B) = probability that t takes sensitive value s given that t in bucket B.
t’s column value = t[C1], t[C2]...t[Cc]
B’s column value =B[C1], B[C2]...B[Cc]
fi(t,B) = Fraction of occurrences of t(Ci)inB(Ci).
fc(t,B)= Fraction of occurrences of t[Cc − {s}]inB[Cc − {s}]
The probability that t is in bucket B is =
f (t, B)
p(t, B) = (5)
f (t)
Σ
where f(t) = f(t, B), p = (s|t, B) = D(t, B).[s]
where D(t,B) = Distribution of the candidate sensitive values in B.
D(t,B).[s] = The probability sensitive value s in the distribution.
6.5 Algorithmic Strategy
Tuple Partitioning
In the tuple partitioning phase, tuples are partitioned into buckets. Here modify the
Mondrian algorithm for tuple partition. Unlike Mondrian k-anonymity, no generaliza-
tion is applied to the tuples, use Mondrian for the purpose of partitioning tuples into
buckets. The algorithm maintains two data structures: (1)a queue of buckets Q and
(2) a set of sliced buckets SB. Initially, Q contains only one bucket which includes all
tuples and SB is empty. In each iteration, the algorithm removes a bucket from Q and
splits the bucket into two buckets. If the sliced table after the split satisfies l-diversity,
then the algorithm puts the two buckets at the end of the queue Q otherwise, we
cannot split the bucket anymore and the algorithm puts the bucket into SB. When Q
becomes empty,we have computed the sliced table. The set of sliced buckets is SB. The
main part of the tuple-partition algorithm is to check whether a sliced table satisfies
l-diversity.
Algorithm tuple-partition(t,l)
1. Q = {T }; SB = φ
2. while Q is not empty
3. remove the first bucket B from Q; Q = Q -{B}
4. split B into two buckets B1andB2
5. if divercity-check (T, Q ∪ {B1, B2} ∪ SB, l)
6. Q = Q ∪ {B1, B2}
7. else SB = SB ∪ {B}
8. return SB
Clustering
Clustering is the process of grouping a set of objects into classes or clusters so that
objects within a cluster have similarity in comparison to one another, but are dissimi-
lar to objects in other clusters. K-means clustering and Partitioning Around Medoids
(PAM) are well known techniques for performing non-hierarchical clustering. K-means
clustering finds the centroids, where the coordinate of each centroid is the means of
the coordinates of the objects in the cluster and assigns every object to the nearest
centroid.
K-medoids algorithm
The k-medoids algorithm is a clustering algorithm related to the k-means algorithm
and the medoidshift algorithm. K-medoids algorithms are partition the dataset into
groups and attempt to minimize the distance between points labeled to be in a cluster
and a point designated as the center of that cluster. In contrast to the k-means algo-
rithm, k-medoids chooses datapoints as centers and works with an arbitrary matrix of
distances between data points.
6.6 Time Complexity
The time complexity of Mondrian is O(n logn) whereas the alternate tuple partitioning
algorithm presented here takes only O(n) time. The diversity check algorithm is the
same as in slicing except that the computation of p(t,B) and D(t,B) requires system
to calculate the total number of possible tuples generated in each bucket.
6.7 Certainty Analysis
Certainty analysis is a ongoing process. Doing certainty analysis researchers get idea
of there uniqueness of there research work. Certainty analysis leads to risk analysis.
Overlapping slicing with multiple sensitive attributes is the new anonymization tech-
nique to privacy preserving data. There is multiple sensitive attributes and have access
to multiple domains.
7 Testing
7.1 Introduction
Testing helps to check the performance of the system for various scenarios and can
help accordingly to contribute towards the periodic update. It is often referred to as
verification and validation which has set of investigative activities that can be planned
in advance and conducted systematically, to assure the stakeholder that system fulfill
all the requirements gathered during requirement gathering phase. Verification refers
to the set of activities that ensure that software correctly implements specified func-
tionality. Validation refers to a set of activities built around matrix which ensure that
the functionality implemented by the system is traceable to customer requirements.
• White box testing - White Box Testing is a testing in which the software tester
has knowledge of the inner workings, structure and language of the software or at least
its purpose.
• Black box testing- Black Box Testing is testing the software without any knowl-
edge of the inner workings, structure or language of the module being tested. User not
able to see internal working.
• GUI testing - It is the process of testing a product’s graphical user interface to
ensure it meets its written specifications. This is normally done through the use of a
variety of test cases. To generate a set of test cases, test designers must be certain that
their suite covers all the functionality of the system.
7.2 Objective
The Software Test Plan (STP) is designed to test the module for performance degra-
dation under stress. To uncover bugs in the system to set aright any flaws in logic
that may be present. And to check logical flow from one module to another within the
system.
7.3 Testing Strategy

7.3.1 Unit Testing
The unit testing is a method by which individual units of source code, sets of one or
more computer program modules together with associated control data, usage proce-
dures and operating procedures are tested to determine if they are fit for use. The goal
of unit testing is to isolate each part of the program and show that the individual parts
are correct. A unit test provides a strict, written contract that the piece of code must
satisfy. Units in the proposed system are display result form, command button etc.
Table 12: Test Cases 1
Sr.No. Test case Test prerequi- Input Expected out- Actual out- Result
name site put put
1 Welcome A page with ad- Source code Data should be welcome Pass
page ministrator login in proper format page in
and agent login proper for-
mat
2 New user All details User details All details registration pass
registration should be filled should be filled
3 Administra- Valid username Username All details Administra- pass
tor login and password and pass- should be filled tor login
word correctly
4 Return to Logout from the Logout Return to wel- Welcome pass
welcome administrator come page page
page
5 User login Valid username Username All details User login pass
and password and pass- should be filled
word correctly
6 Browse but- Browse dataset Dataset with Fail to extract Fail to ex- pass
ton different at- dataset tract dataset
tribute
7.3.2 Integration Testing
It takes as its input modules that have been unit tested, groups them in larger aggre-
gates, applies tests defined in an integration test plan to those aggregates and delivers
as its output the integrated system ready for system testing. The following table 13 of
Table 13: Test Cases 2
Sr. Test Case Objective Expected Result Actual Result

No.
1 Front page occurs prop- Application starts prop- Application started
erly erly properly
2 Is admin Login on Indicate Success/Fail Lo- Login Admin
browser is successful gin if requested
3 Is register new user is Indicate Success/Fail Register new user
successful register if requested
4 Is user Login on browser Indicate Success/Fail Lo- Login User
is successful gin if requested
5 Is Education Domain It should show the Educ- It displays Overlapped
user login tion Domain data Slicing Data
6 Is Medical Domain user It should show the Med- It displays Overlapped
login ical Domain data Slicing Data
7 Is Political Domain user It should show the Polit- It displays Overlapped
login ical Domain data Slicing Data
8 Is Crime Domain user lo- It should show the Crime It displays Overlapped
gin Domain data Slicing Data
tests cases were conducted for proper functioning.

8 Results and Discussion

8.1 Dataset
Data set is collection of data which is stored in relational database where database
schema are highly normalized. Data in the form of data set or in horizontal layout is
required by many data mining algorithms. Data set in the form of horizontal layout
that is in the form of point dimension, observation variable,instance-feature is the
standard form required by most of the data mining algorithms.
8.2 Result
Following figure shows the main page of the system and loading of dataset which
contains various attributes.
Figure 13: Login Page

Figure 14: Load Database
Figure 15: Result for Medical Domain

Figure 16: Result for Political Domain
Figure 17: Result for Education Domain

9 Conclusion and Future Work

9.1 Conclusion
Database privacy is important to protect sensitive information. Genaralization and

bucketization techniques is use for protect sensitive information from attacks. Over-
lapped slicing is a new technique for increasing the utility of anonymized datasets
by improving slicing. It can duplicate attributes in more than one column and this
leads to greater data utility because it increases attribute correlations. Traditional
slicing has privacy safeguards like attribute disclosure and membership disclosure, pro-
posed system overlapping slicing are satisfies theses safeguards. The overlapped slicing
demonstrate the greater data utility provided by improved slicing while satisfying l-
diversity.
9.2 Future Work
Future research work in this area can include the extension of the notion of improved
slicing to datasets satisfying more severe anonymity parameters such as t-closeness.
Further analysis on the effect of the number of released columns on data privacy and
utility should also be considered.
Reference
[1] Tiancheng Li, Ninghui Li, Senior Member, IEEE, Jia Zhang, Member, IEEE, and
Ian Molloy Slicing: A New Approach for Privacy Preserving Data Publishing” Proc.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24,
NO. 3, MARCH 2012.
[2] Neha V. Mogre, Girish Agarwal, Pragati Patil: A Review On Data Anonymization
Technique For Data Publishing” Proc. International Journal of Engineering Research
Technology (IJERT) Vol. 1 Issue 10, December- 2012 ISSN: 2278-0181
[3] S.Kiruthika, Dr.M.Mohamed Raseen,”Enhanced Slicing Models For Preserving Pri-

vacy In Data Publication”,ICCETET,pages 406-409 in proc. of IEEE ,2013.
[4] D. Mohanapriya, Dr. T. Meyyappan,”Slicing : A Efficient Method For Privacy

Preservation In Data Publishing” in International Journal of Engineering Research
and Aplliction Vol. 3 Issue 4, August- 2013 ISSN: 2248-9622.
[5] Amar Paul Singh, Ms. Dhanshri Parihar,”A Review of Privacy Preserving Data
Publishing Technique” in International Journal of Engineering Research Mangment
technology Vol. 2 Issue 6, June- 2013 ISSN: 2278-9359.
[6] Na Li, Nan Zhang, Sajal K. Das,”Relationship Privacy Preservation in Publishing

Online Social Networks” in IEEE International Conference on Privacy,Security,Risk,Trust,and
IEEE International Conferance on Social Computing 2011.
[7] F.Bonchi, A. Gionis,and A. Ukkonen, Overlapping Correlation Clustering in pro-

ceeding IEEE 11th International conference on Data Mining, 2011,PP.51-60
[8] Ajinkya A. Dhaigude, Preetham Kumar,“Improved Slicing Algorithm For Greater

Utility In Privacy Preserving Data Publishing” in International Journal of Data Engi-
neering (IJDE), Volume (5) : Issue (2) : 2014
[9] Benjamin C. M. Fung, Ke Wang, Ada Wai-Chee Fu, and Philip S. Yu, Privacy
Preserving Data Publishing Concepts and Techniques” ,Data mining and knowledge
discovery series 2010.
[10] Gabriel Ghinita, Member IEEE, Panos Kalnis, Yufei Tao,” Anonymous Publica-
tion of Sensitive Transactional Data” in Proc. (vol. 23 no. 2) pp. 161-174 Of IEEE
Transactions on Knowledge and Data Engineering February 2011 .
[11] Lan Sun, YileiWang, YingjieWu,”A Survey of Transaction dada Anonymous pub-
lication” in IEEE Symposium on Robotics and Applications 2012.
[12] Jinfei Liu, Jun Luo and Joshua Zhexue Huang,”Rating: Privacy Preservation for
Multiple Attributes with Different Sensitivity Requirements” in IEEE International
Conference on Data Mining Workshops 2011.
[13] Shyue-Liang Wang,”K-anonymity on Sensitive Transaction Items” in IEEE Inter-

national Conferance on Granular Computing 2011.
[14] B.K.Tripathy, A.Maity, B.Ranajit,D.howdhuri,”A fast p-sensitive l-diversity Anonymi-

sation algorithm” in IEEE 2011.
[15] Yingjie Wu, ”Privacy Preservation in Transaction Databases based on Anatomy

technique” in IEEE International Conferance on Computre Science and Eduction 2010.
PS II An Efficient Technique to Secure Data Access for Multiple Domains using Overlapping Slicing
[16] Yeye He, Jerey F. Naughton,”Anonymization of Set Valued Data via TopDown,
Local Generalization” in ACM, VLDB,August 2009.
[17]G. Ghinita, Y. Tao, and P. Kalnis, ”On the anonymization of sparse high-dimensional
data” In ICDE, pages 715724, 2008.
[18] T. Li and N. Li, ”On the trade-off between privacy and utility in data publishing”
In KDD, pages 517 526, 2009.
[19] Y. Xu, K. Wang, A. W.-C. Fu, and P. S. Yu, ”Anonymizing transaction databases
for publication” In KDD, pages 767775, 2008.
[20] M. Terrovitis, N. Mamoulis, and P. Kalnis, ”Privacy-preserving anonymization of

set-valued data” In VLDB, pages 115125, 2008.
42

An Efficient Technique To Secure Data Access For Multiple Domains Using Overlapping Slicing

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

An Efficient Technique To Secure Data Access For Multiple Domains Using Overlapping Slicing

Uploaded by

Copyright:

Available Formats

AN EFFICIENT TECHNIQUE TO SECURE DATA ACCESS FOR

MULTIPLE DOMAINS USING OVERLAPPING SLICING

Prof. Rani V. Ingawale

Dr. Somnath B. Thigale

Keywords: Data anonymization, Privacy preservation, Data publishing, Data secu-

⇒ Security and Access Control.

List of Publications iii

List of Figures viii

3 Problem Description and Specification 11

5 Software Requirement Specifications 15

6 System Design and Implementation 22

8 Results and Discussion 36

9 Conclusion and Future Work 39

8 DFD Level 1 ................................................................................................................................. 25

9 Class Diagram ...................................................................................................... 26

QID : Quasi-Identifier Distance

PPDP : Privacy Preserving Data Publishing

SRS : Software Requirement Specifications

KDD : Knowledge Discovery in Database

GUI : Graphical User Interface

UML : Unified Model Language

STP : Software Test Plan

RMMM : Risk Mitigation, Monitoring and Management

LOC : Lines of Code

1.1 Privacy-Preserving in Data Mining

A person is interested to prevent its personal information of their medical records

Figure 1: Data Collection and Data Publication

1.2 Privacy-Preserving in Data Publishing

Privacy-Preserving techniques tend to study different transformation methods as-

Figure 2: Simple Model of PPDP

The most basic form of Privacy-Preserving in Data Publishing (PPDP) is shown

Many organizations collecting information is used for knowledge-based decision mak-

Table 1: The orginial table

Generalization is one of the common anonymized approach, which replaces quasi-

Table 2: The generalized table

In bucketization, the tuples in T are partitioned into buckets. Then to separate

Table 3: The bucketized table

Table 4: The Sliced table

2.1.4 Improved Slicing

Table 5: Comparison of anonymization techniques

2.2 Privacy Treads

2.2.1 Membership Disclosure Protection

2.2.2 Identity Disclosure Protection

2.2.3 Attribute Disclosure Protection

Table 6: Summary of Literature Review

3 Problem Description and Specification

1. Generalization loses some amount of information in high-dimensional data.

3.2 Problem Solution

3.3 Goals and Objectives

1. The proposed technique achieves data privacy and utility.

2. In proposed technique the value of sensitive attributes are duplicated in many

3. Provide the secure data access to multiple domain.

3. Overlapping slicing is a promising technique for handling high dimensional data.

3.4 Statement of Scope

Anonymization technique is powerful method for preserving privacy of published

4.1 Timeline of Project

Figure 3: The Timeline for Seminar III

Figure 4: The Timeline for Project Stage I

Figure 5: The Timeline for Project Stage II

5 Software Requirement Specifications

The role of Software Requirements Specification (SRS) document is to explain the

5.2 Design and Implementation constraints

5.3 Assumptions and Dependencies