Professional Documents
Culture Documents
BY
ii
Abstract
Data Mining is the process of analysing data from different perspective, summariz-
ing it and extracting the needed information from the database. Most enterprises are
collecting and storing data in large database. Database privacy is a important responsi-
bility of organizations to protects client sensitive information, because client trusts them
to do so. Various anonymization techniques have been proposed for the privacy of sen-
sitive microdata. Generalization loses considerable amount of information, especially
for high-dimensional data. Bucketization, does not prevent membership disclosure and
does not apply for data that do not have a clear separation between quasi-identifying
attributes and sensitive attributes. Slicing is a technique proposed for anonymized pub-
lished dataset by partitioning the dataset vertically and horizontally. Proposed technique
increases the utility and privacy of a sliced dataset by allowing overlapped slicing while
maintaining the prevention of membership disclosure. It also provides secure data ac-
cess for multiple domains. This novel approach works on overlapped slicing to improve,
preserve data utility and privacy better than traditional slicing.
Category:
• Data Mining.
iii
Contents
Certificate i
Acknowledgement ii
Abstract iv
Contents v
List of Tables ix
Abbreviations x
1 Introduction 1
1.1 Privacy-Preserving in Data Mining . . . . . . . . . . . . . . . . . . . . 1
1.2 Privacy-Preserving in Data Publishing . . . . . . . . . . . . . . . . . . 3
2 Literature Review 5
2.1 Various Anonymization Techniques . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 Bucketization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.3 Slicing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.4 Improved Slicing . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Privacy Treads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Membership Disclosure Protection . . . . . . . . . . . . . . . . . 8
2.2.2 Identity Disclosure Protection . . . . . . . . . . . . . . . . . . . 9
2.2.3 Attribute Disclosure Protection . . . . . . . . . . . . . . . . . . 9
iv
3.3.2 Objective ................................................................................................... 11
3.4 Statement of Scope .............................................................................................. 12
4 Dissertation Plan 13
4.1 Timeline of Project............................................................................................... 13
v
6.3.3 Use Case Diagram .................................................................................... 27
6.3.4 Sequence Diagram .................................................................................... 28
6.3.5 State Machine Diagram ........................................................................... 29
6.4 Mathematical Model............................................................................................ 30
6.5 Algorithmic Strategy ............................................................................................ 31
6.6 Time Complexity .................................................................................................. 32
6.7 Certainty Analysis ................................................................................................ 32
7 Testing 33
7.1 Introduction .......................................................................................................... 33
7.2 Objective ............................................................................................................... 33
7.3 Testing Strategy ................................................................................................... 34
7.3.1 Unit Testing ............................................................................................. 34
7.3.2 Integration Testing .................................................................................. 34
Reference 40
Appendix 43
cPGCON Attendee Certificate ................................................................................... 43
cPGCON Presentee Certificate .................................................................................... 44
cPGCON Review ......................................................................................................... 45
cPGCON Paper ............................................................................................................ 46
IJRITCC Certificate ................................................................................................... 50
IJRITCC Review ......................................................................................................... 51
IJRITCC Journal Paper ............................................................................................. 52
Plagiarism Summary ..................................................................................................... 58
vi
List of Figures
1 Data Collection and Data Publication . . . . . . . . . . . . . . . . . . . 2
2 Simple Model of PPDP . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3 The Timeline for Seminar III .............................................................................. 13
4 The Timeline for Project Stage I ....................................................................... 14
5 The Timeline for Project Stage II ...................................................................... 14
6 System Model....................................................................................................... 22
7 DFD Level 0 ................................................................................................................................. 25
vii
List of Tables
1 The orginial table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 The generalized table . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 The bucketized table . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4 The Sliced table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
5 Comparison of anonymization techniques . . . . . . . . . . . . . . . . . 8
6 Summary of Literature Review........................................................................... 10
7 The Sliced table ................................................................................................... 16
8 The Overlapped Slicing table with Eduction Domain ..................................... 23
9 The Overlapped Slicing table with Medical Domain........................................ 23
10 The Overlapped Slicing table with Crime Domain........................................... 24
11 The Overlapped Slicing table with Political-Opinion Domain ........................ 24
12 Test Cases 1 ......................................................................................................... 34
13 Test Cases 2 ......................................................................................................... 35
viii
Abbreviations
QI : Quasi-Identifiers
SA : Sensitive Attributes
F : Female
M : Male
ix
An Efficient Technique to Secure Data Access for Multiple Domains using Overlapping Slicing
1 Introduction
Data Mining is process of evaluating data from different view, summarize it into valu-
able information and extract the needed information from the database. Now a days,
collection of information by various organizations, government is used for knowledge-
based decision making as well as analysis purpose. So, there is need to share the
collected information, but the data in its original form contains some sensitive in-
formation. People or organization does not want their sensitive information to be
disclosed as if the shared data in its original form will violate the individual privacy.
So, to prevent this violation of privacy there should be some technique to publish the
data in such a way that privacy is preserved and at the same time data analysis can be
done effectively. In data publication privacy preserving has been studied extensively
in recent years. Data published contains records each of which contains information
about an individual entity, such as a person, a household, or an organization. Several
microdata anonymization techniques have been proposed.
Generally when people talk of privacy, they don’t want personal information to be
disclose. As long as data is not misused, most people do not feel their privacy has been
violated. The problem is that once information is released, it may be impossible to
prevent misuse. Utilizing this distinction, ensuring that a data mining would not en-
able misuse of personal information. A collection of data, it is possible to learn things
that are not revealed by any individual data item. An individual may not care about
someone knowing their birth date, mother’s maiden name, or social security number,
but knowing all of them enables identity theft. This type of privacy problem arises with
large, multi-individual collections as well. A technique that guarantees no individual
data is revealed may still release information describing the collection as a whole. Such
corporate information is generally the goal of data mining, but some results may still
lead to concerns. The difference between such corporate privacy issues and individual
privacy is not that significant[5].
An Efficient Technique to Secure Data Access for Multiple Domains using Overlapping Slicing
holds some sensitive information and people do not want their sensitive information
to be revealed. Sharing data in its original form thus reveal the individual privacy.
So, to prevent this violation of privacy there should be some technique to publish the
data in such a way that privacy is preserved and at the same time data analysis can be
done effectively. Microdata holds information about an individual entity, like a person,
organization. The most popular one is generalization and bucketization. Here,data
attributes are partitioned into three categories:
1. Identifier attributes: Attributes are identifiers that can uniquely identify an indi-
vidual, like Age, name or Social Security Number
2. Quasi-Identifiers (QI): The set of attributes that can be linked with public available
datasets to reveal personal identity e.g., Birth date, Gender, and Zipcode.
3. Sensitive Attributes (SA): Which contains personal privacy information, like Dis-
ease, political opinion, crime.
Multiple domains contain multiple sensitive attributes, slicing anonymization tech-
niques proposed to prevent the sensitive information. The basic idea of slicing is to
break the link cross columns, but to preserve the link within each column. Slicing
is in multiple sensitive attributes preserves good usefulness than generalization and
bucketization and reduces the dimensionality of the data. Overlapped slicing increase
the utility and privacy of a sliced dataset with multiple sensitive attributes in different
domains.
An Efficient Technique to Secure Data Access for Multiple Domains using Overlapping Slicing
2 Literature Review
2.1 Various Anonymization Techniques
Two widely studied data anonymization technique are generalization and bucketization.
In table 1 show the collected data from the record owners.
2.1.1 Generalization
attributes are lost. In order to study attribute correlations on the generalized table,
the data analyst has to assume that every possible combination of attribute values is
equally possible. This is an inherent problem of generalization that prevents effective
analysis of attribute correlations[3].
2.1.2 Bucketization
2.1.3 Slicing
Slicing is a novel data anonymization technique. Slicing partitions the data set both
vertically and horizontally. Vertical partitioning is done by grouping attributes into
columns based on the correlations among the attributes. Each column contains a subset
of attributes that are highly correlated. Horizontal partitioning is done by grouping
tuples into buckets. Finally, within each bucket, values in each column are randomly
permuted (or sorted) to break the linking between different columns. [2]
The basic idea of slicing is to break the association cross columns, but to preserve
the association within each column. This reduces the dimensionality of the data and
preserves better utility than generalization and bucketization. Slicing preserves utility
because it groups highly correlated attributes together, and preserves the correlations
between such attributes. Slicing achieves privacy because it breaks the associations
between uncorrelated attributes. When the data set contains quasi-identifiers and one
sensitive attributes, bucketization has to break their correlation. Slicing can group
some quasi-identifier attributes with the sensitive attribute for preserving attribute
correlations with the sensitive attribute. Slicing first partitions attributes into columns
which contains a subset of attributes after that partition into buckets where each bucket
contains a subset of tuples. Within each bucket, values in each column are randomly
permuted to break the linking between different columns.
An Efficient Technique to Secure Data Access for Multiple Domains using Overlapping Slicing
A novel data anonymization model is lead that improves limitations of slicing. The
major influences of this model are the use of an overlapped clustering technique in the
attribute partitioning phase and the use of an alternative tuple partitioning algorithm.
Improved slicing works by first finding the correlations between each pair of attributes
and then clustering these attributes into columns by overlapped clustering on the ba-
sis of their association coefficients. The dataset is then horizontally partitioned into
buckets satisfying l-diversity using a novel tuple partitioning algorithm. The columns
within each bucket are then arbitrarily permuted with respect to one another to give
an enhanced sliced dataset [8].
When publishing microdata, there are three types of information disclosure threats.
The first type is membership disclosure, when the data to be published is selected
from a larger dataset and the selection conditions are sensitive, it is important to pre-
vent an attacker to knowing whether an individuals record is in the data or not.
An Efficient Technique to Secure Data Access for Multiple Domains using Overlapping Slicing
The second type is identity disclosure, which occurs when an individual is linked
to a particular record in the released table. In some situations, one wants to protect
against identity disclosure when the attackers is undefined of membership.
The third type is attribute disclosure, occurs when new data about some individuals
is published. That means the released data makes it possible to assume the attributes
of an individual more correctly than it would be possible before the release. Alike to
the case of identity disclosure, required to consider attacker who previously know the
membership information. Most of the time Identity disclosure leads to attribute disclo-
sure. Once there is identity disclosure, an individual is re- identified and the equivalent
sensitive value is discovered. Attribute disclosure can happen with or without identity
disclosure, for example when the sensitive values of all matching tuples are the same.
An Efficient Technique to Secure Data Access for Multiple Domains using Overlapping Slicing
The summary of some of the main reviewed papers is included in following table.
The purpose of anonymization technique is for security. Problems can be listed as:
2. Bucketization, dose not avoid membership disclosure and not apply on data that
do not have a clear separation between sensitive attributes and QI- attributes.
The proposed approach overcomes the above limitations by using slicing. Slicing
with multiple sensitive attributes of multiple columns to protect data from member-
ship discloser. It improves the efficiency and protect data than other anonymization
techniques. Overlap slicing demonstrate the greater data utility, and provide the secure
data access to multiple domains.
3.3.2 Objective
1. The overlapping slicing shows better performance than the traditional techniques
such as slicing, bucketization, generalization.
2. Sensitive attributes are partitioned into both horizontally and vertically, therefore
more attribute correlation is achieved and utility of data is increased.
An Efficient Technique to Secure Data Access for Multiple Domains using Overlapping Slicing
4 Dissertation Plan
This plan is the basis for the execution and tracking of all the project activities. It
shall be used throughout the life of the project and shall be kept up to date to reflect
the actual accomplishments and plans of the project. Schedule of the project work is
represented here in Gantt Chart format. As per the given deadlines the planning of
the projects are given in the figure 3, 4, 5:
In Seminar III the most of the timeline is utilized for studying various anonimaization
techniques for security in data mining and studying their comparative analysis. Simi-
lalrly in Stage I and Stage II majority of time is inculcated for module designing and
discussing the progress with the Guide.
The idea of slicing is to release correlated attributes together which then lends to
the utility of the anonymized dataset. Thus, authorizing an attribute to more than
one column would release more attribute correlations and thus improve the utility of
the released dataset. Table 7 show the overlapped slicing technique. Age is grouped
with Disease and Occupation is grouped with Gender. Even if Occupation also had a
nearly high correlation with Disease but Gender did not, they could not be joined into
a higher group and thus the data utility because the association between Disease and
Occupation is absent. In Table 7 , the attributes Occupation and Disease are existing
in more than one column means they are overlapping. This allows highly associated
attributes to group together. This also solves the problem of singular columns by
merging associated attributes into a different column instead of just leaving out an
attribute with a low correlation. The idea of Overlapping Correlation Clustering[8]
was suggested by F. Bonchi et al. It can be occupied to the attribute partitioning
phase of the slicing algorithm.
Data in the form of data set or in horizontal and vertical layout is required in many
data mining algorithms. Data set in the form of horizontal and vertical layout that is
An Efficient Technique to Secure Data Access for Multiple Domains using Overlapping Slicing
5.4 Usability
Usability is a non-functional requirement of the system that specifies how easy the
system is to use or how user-friendly the system is. It specifies how the system func-
tionality is to be perceived by the user and how efficient it is in carrying out. Slicing
technique improve the utility of anonymized datasets. Overlapped slicing demonstrate
the greater data utility provided by improved slicing while satisfying l-diversity. There
are several factors that decide usability of the system such as ease of learning, task
efficiency, understandability, subjective satisfaction, etc. Hence, the project is demon-
strate using above methods so it achieves high ratings in usability factors such as task
efficiency and user satisfaction.
5.5 Interfaces
User Interface
Project is designed in such a way that user can easily start data detection, recognition,
classification, getting output on just a button click which is designed in Java. So it is
very easy for user to get interface with system.
Software Interface
The software involved is the basic processing environment. In addition user created
library files of loading database, detection, recognition and classification are used.
An Efficient Technique to Secure Data Access for Multiple Domains using Overlapping Slicing
Economic feasibility is needed to check economical factor which affect the system
and also calculate the economical factor that is positive benefits to the systems. It is
used to the amount for monitorial investment included in the research as well as in
development work. We need to justify the investment. The use of software eclipse is
freeware, JDK is also freeware.
An Efficient Technique to Secure Data Access for Multiple Domains using Overlapping Slicing
Risk management is required to anticipate and identify risk, which is used to mini-
mize the impact, damage and loss, to reduce the probability, to monitor risk areas for
early detection and to ensure management awareness of risks. Risk management is the
identification, assessment, and prioritization of risk. Risk in our project arises in case
of system failure or crashes.
The risks can be monitored in the following ways to prevent their occurrence and
thus minimize project failure. Risk projection, also called risk estimation, attempts to
rate each risk in two ways the likelihood or probability that the risk is real and the
consequences of the problems associated with the risk, should it occur. Performed risk
projection activities such as measure of overall accuracy of project, backup regularly of
database. If system crash occur any other problem occurs it gets resolved that easily.
Along with this following activities are done to prevent from risk :
2. Insufficient Response
Now a days in all field internet is must for daily work for employee. In any
technical field there is availability of internet.
RMMM stands for Risk, Mitigation, Monitoring and Management. The RMMM plan
documents all the work performed as a part of risk analysis and is used by the project
manager as part of the overall project plan.
The number of lines required for implementation of various modules is called Lines of
Code (LOC). It is written in terms of thousands and calculated in KLOC. Efforts in
persons/month are calculated using the following formula:
E
N= (3)
D
11.2
N= 6.5
= 1.72
5.8.4 Cost
The figure 6 represents the System Architecture which consist of the following modules
as stated:
• Computes the Overlap sliced table with multiple sensitive attributes of different
domains.
Overlap slicing with multiple sensitive attributes of multiple columns to protect data
from membership discloser. It improves the working efficiency and protection schema
rather than other anonymization techniques. Attributes that are highly correlated are
in the same column, this preserves the relationships between such attributes. The
relations between uncorrelated attributes are damaged, this provides better privacy as
the associations between such attributes are less frequent and potentially identifying.
The system may test with high dimensional data that show system work efficiently and
provide good result than the traditional systems. The system have multiple domains
with multiple sensitive attributes. There is provide secure access to multiple domains.
Table 8 shows the Overlapped slicing for Eduction Domain. Here full access to eduction
domain and overlapping between other domains.
The Table no. 9 shows the Overlapped slicing for Medical domain. Here full access to
medical domain and overlapping within other domains.
———
Table 9: The Overlapped Slicing table with Medical Domain
Age Gen- Zip- Disease (PoliOpin, (Occupation, (Occupation, (PoliOpin,
der code Eduction) Crime) Eduction) Crime)
20 F 12578 Flu (Congress,10th) (Sale,theaf) (Gornm,PG) (Aap,Null)
41 M 12589 Cancer (BGP,12th) (Student,Robbery) (Sale,10th) (congress,theaf)
26 M 12460 Cancer (Aap,PG) (Gov,Null) (Student,12th) (BGP,Robbery)
23 F 12216 Flu (Shivsena,12th) (Army,Null) (Agri,12th) (Shivsena,Theaf)
29 M 12903 Dibetes (Congress,Graduate) (Student,Null) (Army,Graduate) (Congress,Null)
32 M 12093 HyerT (BGP,Graduate) (Agri,theaf) (Student,Graduate) (BGP,Null)
Table 10 show the Overlapped slicing for Crime domain. Here full access to Crime
domain and overlapping within other domains.
F¿c
An Efficient Technique to Secure Data Access for Multiple Domains using Overlapping Slicing
———
Table 10: The Overlapped Slicing table with Crime Domain
Age Gen- Zip- Crime (PoliOpin, (Occupation, (Occupation, (PoliOpin, Dis-
der code Eduction) Disease) Eduction) ease)
20 F 12578 Robbery (Congress,10th) (Sale,Cancer) (Gornm,PG) (Aap,Cancer)
41 M 12589 Null (BGP,12th) (Student,Flu) (Sale,10th) (Congress,Cancer)
26 M 12460 Theaf (Aap,PG) (Gove,Cancer) (Student,12th) (BGP,Flu)
23 F 12216 Null (congress,Graduate) (Army,HyperT) (Agri,12th) (Shivsena,Diabetes)
29 M 12903 Theaf (Shivsena,12th) (Student,Flu) (Army,Graduate) (Congress,HypeT)
32 M 12093 Null (BGP,Graduate) (Agri,Diabetes) (Student,Graduate) (BGP,Flu)
Table 11 show the Overlapped slicing for Political domain. Here full access to
Political domain and overlapping within other domains.
———
Table 11: The Overlapped Slicing table with Political-Opinion Domain
Age Gen- Zip- Politica l(-Crime, Educ- (Occupation, (Occupation, (Crime, Disease)
der code tion) Disease) Eduction)
Opinon
20 F 12578 BGP (Theaf,10th) (Sale,Cancer) (Gornm,PG) (Cancer,theaf)
41 M 12589 Aap (Robbery,12th) (Student,Flu) (Sale,10th) (Flu,Robbery)
26 M 12460 Congress (Null,PG) (Gove,Cancer) (Student,12th) (Cancer,Null)
23 F 12216 BGP (Null,Graduate) (Army,HyperT) (Agri,12th) (Hyper,Null)
29 M 12903 Shivsena (Theaf,12th) (Student,Flu) (Army,Graduate) (Flu,Null)
32 M 12093 congress (Null,Graduate) (Agri,Diabetes) (Student,Graduate) (Diabetes,theaf)
An Efficient Technique to Secure Data Access for Multiple Domains using Overlapping Slicing
The Unified Modeling Language (UML) includes a set of graphic notation techniques
to create visual models of object-oriented software-intensive systems. Unified Modeling
Language is used to specify, visualize, modify, construct and document the artifacts of
an object-oriented software-intensive system under development.
Figure 7 is the level 0 Data Flow Diagram gives the flow of the system in initial
phase which just defines the system flow in short graphical representation.
An Efficient Technique to Secure Data Access for Multiple Domains using Overlapping Slicing
Class Diagram describes the structure of a system by showing the system’s classes,
their attributes, and the relationships among the classes.
Purpose:The purpose of the class diagram is to model the static view of an application.
The class diagrams are the only diagrams which can be directly mapped with object
oriented languages and thus widely used at the time of construction.
To model a system the most important aspect is to capture the dynamic behaviour.
To clarify a bit in details, dynamic behaviour means the behaviour of the system
when it is running / operating. So only static behaviour is not sufficient to model
a system rather dynamic behaviour is more important than static behaviour. The
system consists of actor admin itself. User load database and system generate secure
data using anonymization technique slicing. Different use cases are load database,
Extract data, applying slicing and suggest result. Use case diagram for current system
is shown in figure 10.
A sequence diagram is a kind of interaction diagram that shows how processes operate
with one another and in what order. It is a construct of a Message Sequence Chart. A
sequence diagram shows object interactions arranged in time sequence. It depicts the
objects and classes involved in the scenario and the sequence of messages exchanged
between the objects needed to carry out the functionality of the scenario. Sequence
diagrams are typically associated with use case realizations in the Logical View of the
system under development. Sequence diagrams are sometimes called event diagrams,
event scenarios, and timing diagrams. A sequence diagram shows, as parallel vertical
lines (lifelines), different processes or objects that live simultaneously, and, as horizontal
arrows, the messages exchanged between them, in the order in which they occur. This
allows the specification of simple runtime scenarios in a graphical manner. Figure 11
is sequence diagram for the system.
UML state machine diagram describes the states and state transitions of the system.
There are many different states through which system transits. State machine diagram
is a behaviour diagram which shows discrete behaviour of a part of designed system
through unite state transitions. Figure 12 is state machine digram for system.
The above subsection defines various UML diagrams. UML diagrams helps to
construct and document about the system under development.
An Efficient Technique to Secure Data Access for Multiple Domains using Overlapping Slicing
In this section we specify data users present in the system for uploading the data and
downloading the data. The input and their respective outcome is described below in
form of set theory.
1. T = Microdata Table
5. s = Sensitive value
6. B = Sliced Bucket
Σ
p(t, s) = p(t, B)p(s|t, B) (4)
B
f (t, B)
p(t, B) = (5)
f (t)
Σ
where f(t) = f(t, B), p = (s|t, B) = D(t, B).[s]
where D(t,B) = Distribution of the candidate sensitive values in B.
D(t,B).[s] = The probability sensitive value s in the distribution.
An Efficient Technique to Secure Data Access for Multiple Domains using Overlapping Slicing
Tuple Partitioning
In the tuple partitioning phase, tuples are partitioned into buckets. Here modify the
Mondrian algorithm for tuple partition. Unlike Mondrian k-anonymity, no generaliza-
tion is applied to the tuples, use Mondrian for the purpose of partitioning tuples into
buckets. The algorithm maintains two data structures: (1)a queue of buckets Q and
(2) a set of sliced buckets SB. Initially, Q contains only one bucket which includes all
tuples and SB is empty. In each iteration, the algorithm removes a bucket from Q and
splits the bucket into two buckets. If the sliced table after the split satisfies l-diversity,
then the algorithm puts the two buckets at the end of the queue Q otherwise, we
cannot split the bucket anymore and the algorithm puts the bucket into SB. When Q
becomes empty,we have computed the sliced table. The set of sliced buckets is SB. The
main part of the tuple-partition algorithm is to check whether a sliced table satisfies
l-diversity.
Algorithm tuple-partition(t,l)
1. Q = {T }; SB = φ
2. while Q is not empty
3. remove the first bucket B from Q; Q = Q -{B}
4. split B into two buckets B1andB2
5. if divercity-check (T, Q ∪ {B1, B2} ∪ SB, l)
6. Q = Q ∪ {B1, B2}
7. else SB = SB ∪ {B}
8. return SB
Clustering
Clustering is the process of grouping a set of objects into classes or clusters so that
objects within a cluster have similarity in comparison to one another, but are dissimi-
lar to objects in other clusters. K-means clustering and Partitioning Around Medoids
(PAM) are well known techniques for performing non-hierarchical clustering. K-means
An Efficient Technique to Secure Data Access for Multiple Domains using Overlapping Slicing
clustering finds the centroids, where the coordinate of each centroid is the means of
the coordinates of the objects in the cluster and assigns every object to the nearest
centroid.
K-medoids algorithm
The k-medoids algorithm is a clustering algorithm related to the k-means algorithm
and the medoidshift algorithm. K-medoids algorithms are partition the dataset into
groups and attempt to minimize the distance between points labeled to be in a cluster
and a point designated as the center of that cluster. In contrast to the k-means algo-
rithm, k-medoids chooses datapoints as centers and works with an arbitrary matrix of
distances between data points.
The time complexity of Mondrian is O(n logn) whereas the alternate tuple partitioning
algorithm presented here takes only O(n) time. The diversity check algorithm is the
same as in slicing except that the computation of p(t,B) and D(t,B) requires system
to calculate the total number of possible tuples generated in each bucket.
Certainty analysis is a ongoing process. Doing certainty analysis researchers get idea
of there uniqueness of there research work. Certainty analysis leads to risk analysis.
Overlapping slicing with multiple sensitive attributes is the new anonymization tech-
nique to privacy preserving data. There is multiple sensitive attributes and have access
to multiple domains.
An Efficient Technique to Secure Data Access for Multiple Domains using Overlapping Slicing
7 Testing
7.1 Introduction
Testing helps to check the performance of the system for various scenarios and can
help accordingly to contribute towards the periodic update. It is often referred to as
verification and validation which has set of investigative activities that can be planned
in advance and conducted systematically, to assure the stakeholder that system fulfill
all the requirements gathered during requirement gathering phase. Verification refers
to the set of activities that ensure that software correctly implements specified func-
tionality. Validation refers to a set of activities built around matrix which ensure that
the functionality implemented by the system is traceable to customer requirements.
• White box testing - White Box Testing is a testing in which the software tester
has knowledge of the inner workings, structure and language of the software or at least
its purpose.
• Black box testing- Black Box Testing is testing the software without any knowl-
edge of the inner workings, structure or language of the module being tested. User not
able to see internal working.
• GUI testing - It is the process of testing a product’s graphical user interface to
ensure it meets its written specifications. This is normally done through the use of a
variety of test cases. To generate a set of test cases, test designers must be certain that
their suite covers all the functionality of the system.
7.2 Objective
The Software Test Plan (STP) is designed to test the module for performance degra-
dation under stress. To uncover bugs in the system to set aright any flaws in logic
that may be present. And to check logical flow from one module to another within the
system.
An Efficient Technique to Secure Data Access for Multiple Domains using Overlapping Slicing
The unit testing is a method by which individual units of source code, sets of one or
more computer program modules together with associated control data, usage proce-
dures and operating procedures are tested to determine if they are fit for use. The goal
of unit testing is to isolate each part of the program and show that the individual parts
are correct. A unit test provides a strict, written contract that the piece of code must
satisfy. Units in the proposed system are display result form, command button etc.
Sr.No. Test case Test prerequi- Input Expected out- Actual out- Result
name site put put
1 Welcome A page with ad- Source code Data should be welcome Pass
page ministrator login in proper format page in
and agent login proper for-
mat
2 New user All details User details All details registration pass
registration should be filled should be filled
3 Administra- Valid username Username All details Administra- pass
tor login and password and pass- should be filled tor login
word correctly
4 Return to Logout from the Logout Return to wel- Welcome pass
welcome administrator come page page
page
5 User login Valid username Username All details User login pass
and password and pass- should be filled
word correctly
6 Browse but- Browse dataset Dataset with Fail to extract Fail to ex- pass
ton different at- dataset tract dataset
tribute
It takes as its input modules that have been unit tested, groups them in larger aggre-
gates, applies tests defined in an integration test plan to those aggregates and delivers
as its output the integrated system ready for system testing. The following table 13 of
An Efficient Technique to Secure Data Access for Multiple Domains using Overlapping Slicing
Data set is collection of data which is stored in relational database where database
schema are highly normalized. Data in the form of data set or in horizontal layout is
required by many data mining algorithms. Data set in the form of horizontal layout
that is in the form of point dimension, observation variable,instance-feature is the
standard form required by most of the data mining algorithms.
8.2 Result
Following figure shows the main page of the system and loading of dataset which
contains various attributes.
Future research work in this area can include the extension of the notion of improved
slicing to datasets satisfying more severe anonymity parameters such as t-closeness.
Further analysis on the effect of the number of released columns on data privacy and
utility should also be considered.
An Efficient Technique to Secure Data Access for Multiple Domains using Overlapping Slicing
Reference
[1] Tiancheng Li, Ninghui Li, Senior Member, IEEE, Jia Zhang, Member, IEEE, and
Ian Molloy Slicing: A New Approach for Privacy Preserving Data Publishing” Proc.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24,
NO. 3, MARCH 2012.
[2] Neha V. Mogre, Girish Agarwal, Pragati Patil: A Review On Data Anonymization
Technique For Data Publishing” Proc. International Journal of Engineering Research
Technology (IJERT) Vol. 1 Issue 10, December- 2012 ISSN: 2278-0181
[5] Amar Paul Singh, Ms. Dhanshri Parihar,”A Review of Privacy Preserving Data
Publishing Technique” in International Journal of Engineering Research Mangment
technology Vol. 2 Issue 6, June- 2013 ISSN: 2278-9359.
[9] Benjamin C. M. Fung, Ke Wang, Ada Wai-Chee Fu, and Philip S. Yu, Privacy
Preserving Data Publishing Concepts and Techniques” ,Data mining and knowledge
discovery series 2010.
[10] Gabriel Ghinita, Member IEEE, Panos Kalnis, Yufei Tao,” Anonymous Publica-
tion of Sensitive Transactional Data” in Proc. (vol. 23 no. 2) pp. 161-174 Of IEEE
Transactions on Knowledge and Data Engineering February 2011 .
[11] Lan Sun, YileiWang, YingjieWu,”A Survey of Transaction dada Anonymous pub-
lication” in IEEE Symposium on Robotics and Applications 2012.
[12] Jinfei Liu, Jun Luo and Joshua Zhexue Huang,”Rating: Privacy Preservation for
Multiple Attributes with Different Sensitivity Requirements” in IEEE International
Conference on Data Mining Workshops 2011.
[16] Yeye He, Jerey F. Naughton,”Anonymization of Set Valued Data via TopDown,
Local Generalization” in ACM, VLDB,August 2009.
[17]G. Ghinita, Y. Tao, and P. Kalnis, ”On the anonymization of sparse high-dimensional
data” In ICDE, pages 715724, 2008.
[18] T. Li and N. Li, ”On the trade-off between privacy and utility in data publishing”
In KDD, pages 517 526, 2009.
[19] Y. Xu, K. Wang, A. W.-C. Fu, and P. S. Yu, ”Anonymizing transaction databases
for publication” In KDD, pages 767775, 2008.
42