Are you sure?
This action might not be possible to undo. Are you sure you want to continue?
 A N O N Y M I T Y M O D E L
 P R O T E C T I N G
P R I V A C Y
Project Submitted in Partial FulIillment oI the Requirement Ior the
ard oI Degree oI
Bachelor oI Technology in Computer Science & Engineering
Submitted By
nubhav ggaral (87/CSE/2K7)
Saurav Suman (46/CSE/2K7)
Ravi Rajak (07/CSE/2K7)
8huto8h Kr. Jha (41/CSE/2K7)
Under Supervision of:
ProI. Binod Kumar
Department of Computer Science and Engineering
Cambridge Institute of Technology, Ranchi
(2011)
2
ACKNOWLEDGEMENTS
We are pleased to acknowledge Mr Binod Kumar for his invaluable guidance during the
course of this project work.
We extend our sincere thanks to Mr Arshad Usmani who continuously helped us
throughout the project and without his guidance, this project would have been an uphill
task.
We are also grateful to other members of the CÌT team who cooperated with us
regarding some issues.
We would also like to thank 'UC Ìrwin machine repository' for writing the very useful
database for different organizations under the Open Source banner which greatly
helped us in writing the database part.
Last but not the least, our friends who also cooperated with us nicely for the
smooth development of this project.
July 13
th
, 2011
Anubhav, Saurav, Ravi, Ashutosh
TABLE OF CONTENTS
A b s tr a ct 5
1. Ìnt roduction 6
1.1. General kanonymity model 7
2. Motivation 8
2.1. Statistical databases 8
2.2. Multilevel databases 8
2.3. Computer security is not privacy protection 10
2.4. Multiple queries can leak inference 10
3. Preliminaries 11
3.1. Basic Concepts 11
3.2. Existing Techniques 15
4. Anonymization and Clustering 16
4.1. Categorization of major clustering methods 16
5. Definition (frequently used terms in clustering) 20
5.1 Cluster 20
5.2 Distance between Two Clusters 20
5.3 Similarity 20
5.4 Average Similarity 21
5.5 Threshold 21
5.6 Similarity Matrix 21
5.7 Dissimilarity Coefficient 21
5.8 Cluster Seed 21
4
6. Real world Applications of Clustering 22
6.1 Similarity searching in Medical Ìmage Database 22
6.2 Data Mining 23
6.3 Windows NT 24
6.4 Other applications 25
7. Existing Theorem used in this project 26
7.1 Distance and Cost f unct i on 27
8. Proposed Anonymization Al g o r i t h m 33
9. Java Code for implementing proposed algorithm 37
10. Project Output Screen 64
11. Experimental Results 65
11.1 Experi mental Setup 65
12. Conclusions 66
12.1 ClusteringBased Approaches 67
13. References
A b s t r a c t
kanonymity is a model that addresses the question, "How can a data holder
release a version of its private data with scientific guarantees that the
individuals who are the subjects of the data can't be reidentified while the data
remains practically useful? [13]¨ For instance, a medical institution may want to
release a table of medical records. Even though the names of the individuals
can be replaced with dummy identifiers, some set of attributes called the quasi
identifier which can leak confidential information. For instance, the birth date,
zip code and the gender attributes in the disclosed table can uniquely determine an
individual. Joining such a table with some other publicly available information source,
like a voter's list table, which consists of records containing the attributes that make up
the quasiidentifier as well as the identities of individuals, the medical information, can
be easily linked to individuals. kanonymity prevents such a privacy breach by
ensuring that each individual record can only be released if there is at least k1
other
(distinct) individuals whose associated records are indistinguishable from the former in
terms of their quasiidentifier values.
kanonymization techniques have been the focus of intense research in the last few
years. An important requirement for such techniques is to ensure anonymization of
data while at the same time minimizing the information loss resulting from data
modifcation.
Ìn this paper we propose an approach that uses the idea of clustering to minimize
information loss and thus ensure good data quality. The key observation here is that
data records that are naturally similar to each other should be part of the same
equivalence class. We thus formulate a specifc clustering problem, referred to as k
member clustering pr obl em. We prove that this problem is NPhard and present a
greedy heuristic, the complexity of which is in ˛( n
Ŷ
).
6
1. Int roduction
A recent approach addressing data privacy relies on the notion of kanonymity. Ìn this
approach, data privacy is guaranteed by ensuring that any record in the released data is
indistinguishable from at least (k ~ 1) other records with respect to a set of attributes
called the quasiidentifer. Although the idea of kanonymity is conceptually
straightforward, the computational complexity of fnding an optimal solution for the k
anonymity problem has been shown to be NPhard, even when one considers only cell
suppression. The kanonymity problem has recently drawn considerable interest from
research community, and a number of algorithms have been proposed. Current
solutions, however, su er from high information loss mainly due to reliance on predefned
generalization hierarchies or total order imposed on each attribute domain.
The main goal of our work is to develop a new kanonymization approach that
a d d r e s s e s such limitations. The key idea underlying our approach is that the k
anonymization problem can be viewed as a clustering problem. Ìntuitively, the kanonymity
requirement can be naturally transformed into a clustering problem where we want to
fnd a set of clusters (i.e., equivalence classes), each of which contains at least k
records. Ìn order to maximize data quality, we also want the records in a cluster to be
as similar to each other as possible. This ensures that less distortion is required when the
records in a cluster are modifed to have the same quasiidentifer value. We thus
formulate a specifc clustering problem, which we call kmember clustering problem. We
prove that this problem is NPhard and present a greedy algorithm which runs in time
˛( n
Ŷ
). Although our approach does not rely on generalization hierarchies, if there
exist some natural relations among the values in a domain, our algorithm can
incorporate such information to fnd more desirable solutions. We note that while
many quality metrics have been proposed for the hierarchybased generalization, a
metric that precisely measures the information loss introduced by the hierarchy free
generalization has not yet been introduced. For this reason, we defne a data quality
metric for the hierarchyfree generalization, which we call information loss metric. We
also show that with a small modifcation, our algorithm is able to reduce classifcation
errors e ectively.
7
The remainder of this paper is organized as follows. We review the basic concepts
of the kanonymity model and survey existing techniques. We formally defne the
problem of kanonymization as a clustering problem and introduce our approach. Then
we evaluate our approach based on the experimental results.
1.1. GeneraI Anonymity modeI
kanonymity is a model that addresses the question, "How can a data holder
release a version of its private data with scientific guarantees that the
individuals who are the subjects of the data can't be reidentified while the data
remains practically useful?¨. For instance, a medical institution may want to
release a table of medical records. Even though the names of the individuals
can be replaced with dummy identifiers, some set of attributes called the quasi
identifier which can leak confidential information. For instance, the birth date,
zip code and the gender attributes in the disclosed table can uniquely determine an
individual. Joining such a table with some other publicly available information source,
like a voter's list table, which consists of records containing the attributes that make up
the quasiidentifier as well as the identities of individuals, the medical information, can
be easily linked to individuals, kanonymity prevents such a privacy breach by
ensuring that each individual record can only be released if there is at least k1
other
(distinct) individuals whose associated records are indistinguishable from the former in
terms of their quasiidentifier values.
8
. Motivation
The problem of releasing a version of privately held data so that the individuals who
are the subjects of the data cannot be identified is not a new problem. There are
existing works in the statistics community on statistical databases and in the computer
security community on multilevel databases to consider. However, none of these
works provide solutions to the broader problems experienced in today's data rich
setting.
.1. StatisticaI databases
Federal and state statics offices around the world have traditionally been interested with
the release of statistical information about all aspects of the populance. But like other
data holders, statistics offices are also facing tremendous demand for person specific
data for application such as data mining, cost analysis, fraud detection and
retrospective research. But many of the established statistical database techniques,
which involve various ways of adding noise to the data while still maintaining some
statistical invariant, often destroy the integrity of records, or tuples, and so, for many
new uses of data, these established techniques are not appropriate. Wallenberg and De
Wall provide more extensive coverage of traditional statistical techniques.
.. MuItiIeveI databases
Another related area is aggregation and inference in multilevel databases which
concerns restricting the release of lower classified information such that higher
classified information cannot be derived. Denning and Lunt described a multilevel
relational database system (MDB) as having data stored at different security
classifications and users having different security clearances. Su and Ozsoyoglu
formally investigated inference in MDB. They showed that eliminating precise
inference compromise due to functional dependencies and multivalued dependencies
is NPcomplete. By extension to this work, the precise elimination of all inferences with
respect to the identities of the individuals whose information is included in person
specific data is typically impossible to guarantee. Ìntuitively this makes sense because
the data holder cannot consider a priori every possible attack. Ìn trying to produce
anonymous data, the work that is the subject of this paper seeks to primarily protect
against known attacks. The biggest problems result from inferences that can be
drawn after linking the released data to other knowledge, so in this work, it is the
ability to link the result to foreseeable data sources that must be controlled.
Many aggregation inference problems can be solved by database design, but this
solution is not practical in today's data rich setting. Ìn today's environment, information
is often divided and partially replicated among multiple data holders and the data
holders usually operate autonomously in making decisions about how data will be
released. Such decisions are typically made locally with incomplete knowledge of how
sensitive other holders of the information might consider replicated data. For example,
when somewhat aged information on joint projects is declassified differently by the
Department of Defense than by the Department of Energy, the overall declassification
effort suffers; using the two partial releases, the original may be reconstructed in its
entirety. Ìn general, systems that attempt to produce anonymous data must operate
without the degree of omniscience and level of control typically available in the
traditional aggregation problem.
Ìn both aggregation and MDB, the primary technique used to control the flow of
sensitive information is suppression, where sensitive information and all information
that allows the inference of sensitive information are simply not released.
Suppression can drastically reduce the quality of the data, and in the case of
statistical use, overall statistics can be altered, rendering the data practically
useless.
When protecting national interests, not releasing the information at all may be possible,
but the greatest demand for personspecific data is in situations where the data holder
must provide adequate protections while keeping the data useful, such as sharing
personspecific medical data for research and survey purposes.
10
.3. Computer security is not privacy protection
An area that might appear to have a common ancestry with the subject of this paper is
access control and authentication, which are traditional areas associated with
computer security. Work in this area ensures that the recipient of information has the
authority to receive that information. While access control and authentication
protections can safeguard against direct disclosures, they do not address disclosures
based on inferences that can be drawn from released data. The more insidious
problem in the work that is the subject of this paper is not so much whether the
recipient can get access or not to the information as much as what values will
constitute the information the recipient will receive. A general doctrine of the work
presented herein is to release all the information but to do so such that the identities of
the people who are the subjects of the data (or other sensitive properties found in the
data) are protected. Therefore, the goal of the work presented in this paper lies
outside of traditional work on access control and authentication.
.4. MuItipIe queries can Ieak inference
Denning and others were among the first to explore inferences realized from
multiple queries to a database. For example, consider a table containing only
(physician, patient, and medication). A query listing the patients seen by each
physician, i.e., a relation R (physician, patient), may not be sensitive. Likewise, a query
itemizing medications prescribed by each physician may also not be sensitive. But
the query associating patients with their prescribed medications may be
sensitive because medications typically correlate with diseases. One common
solution, called query restriction, prohibits queries that can reveal sensitive
information. This is effectively realized by suppressing all inferences to sensitive
data. Ìn contrast, this work poses a realtime solution to this problem by advocating
that the data be first rendered sufficiently anonymous, and then the resulting data
used as the basis on which queries are processed. Doing so typically retains far
more usefulness in the data because the resulting release is often less distorted.
11
Ìn summary, the dramatic increase in the availability of personspecific data from
autonomous data holders has expanded the scope and nature of inference control
problems and exasperated established operating practices. The goal of this work is to
provide a model for understanding, evaluation and constructing computational system
that control inferences in this setting.
3. PreIiminaries
3.1. Basic Concepts
The kanonymity model assumes that personspecifc data are stored in a table (or a
relation) of columns (or attributes) and rows (or records). The process of anonymizing
such a table starts with removing all the explicit identifers, such as name and SSN, from
the table. However, even though a table is free of explicit identifers, some of the
remaining attributes in combination could be specifc enough to identify individuals if the
values are already known to the public. For example, as shown by Sweeney, [1, 2, 3],
most individuals in the United States can be uniquely identifed by a set of attributes
such as { ZIP, gender, date of birth} . Thus, even if each attribute alone is not specifc
enough to identify individuals, a group of certain attributes together may identify a
particular individual. The set of such attributes is called quasiidentifer.
The main objective of the kanonymity model is thus to transform a table so that no
one can make highprobability associations between records in the table and the
corresponding entities. Ìn order to achieve this goal, the kanonymity model requires
that any record in a table be indistinguishable from at least (k ~ 1) other records with
respect to the predetermined quasiidentifer. A group of records that are
indistinguishable to each other is often referred to as an equivalence class. By
enforcing the kanonymity requirement, it is guaranteed that even though an
adversary knows that a kanonymous table contains the record of a particular
individual and also knows some of the quasiidentifer.
12
ZÌP Gender Age Disease
831001 Male 25 Flu
825303 Female 12 Obesity
834009 Male 34 Cancer
831001 Male 26 HÌV+
825303 Male 16 Cancer
834009 Male 32 Diabetes
825303 Female 26 Obesity
831001 Male 27 Flu
834009 Female 31 Flu
Fig.1 Patient Table
ZÌP Gender Age Diagnosis
83100* Person [2530] Flu
82530* Person [1015] Obesity
83400* Person [3035] Cancer
83100* Person [2530] HÌV+
82530* Person [1520] Cancer
83400* Person [3035] Diabetes
82530* Person [2530] Obesity
83100* Person [2530] Flu
83400* Person [3035] Flu
Fig.2 Anonymous Patient Table
Attribute values of the individual, he/she cannot determine which record in the table
corresponds to the individual with a probability greater than 1/ k . For example, a 3
anonymous version of the table in Fig. 1 is shown in Fig. 2.
1
The kanonymit y model is an approach to protect data from individual
ident ificat ion. Ìt works by ensuring that each record of a table is identical to at
least k ~ 1 other records with respect to a set of privacyrelat ed attributes,
called quasiidentifiers, that could be potent ially used to identify individuals
by linking these attributes to external data sets. For example, consider the
hospital data in Table 1 where the attributes Zip Code, Gender and Age are
regarded as quasiident ifiers. Table 2 gives a 3anonymizat ion version of the
table in Table 1, where anonymizat ion is achieved via generalizat ion at the
attribute level, i.e., if t wo records contain the same value at a quasiident ifier,
they will be generalized to the same value at the quasiidentifier as well. Table3
gives anot her 3anonymization version of the table in Table1,where
anonymizat ion is achieved via generalization at the cell level, i.e., t wo cells with
same value could be generalized to different values (e.g., value 75275 in the Zip
Code column and value Male in the Gender column)
Tabl e 1: Pati e nt r e c or d s of a hospi ta l
Zip Code Gender Age Disease Expense
75275 Male 22 Flu 100
75277 Male 23 Cancer 3000
75278 Male 24 HÌV+ 5000
75275 Male 33 Diabetes 2500
75275 Female 38 Diabetes 2800
75275 Female 36 Diabetes 2600
Tabl e 2: Anonymi zati on a t a t t r i b u t e l evel
Zip Code Gender Age Disease Expense
7527* Person [2130] Flu 100
7527* Person [2130] Cancer 3000
7527* Person [2130] HÌV+ 5000
7527* Person [3140] Diabetes 2500
14
Tabl e 3: Anony mi zati on at cell level
Because anonymization via generalizat ion at the cell level generat es data
that cont ains different generalizat ion level within a column, utilizing such data
becomes more complicat ed than utilizing the data generat ed via
generalizati on at the attribute level. However, generali zation at the cell level
causes less information loss than generalizat ion at the attribute level. Hence,
as far as data qualit y is concerned, generalizat ion at the cell level seems to
generat e better dat a than generalization at the attribute level.
Anonymizat ion via generalizat ion at the cell level can proceed in t wo steps.
First, all records are partitioned int o several groups such that each group
cont ains at least k records. Then, the records in each group are generalized
such that their values at each quasiidentifier are ident ical. To minimize the
information loss incurred by the second step, the first step should place similar
records (with respect to the quasiidentifiers) in the same group. Ìn the
cont ext of dat a mining, clust ering is a useful technique that partitions records into
clusters such that records within a cluster are similar to each other, while
7527* Person [3140] Diabetes 2800
7527* Person [3140] Diabetes 2600
zip Code Gender Age Disease Expense
7527* Male [2125] Flu 100
7527* Male [2125] Cancer 3000
7527* Male [2125] HÌV+ 5000
75275 Person [3140] Diabetes 2500
75275 Person [3140] Diabetes 2800
75275 Person [3140] Diabetes 2600
1
records in different clusters are most distinct from one another. Hence,
clustering could be used for kanonymization.
3. Existing Techniques
The kanonymity requirement is typically enforced through generalization, where real
values are replaced with "less specifc but semantically consistent values¨. Given a
domain, there are various ways to generalize the values in the domain. Typically,
numeric values are generalized into intervals, and categorical values are generalized into
a set of distinct values (e.g., { USA, Canada} ) or a single value that represents such a
set (e.g., NorthAmerica).
Various generalization strategies have been proposed. A non overlapping
generalizationhierarchy is first defined for each attribute of quasiidentifier. Then an
algorithm tries to find an optimal (or good) solution which is allowed by such
generalization hierarchy. Note that in these schemes, if a lower level domain needs to
be generalized to higher level domain, all the values in the lower domain are
generalized to the higher domain. This restriction could be significant drawback in that it
may lead to relatively high data distortion due to unnecessary generalization. on the
other hand, possible generalization are still limited by the imposed generalization
hierarchies.
Recently, some schemes that do not rely on generalization hierarchies have been
proposed. For instance, LeFevre transform the kanonymity problem into a partitioning
problem. Specifcally, their approach consists of the following two steps. The frst step is
to fnd a partitioning of the ddimensional space, where d is the number of attributes in
the quasiidentifer, such that each partition contains at least k records. Then the
records in each partition are generalized so that they all share the same quasiidentifer
value. Although shown to be e cient, these approaches also have a disadvantage that it
requires a total order for each attribute domain. This makes it impractical in most cases
involving categorical data which have no meaningful order.
16
4. Anonymization and CIustering
The key idea underlying our approach is that the kanonymization problem can be
viewed as a clustering problem. Clustering is the problem of partitioning a set of objects
into groups such that objects in the same group are more similar to each other than
objects in other groups with respect to some defned similarity criteria. Ìntuitively, an
optimal solution of the kanonymization problem is indeed a set of equivalence
classes such that records in the same equivalence class are very similar to each
other, thus requiring a minimum generalization.
4.1. Categorization of major cIustering methods
There exists a large number of clustering algorithms in the literature. The choice of
clustering algorithm depends both on the particular purpose and application. Ìf cluster
analysis is used as a descriptive or exploratory tool, it is possible to try several
algorithms on the same data the data may disclose.
Ìn general, major clustering methods can be classified into the following categories.
1. Partitioning methods.
2. Hierarchical methods.
3. Densitybased methods.
4. Gridbased methods.
5. Modelbased methods.
17
1. Partitioning methods
Given a database of n objects or data tuples, a partitioning method constructs k
partitions of the data, where each partition represents a cluster, and ˫ n. That is, it
classifies the data into k groups, which together satisfy the following requirements:
(1) Each group must contain at least one object
(2) Each object must belong to exactly one group
Notice that the second requirement is relaxed in some fuzzy partitioning techniques.
Given k, the number of partitions to construct, a partitioning method creates an initial
partitioning. Ìt then uses an iterative relocation technique which attempts to improve the
partitioning by moving objects from one group to another. The general criterion of a
good partitioning is that objects in the same cluster are "close¨ or related to each other,
whereas objects of different clusters are "far apart¨ or very different. There are various
kinds of other criteria for judging the quality of partitions.
To achieve global optimality in partitioningbased clustering would require the
exhaustive enumeration of all of the possible partition. Ìnstead, most application adopts
one of the two popular heuristic methods.
(1) The k means algorithm, where each cluster is represented by the means value
of the objects in the cluster.
(2) The kmedoids algorithm, where each cluster is represented by one of the
objects located near the center of the cluster.
This heuristic clustering methods works well for finding spherical shaped clusters in
small to medium sized databases. For finding clusters with complex shapes and for
clustering very large data sets, partitioning based methods need to be extended.
18
2. Hierarchical methods.
A hierarchical method creates a hierarchical decomposition of the given set of data
objects. A hierarchical method can be classified as being either agglomerative or
divisive, based on how the hierarchical decomposition is formed. The agglomerative
approach, also called the "bottomup¨ approach, start with each object forming a
separate group. Ìt successively merges the object or group close to one another, until all
of the group is merged into one, or until a termination condition holds. The divisive
approach, also called the "topdown¨ approach, starts with all the objects in the same
cluster. Ìn each successive iteration, a cluster is split up into smaller clusters, until
eventually each object is in one cluster, or until a termination condition holds.
Hierarchical methods suffer from the fact that once a step is done, it can never be
undone. This rigidity of the hierarchical method is both the key to its success because it
leads to smaller computation cost without worrying about a combinatorial number of
different choices, as well as the key to its main problem because it cannot correct
erroneous decisions.
Ìt can be advantageous to combine iterative relocation and hierarchical agglomeration
by first using hierarchical agglomerative algorithm and then refining the result using
iterative relocation. Some scalable clustering algorithms, such as BÌRCH and CURE,
have been developed based on such an integrated approach.
3. Density based methods
Most partitioning methods cluster objects based on the distance between objects. Such
methods can find only sphericalshaped clusters and encounter difficulty at discovering
methods have been developed based on notion of density. The general idea is to
continue growing the given cluster so long as the density (number of objects or points)
in the "neighborhood¨ exceeds some threshold, i.e., for each data point within a given
cluster, the neighborhood of a given radius has to contain at least a minimum number of
points. Such a method can be used to filter out noise (outliers), and discover a cluster of
DBSCAN is a typical densitybased method which grows clusters according to density
1
threshold. OPTÌCS is a densitybased method which computes an augmented
clustering ordering for automatic and iterative cluster analysis. Densitybased method
which computes an augmented clustering ordering for automatic and iterative cluster
analysis.
4. Gridbased methods.
A gridbased method quantizes the object space into a finite number of cells which form
a grid structure. Ìt then performs all of the clustering operation on the grid structure. The
main advantage of this approach is its fast processing time which is typically
independent of the number of data objects, and dependent only on the number of cells
in each dimension in the quantized space.
STÌNG is a typical example of a gridbased method. CLÌQUE and wavecluster are two
clustering algorithms which are both gridbased and densitybased.
5. Model based methods.
A model based method hypothesizes a model for each of the clusters, and find the best
fit of the data to that model. A modelbased algorithm may locate clusters by
constructing a density function that reflects the spatial distribution of data points. Ìt also
lead to a way of automatically determining the number of clusters based on standard
statistics, taking "noise¨ or outliers into account and thus yielding robust clustering
method.
Data clustering is a method in which we make cluster of objects that are somehow
similar in characteristics. The criterion for checking the similarity is implementation
dependent.
20
Clustering is often confused with classification, but there is some difference between the
two. Ìn classification the objects are assigned to pre defined classes, whereas in
clustering the classes are also to be defined.
Precisely, Data Clustering is a technique in which, the information that is logically similar
is physically stored together. Ìn order to increase the efficiency in the database systems
the numbers of disk accesses are to be minimized. Ìn clustering the objects of similar
properties are placed in one class of objects and a single access to the disk makes the
entire class available.
. DEFINITIONS (frequentIy used terms in cIustering)
Ìn this section some frequently used terms are defined.
.1. CIuster
Cluster is an ordered list of objects, which have some common characteristics. The
objects belong to an interval a, b], in our case , 1] 1]
.. Distance between Two CIusters
The distance between two clusters involves some or all elements of the two clusters.
The clustering method determines how the distance should be computed.
.3 SimiIarity
A similarity measure SIMILAR (Di, D
j
) can be used to represent the similarity between
the documents. Typical similarity generates values of 0 for documents exhibiting no
agreement among the assigned indexed terms, and 1 when perfect agreement is
detected. Ìntermediate values are obtained for cases of partial agreement.
21
.4 Average SimiIarity
Ìf the similarity measure is computed for all pairs of documents ( D
i
, D
j
) except when i=j,
an average value Average Similarity is obtainable. Specifically, Average Similarity =
CONSTANT Similar(Di, D
j
), where i=1, 2. n , j=1, 2, ...n and i < > j
. ThreshoId
The lowest possible input value of similarity required to join two objects in one cluster.
.6 SimiIarity Matrix
Similarity between objects calculated by the function SIMILAR (D
i,
Dj), represented in
the form of a matrix is called a similarity matrix.
.7 DissimiIarity Coefficient
The dissimilarity coefficient of two clusters is defined to be the distance between them.
The smaller the value of dissimilarity coefficient, the more similar two clusters are.
.8 CIuster Seed
First document or object of a cluster is defined as the initiator of that cluster i.e. every
incoming object's similarity is compared with the initiator. The initiator is called the
cluster seed.
22
6. ReaI worId AppIications of CIustering
Data clustering has immense number of applications in every field of life. One has to
cluster a lot of thing on the basis of similarity either consciously or unconsciously. So
the history of data clustering is old as the history of mankind.
Ìn computer field also, use of data clustering has its own value. Specially in the field of
information retrieval data clustering plays an important role. Some of the applications
are listed below.
6.1 SimiIarity searching in MedicaI Image Database
This is a major application of the clustering technique. Ìn order to detect many diseases
like Tumor etc, the scanned pictures or the xrays are compared with the existing ones
and the dissimilarities are recognized.
We have clusters of images of different parts of the body. For example, the images of
the CT scan of brain are kept in one cluster. To further arrange things, the images in
which the right side of the brain is damaged are kept in one cluster. The hierarchical
clustering is used. The stored images have already been analyzed and a record is
associated with each image. Ìn this form a large database of images is maintained using
the hierarchical clustering.
Now when a new query image comes, it is firstly recognized that what particular cluster
this image belongs, and then by similarity matching with a healthy image of that specific
cluster the main damaged portion or the diseased portion is recognized. Then the image
is sent to that specific cluster and matched with all the images in that particular cluster.
Now the image, with which the query image has the most similarities, is retrieved and
the record associated to that image is also associated to the query image. This means
that now the disease of the query image has been detected.
2
Using this technique and some really precise methods for the pattern matching,
diseases like really fine tumor can also be detected.
So by using clustering an enormous amount of time in finding the exact match from the
database is reduced.
6. Data Mining
Another important application of clustering is in the field of data mining. Data mining is
defined as follows.
Definition1: "Data mining is the process of discovering meaningful new correlation,
patterns and trends by sifting through large amounts of data, using pattern recognition
technologies as well as statistical and mathematical techniques."
Definition: Data mining is a "knowledge discovery process of extracting previously
unknown, actionable information from very large databases."
Use of CIustering in Data Mining: Clustering is often one of the first steps in data
mining analysis. Ìt identifies groups of related records that can be used as a starting
point for exploring further relationships. This technique supports the development of
population segmentation models, such as demographicbased customer segmentation.
Additional analyses using standard analytical and other data mining techniques can
determine the characteristics of these segments with respect to some desired outcome.
For example, the buying habits of multiple population segments might be compared to
determine which segments to target for a new sales campaign.
For example, a company those sales a variety of products may need to know about the
sale of all of their products in order to check that what product is giving extensive sale
and which is lacking. This is done by data mining techniques. But if the system clusters
the products that are giving fewer sales then only the cluster of such products would
have to be checked rather than comparing the sales value of all the products. This is
actually to facilitate the mining process.
24
6.3 Windows NT
Another major application of clustering is in the new version of windows NT. Windows
NT uses clustering, it determine the nodes that are using same kind of resources and
accumulate them into one cluster. Now this new cluster can be controlled as one node.
Other appIications
Social network analysis
Ìn the study of social network, clustering may be used to recognize communities
within large groups of people.
Software evolution
Clustering is useful in software evolution as it helps to reduce legacy properties
in code by reforming functionality that has become dispersed. Ìt is a form of
restructuring and hence is a way of directly preventative maintenance.
Ìmage segmentation
Clustering can be used to divide a digital image into distinct regions for border
detection or object recognition.
Data mining
Many data mining applications involve partitioning data items into related
subsets; the marketing applications discussed above represent some examples.
Another common application is the division of documents, such as World Wide
Web pages, into genres.
Search result grouping
Ìn the process of intelligent grouping of the files and websites, clustering may be
used to create a more relevant set of search results compared to normal search
engines like Google. There are currently a number of web based clustering tools
such as Clusty.
2
Slippy map optimization
Flickr's map of photos and other map sites use clustering to reduce the number
of markers on a map. This makes it both faster and reduces the amount of visual
clutter.
ÌMRT segmentation
Clustering can be used to divide a fluence map into distinct regions for
conversion into deliverable fields in MLCbased Radiation Therapy.
Grouping of Shopping Ìtems
Clustering can be used to group all the shopping items available on the web into
a set of unique products. For example, all the items on eBay can be grouped into
unique products.
Recommender systems
Recommender systems are designed to recommend new items based on a
user's tastes. They sometimes use clustering algorithms to predict a user's
preferences based on the preferences of other users in the user's cluster.
Mathematical chemistry
To find structural similarity, etc., for example, 3000 chemical compounds were
clustered in the space of 90 topological indices.
Climatology
To find weather regimes or preferred sea level pressure atmospheric patterns.
Petroleum Geology
Cluster Analysis is used to reconstruct missing bottom hole core data or missing
log curves in order to evaluate reservoir properties.
26
Physical Geography
The clustering of chemical properties in different sample locations.
Crime Analysis
Cluster analysis can be used to identify areas where there are greater incidences
of particular types of crime. By identifying these distinct areas or "hot spots"
where a similar crime has happened over a period of time, it is possible to
manage law enforcement resources more effectively.
7. Existing Theorem used in this project
Typical clustering problems require that a specifc number of clusters be found in
solutions. However, the kanonymity problem does not have a constraint on the
number of clusters; instead, it requires that each cluster contains at least k records.
Thus, we pose the kanonymity problem as a clustering problem, referred to as k
member clustering problem.
Deñnition 1: (kmember cl ust er i ng pr obl em) The kmember clustering
problem is to fnd a set of clusters from a given set of n records such that each
cluster contains at least (k > n) data points and that the sum of all intracluster
distances is minimized. Formally, let S be a set of n records and k the specifed
anonymization parameter. Then the optimal solution of the k clustering problem is a
set of clusters = { e
1
, . , e
m
} such that:
i = j { 1, . . . , m}, e
i
i e
j
= ,
U
i=1,...m
ei = S,
ei , ei  < k, and
27
_
l=1,..,m

e
l
· M AX
i , j = 1, . . . ,  e
l 
A (p
( L , i )
, p
(L ,j)
) i s mi ni mal .
Here e is the size of cluster e, P(l, i)
represents the ith data point in cluster e
l
, and
A (x, y) is the distance between two data points x and y.
Note that in Defnition 1, we consider the sum of all intracluster distances, where an
intracluster distance of a cluster is defned as the maximum distance between any two
data points in the cluster (i.e., the diameter of the cluster). As we describe in the
following section, this sum captures the total information loss, which is the amount of
data distortion that generalizations introduce to the entire table.
7.1 Distance and Cost f unct i on
At the heart of every clustering problem are the distance functions that measure the
dissimilarities among data points and the cost function which the clustering problem
tries to minimize. The distance functions are usually determined by the type of data (i.e.,
numeric or categorical) being clustered, while the cost function is defned by the specifc
objective of the clustering problem. Ìn this section, we describe our distance and cost
functions which have been specifcally tailored for the kanonymization problem.
As previously discussed, a distance function in a clustering problem measures how
dissimilar two data points are. As the data we consider in the kanonymity problem are
personspecifc records that typically consist of both numeric and categorical
attributes, we need a distance function that can handle both types of data at the
same time.
For a numeric attribute, the di erence between two values (e.g., x ~ y) naturally
describes the dissimilarity (i.e., distance) of the values. This measure is also suitable for
the kanonymization problem. To see this, recall that when records in the same
equivalence class are generalized, the generalized quasiidentifer must subsume all the
28
attribute values in the equivalence class. That is, the generalization of two values x
and y in a numeric attribute is typically represented as a range x, y], provided that x <
y. Thus, the di erence captures the amount of distortion caused by the generalization
process to the respective attribute (i.e., the length of the range).
Deñnition . (Distance between two numeric values) Let D be a fnite numeric
domain. Then the normalized distance between two values b
, b
e B is defned as:
ô
N
(v
1
, v
2
) = v
1
~ v
2
 /  D 
Where D is the domain size measured by the difference between the maximum and minimum
values in D.
For categorical attributes, however, the di erence is no longer applicable as most of the
categorical domains cannot be enumerated in any specifc order. The most
straightforward solution is to assume that every value in such a domain is equally
di erent to each other; e.g., the distance of two values is 0 if they are the same, and
1 if di erent. However, some domains may have some semantic relationships among
the values. Ìn such domains, it is desirable to defne the distance functions based on
the existing relationships. Such relationships can be easily captured in a taxonomy
t r ee. We assume that a taxonomy tree of a domain is a balanced tree of which the
leaf nodes represent all the distinct values in the domain. For example, Fig. 3
illustrates a natural taxonomy tree for the Country a t t r i bu t e . However, for some
attributes such as Occupation, there may not exist any semantic relationship which can
help in classifying the domain values. For such domains, all the values are classifed
under a common value as
in Fig. 4. We now defne the distance function for categorical
values as follows:
Deñnition 3. (Distance between two categorical values) Let D be a categorical
2
domain and T
D be
a taxonomy tree defned for D. The normalized distance
between two values b
, b
e B is defned as [3, 5]:
o
˕
(
˰
ŵ
,
˰
Ŷ
) =
E
(
A
(
˰
˩
,
˰
˪
)) ¡
E
(
ˠ
˖
)
Taxonomy tree can be considered similar to generalization hierarchy introduced in.
However, we t reat t axonomy tree not as a restriction, but a user's preference.
Country
America Asia
North South West East
USA Canda Brazil Mexico Ìran Egypt Ìndia pak
Fig ÷ 3 Taxonomy Tree of country
Occupation
ArmedForces Teacher Doctor Salesman TechSupport
Fig  4 Taxonomy Tree of Occupation
0
here É(x, y ) is the subtree rooted at the lowest common ancestor of x and y, and H (R)
represents the height of tree T.
Example 1. Consider attribute Country and its taxonomy tree in Fig. 3. The
distance between Ìndia and USA i s 3/ 3 = 1, while the distance between Ìndia and
Ìran is 2/ 3 = .66. On the other hand, for attribute Occupation a n d its taxonomy
tree in Fig. 4 which goes up only one level, the distance between any two values is
always 1.Combining the distance functions for both numeric and categorical domains, we
defne the distance between two records as follows:
Defnition 4 (Distance b e t we e n two records) Let Q
T
= { N
1
, . . . , N
m
, C
1
, . . . , C
n
} be the quasiidentifer of table T , where N
i
(i = 1, . . . , m) is an
attribute with a numeric domain and C
j
( j = 1, . . . , n) is an attribute with a
categorical domain. The distance of two records ^
ŵ
,
^
Ŷ
e T is defned as:
(r
1
, r
2
) = Z ô
N
(r
1
N
i
], r
2
N
i
] ) + Z ô
C (r1  Cj , r2  Cj ),
i = 1 , . . , m j = 1 , . . . , n
Where r
i
A] represents the value of attribute A in r
i
, and ô
N
and ô
C
are the distance
functions defned in Defnition 2 and 3, respectively.
Now we discuss the cost function which the kmembers clustering problem tries to
minimize. As the ultimate goal of our clustering problem is the k anonymization of
data, we formulate the cost function to represent the amount of distortion (i.e.,
information loss) caused by the generalization process. Recall that, records in each
cluster are generalized to share the same quasiidentifer value that represents every
original quasiidentifer value in the cluster. We assume that the numeric values are
generalized into a range [min, max] and categorical values into a set that unions all
distinct values in the cluster. With these assumptions, we defne a metric, referred to
as Ìnformation Loss metric (ÌL), that measures the amount of distortion introduced by
the generalization process to a cluster.
1
Defnition 5. (Ìnformation loss) Let e = { r
1
, . . . , r
k
} be a cluster (i.e., equivalence
class) where the quasiidentifer consists of numeric attributes N
1
, . . . , N
m
And
categorical attributes C
1,
. . . , C
n
. Let T
Ci
be the taxonomy tree defned for the domain
of categorical attribute C
i
. Let MIN
N i
and MAX
N i
be the min and max values in e with
respect to attribute N
i
, and let
Ci
be the union set of values in e with respect to
attribute C
i
. Then the amount of information loss occurred by generalizing 0 denoted
by I L(e), is defned as:
IL (e) =  e . ( _
(MAX
Ni
÷ MIN
Ni
)/N
i
 + _ H( (U
Cj
))/H(T
C j
) )
i=1,..,m j=1,..,n
where e is the number of records in e, N  represents the size of numeric domain N ,
É(
Cj
) is the subtree rooted at the lowest common ancestor of every value in
Cj
,
and H (T ) is the height of taxonomy tree T.3,4,6]
Using the defnition above, the total information loss of the anonymized table is defned
as follows:
Defnition 6. (Total i nfor mati on loss) Let be the set of all equivalence classes in
the anonymized table AT. Then the amount of total information loss of AT is defned
as:
TotalIL (AT) = _ IL (e).
e c
Recall that the cost function of the kmembers problem is the sum of all intra cluster
distances, where an intracluster distance of a cluster is defned as the maximum
distance between any two data points in the cluster. Now, if we consider how records in
each cluster are generalized, minimizing the total information loss of the anonymized
2
table intuitively minimizes the cost function for the k members clustering problem as
well. Therefore, the cost function that we want to minimize in the clustering process is
TotalÌL.
Theorem. The kmember cl usteri ng decision problem is NPcomplete.
Proof. That the kmember clustering decision problem is in NP follows from the
observation that if such a clustering scheme is given, verifying that it satisfes the two
conditions in Defnition 7 can be done in polynomial time.
Ìt is proved that optimal kanonymity by suppression is NPhard, using a reduction
from the Edge Pa r t i ti o n i n t o T r i a n g l e s problem. Ìn the reduction, the table to
be kanonymized consists of n records; each record has m attri butes, and each
attribute takes a value from { , 1, 2} . The kanonymization technique used is to
suppress some cells in the table. Ìt showed that determining whether there exists a 3
anonymization of a table by suppressing certain number of cells is NPhard.
We observe that the problem in 1] is a special case of the kmember clustering
problem where each attribute is categorical and has a fat taxonomy tree. Ìt thus
follows that the kmember clustering problem is also NPhard. When each attribute has
a fat taxonomy tree, the only way to generalize a cell is to the root of the fat taxonomy
tree, and this is equivalent to suppressing the cell. Given such a database, the
information loss of each record in any generalization is the same as the number of cells
in the record that di er from any other record in the equivalent class, which equals the
number of cells to be suppressed. Therefore, there exists a kanonymization with total
information loss no more than t if and only if there exists a kanonymization that
suppresses at most t cells.
Faced with the hardness of the problem, we propose a simple and e cient algorithm
that fnds a solution in a greedy manner. The idea is as follows. Given a set of n
records, we frst randomly pick a record r
i and
make it as a cluster e
1.
Then we
choose a record r
j
that makes IL (e
1
{ r
j }
) minimal. We repeat this until e
1
 = k.
When e
1
 reaches k, we choose a record that is furthest from r
i
and repeat the
clustering process until there are less than k records left. We then iterate over these
leftover records and insert each record into a cluster with respect to which the
increment of the information loss is minimal. We provide the core of our greedy k
member clustering algorithm, leaving out some trivial functions, in Figure 5.
8. Proposed Anonymization AIgorithm
Armed with the distance and cost functions, we are now ready to discuss the k
member clustering algorithm. As in most clustering problems, an exhaustive search for
an optimal solution of the kmember clustering is potentially exponential. Ìn order to
precisely characterize the computational complexity of the problem, we defne the k
member clustering problem as a decision problem as follows.
Defnition 7. (kmember clustering decision pr o bl e m) Given n records, is
there a clustering scheme = { e
1
, . . . , e
}
such that
1. ÉW
É ¸ k, ŵ < k n: the size of each cluster is greater than or equal to a
positive integer k, and
2.
_
i= 1 , . . . , I L
(e
i
) < c, c > the TotalÌL of the clustering scheme is less than a
positive constant c.
Theorem: Let n be the total number of input records and k be the specifed
anonymity parameter. Every cluster that the greedy kmember clustering
algorithm fnds has at least k records, but no more than 2k ~ 1 records.
Proof: Let S be the set of input records. As the algorithm fnds a cluster with exactly k
records as long as the number of remaining records is equal to or greater than k,
every cluster contains at least k records. Ìf there remain less than k records, these
leftover records are distributed to the clusters that are already found. That is, in the
4
worst case, k ~ 1 remaining records are added to a single cluster which already contains
k records. Therefore, the maximum size of a cluster is 2k ~ 1.
Greedy kmember clustering algorithm
Function greedy_k_member_clustering (S, k)
Ìnput: a set of records S and a threshold value k.
Output: a set of cluster each of which contains at least k records.
1. Ìf ( S > k)
2. Return S;
3. End if;
4. Result =Ø; r = a randomly picked from S;
5. While ( S < k)
6. r= the furthest record from r;
7. S=S{r};
8. C ={r};
9. While ( C < k)
10. r= find_best_record(S,C);
11. S=S{r};
12. C=C U {r};
13. End while;
14. Result =Result U {C};
15. End while;
16. While ( S =0)
17. r= a randomly picked record from S;
18. S=S{r};
19. C=find_best_cluster(Result, r);
20. C=C U {r};
21. End while;
22. Return Result;
End;
Function find_best_record (S, c)
Ìnput: a set of records S and a cluster c
Output: a record r c S such that ÌL(c U {r}) is minimal
1. n= S; min=×; best = null;
2. for(i=1..n)
3. r= ith record in S;
4. diff= ÌL(c U {r}) ÷ ÌL(c);
5. Ìf(diff<min)
6. min=diff;
7. best=r;
8. End if;
9. End for;
1O. Return best;
End;
Function find_best_cluster (C, r)
Ìnput: a set of clusters C and a record r.
Output: a cluster c C such that IL(c {r} is minimal
1. n=C; min=×; best=null;
2. for( i=1..n)
3. c=ith cluster in C;
4. diff=ÌL(CU{r}) ÷ ÌL(C);
5. if(diff<min)
6. min=diff;
7. best=c;
8. end if;
9. end for;
10. return best;
End;
Theorem: Let n be the total number of input records and k be the specifed
6
anonymity parameter. The time compl exi t y of the greedy kmember
c l us t er i ng algorithm is in ˛( n
Ŷ
).
Proof. Observe that the algorithm spends most of its time selecting records from the
input set S one at a time until it reaches S = k (Line 9). As the size of the input
set decreases by one at every iteration, the total execution time T is estimated as:
T = (n1) + (n2) + (n3) + +k =n(n  ŵ)¡Ŷ
Therefore, T is in ˛( n
Ŷ
).
7
. Code for impIementing proposed aIgorithm
//CODE FOR GUÌ
import mypack.Cluster;
import mypack.DataBase;
import java.awt.*;
import javax.swing.*;
import java.awt.event.*;
import java.sql.*;
import java.net.*;
import java.util.StringTokenizer;
public class Project implements ActionListener,ÌtemListener,KeyListener
{ //for generate
JLabel generate,g1,kam,pmi;
Choice choice;
//for find
JLabel f_l1,f_record,f_display;
JTextField f_t1,temp;
JButton search;
JLabel l1,l2,l3,l4,l5,l6,l7,l8,l9,l10,msginsert,msgdelete,msgupdate;
JTextField t1,t2,t3,t4,t5,t6;
JTextField t7,t8,t9,t10;
JLabel common;
JFrame f;
JPanel p;
List list1,list2;
Connection con;
Statement st;
ResultSet rs;
JLabel banner1,banner2;
JButton insert,update,refresh,delete,find;
JButton b1,b2,b3;
JLabel l_total_record,l_total_cluster;
public Project(String name)
{
f = new JFrame(name);
p = new JPanel();
temp = new JTextField(20);//for key event
list1 = new List(20);
list2 = new List(20);
choice = new Choice();
b1 = new JButton("New Entry");
b2 = new JButton("Update");
b3 = new JButton("Delete");
insert = new JButton("Ìnsert Confirm");
update = new JButton("Update Confirm");
refresh = new JButton("refresh");
delete = new JButton("Delete Confirm");
find = new JButton("Find");
search = new JButton("search");
l_total_record = new JLabel("Total Number of Records ");
l_total_cluster = new JLabel("Total Number of Cluster ");
8
kam=new JLabel("kAnonymity Model");
pmi=new JLabel("Protecting Medical Ìnformation");
banner1 = new JLabel("Original Patient Record");
banner2 = new JLabel("Annomized Patient Record");
banner1.setFont(new Font("Sanserrif",Font.BOLD,20));
banner2.setFont(new Font("Sanserrif",Font.BOLD,20));
kam.setFont(new Font("Times New Roman",Font.BOLD,20));
pmi.setFont(new Font("Tahoma",Font.ÌTALÌC,18));
pmi.setForeground(Color.blue);
msginsert = new JLabel("Press Ìnsert Confirm Button for saving");
msgupdate = new JLabel("Press Update Confirm Button for saving");
msgdelete = new JLabel("Press Delete Confirm Button for saving");
msginsert.setFont(new Font("Sanserrif",Font.BOLD,12));
msgupdate.setFont(new Font("Sanserrif",Font.BOLD,12));
msgdelete.setFont(new Font("Sanserrif",Font.BOLD,12));
generate = new JLabel("Generate");
generate.setFont(new Font("Sanserrif",Font.BOLD,12));
g1 = new JLabel("Annomized Table");
g1.setFont(new Font("Sanserrif",Font.BOLD,12));
choice.add("3");
choice.add("4");
choice.add("5");
choice.add("6");
choice.add("7");
choice.add("8");
choice.add("9");
choice.add("10");
msginsert.setForeground(Color.red);
msgupdate.setForeground(Color.red);
msgdelete.setForeground(Color.red);
l_total_record.setForeground(Color.blue);
l_total_cluster.setForeground(Color.blue);
l_total_record.setFont(new Font("Times New Roman",Font.BOLD,14));
l_total_cluster.setFont(new Font("Times New Roman",Font.BOLD,14));
common = new JLabel("");
common.setFont(new Font("Sanserrif",Font.BOLD,12));
common.setForeground(Color.red);
f_l1=new JLabel("Enter PÌD");
f_record=new JLabel("Press search button for display record");
f_t1 = new JTextField(10);
f_l1.setFont(new Font("Sanserrif",Font.BOLD,12));
f_record.setFont(new Font("Times New Roman",Font.BOLD,14));
f_l1.setForeground(Color.blue);
f_record.setForeground(Color.blue);
f_display=new JLabel("");
f_display.setForeground(Color.red);
l1 = new JLabel("PÌD");
l2 = new JLabel("NAME");
l3 = new JLabel("PH NO");
l4 = new JLabel("CÌTY");
l5 = new JLabel("COMPANY");
l6 = new JLabel("ZÌPCODE");
l7 = new JLabel("GENDER");
l8 = new JLabel("AGE");
l9 = new JLabel("DÌSEASE");
l10 = new JLabel("EXPENCES");
t1 = new JTextField(10);
t2 = new JTextField(10);
t3 = new JTextField(10);
t4 = new JTextField(10);
t5 = new JTextField(10);
t6 = new JTextField(10);
t7 = new JTextField(10);
t8 = new JTextField(10);
t9 = new JTextField(10);
t10 = new JTextField(10);
t1.addKeyListener(this);
try
{
Class.forName("sun.jdbc.odbc.JdbcOdbcDriver");
con=DriverManager.getConnection("jdbc:odbc:patient_record","system","123");
st=con.createStatement();
String sql ="select * from patient_record order by pid";
rs = st.executeQuery(sql);
list1.add("PÌD ZÌPCODE GENDER AGE DÌSEASE
EXPENCES");
list1.add("
");
while(rs.next())
{
list1.add(rs.getString(1)+" "+rs.getString(2)+" "+rs.getString(3)+"
"+rs.getString(4)+" "+rs.getString(5)+" "+rs.getString(6));
list1.add("
");
}
}catch(Exception e){System.out.print("\n"+e.getMessage());}
p.setLayout(null);
p.add(kam);
p.add(pmi);
p.add(list1);
p.add(list2);
p.add(banner1);
p.add(banner2);
p.add(insert);
p.add(update);
p.add(delete);
p.add(refresh);
p.add(find);
p.add(search);
40
p.add(f_l1);
p.add(f_t1);
p.add(f_display);
p.add(f_record);
p.add(l1);
p.add(l2);
p.add(l3);
p.add(l4);
p.add(l5);
p.add(l6);
p.add(l7);
p.add(l8);
p.add(l9);
p.add(l10);
p.add(t1);
p.add(t2);
p.add(t3);
p.add(t4);
p.add(t5);
p.add(t6);
p.add(t7);
p.add(t8);
p.add(t9);
p.add(t10);
p.add(b1);
p.add(b2);
p.add(b3);
p.add(generate);
p.add(g1);
p.add(choice);
p.add(msgupdate);
p.add(msgdelete);
p.add(msginsert);
p.add(common);
p.add(l_total_record);
p.add(l_total_cluster);
f.add(p);
setEntryAdd(false);
choice.addÌtemListener(this);
insert.addActionListener(this);
delete.addActionListener(this);
update.addActionListener(this);
refresh.addActionListener(this);
find.addActionListener(this);
search.addActionListener(this);
b1.addActionListener(this);
b2.addActionListener(this);
b3.addActionListener(this);
p.setBackground(Color.pink);
kam.setBounds(530,1,250,40);
pmi.setBounds(490,31,250,40);
b1.setBounds(30,90,100,25);
b2.setBounds(135,90,100,25);
41
b3.setBounds(240,90,100,25);
l1.setBounds(90,130,100,25);
t1.setBounds(210,130,120,25);
l2.setBounds(90,160,100,25);
t2.setBounds(210,160,120,25);
l3.setBounds(90,190,100,25);
t3.setBounds(210,190,120,25);
l4.setBounds(90,220,100,25);
t4.setBounds(210,220,120,25);
l5.setBounds(90,250,100,25);
t5.setBounds(210,250,120,25);
l6.setBounds(90,280,110,25);
t6.setBounds(210,280,120,25);
l7.setBounds(90,310,100,25);
t7.setBounds(210,310,120,25);
l8.setBounds(90,340,100,25);
t8.setBounds(210,340,120,25);
l9.setBounds(90,370,130,25);
t9.setBounds(210,370,120,25);
l10.setBounds(90,400,130,25);
t10.setBounds(210,400,120,25);
common.setBounds(88,430,300,25);
msginsert.setBounds(88,430,300,25);
msginsert.setVisible(false);
insert.setBounds(120,450,150,25);
insert.setVisible(false);
msgupdate.setBounds(88,430,300,25);
msgupdate.setVisible(false);
update.setBounds(120,450,150,25);
update.setVisible(false);
msgdelete.setBounds(88,430,300,25);
msgdelete.setVisible(false);
delete.setBounds(120,450,150,25);
delete.setVisible(false);
find.setBounds(150,490,80,20);
f_l1.setBounds(70,515,100,25);
f_t1.setBounds(150,515,120,20);
f_record.setBounds(50,535,240,25);
f_display.setBounds(50,590,400,30);
search.setBounds(150,560,80,20);
banner2.setBounds(700,80,300,40);
generate.setBounds(720,120,180,25);
choice.setBounds(900,120,70,25);
list2.setBounds(500,140,710,220);
l_total_record.setBounds(600,367,200,20);
l_total_record.setVisible(false);
l_total_cluster.setBounds(930,367,200,20);
l_total_cluster.setVisible(false);
banner1.setBounds(720,410,300,40);
refresh.setBounds(790,450,90,20);
list1.setBounds(500,480,710,220);
addFind(false);
setEntryAdd(false);
f.setSize(1280,960);
42
f.setVisible(true);
f.setDefaultCloseOperation(JFrame.EXÌT_ON_CLOSE);
}
void addFind(boolean t)
{
if(t==true)
{
f_l1.setVisible(true);
f_record.setVisible(true);
f_t1.setVisible(true);
search.setVisible(true);
f_t1.requestFocus();
}
else
{
f_l1.setVisible(false);
f_record.setVisible(false);
f_t1.setVisible(false);
search.setVisible(false);
}
}
void setEntryAdd(boolean t)
{
t1.setText("");
t2.setText("");
t3.setText("");
t4.setText("");
t5.setText("");
t6.setText("");
t7.setText("");
t8.setText("");
t9.setText("");
t10.setText("");
t1.requestFocus();
if(t==true)
{
t1.setEditable(true);
t2.setEditable(true);
t3.setEditable(true);
t4.setEditable(true);
t5.setEditable(true);
t6.setEditable(true);
t7.setEditable(true);
t8.setEditable(true);
t9.setEditable(true);
t10.setEditable(true);
}
else
{
t1.setEditable(false);
t2.setEditable(false);
t3.setEditable(false);
t4.setEditable(false);
t5.setEditable(false);
4
t6.setEditable(false);
t7.setEditable(false);
t8.setEditable(false);
t9.setEditable(false);
t10.setEditable(false);
}
}
void setEntryDelete(boolean t)
{
t1.setText("");
t2.setText("");
t3.setText("");
t4.setText("");
t5.setText("");
t6.setText("");
t7.setText("");
t8.setText("");
t9.setText("");
t10.setText("");
t1.requestFocus();
if(t==true)
{
t1.setEditable(true);
t2.setEditable(false);
t3.setEditable(false);
t4.setEditable(false);
t5.setEditable(false);
t6.setEditable(false);
t7.setEditable(false);
t8.setEditable(false);
t9.setEditable(false);
t10.setEditable(false);
}
else
{
t1.setEditable(false);
t2.setEditable(false);
t3.setEditable(false);
t4.setEditable(false);
t5.setEditable(false);
t6.setEditable(false);
t7.setEditable(false);
t8.setEditable(false);
t9.setEditable(false);
t10.setEditable(false);
}
}
void setEntryUpdate(boolean t)
{
t1.setText("");
t2.setText("");
t3.setText("");
t4.setText("");
t5.setText("");
t6.setText("");
44
t7.setText("");
t8.setText("");
t9.setText("");
t10.setText("");
t1.requestFocus();
if(t==true)
{
t1.setEditable(true);
t2.setEditable(true);
t3.setEditable(true);
t4.setEditable(true);
t5.setEditable(true);
t6.setEditable(true);
t7.setEditable(true);
t8.setEditable(true);
t9.setEditable(true);
t10.setEditable(true);
}
else
{
t1.setEditable(false);
t2.setEditable(false);
t3.setEditable(false);
t4.setEditable(false);
t5.setEditable(false);
t6.setEditable(false);
t7.setEditable(false);
t8.setEditable(false);
t9.setEditable(false);
t10.setEditable(false);
}
}
public void keyPressed(KeyEvent ke)
{
}
public void keyReleased(KeyEvent ke)
{
try
{
Class.forName("sun.jdbc.odbc.JdbcOdbcDriver");
String sql1="select * from patient_information where pid='"+t1.getText()+"'";
con = DriverManager.getConnection("jdbc:odbc:patient_information","system","123");
st = con.createStatement();
ResultSet rs = st.executeQuery(sql1);
if(rs.next())
{
t2.setText(rs.getString(2));
t3.setText(rs.getString(3));
t4.setText(rs.getString(4));
t5.setText(rs.getString(5));
}
else
4
{
t2.setText("");
t3.setText("");
t4.setText("");
t5.setText("");
}
con.close();
st.close();
}catch(Exception EK1){System.out.print("\n"+EK1.getMessage());}
}
public void keyTyped(KeyEvent ke)
{
String t=t1.getText();
}
public void itemStateChanged(ÌtemEvent ie)
{
int n=Ìnteger.parseÌnt(choice.getSelectedÌtem());
DataBase obj1 = new DataBase();
Cluster obj2 = new Cluster();
obj1.CreateTable();
int r = obj2.totalRow();
int c = obj2.totalRow()/n;
Choice nCluster[] = obj2.k_Member_Cluster(n);
list2.removeAll();
list2.add("PÌD ZÌPCODE GENDER AGE DÌSEASE EXPENCES");
for(int i=0; i<nCluster.length; i++)
{
list2.add("
");
for(int j=0; j<nCluster[i].getÌtemCount(); j++)
{
list2.add(nCluster[i].getÌtem(j));
}
}
list2.add("");
l_total_record.setText("Total Number of Records ");
String t1=l_total_record.getText();
t1+=" : "+String.valueOf(r);
l_total_record.setText(t1);
l_total_record.setVisible(true);
l_total_cluster.setText("Total Number of Cluster ");
String t2 = l_total_cluster.getText();
t2+=" : "+String.valueOf(c);
l_total_cluster.setText(t2);
l_total_cluster.setVisible(true);
46
obj1.DropTable();
}
public void actionPerformed(ActionEvent e)
{
Object obj=e.getSource();
if(obj==insert)
{
msginsert.setVisible(false);
common.setVisible(false);
Connection con1,con2;
Statement st1,st2;
try
{
Connection con;
Statement st;
int n=0;
Class.forName("sun.jdbc.odbc.JdbcOdbcDriver");
String sql1="select * from patient_information where pid='"+t1.getText()+"'";
con = DriverManager.getConnection("jdbc:odbc:patient_information","system","123");
st =
con.createStatement(ResultSet.TYPE_SCROLL_SENSÌTÌVE,ResultSet.CONCUR_UPDATABLE);
ResultSet rs = st.executeQuery(sql1);
int flag=0;
while(rs.next())
{
flag=1;
}
con.close();
st.close();
if(flag==1)
{
con1 = DriverManager.getConnection("jdbc:odbc:patient_information","system","123");
st1 = con1.createStatement();
String sql ="insert into patient_information values
('"+t1.getText()+"','"+t2.getText()+"','"+t3.getText()+"','"+t4.getText()+"','"+t5.getText()+"')";
int r = st1.executeUpdate(sql);
if(r==1)
{
System.out.print("\n"+r+" Record is inserted in patient_information table");
common.setText("Ìnserted succesfully");
common.setVisible(true);
}
con1.commit();
con1.close();
st1.close();
Class.forName("sun.jdbc.odbc.JdbcOdbcDriver");
con2 = DriverManager.getConnection("jdbc:odbc:patient_record","system","123");
st2 = con2.createStatement();
47
sql ="insert into patient_record values
('"+t1.getText()+"','"+t6.getText()+"','"+t7.getText()+"','"+t8.getText()+"','"+t9.getText()+"','"+t10.getText()+"'
)";
r = st2.executeUpdate(sql);
if(r==1)
System.out.print("\n"+r+" Record is inserted in patient_record table");
con2.commit();
con2.close();
st2.close();
}
else
{
Class.forName("sun.jdbc.odbc.JdbcOdbcDriver");
con2 = DriverManager.getConnection("jdbc:odbc:patient_record","system","123");
st2 = con2.createStatement();
String sql ="insert into patient_record values
('"+t1.getText()+"','"+t6.getText()+"','"+t7.getText()+"','"+t8.getText()+"','"+t9.getText()+"','"+t10.getText()+"'
)";
int r = st2.executeUpdate(sql);
if(r==1)
System.out.print("\n"+r+" Record is inserted in patient_record table");
con2.commit();
con2.close();
st2.close();
}
}
catch(Exception e2){System.out.print("\n"+e2.getMessage());common.setText("Record can't be
inserted");common.setVisible(true);}
setEntryAdd(false);
}
if(obj==delete)
{
Connection con1,con2;
common.setVisible(false);
msgdelete.setVisible(false);
Statement st1,st2;
try
{
Class.forName("sun.jdbc.odbc.JdbcOdbcDriver");
con2 = DriverManager.getConnection("jdbc:odbc:patient_record","system","123");
st2 = con2.createStatement();
String sql2 ="delete patient_record where pid='"+t1.getText()+"'";
int r2 = st2.executeUpdate(sql2);
con1 = DriverManager.getConnection("jdbc:odbc:patient_record","system","123");
st1 = con1.createStatement();
String sql1 ="delete patient_information where pid='"+t1.getText()+"'";
int r1 = st1.executeUpdate(sql1);
if(r1>=1  r2>=1)
{
common.setText("One record deleted");
common.setVisible(true);
}
48
else
{
con1.rollback();
con2.rollback();
common.setText("Pid did not match");
common.setVisible(true);
}
con1.close();
con2.close();
st1.close();
st2.close();
}
catch(Exception e3){ System.out.print("\n"+e3.getMessage());common.setText("Record can't be
deleted");common.setVisible(true);}
setEntryDelete(false);
}
if(obj==update)
{
msgupdate.setVisible(false);
common.setVisible(false);
Connection con1,con2;
Statement st1,st2;
try
{
Class.forName("sun.jdbc.odbc.JdbcOdbcDriver");
con1 = DriverManager.getConnection("jdbc:odbc:patient_information","system","123");
st1 = con1.createStatement();
String sql="update patient_information set pid='"+t1.getText()+"',
name='"+t2.getText()+"',phno='"+t3.getText()+"',city='"+t4.getText()+"',company='"+t5.getText()+"' where
pid='"+t1.getText()+"'";
int r = st1.executeUpdate(sql);
if(r==1)
{
System.out.print("\n"+r+" Record is updated in patient_information table");
common.setText("one record has updated");
common.setVisible(true);
}
else
{
common.setText("Record not found");
common.setVisible(true);
}
con1.commit();
con1.close();
st1.close();
}
catch(Exception e4)
{ System.out.print("\n"+e4.getMessage()); common.setText("Updation failed in 1st
table");common.setVisible(true);}
try
4
{
Class.forName("sun.jdbc.odbc.JdbcOdbcDriver");
con2 = DriverManager.getConnection("jdbc:odbc:patient_record","system","123");
st2 = con2.createStatement();
//String sql="insert into library values
('"+t1.getText()+"','"+t2.getText()+"','"+t3.getText()+"','"+t4.getText()+"')";
String sql="update patient_record set pid='"+t1.getText()+"', zipcode='"+t6.getText()+"',
gender='"+t7.getText()+"', age='"+t8.getText()+"',disease='"+t9.getText()+"',expences='"+t10.getText()+"'
where pid='"+t1.getText()+"'";
int r = st2.executeUpdate(sql);
if(r==1)
{
System.out.print("\n"+r+" Record is updated in patient_record table");
common.setText("One recored is updated");
common.setVisible(true);
}
con2.commit();
con2.close();
st2.close();
}
catch(Exception e5){System.out.print("\n"+e5.getMessage());common.setText("Updation
failed");common.setVisible(true);}
setEntryDelete(false);
}
if(obj==b1)
{
setEntryAdd(true);
msginsert.setVisible(true);
msgdelete.setVisible(false);
msgupdate.setVisible(false);
common.setVisible(false);
insert.setVisible(true);
delete.setVisible(false);
update.setVisible(false);
}
if(obj==b2)
{
setEntryUpdate(true);
msgupdate.setVisible(true);
msgdelete.setVisible(false);
msginsert.setVisible(false);
common.setVisible(false);
update.setVisible(true);
delete.setVisible(false);
insert.setVisible(false);
}
if(obj==b3)
{
setEntryDelete(true);
msgdelete.setVisible(true);
msgupdate.setVisible(false);
msginsert.setVisible(false);
delete.setVisible(true);
0
update.setVisible(false);
insert.setVisible(false);
common.setVisible(false);
}
if(obj==refresh)
{
setEntryAdd(false);
addFind(false);
msgdelete.setVisible(false);
msgupdate.setVisible(false);
msginsert.setVisible(false);
insert.setVisible(false);
delete.setVisible(false);
update.setVisible(false);
common.setVisible(false);
Connection con;
Statement st;
ResultSet rs;
list1.removeAll();
try
{
Class.forName("sun.jdbc.odbc.JdbcOdbcDriver");
con=DriverManager.getConnection("jdbc:odbc:patient_record","system","123");
st=con.createStatement();
String sql ="select * from patient_record order by pid";
rs = st.executeQuery(sql);
list1.add("PÌD ZÌPCODE GENDER AGE DÌSEASE
EXPENCES");
list1.add("
");
while(rs.next())
{
list1.add(rs.getString(1)+" "+rs.getString(2)+" "+rs.getString(3)+"
"+rs.getString(4)+" "+rs.getString(5)+" "+rs.getString(6));
list1.add("
");
}
}catch(Exception e2){System.out.print("\n"+e2.getMessage());}
}
if(obj==find)
{
f_t1.setText("");
f_t1.requestFocus();
f_record.setText("Press Search button to display record");
addFind(true);
}
if(obj==search)
{
String t;
Connection con1;
Statement st1;
try
1
{
Class.forName("sun.jdbc.odbc.JdbcOdbcDriver");
con1 = DriverManager.getConnection("jdbc:odbc:patient_information","system","123");
st1 =
con1.createStatement(ResultSet.TYPE_SCROLL_SENSÌTÌVE,ResultSet.CONCUR_UPDATABLE);
String sql = "select * from patient_information where pid='"+f_t1.getText()+"'";
ResultSet r = st1.executeQuery(sql);
if(r.first())
{
t=r.getString(2)+" "+r.getString(3)+" "+r.getString(4)+" "+r.getString(5);
f_display.setText(t);
}
else
{
f_display.setText("record could not found");
}
con1.close();
st1.close();
}
catch(Exception e6){System.out.print("\n"+e6.getMessage());f_display.setText("Record did not
match");}
}
}
public static void main(String args[])
{
new Project("kAnnomity Model(Protecting Privacy)");
}
}
End of GUI Code
//CODE FOR Cluster Package
package mypack;
import java.util.StringTokenizer;
import java.sql.*;
2
import java.awt.*;
//Beginning of Cluster
public class Cluster
{
public Choice[] k_Member_Cluster(int k)
{
int r=totalRow()/k;
Choice nCluster1[] = new Choice[r];
Choice nCluster2[] = new Choice[r];
for(int i=0; i<r; i++)
{
nCluster1[i] = new Choice();
nCluster2[i] = new Choice();
}
//*
System.out.print("\n Total cluster :"+r);
try
{
int i=0;
String cluster_age[]=new String[r];
while(totalRow()>= k)
{
int mid = totalRow()/2;
if(mid>0)
{
String t = annomize_getRecord(mid);
nCluster1[i].add(t);
String tt = getRecord(mid);
nCluster2[i].add(tt);
cluster_age[i] = get_ResultSet_Age(mid);
deleteRow(mid);
while(nCluster2[i].getÌtemCount()<k)
{
int index = find_Best_Record(cluster_age[i]);
String t2 = getRecord(index);
nCluster2[i].add(t2);
String t1= annomize_getRecord(index);
nCluster1[i].add(t1);
deleteRow(index);
}
if(i<r)
{
System.out.print("\n "+i);
i++;
}
else
break;
}
}
for(int p=0; p<nCluster2.length; p++)
{
if(nCluster2[p].getÌtemCount()==0)
{
int index = totalRow()/2;
String tp = annomize_getRecord(index);
nCluster1[p].add(tp);
String tp1=getRecord(index);
nCluster2[index].add(tp1);
deleteRow(index);
}
}
int n=totalRow();
System.out.print("\n At Last :"+n);
for(int idx=1; idx<=n; idx++)
{
String rs_age=get_ResultSet_Age(1);
int c = find_Best_Cluster(nCluster2,cluster_age,rs_age);
String record1=annomize_getRecord(1);
nCluster1[c].add(record1);
String record2=getRecord(1);
nCluster2[c].add(record2);
deleteRow(1);
}
}//end try
catch(Exception e7)
{
System.out.print("\nError2: K_member_Cluster "+e7.getMessage());
}
//*
return(nCluster1);
}
public int find_Best_Record(String cluster_age)
{
4
int index=1;
try
{
int n = totalRow();
if(n>=2)
{
int min = difference(cluster_age.trim(),get_ResultSet_Age(1).trim());
index=1;
for(int i=2; i<=n; i++)
{
int diff=difference(cluster_age,get_ResultSet_Age(i));
if(diff<min)
{
min=diff;
index=i;
}
}
}
else
index=1;
}catch(Exception e9){System.out.print("\nError9 Find best record"+e9.getMessage());}
return(index);
}
public String get_ResultSet_Age(int index)
{
String age="0";
try{
Connection con;
Statement st;
ResultSet rs;
Class.forName("sun.jdbc.odbc.JdbcOdbcDriver");
con = DriverManager.getConnection("jdbc:odbc:patient_record2","system","123");
st =
con.createStatement(ResultSet.TYPE_SCROLL_SENSÌTÌVE,ResultSet.CONCUR_UPDATABLE);
String sql ="select * from patient_record2 order by pid";
rs =st.executeQuery(sql);
rs.absolute(index);
age =rs.getString("AGE");
con.close();
st.close();
}
catch(Exception e8){System.out.print("\nError8 get_ResultSet_age "+e8.getMessage());}
return(age);
}
public int find_Best_Cluster(Choice nCluster[],String cluster_age[],String rs_age)
{
int min=difference(cluster_age[0],rs_age);
int index=0;
int diff=0;
for(int i=1; i<nCluster.length; i++)
{
diff=difference(rs_age,cluster_age[i]);
if(diff<min)
{
index=i;
min=diff;
}
}
return(index);
}
public String get_Cluster_Age(Choice nCluster[],int index)
{
String age="0";
StringTokenizer token = new StringTokenizer(nCluster[index].getÌtem(0)," ");
while(token.hasMoreTokens())
{
String a1=token.nextToken();
String a2=token.nextToken();
String a3=token.nextToken();
age=token.nextToken();
String a4=token.nextToken();
String a5=token.nextToken();
}
6
return(age);
}
public int difference(String age1,String age2)
{
int r=Ìnteger.parseÌnt(age1.trim())Ìnteger.parseÌnt(age2.trim());
if(r<0)
r=r*1;
return(r);
}
public void showTable()
{
try
{
Connection con;
Statement st;
ResultSet rs;
Class.forName("sun.jdbc.odbc.JdbcOdbcDriver");
con = DriverManager.getConnection("jdbc:odbc:patient_record2","system","123");
st =
con.createStatement(ResultSet.TYPE_SCROLL_SENSÌTÌVE,ResultSet.CONCUR_UPDATABLE);
String sql ="select * from patient_record2 order by pid";
rs =st.executeQuery(sql);
while(rs.next())
{
System.out.print("\n"+rs.getString(1)+"\t"+rs.getString(2)+"\t"+rs.getString(3)+"\t"+rs.getString(4)+"\t"+rs.g
etString(5)+"\t"+rs.getString(6));
}
if(rs.last()){}
System.out.print("\n total "+rs.getRow());
con.close();
st.close();
}
catch(Exception e1){System.out.print("\nError showTable "+e1.getMessage());}
}
public void deleteRow(int index)
{
try
{ if(index>0)
{
Connection con;
Statement st;
ResultSet rs;
Class.forName("sun.jdbc.odbc.JdbcOdbcDriver");
con = DriverManager.getConnection("jdbc:odbc:patient_record2","system","123");
st =
con.createStatement(ResultSet.TYPE_SCROLL_SENSÌTÌVE,ResultSet.CONCUR_UPDATABLE);
7
String sql ="select * from patient_record2 order by pid";
rs =st.executeQuery(sql);
rs.absolute(index);
rs.deleteRow();
con.close();
st.close();
}
}
catch(Exception e2){System.out.print("\nError deleteRow "+e2.getMessage());}
}
public int totalRow()
{
int n=0;
ResultSet rs;
try
{
Connection con;
Statement st;
Class.forName("sun.jdbc.odbc.JdbcOdbcDriver");
con = DriverManager.getConnection("jdbc:odbc:patient_record2","system","123");
st =
con.createStatement(ResultSet.TYPE_SCROLL_SENSÌTÌVE,ResultSet.CONCUR_UPDATABLE);
String sql ="select * from patient_record2 order by pid";
rs =st.executeQuery(sql);
if(rs.last())
{
n=rs.getRow();
con.close();
st.close();
return(n);
}
con.close();
st.close();
}
catch(Exception e5){System.out.print("\nError totalRow "+e5.getMessage());}
return(0);
}
public String annomize(String t)
{
StringBuffer sb1 = new StringBuffer(t);
8
for(int i=sb1.length()1; i>=sb1.length()3; i)
{
sb1.setCharAt(i,'*');
}
return(String.valueOf(sb1));
}
public String annomize_Age(String age)
{
String t="[";
int n=Ìnteger.parseÌnt(age);
if(n>=1 && n<=10)
t+="110";
if(n>=11 && n<=20)
t+="1120";
if(n>=21 && n<=30)
t+="2130";
if(n>=31 && n<=40)
t+="3140";
if(n>=41 && n<=50)
t+="4150";
if(n>=51 && n<=60)
t+="5160";
if(n>=61 && n<=70)
t+="6170";
if(n>=71 && n<=80)
t+="7180";
if(n>=81 && n<=90)
t+="8190";
if(n>=91 && n<=100)
t+="91100";
t+=" ]";
return(t);
}
public String annomize_getRecord(int index)
{
String t="";
if(index>0)
{
try
{
Connection con;
Statement st;
ResultSet rs;
Class.forName("sun.jdbc.odbc.JdbcOdbcDriver");
con = DriverManager.getConnection("jdbc:odbc:patient_record2","system","123");
st =
con.createStatement(ResultSet.TYPE_SCROLL_SENSÌTÌVE,ResultSet.CONCUR_UPDATABLE);
String sql ="select * from patient_record2 order by pid";
rs =st.executeQuery(sql);
rs.absolute(index);
String t1=rs.getString("GENDER");
t1.trim();
if(t1.equals("male")  t1.equals("MALE"))
{
rs.absolute(index);
t+=annomize(rs.getString(1))+" "+annomize(rs.getString(2))+" "+"[ person ]"+"
"+annomize_Age(rs.getString(4))+" "+format(rs.getString(5))+" "+rs.getString(6);
}
else
{
rs.absolute(index);
t+=annomize(rs.getString(1))+" "+annomize(rs.getString(2))+" "+"[ person ]"+"
"+annomize_Age(rs.getString(4))+" "+format(rs.getString(5))+" "+rs.getString(6);
}
rs.close();
con.close();
st.close();
return(t);
}
catch(Exception e6){System.out.print("\nError6 getRecord "+e6.getMessage());}
}
return(t);
}
public String format(String t)
{
String t1=t;
System.out.print("\n length of "+t+" is "+t.length());
for(int i=t.length(); i<=20; i++)
t1+=" ";
return(t1);
}
public String getRecord(int index)
{
String t="";
if(index>0)
{
try
{
Connection con;
Statement st;
ResultSet rs;
Class.forName("sun.jdbc.odbc.JdbcOdbcDriver");
con = DriverManager.getConnection("jdbc:odbc:patient_record2","system","123");
st =
con.createStatement(ResultSet.TYPE_SCROLL_SENSÌTÌVE,ResultSet.CONCUR_UPDATABLE);
String sql ="select * from patient_record2 order by pid";
rs =st.executeQuery(sql);
rs.absolute(index);
60
t+=annomize(rs.getString(1))+" "+rs.getString(2)+" "+rs.getString(3)+"
"+(rs.getString(4))+" "+rs.getString(5)+" "+rs.getString(6);
rs.close();
con.close();
st.close();
return(t);
}
catch(Exception e6){System.out.print("\nError6 getRecord "+e6.getMessage());}
}
return(t);
}
}
End of CIuster cIass
//CODE FOR DataBase Package
package mypack;
import java.sql.*;
61
public class DataBase
{
Connection con1;
Statement st1;
Connection con2;
Statement st2;
ResultSet rs1;
int n=0;
public void CreateTable()
{
try
{
Class.forName("sun.jdbc.odbc.JdbcOdbcDriver");
con1 = DriverManager.getConnection("jdbc:odbc:patient_record","system","123");
st1 =
con1.createStatement(ResultSet.TYPE_SCROLL_SENSÌTÌVE,ResultSet.CONCUR_UPDATABLE);
rs1 = st1.executeQuery("select * from patient_record");
if(rs1.last())
{
n = rs1.getRow();
}
ResultSetMetaData rsmd = rs1.getMetaData();
int nCols = rsmd.getColumnCount();
con2 = DriverManager.getConnection("jdbc:odbc:patient_record2","system","123");
st2 = con1.createStatement();
String row[] = new String[nCols+1];
for(int i=1; i<=nCols; i++)
{
row[i]=""+rsmd.getColumnName(i)+"
"+rsmd.getColumnTypeName(i)+"("+rsmd.getColumnDisplaySize(i)+")";
}
String addrow="";
for(int i=1; i<=nCols; i++)
{
if(i==1)
{
addrow+=row[i];
}
else
{
addrow+=","+row[i];
}
}
String sql="create table patient_record2"+
"("+addrow+")";
st2.executeUpdate(sql);
//for insert all row in a patient_record2 table
if(rs1.first()){}
for(int i=1; i<=n; i++)
{
String sql1="insert into patient_record2 values (";
62
for(int j=1; j<=nCols; j++)
{
if(j==1)
sql1+="'"+rs1.getString (j)+"'";
else
sql1+=","+"'"+rs1.getString(j)+"'";
}
sql1+=")";
//System.out.print("\n"+sql1);
if(rs1.next()){}
st2.executeUpdate(sql1);
}
con2.commit();
con2.close();
st2.close();
con1.close();
st1.close();
}
catch(Exception e1)
{
System.out.print("\nDataBase Error1 "+e1.getMessage());
}
}//close createTable
public void DropTable()
{
Connection con3;
Statement st3;
try
{
Class.forName("sun.jdbc.odbc.JdbcOdbcDriver");
con3 = DriverManager.getConnection("jdbc:odbc:patient_record2","system","123");
st3 = con3.createStatement();
String sql ="drop table patient_record2";
int r= st3.executeUpdate(sql);
System.out.print("\n One Table Dropped ");
}
catch(Exception e2)
{
System.out.print("\n DataBase Error2 "+e2.getMessage ());
}
}//close DropTable
public ResultSet getResultSet()
{
Connection con4;
Statement st4;
ResultSet rs4;
try
{
Class.forName("sun.jdbc.odbc.JdbcOdbcDriver");
con4 =DriverManager.getConnection("jdbc:odbc:patient_record2","system","123");
st4 =
con4.createStatement(ResultSet.TYPE_SCROLL_SENSÌTÌVE,ResultSet.CONCUR_UPDATABLE);
6
String sql="select * from patient_record2";
rs4 = st4.executeQuery(sql);
con4.close();
st4.close();
return(rs4);
}
catch(Exception e4)
{
rs4=null;
System.out.print("\nDataBase Error3"+e4.getMessage());
return (rs4);
}
}//close getResultSet
public Connection getConnection()
{
/*try
{
con2 = DriverManager.getConnection("jdbc:odbc:patient_record2","system","123");
}
catch(Exception e5){System.out.print("\nDataBase Error5 "+e5.getMessage());} */
return(con2);
}
public void CloseConnection()
{
try
{
con2.close();
st2.close();
con1.close();
st1.close();
}
catch(Exception e6){System.out.print("\n DataBase Error6"+e6.getMessage());}
}
}
10.Project Output Screen
64
Figure:
The whole screen is divided into three parts, part one display the functionality for
interfacing with the database, part two display the records of table in anonymized form
and part three display the original records of table.
Search button is used to search the particular record by PID. We can add new record by
pressing New Entry button and Update and Delete button is used to update and delete
the record in the database.
Choice object is used to select the value of k=1, 2,., n as the parameter to the cluster
algorithm.
The value of k is used to decide the total number of cluster and maximum number of
records that a cluster can accommodate.
The above output screen we selected the value of k = 3 that means each cluster can
contain at least 3 records and a cluster can contain at most 2k1 records.
The output screen also showing total number of cluster and total number of records
according to the value of k selected.
11. ExperimentaI ResuIts
6
The main goal of the experiment was to investigate the performance of our approach in
terms of data quality, e ciency, and scalability. To accurately evaluate our approach, we
also compared our implementation with another algorithm, namely the greedy k
member al gori thm.
11.1. ExperimentaI Setup
We have worked on a 1.60 GHz Ìntel(R) Pentium M processor machine with 512 MB
of RAM. The operating system on the machine was Microsoft Windows XP
Professional version 2002, service pack 2, and the implementation was built and run
in Java 2 Platform, Standard Edition 5.0.
For our experiments, we used the Adult dataset from the UC Ìrvine Machine Learning
Repository, which is considered a de facto benchmark for evaluating the performance
anonymity algorithms. Before the experiments, the Adult data set was prepared and
then we removed records with missing values and retained only nine of the original
attributes. For kanonymization, we considered {age, zipcode, gender, disease,
expenses, patient name, address} in the database attributes. Ìn which attributes {age,
zip code, and gender} are the quasiidentifier. Among these age and zip code were
treated as numerical attributes while the gender is treated as categorical attributes
and the disease is treated as sensitive attribute.
We have created two tables for keeping patient information, The names of the table
are given below:
1. patient information table
2. patient record table
patient information table is primary table that contains information regarding
patient having attributes PID,NAME, ADDRESS,MOBILENO and OCCUPATION,
the patient record table is the secondary table that contains the attributes having
fields PID,ZIPCODE,GENDER,AGE,DIESEASE and EXPENCES.
To design database schema we used Oracle 10G, Database management Server.
1. ConcIusions
66
Ìn this thesis we proposed an e cient kanonymization algorithm by transforming the k
anonymity problem to the kmember clustering problem. We also proposed two
important elements of clustering, that is, distance and cost functions, which are
specifcally tailored for the kanonymization problem. We emphasize that our distance
and cost function, naturally captures the data distortion introduced by the generalization
process and is general enough to be used as a data quality for any kanonymized
dataset.
1.1. CIusteringBased Approaches
Byun proposed the greedy kmember clustering algorithm (kmember algorithm
for short) for kanonymizat ion. This algorithm works by first randomly selecting a
record r as the seed to start building a cluster, and subsequent ly select ing
and adding more records to the cluster such that the added records incur the
least information loss within the cluster. Once the number of records in this
cluster reaches k, this algorithm select s a new record that is the furthest from r,
and repeats the same process to build the next cluster.
Eventually when there are fewer than k records not assigned to any
clusters yet, this algorithm then individually assigns these records to their
closest clusters. This algorithm has two drawbacks.
O First, it is slow. The ti me complexity of this algorithm is ˛( n
Ŷ
) .
O Second, if the cluster contains outliers, the information loss is
increases.
This thesis proposed greedy algorithm for kanonymi zation. Si milar to the
kmember algorithm, this algorithm chooses the seed (i.e., the first selected
record) of each cluster randomly. Also, when building a cluster, this
algorithm keeps selecting and adding records to the cluster until the
diversity (si milar to information loss) of the cluster exceeds a userdefined
threshold. Subsequently, if the number of records in this cluster is less than
k, the entire cluster is deleted.
67
With the help of the userdefined threshold, this algorithm is less sensitive
to outliers. However, this algorithm also has two drawbacks.
O First, it is difficult to decide a proper value for the userdefined
threshold.
O Second, this algorithm mi ght delete many records, which in turn
cause a significant information loss.
The ti me complexity of this algorithm is ˛(( n
Ŷ
log (n)¡c), where c is the
average number of records in each cluster.
68
13. References
1. L. Sweeney. Kanonymity: A model for protecting privacy, "Ìnternational Journal
on Uncertainty, Fuzziness and knowledgedbase system.pp.557570,2002.
2. G. Aggarwal, T. Feder, K. Kenthapadi, R. Motwani, R. Panigrahy, D. Thomas,
and A. Zhu. Anonymizing tables. Ìn Ìnternational conference on Database
Theory, pages 246256, 2005.
3. C. C. Aggarwal and P. S. Yu, A condensation approach to privacy preserving
data mining, Ìn Ìnternational conference on Extending Database Technology,
2002.
4. R. J. Bayardo and R.Agarwal. Data privacy through optimal kanonymization. Ìn
Ìnternational Conference on Data Engineering, 2005.
5. B. C. M. Fung, K. Wang, and P.S. Yu. Topdown specialization for information
and privacy preservation, Ìn Ìnternational Conference on Data Engineering, 2005.
6. Z. Huang. Extensions to the kmeans algorithm for clustering large data sets with
Categorical values. Data Mining and knowledge Discovery, 1998.
7. V. S. Ìyengar. Transforming data to satisfy privacy constraints. Ìn ACM
conference on knowledge Discovery and Data mining, 2002.
8. K. LeFevre, D. DeWitt, and R. Ramakrishnan. Ìncognito: Efficient full domain k
anonymity. Ìn ACM Ìnternational Conference on Management of Data, 2005.
9. K. LeFevre, D. DeWitt, and R. Ramakrishnan. Mondrian multidimensional k
anonymity. Ìn Ìnternational Conference on Data Engineering, 2006.
10. A. Meyerson and R. Williams. On the complexity of optimal kanonymity. Ìn ACM
symposium on principles of Database Systems, 2004.
6
11. P. Samarati. Protecting respondent's privacy in microdata release, ÌEEE
Transaction on Knowledge and Data Engineering, 13, 2001.
12. L. Sweeney, Ìnformation Explosion. Confidentiality, Disclosure, and Data
Access: Theory and Practical Application for Statistical Agencies, L.Zayatz,
P.Doyle, J. Theeuwes and J.Lane, Urban Ìnstitute, Washington, DC, 2001.
13. L. Sweeney, Uniqueness of simple Demographics in the U.S. population, LÌDAP
WP4. Carnegie Mellon University, Laboratory for Ìnternational Data Privacy,
Pittsburg, PA: 2000. Forthcoming Book Entitled, the Ìdentifiability of Data.
14. L. Sweeney. Kanonymity: a model for protecting privacy. Ìnternational Journal
of Uncertainty, Fuzziness and knowledgebased system, 10(7), 2002.
15. T. Dalenius. Finding a needle in a haystack or identifying anonymous census
record. Journal of official Statistics, 2(3): 329336, 1986.
16. L. Sweeney. Guaranteeing anonymity when sharing medical data, the Datafly
system. Proceedings, Journal of the American Medical Ìnformatics Association.
Washington, DC; Hanley & Belfus, Ìnc., 1997.
17. A.Hundepol and L. Willenbord. And Argus: software for statistical disclosure
control. Third Ìnternational Seminar on Statistical Confidentiality. Bled: 1996.
18. J. Ullman. Principles of Database and knowledge Base system. Computer
Science Press. Rockville, MD. 1988.
19. L.Sweeney, Computational Disclosure Control: A primer on data privacy
protection. Ph. D. Thesis, Massachusetts Ìnstitute of Technology, 2001.
70
20. A. R. Adam and J.C. Wortman. Securitycontrol methods for statistical
databases. ACM Computing, Survey, 1989.
21. F.Y. Chin and G. Ozsoyoglu. Auditing and inference control in statistical
databases. ÌEEE Transactions on Software Engineering, 1982.
22. Computer Science and Telecommunications Board. ÌT Roadmap to a Geospatial
Future. The National Academics Press, November 2003.
23. D.E. Denning. Secure statistical database with random sample queries. ACM
Transactions on Database Systems, 1980.
24. D. Dobkin, A.K.Jones, and R.J. Lipton. Secure database: Protection against user
influence. ACM Transactions on database system, 1979.
25. A.D.Friedman and L.J. Hoffman. Towards a failsafe approach to secure
databases. Ìn ÌEEE symposium on security and privacy, 1980.
26. Global mapper. http://www.globalmapper.com/, November 2003.
27. M. Gruteser and D. Grunwald. Anonymous usage of location based services
through spatial and temporal cloaking. Ìn ACM/USENÌX MobiSys, 2003.
28. C. K. Liew, W. J. Choi, and C. J. Liew. A data distortion by probability
distribution. ACM Transactions on Database Systems, 10(3), 1985.
29. L. Sweeney. Kanonymity: A model for protecting privacy.
ÌJUFKS, 10(5), 2002.
38. L. Sweeney. kanonymity privacy protection using generalization and
suppression. ÌJUFKS, 10(5), 2002.
71
Thank You
This action might not be possible to undo. Are you sure you want to continue?