You are on page 1of 12

Abstract

The partitioning of data into clusters is an important problem with many applications.
Typically, one locates partitions using an iterative fuzzy c-means algorithm and Fast Algorithm
of one form or another. in data mining clustering techniques are used to group together the
objects showing similar characteristics within the same cluster and the objects demonstrating
different characteristics are grouped in to clusters. Clustering approaches can be classified into
two categories namely Hard clustering and Soft clustering. Our proposed WLI partially
allows, to some extent, the existence of closely allocated centroids in the clustering results by
considering not only the minimum but also the median distances between a pair of centroids and
therefore possesses the better stability. The performances of WLI and some existing clustering
validity indexes are evaluated and compared by running the fuzzy c-means algorithm for
clustering various types of data sets, including articial data sets, UCI data sets, and images.
Experimental results have shown that WLI has the more accurate and satisfactory performance
than other indexes. The FCM algorithm is also tested with cluster validity indices such as
partition coefficient and partition entropy. . Validity functions typically suggest finding a tradeoff between intra-cluster and inter-cluster variability, which is of course a reasonable principle.
The latter process uses a region-based similarity representation of the image regions to decide
whether regions can be merged. The results show that LHS and RHS distance measure is
reported maximum partition coefficient and minimum partition entropy than the other distance
measures.
Keywordsclustering analysis; clustering validity index; partition clustering algorithm; fuzzy cmeans clustering algorithm.

Introduction
Microarray technology has made available an incredible amount of gene expression data,
driving research in several areas including the molecular basis of disease, drug discovery,
neurobiology, and others. Usually, microarray data is collected with the goal of either
discovering genes associated with some event, predicting outcomes based on gene expression, or
discovering sub-classes of diseases. While clustering has been used for decades in image
processing and pattern recognition, in recent years it has become a popular technique in genomic
studies for extracting this kind of valuable information from massive sets of gene expression
data.
Clustering applied to genes from microarray data groups together those whose expression
levels exhibit similar behavior through the samples. In this context, similarity is taken to indicate
possible co-regulation between the genes, but may also reveal other processes that relate their
expression. In other words, the application of clustering in our first goal listed above is founded
by the concept of guilty by association, where genes with similar expression across samples
are assumed to share some underlying mechanism.
Objects belonging to the same cluster are similar to each other i.e. each cluster is
homogeneous. Each cluster should be different from other clusters such that objects belonging to
one cluster are different from the objects present in other clusters i.e. Different clusters are nonhomogenous.

Clustering technique provides many advantages but the two most important

benefits of clustering can be outlined as follows [3]:


1. Detection and handling of noisy data and outliers is relatively easier.
2. It provides the ability to deal with the data having different types of variables
such as continuous variable that requires standardized data, binary variable,
nominal variable, ordinal variable and mixed variables.
Data Mining is the process of finding the pattern or relationship in the large data sets.
This pattern or relationship can be coined as Knowledge, so the term Knowledge Discovery from
Data, or KDD is used interchangeably. This process involves use of automated data analysis
technique to uncover the relationship between the data items. This data analysis technique
involves many different algorithms. The overall goal of the data mining process is to extract

knowledge from an existing data set and transform it into a human-understandable structure for
further use.
The KDD process consists of various steps; they are data cleaning, data integration, data
selection, data transformation, data mining, pattern evaluation and knowledge representation.
The first four steps are different forms of data preprocessing, where data is prepared for data
mining. The data mining step is an essential step where data analysis technique is applied to
extract patterns or knowledge. The extracted patterns or knowledge is then evaluated in process
evaluation step and this evaluated knowledge is the represented before the user in knowledge
representation step. Basic data mining tasks are classification, regression, time-series-analysis,
prediction, clustering, summarization, association rules and sequence discovery.
Clustering
Clustering is the unsupervised data mining technique partitioning or grouping a given set
of patterns into disjoint clusters without advance knowledge of the group or clusters. This is done
such that patterns belonging to same clusters are alike and patterns belonging to two different
clusters are different. Clustering process can be divided into two parts, cluster formation and
cluster validation
Mathematical model of clustering
In the context of pattern recognition theory, each object is represented by a vector of
features, called a pattern. Clustering can be defined as the process of partitioning a collection of
vectors into subgroups whose members are similar relative to some distance measure.
A clustering algorithm receives a set of vectors, and groups them based on a cost criterion or
some other optimization rule.
The related field of pattern classification, which involves simply assigning individual
vectors to classes, has developed a theory based on defining error criteria, designing optimal
classifiers, and learning. In comparison, clustering has historically been approached heuristically;
there has been almost no consideration of learning or optimization, and error estimation has been
handled indirectly via validation indices. Only recently has a rigorous clustering theory been
developed in the context of random sets. Although we will not go over the mathematical details

of, in this section we summarize some essential points regarding clustering error, error
estimation, and inference.
Fuzzy C-Means
In the K-means algorithm, each vector is classified as belonging to a single cluster (hard
clustering), and the centroids are updated based on the classified samples. In a variation of this
approach known as fuzzy c-means, all vectors have a degree of membership for each cluster, and
the respective centroids are calculated based on these membership degrees.
Whereas the K-means algorithm computes the average of the vectors in a cluster as the center,
fuzzy c-means finds the center as a weighted average of all points, using the membership
probabilities for each point as weights. Vectors with a high probability of belonging to the class
have larger weights, and more influence on the centroid. As with K-means clustering, the process
of assigning vectors to centroids and updating the centroids is repeated until convergence is
reached.
Hierarchical
Hierarchical clustering creates a hierarchical tree of similarities between the vectors,
called a dendrogram. The usual implementation is based on agglomerative clustering, which
initializes the algorithm by assigning each vector to its own separate cluster and defining the
distances between each cluster based on either a distance metric (e.g., Euclidean) or similarity
(e.g., correlation). Next, the algorithm merges the two nearest clusters and updates all the
distances to the newly formed cluster via some linkage method, and this is repeated until there is
only one cluster left that contains all the vectors. Three of the most common ways to update the
distances are with single, complete or average linkages.
This process does not define a partition of the system, but a sequence of nested partitions, where
each partition contains one less cluster than the previous partition. To obtain a partition
with K clusters, the process must be stopped K 1 steps before the end.
Different linkages lead to different partitions, so the type of linkage used must be selected
according to the type of data to be clustered. For instance, complete and average linkages tend to

build compact clusters, while single linkage is capable of building clusters with more complex
shapes but is more likely to be affected by spurious data.
Literature Review
Data clustering is the process of dividing data elements into groups or clusters
such that items in the same class are similar and items belonging to different classes are
dissimilar. Different measures of similarity such as distance, connectivity, and intensity
may be used to place different items into clusters. The similarity measure controls how the
clusters are formed and
depends on the nature of the data and the purpose of clustering data.
technique

can

be

hard

or

soft

Clustering techniques

can

be

Clustering

classified

into

supervised clustering that demands human interaction to decide the clustering criteria and that
decides the clustering criteria itself. The two types of classic clustering techniques are defined as
follows:

Hierarchical Clustering Techniques


Partition Clustering Techniques

Algorithm
Fuzzy C-Means Algorithm
Fuzzy C-means (FCM) is a method of clustering which allows one piece of data to
belong to more than one cluster. In other words, each data is a member of every cluster but with
a certain degree of membership value. So each sample or element has some membership value so
a sample is attached partially to other clusters also. So no cluster will be empty or no class will
be with any data points. The output of such algorithm will be clustering but not partition. It is
based on the minimization of the objective function

1. Initiate U=[ui,j] matrix, U(0) finding new objective function.


2. At k-step calculate the centers vectors C(k)=[cj] with U(k)

3. Update U(k), U(k+1)

4.

If||U(k+1)- U(k)||< x then stop, otherwise return to step 2.

The advantages of using fuzzy c-means are:


1. No class or cluster will be empty. Every cluster will have some or partial membership of
elements.
2. More efficient then k-means.
3. It has better convergence properties.

Fast Clustering Algorithm


Framework and Definitions Irrelevant features, along with redundant features, severely affect
the accuracy of the learning machines. Thus, feature subset selection should be able to identify
and remove as much of the irrelevant and redundant information as possible. Moreover, good
feature subsets contain features highly correlated with (predictive of) the class, yet uncorrelated
with (not predictive of) each other.

Example for Algorithm

Patient data

Iteration # 1 (full table)


Partition

Anonymzed data

dim = Zip
code

splitVal = 53711

LHS

RHS

Iteration # 2 (LHS from iteration # 1)


Partition
dim = Age

fs

splitVal =
26

LHS

RHS

Iteration # 3 (LHS from iteration # 2)


Partition

No Allowable
Cut

Summary: Age = [25-26]


Zip= [53711]

Iteration # 4 (RHS from iteration # 2)


Partition

No Allowable Cut

Iteration # 5 (RHS from iteration # 1)


Partition

Summary: Age = [27-28] Zip=


[53710 - 53711]

No Allowable Cut

Summary: Age = [25-27] Zip=


[53712]

Advantages

The protection k-anonymity provides is simple and easy to understand.

If a table satisfies k-anonymity for some value k, then anyone who knows only the quasiidentifier values of one individual cannot identify the record corresponding to that
individual with confidence greater than 1/k.

Evaluation Result
Bacteria Image The specification for the bacteria image considered for the experimentation
is 120 x 142 x 3 pixels. The goal of the algorithm is to separate the bacteria from the
background efficiently.

Implementation

of

all

two algorithms is done on both kinds of

images noiseless and corrupted with noise. Gaussian noise is introduced with

3% intensity

and the image consists of two clusters. Various types of noises and levels of noise
percentages have been experimented with images to show the performance of all the
clustering

algorithms. Presents the original noiseless bacteria image, and

the

clustering

outcomes shown in Fig 1(b) (d) for three clustering methods FCM, and Fast Algorithm.

Bacteria original image, (b) FCM Cluster, (c) Fast Clustering


Noiseless image results depict the best performance for FCM algorithm and Fast
Clustering algorithms. FCM produces an image with blurred boundaries and augmented
size of bacteria. In case of noise corrupted image, Fast Clustering tops the result with
removal of noise as well as retaining the boundaries of bacteria. FCM does better as
compared to FCM in removal of noise, but at the cost of increased bacteria size.
Performance comparison of FCM, Fast, is shown in TABLEI in terms of these four cluster
validity functions.

Execution time TABLE II shows the outcome for FCM, Fast in terms of the convergence
rate and the execution time. FCM technique has least execution time compared to other image
segmentation techniques. It is can be seen that FAST method takes much more time to execute,
but has the best convergence rate as the number of iterations is the least in both images.