When dealing with clustering techniques, a notionof a high dimensional space must be adopted, orspace in which orthogonal dimensions are allattributes from the table of analysed data. The valueof each attribute of an example represents adistance of the example from the origin along theattribute axes. Of course, in order to use thisgeometry efficiently, the values in the data set mustall be numeric and should be normalized in order toallow fair computation of the overall distances in amulti-attribute space.K-means algorithm is a simple, iterativeprocedure, in which a crucial concept is the one of
centroid
.
Centroid
is an artificial point in the spaceof records that represents an average location of theparticular cluster. The coordinates of this point areaverages of attribute values of all examples thatbelong to the cluster. The steps of the K-meansalgorithm are given in Figure 2.1.
Select randomly
k
points (it can be alsoexamples) to be theseeds for the
centroids
of
k
clusters.2.
Assign each example to the
centroid
closest to the example,forming in this way
k
exclusive clusters of examples.3.
Calculate new
centroids
of the clusters. Forthat purpose averageall attribute values of the examplesbelonging to the same cluster (
centroid
).4.
Check if the cluster
centroids
have changedtheir "coordinates".If yes, start again form the step 2). If not,cluster detection isfinished and all examples have their clustermemberships defined.Fig.2 – K- means algorithmUsually this iterative procedure of redefining
centroids
and reassigning the examples to clustersneeds only a few iterations to converge.
2.3 Two Step Cluster
The Two Step cluster analysis [10] can be used tocluster the data set into distinct groups in case thesegroups are initially unknown. Similar to K-Meansalgorithm, Two Step Cluster models do not
use atarget field. Instead of trying to predict an outcome,Two Step Cluster tries to uncover patterns in the setof input fields. Records are grouped so that recordswithin a group or cluster tend to be similar to eachother, being dissimilar to records in other groups.Two Step Cluster is a two-step clusteringmethod. The first step makes a single pass throughthe data, during which it compresses the raw inputdata into a manageable set of subclusters. Thesecond step uses a hierarchical clustering method toprogressively merge the subclusters into larger andlarger clusters, without requiring another passthrough the data. Hierarchical clustering has theadvantage of not requiring the number of clusters tobe selected ahead of time. Many hierarchicalclustering methods start with individual records asstarting clusters, and merge them recursively toproduce ever larger clusters. Though suchapproaches often break down with large amounts of data, Two Step’s initial pre-clustering makeshierarchical clustering fast even for large data sets.
2.4 Association Rules
A rule consists of a left-hand side proposition(antecedent) and a right-hand side (consequent) [2].Both sides consist of Boolean statements. The rulestates that if the left-hand side is true, then theright-hand is also true. A probabilistic rule modifiesthis definition so that the right-hand side is truewith probability p, given that the left-hand side istrue.A formal definition of association rule [6, 13] isgiven below.
Definition
. An association rule is a rule in theform of X
YWhere
X
and
Y
are predicates or set of items.As the number of produced associations mightbe huge, and not all the discovered associations aremeaningful, two probability measures, called
support
and
confidence
, are introduced to discardthe less frequent associations in the database. Thesupport is the joint probability to find X and Y inthe same group; the confidence is the conditionalprobability to find in a group Y having found X.Formal definitions of support and confidence [6]are given below.
Definition
Given an itemset pattern X, its
frequency
fr(X)
is the number of cases in the datathat satisfy X.
Support
is the frequency fr(X
∧
Y).
Confidence
is the fraction of rows that satisfy Yamong those rows that satisfy X,
c(X
Y)=
)()(
X fr Y X fr
∧
In terms of conditional probability notation, theempirical accuracy of an association rule can beviewed as a maximum likelihood (frequency-based)