Professional Documents
Culture Documents
Clustering Algorithms.
Douglas Turnbull
CSE 202 – Project
November 2002
1. Introduction
The problem can be generalized as follows: Given a set N of n data points in d dimensional
space, we must determine how to assign a set K of k points, called centers, in N so as to
optimize based on some criterion. In most cases, it is natural to assume that N is much
greater than K and d is relatively small. This formulation is an example of unsupervised
learning. The system will create groupings based only the criterion and the information
contained in the n data point.
K-means and K-harmonic means are two center-based algorithm that have been developed
to solve this problem. K-means (KM) is a popular algorithm that was first presented over
three decades ago [1]. The criterion it uses minimizes the total mean-squared distance from
each point in N to that point’s closest center in K. K-harmonic means (KHM) is a more
recent algorithm presented by Zhang in 2000[2]. This algorithm minimizes the harmonic
average from all points in N to all centers in K.
This paper will give a description of the general KM algorithm in section 2 and a
description of KHM in section 3. Section 4 will compare KM and KHM by first looking
first looking a test case where KHM outperforms KM and then discussing some of the
fundamental differences between the two algorithms. Section 5 will present one current
application involving the classification of music and how it can use unsupervised
clustering. Lastly, section 6 will look at some potential research ideas for KHM.
2. K-Means
Initialization
1. Create an initial K:
Choose any k points from N
2. Initialized pointers in N:
Assign pointers to any element in K
Main Loop
3. Fill Matrix M:
Calculate distances from all points in N to all centers in K
4. Update Array N:
Change the pointer N[i] to point to a new center if that new center is closer than pervious center
5. If no pointer to a center is updated in step 4 then stop
The algorithm has converged
6. Otherwise Recompute and Update K:
For each center c in K, compute average value for all points in N that point to c and replace c with new point
c’ from data set that is closest to average value. This can all be done with a linear pass though N.
Because the objective function only uses the minimum distances, each point xi is implicitly
assigned to exactly one center cj. The property of being assigned to one center is referred to
as hard membership. (We will return to this notion of membership in section 4.)
N + K + M = n * (d+1) + n * d + n * k = O(nk)
O(nkd)
The convergence rate in not just a function of n, k, and d but depends on the nature of the
data set. Therefore, it is difficult to discuss a runtime for the algorithm. For this reason
center-based clustering algorithms are usually compared by the runtime of a single
iteration.
2.2.4 Correctness
It has been shown in [5] that the problem of finding an optimal grouping given a d
dimensional set of n data points into k groups is NP-hard. K-means uses a greedy heuristic
that seeks to find a local minimum given an initial condition.
In each iteration, we find a better local solution. We only include new centers that reduce
the value of the objective function at each step. The algorithm converges when it is not
possible to exchange any center such that the total cost can be reduced. Although this gives
us a local minimum, we would have to rerun the algorithm to check every initial condition
in order to get a global minimum. Since there are an exponential number of possible
initialization combinations, rerunning with all possibilities would take exponential time.
3. K-Harmonic Mean
KHM is similar to KM in that they are both center-based iterative algorithm. The main
difference is how the centers are updated in each iteration. The formula for updating center
comes from a derivation found in [2]. In his paper, Zhang introduces a class of KHM with
parameter p that is power associated with the distance calculation. In the standard KM
algorithm p would be 2 because the distance calculation is given by squared distance ||xi –
cj||2. It was found that KHM works better with values of p > 2. This will be discussed more
in section 4.
This function has the property that if any one element in a1..aK is small, the Harmonic
Average will also be small. If there are no small values the harmonic average will be large.
It behaves like a minimum function but also gives some weight to all the other values.
Initialization
1. Create an initial K:
Choose any k points from N
Main Loop
2. Compute Harmonic Averages and Update K: See Appendix A
3. If no center is updated in step 4 then stop
The algorithm has converged
(Note: That a more complete version of the pseudocode is given in Appendix A.)
where HA() is the harmonic average for each data point. Unlike KM, this algorithm uses
information from all of the centers in K to calculate the harmonic average for each point in
N. This means that no center completely owns a point, but rather partially influences the
harmonic average for each point. This condition is referred to as soft membership.
Like KM, KHM finds a local optimal solution based on initial conditions using a greedy
heuristic. Each iteration improves the objective function until convergence.
A recent study by Hamerly and Elkan[4] found that KHM significantly outperforms KM in
a number of experiments. They also found that KHM was much less sensitive to initial
conditions. This section will give a specific case in which KHM will finds a better
grouping. We will then discuss two properties of KHM that make it superior to KM.
1 2 3 4 5 6 7 8 9 10
x1 x2 x3 x4 x5 x6 x7 x8
c1 c2 c3
However, a better local minimum would have been found given an initialization in which
the centers were assigned to (x2, x5, x9). This would yield and local objective of 8.
4.1.2 KHM on Test Case
Given the same initial set up, the KHM algorithm, with p =3, given in Appendix A will
converge after 3 iterations. The weight of each data point on the objective function is
calculated in each iteration.
Once calculated, these weights are normalized and used for calculating the new values for
updating the centers.
Iteration 1:
1 2 3 4 5 6 7 8 9 10
x1 x2 x3 x4 x5 x6 x7 x8
c1 c2 c3
Weight: 0 .79 .78 0 34 36 0 .99
Iteration 2:
1 2 3 4 5 6 7 8 9 10
x1 x2 x3 x4 x5 x6 x7 x8
c1 c2 c3
Weight: .91 0 .24 0 .76 0 54 524
Iteration 2:
1 2 3 4 5 6 7 8 9 10
x1 x2 x3 x4 x5 x6 x7 x8
c1 c2 c3
Weight: .99 0 .92 15 0 .77 0 .96
Note that the final set of center does find a good local (and in this case, global) optimal
solution both in respect to the KHM objective and KM objective function. Some of the
weights are relatively high (524, 54, 36) for points far from a center and low for point lose
to a center (epsilon represented by 0, .24, .78).
One application that involves the unstructured clustering is the automatic classification of
music for information retrieval(IR). The problem involves large libraries of digitally
recorded sound files. In [6], Tzanetakis and Cook use signal processing techniques and
statistical pattern recognition algorithms to extract features from recorded sound files.
In this case, the sound files represent data point and the features represent the
dimensionality of data. Much energy has been put into supervised clustering based on
nearest neighbor algorithms with a predefined audio classification hierarchy and
associated training sets. However, in this method, notions of human cognition and
psychoacoustics play a large role in determining the success of a classification system.
Using a fully automatic clustering algorithm such as KHM could limit the subjective
nature of musical classification. Although this new classification may not make sense to a
human, it could be used to automatically store an retrieve sound information with a much
higher degree of accuracy.
6 Future Work
Over the past 30 years, many papers have presented ways in which to improve KM.
Some focus on creating good initialization methods while other look at finding optimal
values for k. My thought was that using some of these ideas but with KHM in mind.
In [3], Elkan describes ways to improve KM using geometric insights from the triangle
inequality. He presents two lemma, one for an upper bound and one for a lower bound, in
order to reduce the number of distance calculations needed. Although the improvements
do not reduce the theoretical computational complexity, they do greatly reduce the
number of distance calculation and achieve significant speedups in practice.
My original goals in writing this paper was to try to use the same principles involving the
triangle inequality and apply them to the KHM algorithm. This was found to be not
possible due to fact that the number of calculation cannot be reduced since all
calculations between centers and points are needed for soft membership algorithms.
One idea to further reduce this variance would be to use the centers found after one pass
KHM as an initialization to a second pass through KHM. We could artificially move
centers from more dense clusters to locations between sparse clusters.
1. Run KHM
2. Run KM to determine hard membership
3. For each center, calculate variance of cluster.
4. Transplant centers from the m densest clusters to the m locations between sparse
clusters.
5. Rerun KHM
This algorithm could be evaluated by testing to see if the variance of the solution set is
less than that of one pass through KHM given a series of different initializations.
Appendix A: Complete Pseudo Code for KHM
Data Structures:
N: n by d+1 array - contains static information about data set
Nmin: n element array which holds the minimum distance to any center
K: k by d array that holds information about centers
M: n by k array that holds distance from all point in N to all points in K
Initialization
4. Create an initial K:
Choose any k points from N
Main Loop
5. Fill Matrix M:
Calculate distances from all points in N to all centers in K
6. Compute Nmin:
Find minimum distance for to any center for each point in N
7. Recompute Harmonic Averages and Update K:
The algorithm above is given in a simplified form as to show all the temporary values that
are calculated by taking partial derivative for the Harmonic Average Function. Each nested
loop in Step 4 represents the one of the five decomposed equations from Equation 6 of [3].
References
[3] C. Elkan. K-means: fast, faster, fastest. UCSD AI Seminar Talk, October 2002.
[4] G. Hamerly and C. Elkan. Alternatives to the k-means algorithm that find better
clusterings (pdf). To appear in Proceedings of the Eleventh International Conference on
Information and Knowledge Management (CIKM'02), November 2002.
[5] Drineas, Frieze, Kannan, Vempala, and Vinay. “Clustering in large graphs and
matrices.” ACM-SIAM Symposium on Discrete Algorithms, 1999.
[6] Tzanetakis, and Cook “Musical Genre Classification of Audio Signals.” IEEE
Transactions on Speech and Audio Processing. 2002.