You are on page 1of 9

K-Means & K-Harmonic Means: A Comparison of Two Unsupervised

Clustering Algorithms.

Douglas Turnbull
CSE 202 – Project
November 2002

1. Introduction

Gaining a relational understanding of information is important to biology, human


cognition, artificial intelligence, and many other data-intensive fields of research. In many
instances finding relationships may not be obvious by inspection due to numerous data
points or high dimensionality. In these cases, computation may be useful in order to divide
subsets of the information into groups called clusters. To effectively divide the information
we must first define a criterion for creating groups, and second, find an optimal grouping
based on that criterion.

The problem can be generalized as follows: Given a set N of n data points in d dimensional
space, we must determine how to assign a set K of k points, called centers, in N so as to
optimize based on some criterion. In most cases, it is natural to assume that N is much
greater than K and d is relatively small. This formulation is an example of unsupervised
learning. The system will create groupings based only the criterion and the information
contained in the n data point.

K-means and K-harmonic means are two center-based algorithm that have been developed
to solve this problem. K-means (KM) is a popular algorithm that was first presented over
three decades ago [1]. The criterion it uses minimizes the total mean-squared distance from
each point in N to that point’s closest center in K. K-harmonic means (KHM) is a more
recent algorithm presented by Zhang in 2000[2]. This algorithm minimizes the harmonic
average from all points in N to all centers in K.

This paper will give a description of the general KM algorithm in section 2 and a
description of KHM in section 3. Section 4 will compare KM and KHM by first looking
first looking a test case where KHM outperforms KM and then discussing some of the
fundamental differences between the two algorithms. Section 5 will present one current
application involving the classification of music and how it can use unsupervised
clustering. Lastly, section 6 will look at some potential research ideas for KHM.

2. K-Means

2.1 The General KM Algorithm


An outline for this algorithm is based on an algorithm presented by Elkan [3].
Data Structures:
N: n by (d+1) array - contains static information about data set
and a dynamic pointer to closest center
K: k by d array that holds information about centers
M: k by n array that holds distance from all point in N to all points in K

Initialization
1. Create an initial K:
Choose any k points from N
2. Initialized pointers in N:
Assign pointers to any element in K
Main Loop
3. Fill Matrix M:
Calculate distances from all points in N to all centers in K
4. Update Array N:
Change the pointer N[i] to point to a new center if that new center is closer than pervious center
5. If no pointer to a center is updated in step 4 then stop
The algorithm has converged
6. Otherwise Recompute and Update K:
For each center c in K, compute average value for all points in N that point to c and replace c with new point
c’ from data set that is closest to average value. This can all be done with a linear pass though N.

2.2 K-means Analysis


2.2.1 Objective Function and Membership Constraint
This algorithm minimizes the total mean squared distance for each point, xi, and the closest
center, cj.

minimize [SUM over i = 0 to N [min( ||xi – cj||2 for all cj in K)]]

Because the objective function only uses the minimum distances, each point xi is implicitly
assigned to exactly one center cj. The property of being assigned to one center is referred to
as hard membership. (We will return to this notion of membership in section 4.)

2.2.2 Storage Requirements


Storage requirements assuming that n >> k >> d:

N + K + M = n * (d+1) + n * d + n * k = O(nk)

The size of matrix M dominates the storage

2.2.3 Running Time


The running time for each iteration algorithm is dominated by distance calculations in step
4. If we can assume that a d dimension distance calculation is takes O(d) time, than the
running time per iteration is given by:

O(nkd)

The convergence rate in not just a function of n, k, and d but depends on the nature of the
data set. Therefore, it is difficult to discuss a runtime for the algorithm. For this reason
center-based clustering algorithms are usually compared by the runtime of a single
iteration.

2.2.4 Correctness
It has been shown in [5] that the problem of finding an optimal grouping given a d
dimensional set of n data points into k groups is NP-hard. K-means uses a greedy heuristic
that seeks to find a local minimum given an initial condition.

In each iteration, we find a better local solution. We only include new centers that reduce
the value of the objective function at each step. The algorithm converges when it is not
possible to exchange any center such that the total cost can be reduced. Although this gives
us a local minimum, we would have to rerun the algorithm to check every initial condition
in order to get a global minimum. Since there are an exponential number of possible
initialization combinations, rerunning with all possibilities would take exponential time.

3. K-Harmonic Mean

KHM is similar to KM in that they are both center-based iterative algorithm. The main
difference is how the centers are updated in each iteration. The formula for updating center
comes from a derivation found in [2]. In his paper, Zhang introduces a class of KHM with
parameter p that is power associated with the distance calculation. In the standard KM
algorithm p would be 2 because the distance calculation is given by squared distance ||xi –
cj||2. It was found that KHM works better with values of p > 2. This will be discussed more
in section 4.

3.1 Harmonic Average Function


The harmonic average is defined as

HA({a1,..,aK}) = K / [SUM over k = 1 to K ( 1 / ak) ]

This function has the property that if any one element in a1..aK is small, the Harmonic
Average will also be small. If there are no small values the harmonic average will be large.
It behaves like a minimum function but also gives some weight to all the other values.

3.2 The KHM Algorithm


Data Structures:
N: n by d+1 array - contains static information about data set
K: k by d array that holds information about centers

Initialization
1. Create an initial K:
Choose any k points from N
Main Loop
2. Compute Harmonic Averages and Update K: See Appendix A
3. If no center is updated in step 4 then stop
The algorithm has converged

(Note: That a more complete version of the pseudocode is given in Appendix A.)

3.3 K-harmonic Means Analysis


3.3.1 Objective Function and Membership Constraint
The objective function of KHM is given by:
minimize [SUM over i = 0 to N [HA( ||xi – cj||2 for all cj in K)]]

where HA() is the harmonic average for each data point. Unlike KM, this algorithm uses
information from all of the centers in K to calculate the harmonic average for each point in
N. This means that no center completely owns a point, but rather partially influences the
harmonic average for each point. This condition is referred to as soft membership.

3.3.2 Space Requirements, Running Time, and Correctness


The space requirements, O(nd), and running time, O(nkd), for KHM are given in [2]. The
space requirement for KHM is given as somewhat less than KM because the algorithm
never requires the storage of large matrix M that stores the distance calculations from all
centers to all point. However, the all of these calculations must me made can be stored
temporarily.

Like KM, KHM finds a local optimal solution based on initial conditions using a greedy
heuristic. Each iteration improves the objective function until convergence.

4. Comparison of K-Means with K-Harmonic means

A recent study by Hamerly and Elkan[4] found that KHM significantly outperforms KM in
a number of experiments. They also found that KHM was much less sensitive to initial
conditions. This section will give a specific case in which KHM will finds a better
grouping. We will then discuss two properties of KHM that make it superior to KM.

4.1 Test Case


Given a one-dimensional set of 8 data points, we wish to find an optimal three clustering.
In this case d=1, n = 8 and k = 3. The points are located at (1,2,3,4,6,7,9,10) and the initial
centers are located at (1,4,9).

1 2 3 4 5 6 7 8 9 10
x1 x2 x3 x4 x5 x6 x7 x8
c1 c2 c3

4.1.1 KM on Test Case


Running KM, we find that after the first iteration, (x1,x2) is assigned to c1, (x3, x4, x5) is
assigned to c2, and (x6, x7, x8) is assigned to c3. When recomputing the averages for each
center in step 6 of the KM algorithm we would find that no centers would need to be
changed and the algorithm would converge with an objective function to equal to 11. (We
will assume that an average value that is equidistant from two data points will be assigned
to point with lesser absolute value. This is an unimportant detail but a needed policy for the
reassignment of c1. )

However, a better local minimum would have been found given an initialization in which
the centers were assigned to (x2, x5, x9). This would yield and local objective of 8.
4.1.2 KHM on Test Case
Given the same initial set up, the KHM algorithm, with p =3, given in Appendix A will
converge after 3 iterations. The weight of each data point on the objective function is
calculated in each iteration.

The weight function is given as:

Weighti = 1 / [ Sum from l = 1 to K ( 1 / ||xi –cl||p) ] 2

Once calculated, these weights are normalized and used for calculating the new values for
updating the centers.

The state after each iteration is as follows:

Iteration 1:
1 2 3 4 5 6 7 8 9 10
x1 x2 x3 x4 x5 x6 x7 x8
c1 c2 c3
Weight: 0 .79 .78 0 34 36 0 .99

Iteration 2:
1 2 3 4 5 6 7 8 9 10
x1 x2 x3 x4 x5 x6 x7 x8
c1 c2 c3
Weight: .91 0 .24 0 .76 0 54 524

Iteration 2:
1 2 3 4 5 6 7 8 9 10
x1 x2 x3 x4 x5 x6 x7 x8
c1 c2 c3
Weight: .99 0 .92 15 0 .77 0 .96

Note that the final set of center does find a good local (and in this case, global) optimal
solution both in respect to the KHM objective and KM objective function. Some of the
weights are relatively high (524, 54, 36) for points far from a center and low for point lose
to a center (epsilon represented by 0, .24, .78).

4.2 Hard vs. Soft Membership


As we saw in section 2.2.1, KM imposes hard membership on it data points: each data
point is assigned to exactly one center. This means that each data point only has an
influence over the center to which it is assigned. In area with high local density of data
points and centers, a center maybe unable move away from a data point despite the fact that
there is a second center nearby. This second center might yield a slightly worse local
solution, but the global effect that repositioning one of the two centers would be beneficial
to the clustering. However, this swapping cannot be done in using KM.
KHM differs in that, for each data point, the objective function uses the distances to all
centers. The harmonic average is sensitive to the fact that there exist two or more centers
are close to a data point. The algorithm will naturally shift one or more of these centers
away to areas where there data point that have no close center. This will create a lower
value for the objective function.

4.3 Static vs. Dynamic Weights


In each iteration of KM, the objective function gives equal weight to all of the data
points. KHM, on the other hand, assigns dynamic weights to each data points based on a
harmonic average. As described in section 3.1, the harmonic average will assign a large
weight to a data point that is not close to any centers and a small weight to data point that
is close to one or more centers. This principle is important because we want to avoid
creating densely packed area of multiple centers. By increasing the weight of centers that
are not close to any center, the algorithm can attract centers away from those dense areas
without increasing the weight of those data points that are in more dense areas. This
principal is central to KHM being less sensitive to initialization than KM. In KM centers
tend to get trapped in dense area yielding poor clustering results.

5 Application: Music Genre Classification

One application that involves the unstructured clustering is the automatic classification of
music for information retrieval(IR). The problem involves large libraries of digitally
recorded sound files. In [6], Tzanetakis and Cook use signal processing techniques and
statistical pattern recognition algorithms to extract features from recorded sound files.

In this case, the sound files represent data point and the features represent the
dimensionality of data. Much energy has been put into supervised clustering based on
nearest neighbor algorithms with a predefined audio classification hierarchy and
associated training sets. However, in this method, notions of human cognition and
psychoacoustics play a large role in determining the success of a classification system.

Using a fully automatic clustering algorithm such as KHM could limit the subjective
nature of musical classification. Although this new classification may not make sense to a
human, it could be used to automatically store an retrieve sound information with a much
higher degree of accuracy.

6 Future Work

6.1 Original Goal

Over the past 30 years, many papers have presented ways in which to improve KM.
Some focus on creating good initialization methods while other look at finding optimal
values for k. My thought was that using some of these ideas but with KHM in mind.
In [3], Elkan describes ways to improve KM using geometric insights from the triangle
inequality. He presents two lemma, one for an upper bound and one for a lower bound, in
order to reduce the number of distance calculations needed. Although the improvements
do not reduce the theoretical computational complexity, they do greatly reduce the
number of distance calculation and achieve significant speedups in practice.

My original goals in writing this paper was to try to use the same principles involving the
triangle inequality and apply them to the KHM algorithm. This was found to be not
possible due to fact that the number of calculation cannot be reduced since all
calculations between centers and points are needed for soft membership algorithms.

6.2 Improving on KHM


Despite Zhang claim in [2] that KHM is “essentially insensitive to initialization,” both [2]
and [4] show that there is still remains some variance in the local optimal solution given
by different initialization.

One idea to further reduce this variance would be to use the centers found after one pass
KHM as an initialization to a second pass through KHM. We could artificially move
centers from more dense clusters to locations between sparse clusters.

The Algorithm is as follows:

1. Run KHM
2. Run KM to determine hard membership
3. For each center, calculate variance of cluster.
4. Transplant centers from the m densest clusters to the m locations between sparse
clusters.
5. Rerun KHM

This algorithm could be evaluated by testing to see if the variance of the solution set is
less than that of one pass through KHM given a series of different initializations.
Appendix A: Complete Pseudo Code for KHM
Data Structures:
N: n by d+1 array - contains static information about data set
Nmin: n element array which holds the minimum distance to any center
K: k by d array that holds information about centers
M: n by k array that holds distance from all point in N to all points in K

Temporary Arrays (Could be reduced but shown for simplicity)


U: n element array
Q: n by k temporary array
QQ:k element array
R: n by k temporary array
RR:k element array
T: n by k
p: KHM parameter discussed in beginning of section 3

Initialization
4. Create an initial K:
Choose any k points from N

Main Loop
5. Fill Matrix M:
Calculate distances from all points in N to all centers in K
6. Compute Nmin:
Find minimum distance for to any center for each point in N
7. Recompute Harmonic Averages and Update K:

For each point (j = 0 to n)


For each center (i = 0 to k)
U[j] = U[j] + (Nmin[j]/N[j,I])
U[j] = U[j] – 1;

For each center (i = 0 to k)


For each point (j = 0 to n)
Q[j,I] = [(Nmin[i]^(p-2) * (Nmin[i]/N[j,i])^(p+2)] /
[(1 + U[j]^p)^2]

For each center (i = 0 to k)


For each point (j = 0 to n)
QQ[i] += Q[j,I]

For each center (i = 0 to k)


For each point (j = 0 to n)
R[j,I] = Q[j,i] / QQ[i]

For each center (i = 0 to k)


For each point (j = 0 to n)
K[i] = K[i] + R[j,i]*N[j]

8. If no center is updated in step 4 then stop


The algorithm has converged

The algorithm above is given in a simplified form as to show all the temporary values that
are calculated by taking partial derivative for the Harmonic Average Function. Each nested
loop in Step 4 represents the one of the five decomposed equations from Equation 6 of [3].
References

[1] J. MacQueen. Some methods for classification and analysis of multivariate


observations. Proceeding of the fifth Berkeley symposium on mathematical statistics and
probability. Pp 281-297, 1967.

[2] B. Zhang. Generalized k-harmonic means – boosting in unsupervised learning.


Technical Report HPL-2000-137, Hewlett-Packard Labs, 2000.

[3] C. Elkan. K-means: fast, faster, fastest. UCSD AI Seminar Talk, October 2002.

[4] G. Hamerly and C. Elkan.   Alternatives to the k-means algorithm that find better
clusterings (pdf).  To appear in Proceedings of the Eleventh International Conference on
Information and Knowledge Management (CIKM'02), November 2002.

[5] Drineas, Frieze, Kannan, Vempala, and Vinay. “Clustering in large graphs and
matrices.” ACM-SIAM Symposium on Discrete Algorithms, 1999.

[6] Tzanetakis, and Cook “Musical Genre Classification of Audio Signals.” IEEE
Transactions on Speech and Audio Processing. 2002.

You might also like