You are on page 1of 6

A Cluster Estimation Method

with Extension to Fuzzy Model Identification


Stephen L. Chiu
Rockwell International Science Center
1049 Camino Dos Rios
Thousand Oaks, California 91360, USA

Abstract [3]. The quality of the FCM solution, like that of most
nonlinear optimization problems, depends strongly on
We present an efficient method for estimating cluster the choice of initial values (i.e., the number c and the
centers of numerical data. This method can be used to initial cluster centers).
determine the number of clusters and their initial values
for initializing iterative optimization-based clustering Yager and Filev [4] proposed a simple and effective
algorithms such as Fuzzy C-Means. Here we combine algorithm for estimating the number and initial location
this cluster estimation method with a least squares of cluster centers. Their method is based on griding the
estimation algorithm to provide a fast and robust data space and computing a potential value for each grid
method for identifying fuzzy models from inputloutput point based on its distances to the actual data points; a
data. A benchmark problem involving the prediction of grid point with many data points nearby will have a
a chaotic time series shows this method compares high potential value. The grid point with the highest
favorably with other more computationally intensive potential value is chosen as the first cluster center. The
methods. key idea in their method is that once the first cluster
center is chosen, the potential of all grid points are
1. Introduction reduced according to their distance from the cluster
center. Grid points near the first cluster center will have
Clustering of numerical data forms the basis of many greatly reduced potential. The next cluster center is then
classification and system modeling algorithms. The placed at the grid point with the highest remaining
purpose of clustering is to distill natural groupings of potential value. This procedure of acquiring new cluster
data from a large data set, producing a concise center and reducing the potential of surrounding grid
representation of a system’s behavior. In particular, the points repeats until the potential of all grid points fall
Fuzzy C-Means (FCM) clustering algorithm [1,2,3] has below a threshold. Although this method is simple and
been widely studied and applied. The FCM algorithm is effective, the computation grows exponentially with the
an iterative optimization algorithm that minimizes the dimension of the problem. For example, a clustering
cost function problem with 4 variables and each dimension having a
resolution of 10 grid lines would result in lo4 grid
points that must be evaluated.

k = l i=l We present a modified form of Yager and Filev’s cluster


estimation method. We consider each data point, not a
where n is the number of data points, c is the number of grid point, as a potential cluster center. Using this
clusters, xk is the k’th data point, V i is the i’th cluster method, the number of effective “grid points” to be
center, j.fik is the degree of membership of the k’th data evaluated is simply equal to the number of data points,
in the i’th cluster, and m is a constant (typically m = 2). independent of the dimension of the problem. Another
The degree of membership j.fik is defined by advantage of this method is that it eliminates the need
to specify a grid resolution, in which trade-offs between
accuracy and computational complexity must be
considered. We also improve the computational
efficiency and robustness of the original method via
other modifications.

Although clustering is generally associated with


Starting with a desired number of clusters c and initial classification problems, here we use the cluster
guess for each cluster center Vi, i=l,Z,...,c , FCM will estimation method as the basis of a fuzzy model
converge to a solution for v i that represents either a identification algorithm. The key to the speed and
local minimum or a saddle point of the cost function robustness of this model identification algorithm is that

0-7803-1896-X/94 $4.00 01994 IEEE 1240

Authorized licensed use limited to: NWFP UNIV OF ENGINEERING AND TECHNOLOGY. Downloaded on May 08,2022 at 13:33:37 UTC from IEEE Xplore. Restrictions apply.
it does not involve any iterative non-linear points .near the first cluster center will have greatly
optimization; in addition, the computation grows only reduced potential, and therefore will unlikely be selected
linearly with the dimension of the problem. We use a as the next cluster center. The constant rb is effectively
benchmark problem in chaotic time series prediction to the radius defining the neighborhood which will have
compare the performance of this algorithm with the measurable reductions in potential. To avoid obtaining
published results of other algorithms. closely spaced cluster centers, we set rb to be somewhat
greater than ru ;a good choice is rb = 1.5 ru .
2. Cluster Estimation
When the potential of all data points have been revised
Consider a collection of n data points [ X I J ~ , ...,xn} in according to Eq. (3). we select the data point with the
an M dimensional space. Without loss of generality, highest remaining potential as the second cluster center.
we assume that the data points have been normalized in We then further reduce the potential of each data point
each dimension so that their coordinate ranges in each according to their distance to the second cluster center.
dimension are equal; i.e., the data points are bounded by In general, after the k'th cluster center has been
a hypercube. We consider each data point as a potential obtained, we revise the potential of each data point by
cluster center and define a measure of the potential of the formula
data point xi as

where x i is the location of the k'th cluster center and


P; is its potential value.
where a=&
In Yager and Filev's procedure, the process of acquiring
new cluster center and revising potentials repeats until
and ra is a positive constant. Thus, the measure of
potential for a data point is a function of its distances to P; < EP;
all other data points. A data point with many
neighboring data points will have a high potential where & is a small fraction. The choice of E is an
value. The constant ra is effectively the radius defming important factor affecting the results; if E is too large,
a neighborhood; data points outside this radius has little too few data points will be accepted as cluster centers; if
influence on the potential. This measure of potential E is too small, too many cluster centers will be
differs from that proposed by Yager and Filev in two generated. We have found it difficult to establish a
ways: (1) the potential is associated with an actual data single value for E that works well for all data patterns,
point instead of a grid point; (2) the influence of a and have therefore developed additional criteria for
neighboring data point decays exponentially with the acceptindrejecting cluster centers. We use the
square of the distance instead of the distance itself. following criteria:
Using the square of the distance eliminates the square
root operation that otherwise would be needed for
determining the distance itself. ifp; > E P ;
Accept x i as a cluster center and continue.
After the potential of every data point has been
else if P; < E P;
computed, we select the data point with the highest
Reject x i and end the clustering process.
potential as the first cluster center. Let x ; be the
location of the first cluster center and P; be its potential else
Let d- = shortest of the distances between
value. We then revise the potential of each data point
x; by the formula
x i and all previously found cluster centers.
*

(3)
Accept xz as a cluster center and continue.
else
Reject x i and set the potential at x i to 0.
Select the data point Yith the next highest
and rb is a positive constant. Thus, we subtract an potential as the new xk and re-test.
amount of potential from each data point as a function end if
of its distance from the first cluster center. The data end if

1241

Authorized licensed use limited to: NWFP UNIV OF ENGINEERING AND TECHNOLOGY. Downloaded on May 08,2022 at 13:33:37 UTC from IEEE Xplore. Restrictions apply.
Here Z specifies a threshold for the potential above
which we will definitely accept the data point as a
e- a ( 4 - YY* ) 2
~ ~ (=4 )
cluster center; E specifies a threshold below which we Bj = Z; (for the i 'th rule)
will definitely reject the data point. We use Z = 0.5
and E = 0.15 . If the potential falls in the gray region, where yl is the j'th element of y: and z$ is the j' t h
we check if the data point provides a good trade-off element of 2.; Our computational scheme is equivalent
between having a reasonable potential and being to an inference method that uses multiplication as the
sufficiently far from existing cluster centers. AND operator, weights the output of each rule by the
rule's firing suength, and computes the output value as
3. Model Identification a weighted average of each rule's output.

When we apply the cluster estimation method to a Equations (4) and (5) provide a simple and direct way to
collection of inputloutput data, each cluster center is in translate a set of cluster centers into a fuzzy inference
essence a prototypical data point that exemplifies a system. However, given the same number of cluster
characteristic behavior of the system. Hence, each centers, the modeling accuracy of the system can
cluster center can be used as the basis of a rule that improve significantly if we allow z: in Eq. (5) to be an
describes the system behavior. optimized linear function of the input variables, instead
of a simple constant. That is, we let
Consider a set of c cluster centers (XTJ; ,...,x : ) in an
M dimensional space. Let the first N dimensions
correspond to input variables and the last M - N
dimensions correspond to output variables. We where Gi is a ( M - N ) x N constant matrix, and hi is a
XI
decompose each vector into two component vectors constant column vector with M - N elements. The
y* and z: , where y: contains the first N elements of equivalent if-then rules then become the Takagi-Sugcno
f
x i (i.e., the coordinates of the cluster center in input type [ 5 ] , where the consequent of each rule is a linear
space) and :Z contains the last M - N elements (i.e.. the equation in the input variables. We will refer to fuzzy
coordinates of the cluster center in output space). models that set z: as a constant as "0th order models",
and those that employ Eq. 6 as "1st order models".
We consider each cluster center x: as a fuzzy rule that
describes the system behavior. Given an input vector y, Expressing z: as a linear function of the input allows a
the degree to which rule i is fulfilled is defined as significant degree of rule optimization to be performed
without adding much computational complexity. As
(4) pointed out by Takagi and Sugeno [ 5 ] , given a set of
rules with fixed premises, optimizing the parameters in
where a is the constant defined by Eq. (2). We the consequent equations with respect to training data
compute the output vector z via reduces to a linear least-squares estimation problem.
Such problems can be solved easily and the solution is
always globally optimal. We refer to [ 5 ] for details on
z = i=l how to convert the equation parameter optimization
c problem into the least squares estimation problem. It
suffices to say that we can quickly compute the optimal
i= I Gi and hi that minimizes the error between the output
of the inference system and the output of the training
We can view this computational scheme in terms of an data.
inference system employing traditional fuzzy if-then
rules. Each rule has the following form: 4. Results
if y i is A i & y2 is A2 & ... then z l is B 1 & 22 is B 2 We will first apply the model identification method to a
... simple 2-dimensional function-approximation problem
to illustrate its operation. Next we will consider a
where y , is the j'th input variable and zj is the j ' t h benchmark problem in prediction of chaotic time series
output variable; A j is an exponential membership and compare the performance of this method with the
function and B j is a singleton. For the rule that is published results of other methods.
represented by cluster center x: ,Aj and B j are given by

1242

Authorized licensed use limited to: NWFP UNIV OF ENGINEERING AND TECHNOLOGY. Downloaded on May 08,2022 at 13:33:37 UTC from IEEE Xplore. Restrictions apply.
4.1 Function ADDroximation 1

0.8
For illustrative purposes, we consider the simple
problem of modeling the nonlinear function
0.6
sin Q P
Z=
0.4
Y

For the range [-10.10], we used equally spaced y values 0.2


to generate 100 training data points. Because the
training data are normalized before clustering so that 0
-10 -5 0 5 10
they are bounded by a hypercube, we find it convenient
to express ra as a fraction of the width of the hypercube, Y
Fig. 2 . Degree of fulfillment of each rule as a
in this example we chose ru = 0.25. Applying the
function of input y.
cluster estimation method to the training data, 7 cluster
centers were found. Fig. 1 shows the training data and cluster center. The output of the 0th order fuzzy model
the location of the cluster centers. when the FCM definition of p is adopted is shown in
Fig. 4. It is evident that the modeling accuracy has

“‘1
improved significantly; in particular, the model output
trajectory is now compelled to pass through the cluster
0.6
centers.
J 0.4

0.2 t :i ‘t, 1

0.8

-0.2O
W
0.6
-0.41 /r
-10 -5 0 5 10
0.4
Y
Fig. 1. Comparison of training data with cluster center #5
0.2
unoptimized 0th order model output.
0
Fig. 1 also shows the output of a 0th order fuzzy model -10 -5 0 5 10
that uses constant zr as given by the z coordinate of Y
each cluster center. We see that the modeling error is Fig. 3. Degree of fuljillment of a typical rule as a
quite large. Because the clusters are closely spaced with function of input y , based on the Fuzzy C-Means
respect to the input dimension. there is significant definition of p.
overlap between the premise conditions of the rules.
The degree of fulfillment p of each rule as a function of
y (viz. Eq. 4) is shown in Fig. 2, where it is evident 0.8 - 0 cluster centers
that several rules can simultaneously have high firing
strength even when the input precisely matches the 0.6 -
premise condition of one rule. Therefore, the model 0.4 -
output is typically a neutral point interpolated from the
0.2 -
z coordinates of strongly competing cluster centers.

One way to minimize competition among closely


spaced cluster centers is to use Fuzzy C-Means’
definition of p, viz. Eq. (1). When the input precisely 4.4
-LO 5 0 5 10
matches the location of a cluster center, the FCM
definition ensures that the input will have zero degree of Y
membership in all other clusters. Using the FCM Fig. 4. Comparison of training data with
definition, the degree of fulfillment p as a function of y unoptimized 0th order model output, f o r the case
is shown in Fig. 3 for a typical rule. We see that p is where inference computation is based on the Fuzzy
1 when y is at the cluster center associated with the rule C-Means definition of p.
and drops sharply to zero as y approaches a neighboring

Authorized licensed use limited to: NWFP UNIV OF ENGINEERING AND TECHNOLOGY. Downloaded on May 08,2022 at 13:33:37 UTC from IEEE Xplore. Restrictions apply.
Although using the FCM definition of p can improve
the accuracy of unoptimized 0th order models, we note
that the p function and the resultant model output
trajectory have highly nonlinear "kinks"compared with
that obtained with the exponential definition of p.
These kinks tend to limit the ultimately attainable
accuracy when optimization is applied to the model.

We now apply least squares optimization to t f . For


illustrative purposes, we consider the two cases: (1)
I
zI.= hi , and (2) 27. = Gi y + hi . In the first case, we -5 0 5 10
retain the assumption of a 0th order model, but now Y
optimize the constant value assigned to z:. In the Fig. 6. Comparison of training data with optimized
second case. we assume a 1st order model and optimize 1st order model output.
both Gi and hi . Table 1 shows the root-mean-square 1.2
(RMS) modeling error resulting from the different I
1. consequent -
extents of optimization. Table 1 also shows the effects functions
of using the exponential definition of p versus the 0.8.
model output
FCM definition. 0.6.
b.
0.4.
I Optimization I RMSerror I
I using EXPp I using FCM p
z f . unoDtimized I I
0.0174 I 0.01126
z:.= hi 0.00542 0.00946
.:z = G i v + hi 0.001 17 0.00142 -10 -5 0 5 10
Y
Table 1. Comparison of RMS modeling error for Fig. 7. Consequentfunctions for the 1st order model.
different extents of optimization and for different
definitions of p. 4.2 Chaotic Time Series Prediction

Using the exponential definition of p generally results We now consider a benchmark problem in model
in more accurate optimized models. In what follows, identification - that of predicting the time series
we will not draw any further comparisons between generated by the chaotic Mackey-Glass differential delay
using the exponential definition versus using the FCM equation [6] defined by:
definition, but present only the results obtained from
using the exponential definition. The output of the
optimized 0th order model is shown in Fig. 5 and he
output of the optimized 1st order model is shown in
Fig. 6. The consequent function z* = Gi y + hi of each The task is to use past values of x up to time t to
rule is shown in Fig, 7. predict the value of x at some time t+AT in the future.
The standard input for this type of prediction is N
1.2
I points in the time series spaced '5 period apart, i.e., the
input vector is y = { x ( t - ( N - 1 ) ~...,
) .X ( ~ - ~ T ) J ( ~ - T ) J ( ~ ) } .
To allow comparison with other methods, we use N=4,
-6, AT=6. Therefore, each data point in the training
set consists of

x = { x(t- l S ) ~ ( t12)~(t-6>~(t),~(t+6))
-

where the first 4 elements correspond to the input


variables and the last element corresponds to the output
I variable. We will compare the performance of the
-10 -5 0 5 10
model identification method with the Adaptive-Network-
Y Based Fuzzy Inference System ( A N F I S ) proposed by
Fig. 5. Comparison of training data with optimized
Jang [7]as well as other neural network-based and
0th order madel output. polynomial-fitting methods reported in [8]. The ANFIS

1244

Authorized licensed use limited to: NWFP UNIV OF ENGINEERING AND TECHNOLOGY. Downloaded on May 08,2022 at 13:33:37 UTC from IEEE Xplore. Restrictions apply.
algorithm also generates fuzzy models consisting of the
Takagi-Sugeno type rules. After specifying the number
of membership functions for each input variable, the 1.2
ANFIS algorithm iteratively learns the parameters of
the premise membership functions via a neural network 3 1
and optimizes the parameters of the consequent
equations via linear least squares estimation. ANFIS 0.8
has the advantage of being significantly faster and more
accurate than many pure neural-network-basedmethods. 0.6

For the Mackey-Glass time series problem, we chose ru 0.41 J


0 100 200 400 500
= 0.5 (i.e.. the radius of influence of a cluster center is xu)
time
half the width of the data hypercube). This is done to
approximate the settings used by Jang [7], where the Fig. 8. Comparison of checking data with
number of membership functions for each input was optimized 1st order model output.
chosen to be 2. thus dividing each input dimension into
in MATLAB) was able to identify the model in 11
two partitions. For the 4-input Mackey-Glass problem,
minutes on a Macintosh Centris 660AV (68040
this leads to 42 = 16 fuzzy partitions in the input space, processor running at 25 MHz). Noting that MATLAB
and thus 16 rules in A N F I S . is an interpreted language, we expect a further increase
in speed when the algorithm is coded in C.
We used the same data set as that used by Jang [7],
which consisted of 1000 data points extracted from 5. Summary
t=118 to t=1117. The first 500 data points were used
for training, and the last 500 data points were used for
We presented an efficient method for estimating cluster
checking the identified model. The cluster estimation
centers from numerical data. When combined with
method identified 9 cluster centers. Therefore, only 9
linear least-squares estimation, it provides an extremely
rules were generated. The consequent equations for a 1st
fast and accurate method for identifying fuzzy models.
order model were then optimized. The modeling error
The author thanks Roger Jang for providing the data set
with respect to the checking data is listed in Table 2,
along with the results from the other methods as for the Mackey-Glass time series example.
reported in 171 and [SI. The error index shown in Table
2 is defined as the root mean square error divided by the References
standard deviation of the actual time series [7,8].
J. Dum, “A fuzzy relative of the ISODATA process and
its use in detecting compact, well separated clusters.”
J. Cybernetics, vol. 3, no. 3, pp. 32-57, 1974.
J. Bezdek, “Cluster validity with fuzzy sets,” J.
Cybernetics, vol. 3, no. 3, pp. 58-71. 1974.
J. Bezdek, R . Hathaway, M. Sabin, W. Tucker,
“Convergence theory for fuzzy c-means:
counterexamples and repairs.” The Analysis of Fuzzy
Information, J. Bezdek, ed.. vol. 3, Chap. 8, CRC
Press, 1987.
R. Yager and D. Filev. “Approximate clustering via
the mountain method,” Iona College Tech. Report
#MII-1305, 1992. Also to appear in IEEE Trans. on
Systems, Man & Cybernetics.
Table 2. Comparison of results from different T. Takagi and M. Sugeno, “Fuzzy identification of
methods of model identification. (Rows 2 and 3 are systems and its application to modeling and control,”
from [7], and the last 4 rows arefrom 181.) IEEE Trans. on Systems, Man & Cybernetics, vol. 15.
pp. 116-132, 1985.
M. Mackey and L. Glass, “Oscillation and chaos in
Fig. 8 shows the model output evaluated with respect to physiological control systems,” Science, vol. 197,
the checking data. We see that the model output is pp. 287-289, July 1977.
indistinguishable from the checking data. Comparison R. Jang. “ANFIS: Adaptive-network-based fuzzy
between our cluster estimation-based method and inference system,” IEEE Trans. on Systems, Man &
ANFIS shows that they both provide approximately the Cybernetics, May 1993.
same degree of accuracy. However, identifying the R.S. Crowder, “Predicting the Mackey-Glass time
Mackey-Glass model took the ANFIS algorithm (coded series with cascade-correlation learning,“ Proc. 1990
in C) 2.5 hours on an Apollo 700 series workstation Connectionist Models Summer School, pp. 117-123.
[7], while the cluster estimation-based algorithm (coded Carnegie Mellon University, 1990.

Authorized licensed use limited to: NWFP UNIV OF ENGINEERING AND TECHNOLOGY. Downloaded on May 08,2022 at 13:33:37 UTC from IEEE Xplore. Restrictions apply.

You might also like