Geffcm Gefem: Approximate Clustering in Very Large Object Data

Approximate Clustering in
Very Large Object Data
Rick Hathaway
Jim Bezdek
geFFCM
geFEM
3/15/06 geFFCM 1
The geFFCM Algorithm
Input XN A VL object data set
IA active feature index set

SM selection method in {A, FA, CA, SA}
{bk} # of EC histogram intervals, k ∈ IA
Pick
{αk} termination thresholds, k ∈ IA
p initial sample % (of N)
∆p percentage increment
Compute
initial # of samples ⎡
n = (pN) / 100⎤
n/b = initial # of samples per bin
3/15/06 geFFCM 2
PS1 Randomly choose (wo replacement)* n vectors Xn ⊂ XN
PS2 Choose a selection strategy for active features (define IA)
S1 = A = Single Accept
A specified feature passes
S2 = FA = First Accept
Any single feature passes
S3 = CA = Cumulative Accept
Each feature in a specified set either did or does pass
S4 = SA = Simultaneous Accept
All features in a specified set pass simultaneously

3/15/06 geFFCM 3
The single accept (S1=A) selection strategy for active feature k
x1 x2 xj xg x m xn x p xq xN VL data
1
1st Xn
2
get EC bins
with {xik }
S1 k
x2k x kj x kg x km x kn x kp x kq
p test {xik }
Test fails
Test ok Get ∆X
Test
Run LFCM on all p features
{xik + ∆xik }
of the accepted sample
3/15/06 geFFCM 4
The first accept (S2=FA) selection strategy for any single feature
x1 x2 xj xg x m xn x p xq xN VL data
1
1st Xn
2
For each k
get EC bins

k with {xik }
p For each k
test {xik }
Feature 2 passes ⇒ All k tests fail ⇒
Accept Current Sample Get ∆X
reject current sample
For each k
Run LFCM on all p features Test
of the accepted sample {xik + ∆xik }
3/15/06 geFFCM 5
The cumulative accept (S3=CA) selection strategy for features (j,k)
x1 x2 xj xg x m xn x p xq x N VL data
1
1st Xn
2
j j
j x2 xj x jg x jm j
x jn x p x jq get EC bins
for {x j } {xik }
i
k
x2k x kj x kg x km k
x kn x p x kq
p test {xij } {xik }
Get ∆X
j fails, k passes

Test
j j
{xi + ∆xi } j passes, k fails

{xik + ∆xik }
Each feature did pass, or does pass ⇒ accept current sample
Run LFCM on all p features of the accepted sample

3/15/06 geFFCM 6
The simultaneous accept (S4=SA) selection strategy for features (j,k)
1
1st Xn
2
j j
j x2 xj x jg x jm j
x jn x p x jq get EC bins
for {x j } {xik }
i
k
x2k x kj x kg x km k
x kn x p x kq
p test {xij } {xik }

Get ∆X j fails, k passes

Test
j passes, k fails
j j
{xi + ∆xi }

{xik + ∆xik } j,k both pass

Get ∆X
Both features pass at the same time ⇒ accept current sample
Test
Run LFCM on all p features of the accepted sample
3/15/06 geFFCM 7
PS3 For each active feature k
Sort initial sample {x1k ,… x kn }
Construct b EC bins with order statistics

⎛ ⎤ ⎛ ⎤ ⎛ ⎞
k k k k
⎜ −∞, x ⎥ ⎜x ,x ⎥ ⎜x , ∞⎟
⎝ ( )⎦ ⎝ ( ) ( )⎦
1 b 1 b 2 b
⎝ ( b−1 b) ⎠
Many narrow bins in dense data areas
Fewer (wider) bins in sparse data areas
Endpoints vary from feature to feature
Endpoints depend only on original n-sample
3/15/06 geFFCM 8
What do (EC) histogram bins look like ?
n = 500 observations (e.g., active feature values of 500 columns of XN)
b = 5 bins ⇒ each bin contains 100 values = 20% of samples
% n = 0.2
XN
-∞ x(101) x(201) x(301) x(401) +∞

Determine bin endpoints by order statistics
3/15/06 geFFCM 9
PS4 For each active feature (k=1 to IA)
For each bin (i = 1 to bik)
Get full count for bik Nik
Get sample count for bik nik
bik ⎛ k ⎞ ⎛ k⎞
⎜ Ni nik⎟ log⎜ nNi ⎟
Compute divergence* div k = n∑ −
⎜ n ⎟⎠ ⎜ k⎟
i= 1 ⎝ N ⎝ i ⎠
Nn
* wo replacement sampling ⇒ div is not approx. χ2, but here div

measures goodness of fit (i.e., is not a hypothesis test statistic)
3/15/06 geFFCM 10
⎛ −1 ⎞
PS5 WHILE ⎜ ∃ k ∈ IA div k > F (1 − α k )⎟
⎝ ⎠
(F = cdf for χ2(bk-1))
∆n = min {N-n, (∆pN/100}

Get ∆X∆n from XN-Xn
Xn=Xn+∆X∆n
Return to PS4
PS6 Run LFCM on Xn to get (Un, Vn)
PS7 Extend fuzzy partition : Un → [Un|UN-n]
3/15/06 geFFCM 11
Data XL (loadable), N=100,000 draws
from a mixture of c=2 2D normals
priors means covariances

⎛ 0⎞ ⎛ 10⎞ ⎡ 1 0⎤
p1 = p2 = 0.5 µ1 = ⎜ ⎟ , µ2 = ⎜ ⎟ Σ1 = Σ 2 = ⎢ ⎥
⎝ 0⎠ ⎝0⎠ ⎣ 0 1 ⎦
LFCM & geFFCM geFFCM

Algorithm c = m = 2 n =1000 samples
Parameters ε =.00001 ∆n =1000 samples
MaxIt = 1000 bk = b = 100 bins
2-Norm for Jm αk=α ∈ {0.90, 0.95}
⎧ k +1 k ⎫
Termination U k +1 − U k = max ⎨ Ui,j − Ui,j ⎬ < ε
i,j
⎩ ⎭
3/15/06 geFFCM 12
Divergence vs χ2 : Why we use either one
120
Wo replacement
100 Feature 1 only
F-1(0.10)
80
F-1(0.05)
Because they
60
are basically
Accept @
-1(*)
F40 Accept @ identical !
α =0.95 α =0.90
20
| Xn |
0 %
0 10 20 |30XN | 40 50 60 70 80 90 100
3/15/06 geFFCM 13
Acceptance Strategies
120
Wo replacement F1 first signals
Feature F1 ( ) at |Xn|= 3%
100 Feature F2 ( )
F2 first signals
-1
F (0.10) at |Xn|= 6%
80
F-1(0.05) Both first signal
FACA SA at |Xn|= 20%
60
40
tdiv(Xn) FA≤CA≤SA
20
(always !)
| Xn |
0 %
0 10 20 | XN30| 40 50 60 70 80 90 100
3/15/06 geFFCM 14
V X N − VX n
Terminal
Prototypes
0.14
LFCM( XN ) ⇒ VX N
0.12
LFCM( Xn ) ⇒ VX n
0.1
0.08
With replacement
0.06
>30%~ Don't care

0.04
Wo replacement ~ usually better !

0.02
| Xn |
0 %
0 10 20 30 40 50 60 70 80 90 100 | XN |
3/15/06 geFFCM 15
Data XL (loadable), N=100,000 draws
from a mixture of c=4 5D normals
priors means covariances

⎛ 0⎞ ⎛ 1⎞ ⎛ 2⎞ ⎛3 ⎞ 2
⎜ 1⎟ ⎜ 1⎟ ⎜3 ⎟ ⎜3 ⎟ Σ k = σ I5
p k = 0.25 µ = ⎜3 ⎟ , µ = ⎜ 1⎟ , µ = ⎜ 1⎟ , µ = ⎜3 ⎟ k = 1,…, 4
1 2 3 4
k = 1,… , 4 ⎜ 0⎟ ⎜ 0⎟ ⎜ 0⎟ ⎜ 0⎟ 2
⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ σ ∈ {.1,.5, 1}
⎝ 0⎠ ⎝ 0⎠ ⎝ 0⎠ ⎝ 0⎠
LFCM & geFFCM geFFCM
Algorithm Same (as 2D) Same (as 2D) Except

Parameters except c = 4 bk = b ∈ {25, 50, 75, 200} bins
Termination Same
Study parameters
3/15/06 geFFCM 16
"good" separation
b= 25 50 100 200
α= .90 .95 .90 .95 .90 .95 .90 .95 σ2=0.1
x1 |XS1| 19 21 15 19 12 14 7 10 Typical Result
x2 |XS1| 17 27 14 19 9 14 8 12
25 trials ave.
x3 |XS1| 17 26 16 21 13 15 9 13
x4 |XS1| 13 22 12 17 13 19 10 12 |Xn| of %|XN|
x5 |XS1| 19 27 18 23 13 18 11 13
FA Trend Studies
3 6 4 6 3 5 3 4
|XS2|
time, secs .14 .15 .16 .17 .18 .19 .20 .21 %|XN| vs b
CA
35 47 30 35 24 29 17 22 %|XN| vs SS
|XS3|
time, secs .17 .19 .20 .21 .22 .25 .26 .29 %|XN| vs α
SA
44 54 35 40 27 33 21 26 %|XN| vs cpu time
|XS4|
time, secs .21 .23 .22 .24 .25 .26 .29 .31
SS vs σ2
3/15/06 geFFCM 17
Average Trends
Sample size vs b
b 25 200 % |XN|
25 50 100 200 by more than 1/2
25.5 19.8 16.4 12.3
α .90 .95 % |XN|

Sample size vs α
about 5 %
cpu time almost linearly

Sample size vs cpu time
with b but not with % |XN|
3/15/06 geFFCM 18
Average Trends
Sample size vs [SS vs σ2]
Strategy σ2 = 0.1 σ2 = 0.5 σ2 = 1.0

A 15.7 15.7 15.5
FA 4.3 3.8 3.4
CA 29.9 30.6 30.8
SA 35.0 36.0 36.0
col. ave. 21.4 21.5 21.2
Separation (σ2 ) has neglible effect on % |XN| for each (SS)
Separation (σ2) has neglible effect on % |XN| over all SS
3/15/06 geFFCM 19
Approximation and Acceleration Measures
DECREASING separation
σ2=0.1 σ2=0.5 σ2=1.0
geFFCM LFCM geFFCM LFCM geFFCM LFCM
tacc 5.21 - 12.44 - 28.41 -

SS2 %Etr 0.03 0.03 8.90 8.85 22.40 22.18
FA=1 %Eapp 0.00 - 0.95 - 3.15 -
EV 0.00 - 0.01 - 0.03 -
tacc 2.61 - 3.59 - 4.34 -
SS3 %Etr 0.03 0.03 8.86 8.85 22.19 22.18
CA=5 %Eapp 0.00 - 0.23 - 0.78 -
EV 0.00 - 0.00 - 0.00 -
tacc 2.30 - 2.97 - 3.41 -
SS4 %Etr 0.03 0.03 8.85 8.85 22.19 22.18
SA=5 %Eapp 0.00 - 0.20 - 0.69 -
EV 0.00 - 0.00 - 0.00 -
3/15/06 geFFCM 20
Average Trends in Acceleration
Acceleration vs σ2
separation (increasing σ2) Tacc
FA : Tacc 545% from σ2 = 0.10 to 1.00
SA : Tacc 48% from σ2 = 0.10 to 1.00
Acceleration vs SS
Tacc As (FA CA SA)

833% AT σ2 = 1 !!!
3/15/06 geFFCM 21
Average Trends in Approximation
Approximation Error vs [SS and σ2]
Eapp As (FA CA SA)
Eapp as separation
Training Error vs [SS and σ2]
Etr as separation
Prototype Error EV vs [SS and σ2]
EV is ~ 0 for all cases !
3/15/06 geFFCM 22
Probabilistic Clustering Typical Result (10 trials ave.)
with geFEM for "good" separation (σ2=0.1)
geFFCM LFCM geFEM LEM
Acceleration tacc 5.09 1 8.71 1

SS2 % Training Error Etr 0.03 0.03 0.03 0.03
Both accelerate their literal counterparts very well
FA=1 % Approx. Error Eapp 0.00 0 0.01 0
Prototype Error EV 0.01 0 0.01 0

CA=5 % Approx. Error E 0.00 0 0.00 0
Both estimate the app
true labels with high accuracy

SA=5 % Approx. Error Eapp 0.20 0 0.20 0
Both approximate their literal counterparts very well
3/15/06 geFFCM 23
Does |X
Does |XSS|%
|% of
of |X|
|X| with |X|?
with NO -- itit
|X|? NO !!!!!!
|Xn|=100,000 |XN|=1,600,000 25 trials ave.

α = 0.95
b = 25 50 100 200
X = Xn XN Xn XN Xn XN Xn XN σ 2= 0.50
x1 |XS1|% 27 15 25 12 20 8 13 9
x2 |XS1|% 24 12 17 10 14 9 11 5
%|XS| for BIG XN
x3 |XS1|% 25 17 18 11 16 9 12 6
in ALL cells
x4 |XS1|% 25 15 21 13 16 8 9 8
x5 |XS1|% 27 13 20 14 15 11 11 9
FA |XS2|% 5 1 5 1 5 1 3 1 ave cpu time
time, secs .15 1.9 .18 2.2 .19 2.4 .21 2.7 0.23 for Xn
CA |XSS3|% 48 35 37 31 29 21 21 17 2.96 for XN
time, secs .19 2.4 .27 2.8 .25 3.1 .28 3.5
grows O(c) with |X|
SA |XSS4|% 54 52 42 40 33 29 24 23
time, secs .22 3.3 .26 3.5 .26 3.6 .28 4.1
3/15/06 geFFCM 24
Elastic control
Elastic control of
of nn == ||X to reduce
Xnn|| to reduce sample
sample size
size
⎡ bik ⎛ k k⎞ ⎛ k ⎞⎤
Recall that for each active N n nN
D*k = min {n,n*} ⎢ ∑ ⎜ i − i ⎟ log⎜ i ⎟⎥
feature k, we compute ⎢ i= 1⎜⎝ N n ⎟
⎠
⎜
⎝
k ⎟⎥
⎣ Nni ⎠⎦
⎡ bik ⎛ k k⎞ ⎛ k ⎞⎤
N n nN
⎢ ∑ ⎜ i − i ⎟ log⎜ i ⎟⎥ Choose a target sample
Dk = n
⎢ i= 1⎜ N n ⎟⎠ ⎜ k ⎟⎥ size n* and define
⎣ ⎝ ⎝ Nni ⎠⎦
min {n,n*} = n* ⇒ termination at n*<n

D* prevents oversampling of XN
2 Cases
min {n,n*} = n ⇒ termination at n<n*

D satisfied by sample smaller than n*
3/15/06 geFFCM 25
|XN|=1,600,000 25 trials ave. α = 0.95
σ 2= 0.50
n* = 20,000 = 1.25% |XN|
52% of N =
832,000
samples
b = 25 50 100 200
X = D D* D D* D D* D D*
x1 |XS1|% 15 2 12 2 8 1 9 1
x2 |XS1|% 12 2 10 1 9 1 5 1 3% of N =
48,000
x3 |XS1|% 17 2 11 1 9 1 6 1
samples
x4 |XS1|% 15 2 13 1 8 1 8 1
x5 |XS1|% 13 2 14 1 11 1 9 1
FA |XS2|% 1 1 1 0 1 1 1 1
time, secs 1.95 1.94 2.19 2.18 2.42 2.40 2.67 2.63
Elastic when
CA |XS3|% 35 3 31 2 21 2 17 2
D*>1.25%
time, secs 2.44 1.97 2.79 2.22 3.10 2.45 3.55 2.72
SA |XS4|% 52 3 40 2 29 2 23 2
time, secs 3.28 1.99 3.53 2.23 3.63 2.44 4.07 2.65
3/15/06 geFFCM 26
832,000 48,000 5 times faster
samples samples same accuracy
Same error rates

LFCM vs Approx. b=25 bins b=200 bins LFCM
D D* D D* D D*
Acceleration tacc 13.09 12.55 10.00 9.03 1 1

SS2 % Training Error Etr 8.89 8.89 8.87 8.87 8.83 8.83
FA=1 % Approx. Error Eapp 0.83 0.85 0.68 0.70 0 0
Prototype Error EV 0.01 0.01 0.01 0.01 0 0

CA=5 % Approx. Error Eapp 0.04 0.21 0.07 0.28 0 0

SA=5 % Approx. Error Eapp 0.03 0.21 0.06 0.28 0 0
3/15/06 geFFCM 27
b=100 bins 25 trials aves. α = 0.95
SS3=CA n* = 20,000 σ 2= 0.50
|XN| « 100,000 200,000 400,000 800,000
D D* D D* D D* D D*
|Xn| as % of |XN| 28 23 26 13 24 7 21 4
geFFCM time, secs 3.28 2.81 6.55 3.79 12.75 5.19 23.29 7.93
LFCM time, secs 10.87 10.84 21.58 21.54 42.93 42.72 85.91 85.98
Acceleration, tacc 3.40 3.89 3.58 5.69 3.62 8.24 4.02 10.84
tacc with D : 3.40 « 3.58 « 3.62 « 4.02
tacc with D* : 3.89 « 5.69 « 8.24 « 10.84
∴ advantage of using D* as |XN|
3/15/06 geFFCM 28
Empirical Conclusions
Separation is highest
tacc Accleration highest when
SS is minimal (FA)
Etr Training errors comparable to LFCM for all 4 SS's
Eapp Smallest approx. errors for stringent SS's (CA and SA)
Is a very general scheme which is easily adaptable

PS for extension to VL data with many algorithms
3/15/06 geFFCM 29
Yet do do : geFFCM
Process VL data
geFFCM was designed for VL data, but
so far, no real tests have been made Should work ok !
Try simpler termination measures (e.g. Euclidean ||*||)
For real VL data, compare to simple random sampling
3/15/06 geFFCM 30
Thanks mates !
G’Day
3/15/06 geFFCM 31
3/15/06 geFFCM 32

Geffcm Gefem: Approximate Clustering in Very Large Object Data

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Geffcm Gefem: Approximate Clustering in Very Large Object Data

Uploaded by

Copyright:

Available Formats

Approximate Clustering in

Very Large Object Data

Input XN A VL object data set

IA active feature index set

PS2 Choose a selection strategy for active features (define IA)

A specified feature passes

Any single feature passes

All features in a specified set pass simultaneously

Run LFCM on all p features Test

of the accepted sample {xik + ∆xik }

Run LFCM on all p features of the accepted sample

Sort initial sample {x1k ,… x kn }

Construct b EC bins with order statistics

n = 500 observations (e.g., active feature values of 500 columns of XN)

b = 5 bins ⇒ each bin contains 100 values = 20% of samples

-∞ x(101) x(201) x(301) x(401) +∞

Get full count for bik Nik

Get sample count for bik nik

* wo replacement sampling ⇒ div is not approx. χ2, but here div

∆n = min {N-n, (∆pN/100}

PS6 Run LFCM on Xn to get (Un, Vn)

PS7 Extend fuzzy partition : Un → [Un|UN-n]

priors means covariances

LFCM & geFFCM geFFCM

>30%~ Don't care

Wo replacement ~ usually better !

priors means covariances

LFCM & geFFCM geFFCM

Algorithm Same (as 2D) Same (as 2D) Except

α .90 .95 % |XN|

cpu time almost linearly

Sample size vs [SS vs σ2]

Strategy σ2 = 0.1 σ2 = 0.5 σ2 = 1.0

Separation (σ2 ) has neglible effect on % |XN| for each (SS)

Separation (σ2) has neglible effect on % |XN| over all SS

geFFCM LFCM geFFCM LFCM geFFCM LFCM

tacc 5.21 - 12.44 - 28.41 -

separation (increasing σ2) Tacc

FA : Tacc 545% from σ2 = 0.10 to 1.00

SA : Tacc 48% from σ2 = 0.10 to 1.00

Tacc As (FA  CA  SA)

Approximation Error vs [SS and σ2]

Eapp As (FA  CA  SA)

Training Error vs [SS and σ2]

Prototype Error EV vs [SS and σ2]

EV is ~ 0 for all cases !

geFFCM LFCM geFEM LEM

Acceleration tacc 5.09 1 8.71 1

Acceleration tacc 2.56 1 2.98 1

Acceleration tacc 2.29 1 2.61 1

|Xn|=100,000 |XN|=1,600,000 25 trials ave.

min {n,n*} = n* ⇒ termination at n*<n

min {n,n*} = n ⇒ termination at n<n*

Same error rates

Acceleration tacc 13.09 12.55 10.00 9.03 1 1

Acceleration tacc 2.80 9.70 3.89 8.24 1 1

Acceleration tacc 1.84 9.51 3.29 8.10 1 1

|XN| « 100,000 200,000 400,000 800,000

tacc with D : 3.40 « 3.58 « 3.62 « 4.02

tacc with D* : 3.89 « 5.69 « 8.24 « 10.84

∴ advantage of using D* as |XN|

Etr Training errors comparable to LFCM for all 4 SS's

Is a very general scheme which is easily adaptable

Tacc As (FA CA SA)

Eapp As (FA CA SA)

min {n,n} = n ⇒ termination at n*<n

min {n,n} = n ⇒ termination at n<n