You are on page 1of 32

Approximate Clustering in

Very Large Object Data

Rick Hathaway

Jim Bezdek

geFFCM

geFEM
3/15/06 geFFCM 1
The geFFCM Algorithm

Input XN A VL object data set

IA active feature index set


SM selection method in {A, FA, CA, SA}
{bk} # of EC histogram intervals, k ∈ IA
Pick
{αk} termination thresholds, k ∈ IA
p initial sample % (of N)
∆p percentage increment

Compute
initial # of samples ⎡
n = (pN) / 100⎤
n/b = initial # of samples per bin

3/15/06 geFFCM 2
PS1 Randomly choose (wo replacement)* n vectors Xn ⊂ XN

PS2 Choose a selection strategy for active features (define IA)

S1 = A = Single Accept

A specified feature passes

S2 = FA = First Accept

Any single feature passes

S3 = CA = Cumulative Accept
Each feature in a specified set either did or does pass

S4 = SA = Simultaneous Accept

All features in a specified set pass simultaneously


3/15/06 geFFCM 3
The single accept (S1=A) selection strategy for active feature k

x1 x2 xj xg x m xn x p xq xN VL data
1
1st Xn
2
get EC bins
with {xik }
S1  k
x2k x kj x kg x km x kn x kp x kq
p test {xik }

    Test fails

Test ok        Get ∆X

Test
Run LFCM on all p features
{xik + ∆xik }
of the accepted sample

3/15/06 geFFCM 4
The first accept (S2=FA) selection strategy for any single feature

x1 x2 xj xg x m xn x p xq xN VL data

  1
1st Xn
  2
For each k
  get EC bins

 
k with {xik }

  p For each k
test {xik }
Feature 2 passes ⇒ All k tests fail ⇒
Accept Current Sample Get ∆X
reject current sample

For each k

Run LFCM on all p features Test

of the accepted sample {xik + ∆xik }

3/15/06 geFFCM 5
The cumulative accept (S3=CA) selection strategy for features (j,k)

x1 x2 xj xg x m xn x p xq x N VL data
1
1st Xn
2
j j
 j x2 xj x jg x jm j
x jn x p x jq get EC bins
for {x j } {xik }
i
 k
x2k x kj x kg x km k
x kn x p x kq
p test {xij } {xik }

Get ∆X    
j fails, k passes
   
Test
j j       
{xi + ∆xi } j passes, k fails
      
{xik + ∆xik }
Each feature did pass, or does pass ⇒ accept current sample

Run LFCM on all p features of the accepted sample


3/15/06 geFFCM 6
The simultaneous accept (S4=SA) selection strategy for features (j,k)

1
1st Xn
2
j j
 j x2 xj x jg x jm j
x jn x p x jq get EC bins
for {x j } {xik }
i
 k
x2k x kj x kg x km k
x kn x p x kq
p test {xij } {xik }

   
Get ∆X j fails, k passes
   

Test       
j passes, k fails
j j       
{xi + ∆xi }
        
{xik + ∆xik } j,k both pass
        

Get ∆X
Both features pass at the same time ⇒ accept current sample
Test
Run LFCM on all p features of the accepted sample
3/15/06 geFFCM 7
PS3 For each active feature k

Sort initial sample {x1k ,… x kn }

Construct b EC bins with order statistics


⎛ ⎤ ⎛ ⎤ ⎛ ⎞
k k k k
⎜ −∞, x ⎥ ⎜x ,x ⎥ ⎜x , ∞⎟
⎝ ( )⎦ ⎝ ( ) ( )⎦
1 b 1 b 2 b
⎝ ( b−1 b) ⎠
Many narrow bins in dense data areas
Fewer (wider) bins in sparse data areas
Endpoints vary from feature to feature
Endpoints depend only on original n-sample
3/15/06 geFFCM 8
What do (EC) histogram bins look like ?

n = 500 observations (e.g., active feature values of 500 columns of XN)

b = 5 bins ⇒ each bin contains 100 values = 20% of samples

% n = 0.2

XN

-∞ x(101) x(201) x(301) x(401) +∞


Determine bin endpoints by order statistics

3/15/06 geFFCM 9
PS4 For each active feature (k=1 to IA)
For each bin (i = 1 to bik)

Get full count for bik Nik

Get sample count for bik nik

bik ⎛ k ⎞ ⎛ k⎞
⎜ Ni nik⎟ log⎜ nNi ⎟
Compute divergence* div k = n∑ −
⎜ n ⎟⎠ ⎜ k⎟
i= 1 ⎝ N ⎝ i ⎠
Nn

* wo replacement sampling ⇒ div is not approx. χ2, but here div


measures goodness of fit (i.e., is not a hypothesis test statistic)
3/15/06 geFFCM 10
⎛ −1 ⎞
PS5 WHILE ⎜ ∃ k ∈ IA div k > F (1 − α k )⎟
⎝ ⎠
(F = cdf for χ2(bk-1))

∆n = min {N-n, (∆pN/100}


Get ∆X∆n from XN-Xn
Xn=Xn+∆X∆n
Return to PS4

PS6 Run LFCM on Xn to get (Un, Vn)

PS7 Extend fuzzy partition : Un → [Un|UN-n]

3/15/06 geFFCM 11
Data XL (loadable), N=100,000 draws
from a mixture of c=2 2D normals

priors means covariances


⎛ 0⎞ ⎛ 10⎞ ⎡ 1 0⎤
p1 = p2 = 0.5 µ1 = ⎜ ⎟ , µ2 = ⎜ ⎟ Σ1 = Σ 2 = ⎢ ⎥
⎝ 0⎠ ⎝0⎠ ⎣ 0 1 ⎦

LFCM & geFFCM geFFCM


Algorithm c = m = 2 n =1000 samples
Parameters ε =.00001 ∆n =1000 samples
MaxIt = 1000 bk = b = 100 bins
2-Norm for Jm αk=α ∈ {0.90, 0.95}

⎧ k +1 k ⎫
Termination U k +1 − U k = max ⎨ Ui,j − Ui,j ⎬ < ε
i,j
⎩ ⎭

3/15/06 geFFCM 12
Divergence vs χ2 : Why we use either one

120

Wo replacement
100 Feature 1 only

F-1(0.10)
80
F-1(0.05)
Because they
60
are basically
Accept @
-1(*)
F40 Accept @ identical !
α =0.95 α =0.90

20

| Xn |
0 %
0 10 20 |30XN | 40 50 60 70 80 90 100

3/15/06 geFFCM 13
Acceptance Strategies
120
Wo replacement F1 first signals
Feature F1 ( ) at |Xn|= 3%
100 Feature F2 ( )
F2 first signals
-1
F (0.10) at |Xn|= 6%
80
F-1(0.05) Both first signal
FACA SA at |Xn|= 20%
60

40
tdiv(Xn) FA≤CA≤SA
20
(always !)

| Xn |
0 %
0 10 20 | XN30| 40 50 60 70 80 90 100
3/15/06 geFFCM 14
V X N − VX n
Terminal
Prototypes
0.14

LFCM( XN ) ⇒ VX N
0.12
LFCM( Xn ) ⇒ VX n
0.1

0.08

With replacement
0.06

>30%~ Don't care


0.04

Wo replacement ~ usually better !


0.02

| Xn |
0 %
0 10 20 30 40 50 60 70 80 90 100 | XN |
3/15/06 geFFCM 15
Data XL (loadable), N=100,000 draws
from a mixture of c=4 5D normals

priors means covariances


⎛ 0⎞ ⎛ 1⎞ ⎛ 2⎞ ⎛3 ⎞ 2
⎜ 1⎟ ⎜ 1⎟ ⎜3 ⎟ ⎜3 ⎟ Σ k = σ I5
p k = 0.25 µ = ⎜3 ⎟ , µ = ⎜ 1⎟ , µ = ⎜ 1⎟ , µ = ⎜3 ⎟ k = 1,…, 4
1 2 3 4
k = 1,… , 4 ⎜ 0⎟ ⎜ 0⎟ ⎜ 0⎟ ⎜ 0⎟ 2
⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ σ ∈ {.1,.5, 1}
⎝ 0⎠ ⎝ 0⎠ ⎝ 0⎠ ⎝ 0⎠

LFCM & geFFCM geFFCM

Algorithm Same (as 2D) Same (as 2D) Except


Parameters except c = 4 bk = b ∈ {25, 50, 75, 200} bins

Termination Same
Study parameters

3/15/06 geFFCM 16
"good" separation
b= 25 50 100 200
α= .90 .95 .90 .95 .90 .95 .90 .95 σ2=0.1
x1 |XS1| 19 21 15 19 12 14 7 10 Typical Result
x2 |XS1| 17 27 14 19 9 14 8 12
25 trials ave.
x3 |XS1| 17 26 16 21 13 15 9 13
x4 |XS1| 13 22 12 17 13 19 10 12 |Xn| of %|XN|
x5 |XS1| 19 27 18 23 13 18 11 13
FA Trend Studies
3 6 4 6 3 5 3 4
|XS2|
time, secs .14 .15 .16 .17 .18 .19 .20 .21 %|XN| vs b
CA
35 47 30 35 24 29 17 22 %|XN| vs SS
|XS3|
time, secs .17 .19 .20 .21 .22 .25 .26 .29 %|XN| vs α
SA
44 54 35 40 27 33 21 26 %|XN| vs cpu time
|XS4|
time, secs .21 .23 .22 .24 .25 .26 .29 .31
SS vs σ2
3/15/06 geFFCM 17
Average Trends

Sample size vs b
 b 25  200  % |XN|
25 50 100 200 by more than 1/2
25.5 19.8 16.4 12.3

 α .90  .95  % |XN|


Sample size vs α
about 5 %

cpu time  almost linearly


Sample size vs cpu time
with b but not with % |XN|

3/15/06 geFFCM 18
Average Trends

Sample size vs [SS vs σ2]

Strategy σ2 = 0.1 σ2 = 0.5 σ2 = 1.0


A 15.7 15.7 15.5
FA 4.3 3.8 3.4
CA 29.9 30.6 30.8
SA 35.0 36.0 36.0
col. ave. 21.4 21.5 21.2

Separation (σ2 ) has neglible effect on % |XN| for each (SS)

Separation (σ2) has neglible effect on % |XN| over all SS

3/15/06 geFFCM 19
Approximation and Acceleration Measures

DECREASING separation
σ2=0.1 σ2=0.5 σ2=1.0

geFFCM LFCM geFFCM LFCM geFFCM LFCM

tacc 5.21 - 12.44 - 28.41 -


SS2 %Etr 0.03 0.03 8.90 8.85 22.40 22.18
FA=1 %Eapp 0.00 - 0.95 - 3.15 -
EV 0.00 - 0.01 - 0.03 -
tacc 2.61 - 3.59 - 4.34 -
SS3 %Etr 0.03 0.03 8.86 8.85 22.19 22.18
CA=5 %Eapp 0.00 - 0.23 - 0.78 -
EV 0.00 - 0.00 - 0.00 -
tacc 2.30 - 2.97 - 3.41 -
SS4 %Etr 0.03 0.03 8.85 8.85 22.19 22.18
SA=5 %Eapp 0.00 - 0.20 - 0.69 -
EV 0.00 - 0.00 - 0.00 -

3/15/06 geFFCM 20
Average Trends in Acceleration

Acceleration vs σ2

 separation (increasing σ2)  Tacc

FA : Tacc  545% from σ2 = 0.10 to 1.00

SA : Tacc  48% from σ2 = 0.10 to 1.00

Acceleration vs SS

Tacc  As (FA  CA  SA)


833% AT σ2 = 1 !!!

3/15/06 geFFCM 21
Average Trends in Approximation

Approximation Error vs [SS and σ2]

Eapp  As (FA  CA  SA)

Eapp  as separation 

Training Error vs [SS and σ2]

Etr  as separation 

Prototype Error EV vs [SS and σ2]

EV is ~ 0 for all cases !

3/15/06 geFFCM 22
Probabilistic Clustering Typical Result (10 trials ave.)
with geFEM for "good" separation (σ2=0.1)

geFFCM LFCM geFEM LEM

Acceleration tacc 5.09 1 8.71 1


SS2 % Training Error Etr 0.03 0.03 0.03 0.03
Both accelerate their literal counterparts very well
FA=1 % Approx. Error Eapp 0.00 0 0.01 0
Prototype Error EV 0.01 0 0.01 0

Acceleration tacc 2.56 1 2.98 1


SS3 % Training Error Etr 0.03 0.03 0.03 0.03
CA=5 % Approx. Error E 0.00 0 0.00 0
Both estimate the app
true labels with high accuracy
Prototype Error EV 0.00 0 0.00 0

Acceleration tacc 2.29 1 2.61 1


SS4 % Training Error Etr 0.03 0.03 0.03 0.03
SA=5 % Approx. Error Eapp 0.20 0 0.20 0
Prototype Error EV 0.00 0 0.00 0
Both approximate their literal counterparts very well
3/15/06 geFFCM 23
Does |X
Does |XSS|%
|% of
of |X|
|X|  with |X|?
with NO -- itit
|X|? NO  !!!!!!

|Xn|=100,000 |XN|=1,600,000 25 trials ave.


α = 0.95
b = 25 50 100 200
X = Xn XN Xn XN Xn XN Xn XN σ 2= 0.50
x1 |XS1|% 27 15 25 12 20 8 13 9
x2 |XS1|% 24 12 17 10 14 9 11 5
%|XS| for BIG XN
x3 |XS1|% 25 17 18 11 16 9 12 6
in ALL cells
x4 |XS1|% 25 15 21 13 16 8 9 8 
x5 |XS1|% 27 13 20 14 15 11 11 9
FA |XS2|% 5 1 5 1 5 1 3 1 ave cpu time
time, secs .15 1.9 .18 2.2 .19 2.4 .21 2.7 0.23 for Xn
CA |XSS3|% 48 35 37 31 29 21 21 17 2.96 for XN
time, secs .19 2.4 .27 2.8 .25 3.1 .28 3.5
grows O(c) with |X|
SA |XSS4|% 54 52 42 40 33 29 24 23
time, secs .22 3.3 .26 3.5 .26 3.6 .28 4.1

3/15/06 geFFCM 24
Elastic control
Elastic control of
of nn == ||X to reduce
Xnn|| to reduce sample
sample size
size

⎡ bik ⎛ k k⎞ ⎛ k ⎞⎤
Recall that for each active N n nN
D*k = min {n,n*} ⎢ ∑ ⎜ i − i ⎟ log⎜ i ⎟⎥
feature k, we compute ⎢ i= 1⎜⎝ N n ⎟



k ⎟⎥
⎣ Nni ⎠⎦

⎡ bik ⎛ k k⎞ ⎛ k ⎞⎤
N n nN
⎢ ∑ ⎜ i − i ⎟ log⎜ i ⎟⎥ Choose a target sample
Dk = n
⎢ i= 1⎜ N n ⎟⎠ ⎜ k ⎟⎥ size n* and define
⎣ ⎝ ⎝ Nni ⎠⎦

min {n,n*} = n* ⇒ termination at n*<n


D* prevents oversampling of XN
2 Cases

min {n,n*} = n ⇒ termination at n<n*


D satisfied by sample smaller than n*

3/15/06 geFFCM 25
|XN|=1,600,000 25 trials ave. α = 0.95
σ 2= 0.50
n* = 20,000 = 1.25% |XN|
52% of N =
832,000
samples
b = 25 50 100 200
X = D D* D D* D D* D D*
x1 |XS1|% 15 2 12 2 8 1 9 1
x2 |XS1|% 12 2 10 1 9 1 5 1 3% of N =
48,000
x3 |XS1|% 17 2 11 1 9 1 6 1
samples
x4 |XS1|% 15 2 13 1 8 1 8 1
x5 |XS1|% 13 2 14 1 11 1 9 1
FA |XS2|% 1 1 1 0 1 1 1 1
time, secs 1.95 1.94 2.19 2.18 2.42 2.40 2.67 2.63
Elastic when
CA |XS3|% 35 3 31 2 21 2 17 2
D*>1.25%
time, secs 2.44 1.97 2.79 2.22 3.10 2.45 3.55 2.72
SA |XS4|% 52 3 40 2 29 2 23 2
time, secs 3.28 1.99 3.53 2.23 3.63 2.44 4.07 2.65
3/15/06 geFFCM 26
832,000 48,000 5 times faster
samples samples same accuracy

Same error rates


LFCM vs Approx. b=25 bins b=200 bins LFCM
D D* D D* D D*

Acceleration tacc 13.09 12.55 10.00 9.03 1 1


SS2 % Training Error Etr 8.89 8.89 8.87 8.87 8.83 8.83
FA=1 % Approx. Error Eapp 0.83 0.85 0.68 0.70 0 0
Prototype Error EV 0.01 0.01 0.01 0.01 0 0

Acceleration tacc 2.80 9.70 3.89 8.24 1 1


SS3 % Training Error Etr 8.83 8.83 8.83 8.83 8.83 8.83
CA=5 % Approx. Error Eapp 0.04 0.21 0.07 0.28 0 0
Prototype Error EV 0.00 0.00 0.00 0.00 0 0

Acceleration tacc 1.84 9.51 3.29 8.10 1 1


SS4 % Training Error Etr 8.83 8.83 8.83 8.83 8.83 8.83
SA=5 % Approx. Error Eapp 0.03 0.21 0.06 0.28 0 0
Prototype Error EV 0.00 0.00 0.00 0.00 0 0

3/15/06 geFFCM 27
b=100 bins 25 trials aves. α = 0.95
SS3=CA n* = 20,000 σ 2= 0.50

|XN| « 100,000 200,000 400,000 800,000

D D* D D* D D* D D*

|Xn| as % of |XN| 28 23 26 13 24 7 21 4
geFFCM time, secs 3.28 2.81 6.55 3.79 12.75 5.19 23.29 7.93
LFCM time, secs 10.87 10.84 21.58 21.54 42.93 42.72 85.91 85.98
Acceleration, tacc 3.40 3.89 3.58 5.69 3.62 8.24 4.02 10.84

tacc with D : 3.40 « 3.58 « 3.62 « 4.02

tacc with D* : 3.89 « 5.69 « 8.24 « 10.84

∴ advantage of using D*  as |XN| 

3/15/06 geFFCM 28
Empirical Conclusions

Separation is highest
tacc Accleration highest when
SS is minimal (FA)

Etr Training errors comparable to LFCM for all 4 SS's

Eapp Smallest approx. errors for stringent SS's (CA and SA)

Is a very general scheme which is easily adaptable


PS for extension to VL data with many algorithms

3/15/06 geFFCM 29
Yet do do : geFFCM

Process VL data
geFFCM was designed for VL data, but
so far, no real tests have been made Should work ok !

Try simpler termination measures (e.g. Euclidean ||*||)

For real VL data, compare to simple random sampling

3/15/06 geFFCM 30
Thanks mates !

G’Day

3/15/06 geFFCM 31
3/15/06 geFFCM 32

You might also like