2011 Learning Latent Variable Models From Distributed and Abstracted Data

Information Sciences 181 (2011) 2964–2988
Contents lists available at ScienceDirect
Information Sciences
journal homepage: www.elsevier.com/locate/ins
Learning latent variable models from distributed and abstracted data

Xiaofeng Zhang a,⇑, William K. Cheung b, C.H. Li b
a
Harbin Institute of Technology, School of Computer Sciecne and Technology, Shenzhen Graduate School, Kowloon Tong, Hong Kong
b
Department of Computer Science, Hong Kong Baptist University, Kowloon Tong, Hong Kong
a r t i c l e i n f o a b s t r a c t
Article history: Discovering global knowledge from distributed data sources is challenging, where the
Received 14 August 2009 important issues include the ever-increasing data volume at the highly distributed sources
Received in revised form 22 November 2010 and the general concern on data privacy. Properly abstracting the distributed data with a
Accepted 12 February 2011
compact representation which can retain sufficient local details for global knowledge dis-
Available online 8 March 2011
covery in principle can address both the scalability and the data privacy challenges. This
calls for the need to develop formal methodologies to support knowledge discovery on
Keywords:
abstracted data. In this paper, we propose to abstract distributed data as Gaussian mixture
Distributed data mining
Data abstraction
models and learn a family of generative models from the abstracted data using a modified
Model-based methods EM algorithm. To demonstrate the effectiveness of the proposed approach, we applied it to
Gaussian mixture model learn (a) data cluster models and (b) data manifold models, and evaluated their perfor-
Generative topographic mapping mance using both synthetic and benchmark data sets with promising results in terms of
both effectiveness and scalability. Also, we have demonstrated that the proposed approach
is robust against heterogeneous data distributions over the distributed sources.
Ó 2011 Published by Elsevier Inc.
1. Introduction
While most of the existing effort in the data mining and machine learning community is on developing methodologies for
discovering various forms of knowledge from data, the rapid development of the distributed and mobile computing has re-
sulted in the trend that many important analysis tasks (e.g., clustering [41,5], topic discovery [6,40]) are now required to be
performed on data sets which are by design widely distributed at remote data centers or mobile devices. Concerns including
bandwidth limitation and data privacy [39,28,19,38] restrict the distributed data from being pooled together for analysis,
forming the causes for the challenge of carrying out global knowledge discovery from distributed data. This paper describes
an attempt to address the underlying issues using a data abstraction approach. The intuitive idea comes from the observation
that in many applications, we are only interested in discovering some holistic data patterns (e.g., overall customer segments
as data clusters) where the full details of the distributed data points are not needed. Instead, the abstracted versions of the
local data sources could be good enough for the global analysis. The compactness resulted from the abstraction leads to a
substantial cut in the bandwidth requirement for data transmission as well as the complexity of the global data analysis.
1.1. Related work on distributed data analysis
The problem being addressed in this paper is related to a field called distributed data mining (DDM). Most of the existing
DDM methodologies involve two steps – (1) performing local data analysis at the local data sources and (2) combining the
local results to form a global one. For example, a meta-learning process was proposed in [33] for combining a set of locally
⇑ Corresponding author.
E-mail addresses: xfzhang@comp.hkbu.edu.hk (X. Zhang), william@comp.hkbu.edu.hk (W.K. Cheung), chli@comp.hkbu.edu.hk (C.H. Li).
0020-0255/$ - see front matter Ó 2011 Published by Elsevier Inc.

doi:10.1016/j.ins.2011.02.007
X. Zhang et al. / Information Sciences 181 (2011) 2964–2988 2965
learned classifiers (decision trees in particular) to achieve a global classifier. Kargupta et al. [25] proposed a collective data
mining approach which assumes distributed data sources to possess non-overlapping sets of features (also known as vertical
data partition [44]), forming an orthogonal basis to be combined to obtain the global analysis result. This method was later
on applied to learning Bayesian networks for Web log analysis [11]. Some other examples can also be found in [24,7]. Most of
these DDM methodologies suffer from the fact that uncontrolled local analysis could result in losing information that is sali-
ent to the subsequent global analysis [26]. While one can enable sharing and fusion of intermediate local analysis results to
enhance the global analysis accuracy [51], its wide applicability is still limited as some model-specific fusion strategies are
typically required.
1.2. Related work on analyzing abstracted data
One simple approach to compute data abstraction is binning. Data with numerical (or nominal) attributes can have the
possible values of the attributes ‘‘binned’’ as ranges (or sets) of values. The compact representation could be resulted from
independently partitioning the data along each attribute axis according to the boundaries of the ‘‘bins’’.1 By considering also
the inter-attribute dependency, one could ‘‘group’’ nearby data points instead of performing per-attribute binning, and repre-
sent each group with only its first and second order statistics. This approach is in fact equivalent to the use of Gaussian mixture
model (GMM) [22] for density estimation [22] where the GMM parameters form the compact data abstraction. In general, a finer
binning scheme can retain more details of the original data. Similarly, using a GMM with more Gaussian components for data
abstraction should give a more accurate representation.
Related work on performing analysis on abstracted data is limited. The BIRCH algorithm, proposed by Zhang et al. [48],
could probably be one of the few earliest attempts. It computes a type of data abstraction called cluster feature tree to
achieve an efficient implementation of agglomerative data clustering. In [31,4], GMM was used for the abstraction. Virtual
data were then re-sampled from the GMM so as to make existing data analysis techniques still applicable. However, the com-
putational complexity due to the re-sampling step (say using Monte Carlo Markov Chain) and the subsequent analysis step
could be an issue. Also, some researchers, e.g., [21], simply used representative data points as the abstraction in an ad hoc
manner.
In the context of privacy preserving data mining, there also exist some recent studies on data sanctification techniques
like binning [8] and anonymization [52,27,18,20].
1.3. Our goal and paper organization
In this paper, we take the probabilistic approach and explain how a certain group of latent variable models (LVMs) can be
learned from abstracted data in a disciplined manner. In particular, we adopt GMM for the data abstraction, mainly due to its
representation flexibility. For the data analysis algorithms, we focus on LVMs which are powerful by themselves and have
been commonly adopted in many applications. While most LVMs can be learned using the Expectation–Maximization
(EM) algorithm, the conventional EM algorithm only works on data points. We propose a modified EM algorithm which
can learn the LVMs from abstracted data without the need of data re-sampling. We applied the new EM algorithm to learn
Gaussian mixture model (for data clustering) and generative topographic mapping (for manifold discovery) from abstracted
data. Given some initial success we reported early on in [49,50], we provide in this paper first a unified view of the proposed
learning-from-abstraction approach and also the theoretical proof of the EM algorithm. In addition to theoretically comparing
their computational complexities and communication overheads, we also conducted a series of experiments to evaluate the
efficacy of the proposed approach using a number of synthesized and benchmark data sets and under different data distri-
bution scenarios. Using the proposed approach, we demonstrated that the clustering and manifold discovery results ob-
tained are comparable to those obtained without the data abstracted. Also, its performance is robust regardless how the
data are distributed over different data sources.
The rest of the paper is organized as follow. Section 2 provides an overview of the approach we proposed for learning
LVMs from abstracted data. A GMM-based local data abstraction is discussed in Section 2.1. The EM algorithms for learning
global GMMs and GTMs directly from the aggregated GMM-based local data abstraction are presented in Section 3. Details
about the experiment design and the results demonstrating the efficacy of the proposed approach can be found in Section 4.
Section 5 concludes the paper with pointers to future work.
2. Problem formulation
Assume that a complete set of data is distributed over L local data sources. Let t 2 Rd denote a data item where d is the
dimension of the t and R is the set of real numbers, jDlj denote the number of data items at the lth source, and plocal(tjhl)
denote a local probabilistic abstraction of the data subset at the lth source and hl is the corresponding compact set of model
parameters. The problem being considered here is that given only the local abstractions collected from the local data sources,
how one can learn a global probabilistic model which can characterize the complete set of data pglobal(tjU) where U refers to
1
Other than achieving the compactness objective, related techniques are also used commonly for hiding data details [3,8].
2966 X. Zhang et al. / Information Sciences 181 (2011) 2964–2988
8 8 8
6 6 6
4 4 4
2 2 2
0 0 0
−2 −2 −2
−2 0 2 4 6 8 −2 0 2 4 6 8 −2 0 2 4 6 8
(a) data source 1. (b) data source 2. (c) data source 3.
8 8 8
6 6 6
4 4 4
2 2 2
0 0 0
−2 −2 −2
−2 0 2 4 6 8 −2 0 2 4 6 8 −2 0 2 4 6 8
(d) local model from source 1. (e) local model from source 2. (f) local model from source 3.
−2
−2 0 2 4 6 8
(g) global model (doted lines) learned based on the local models.
Fig. 1. An illustration of learning a global cluster model given only abstractions of local data sources.
the model parameters of the global model. Fig. 1 shows a pictorial illustration of the problem. In this figure, there are three
distributed local data sources (Fig. 1(a)–(c)) and the objective is to identify data clusters in the ‘‘global’’ sense. As shown in
Fig. 1(d)–(f), GMM abstractions [22] are first derived from the local data. By aggregating only the local abstractions, a global
cluster model is derived with three final clusters identified. The same analysis result cannot be obtained simply based on one
of the local data sources alone.
While the global probabilistic model, in principle, can be of any type of generative models, we here focus on latent var-
iable models (LVMs) [23] which are known to possess powerful modeling capabilities for data analysis. Given Z = {z1, . . . , zM}
to be i.i.d. latent variables, the global model can be expressed as
X
M
pglobal ðtjUÞ ¼ pðtjzk ; /k Þpðzk Þ ð1Þ
k¼1
where /k is the set of model parameters corresponding to zk and U = {/1, /2, . . . , /M}. Gaussian mixture model and generative
topographic mapping are two models that satisfy this i.i.d. assumption. If {z1, . . . , zM} are not i.i.d. but with their sequential
occurrence following the Markov assumption, the global model will be a temporal one, given as
!
X
M X
M Y
T1
pglobal ðft 1 ; . . . tT gjUÞ ¼ ... pðzk1 Þ pðtp jzkp ; /kp Þpðzkpþ1 jzkp Þ pðtkT jzkT ; /kT Þ: ð2Þ
k1 ¼1 kT ¼1 p¼1
This is essentially equivalent to a discrete hidden Markov model [35]. Further changes in the structural relationship of the
latent variables will result in different LVMs.
2.1. Local data abstraction
Assume that the data set at each local source is abstracted as a Gaussian mixture model (GMM). Let hlj denote the param-
eters corresponding to the jth GMM’s component at the lth data source (containing the component’s mean llj and covariance
matrix Rlj), ajl denote the mixing proportion of the jth component in the lth local model. The probability density function of
the lth local model plocal(tjhl) with jClj components is given as,
X
jC l j X
jC l j
plocal ðt i jhl Þ ¼ ajl pj ðti jhlj Þ ajl ¼ 1 ð3Þ
j¼1 j¼1

d 1 1
pj ðtjhlj Þ ¼ ð2pÞ2 jRlj j2 exp ðt i llj ÞT R1
lj ðt i llj Þ ð4Þ
2
A GMM with different numbers of components can be used to represent data at different levels of granularity. For the ex-
treme case, a GMM with only a single component provides the coarsest information about the data set and the level of gran-
ularity is the lowest. Information content of the GMM increases as the number of its Gaussian components increases. To
compute the hierarchical data abstraction, hierarchical clustering algorithms can first be applied. Then, at each level of
the cluster hierarchy, a GMM-based abstraction can be estimated with each data cluster represented by one Gaussian com-
ponent. The mean vector and the covariance matrix of the data corresponding to each cluster are computed as the parameter
estimates of the corresponding Gaussian component. Note that the GMM-based abstraction obtained is only an approxima-
tion of the maximum likelihood (ML) estimate. In case better abstraction accuracy is needed, ML estimation can be per-
formed as described in [17]. Fig. 2 illustrates how the hierarchy of local data abstractions is computed based on the
agglomerative hierarchical clustering (AGH) algorithm with single-linkage as shown in Algorithm 1.
Algorithm 1. A single-linked agglomerative hierarchical clustering

1: Input: Data: ni, i = 1, . . . ,jDj; Cluster: ci; Cluster Tree: clusTree;
2: ci = ni, clusTree{1} = {ci, i = 1 to jDj}, nLoop = 1
3: while nLoop < jDj do
4: mini = nLoop, minj = nLoop + 1
5: minDis ¼ distanceðnmini ; nminj Þ,
6: for i = nLoop to jDj do
7: for j = nLoop and i – j to jDj do
8: Di, j = distance(ni, nj)
9: if minDis < Di,j then
10: minDis = Di,j
11: mini = i
12: minj = j
13: end if
14: end for
15: end for
16: mergeðcmini ; cminj Þ
17: mergeðdmini ; dminj Þ
18: clusTree(nLoop) = ci
19: nLoop = nLoop + 1
20: end while
21: merge(cnLoop, cjDj)
22: Output: The agglomerative hierarchical tree: clusTree
Depending on the data sets, different hierarchical clustering algorithms could be adopted for computing more optimal
abstractions. See Section 4.2.4 for more discussion on this point.
Given the hierarchical clustering results obtained based on AGH, one can compute the mean vectors l and covariance
matrices R for creating the GMM-based local abstractions at a certain level in an iterative manner. Given that the ith and
Fig. 2. A hierarchy of data abstractions, D1, . . . , D5, where the ith level of abstraction is acquired by merging two nearest data subgroups at the (i 1)th level
with finer data details. For instance, the data can be represented by D2 as four sets of mean vectors and covariance matrices to be shared for learning the
global LVM model.
jth clusters are to be merged in an AGH iteration, the mean and covariance matrix of the new cluster, indexed by p, can be
computed as
Np ¼ Ni þ Nj
Ni li þ Nj lj
lp ¼
Np

Ni E t i tTi þ Nj E t j t Tj
Eðt p tTp Þ ¼
Np

Rp ¼ E t p tp ll lTl
T
where {ti} and {tj} correspond to the data subsets under the ith and jth clusters and Ni and Nj correspond to the number of
data in the two subsets respectively.
Given a particular GMM-based abstraction, there is also a need to quantify how well the local data are represented.
One natural way is to compute the normalized log likelihood of the local data set Dl, given as
PjD j
qL ðDl Þ ¼ jD1 j i¼1l log plocal ðt i jhl Þ [31]. The higher the value of qL(Dl), the higher will be the probability that Dl is generated
l
by plocal(tijhl). The measure is normalized so that data sets of different sizes can effectively be compared. We adopt this
measure to quantify the quality of the local abstraction in our experiments to see how well the proposed approach
works under different levels of abstraction.
While the main focus of this paper is not to study the privacy protection capability provided via the local abstraction,
it is worth pointing out that abstracting clustered data items as Gaussian components could provide the notion of data
privacy protection similar to that of anonymization [42]. [37] so that people will not have access to individual data
items but only local statistics of some clustered items. So, if even one can succeed in guessing the Gaussian component
that a data item is belonging to, what he or she knows will only be the statistical properties of its associated component.
But it should also be noted that we assume all the local sources to be trusted. The consideration will be very different if
untrusted parties exist.
2.2. An EM-based learning algorithm for abstracted data
The Expectation and Maximization (EM) algorithm [13] is typically used for computing the ML parameter estimates of
different LVMs given incomplete data. It involves two steps, namely the E-step and M-step. The E-step estimates the poster-
ior probability of each data item to be generated by each individual latent variable. The M-step is to maximize the expected
likelihood function given the posterior estimates on those unknown latent variables computed in the E-step. The E-step and
M-step alternate iteratively until they converge.
The conventional EM algorithm works on data items only and many LVMs assume that the data follow a Gaussian dis-
tribution given the latent variable. In the E-step, the posterior probability that a data item being generated by a Gaussian
component of the global LVM is determined by the Mahanalobis distance between the data item and the corresponding com-
ponent, as well as the component’s mixing proportion [22]. For the EM algorithm to work on GMM-based data abstractions
(instead of data items), we compute not the Mahanalobis distance but the Kullback Leibner (KL) divergence of a local Gauss-
ian component from a global Gaussian component. Thus, the E-step becomes
ak pðhl j/k Þ
Rlk ¼ pðDlk ¼ 1j/k ; ak ; hl Þ ¼ PM ð5Þ
k¼1 ak pðhl j/k Þ
and
pðhl j/k Þ / expffDðpðtjhl Þjjpðtj/k ÞÞg ð6Þ
where Dlk = 1 indicates that the lth local component is generated by the kth global component, hl and /k denote the sets
of parameters of the local and global component respectively, ak denote the mixing proportion of the kth global com-
ponent, D(p(x)kp(y)) is the KL divergence between p(x) and p(y), and f is a constant. If p(tj/k) takes the form of a mul-
tivariate Gaussian function (which holds for LVMs like GMM, GTM and HMM), a closed-form solution for computing
D(p(tjhl)kp(tj/k)) exists. Fig. 3 presents a flow chart illustrating the proposed approach and the detailed algorithm is
shown in Algorithm 2.
Fig. 3. A flow chart of the proposed approach.

Algorithm 2. Learning from abstraction

1: Input: Aggregated local abstraction: mean vectors {ll}; covariance matrices {Rl}; mixing proportions {al} for
l = 1, . . . , jCj
2: Initialize {lk}, {Rk}, {ak}; /* To initial global GMM – see Section 3.1.3 */
3: Initialize (W, b); /* To initial global GTM – see Section 3.2.3 */
4: {hl} :¼ {(ll, Rl)}
5: repeat
6: E-Step:
n o
7: /old
k :¼ fðlk ; Rk Þg; aold
k :¼ fak g{/* For global GMM */}
8: /old :¼ (W, b); {/* For global GTM */}
9: for l = 1 to jCj do
10: for k = 1 to M do
a expffDðpðtjhl Þjjpðtj/old
k ÞÞg
11: Rlk ¼ PM k old expffDðpðtjh Þjjpðtj/old ÞÞg
; {/* For global GMM */}
k¼1
ak l k
old
12: Rlk ¼ PMexpffDðpðtjhl Þjjpðtjzk ;/ ÞÞg
; {/* For global GTM */}
k¼1
expffDðpðtjhl Þjjpðtjzk ;/old ÞÞg
13: end for

14: end for
15: M-Step:
n o
16: ({lk}, {Rk}, {ak}) = gmmLFA_m_step(fRlk g; /old
k ; aold
k ; fhl gfal g); {/* For global GMM – see Section 3.1.2 */}
17: (W, b) = gtmLFA_m_step({Rlk}, /old, {hl}, {al}); {/* For global GTM – see Section 3.2.2 */}
18: until convergence
19: Output:
20: ({lk}, {Rk}, {ak}); {/* For global GMM */}
21: (W, b);{/* For global GTM */}
In the sequel, we describe how the proposed approach can be applied to learning GMM and GTM as the global models.
3. Learning global models from abstracted data
Before the global model learning takes place, the abstracted data collected from the local sources have to be aggregated
first. Given the fact that the local abstractions take the GMM parametric form, the aggregated data abstraction can simply be
computed by adding the probability distribution functions of the local GMMs together and then recomputing the mixing
portions of the Gaussian components according to the size of the local data sets. Such an aggregated GMM should normally
contain a large number of components of different sizes and many are overlapping with each other to different extents. The
remaining question is how to learn a global model from a large pool of local Gaussian components.
3.1. Learning GMM for clustering
Gaussian mixture model (GMM) [22] has been commonly used for clustering in the literature. Compared with other clus-
tering algorithms like k-means [30] and DBSCAN [14], GMM is more flexible and interpolative regarding its data represen-
tation power [9]. In the following sections, we first provide the key formulation for learning GMM from raw data.2 Then, the
detailed derivation of the proposed EM algorithm for learning GMM from abstracted data are described.
3.1.1. Learning GMM from raw data

Let the probability density function of the global GMM model with M components be defined as
X
M
pglobal ðt i jUÞ ¼ ak pk ðti j/k Þ
k¼1
where U = {/k}, /k = {lk, Rk} denote the mean vector and the covariance matrix of the kth component respectively, and
pk(tij/k) is the kth Gaussian component defined based on /k. Given a set of data {ti}, the maximum likelihood estimates of
the global model parameters are computed by maximizing the log likelihood function, given as
! !
Y
jDj Y
jDj X
M
ln pglobal ðti jUÞ ¼ ln ak pk ðti j/k Þ :
i¼1 i¼1 k¼1
2
Readers are referred to [22] for more details.
Instead of maximizing the log likelihood function directly, the EM algorithm maximizes the expected log likelihood func-
tion [22], given as
!

Y
jDj Y
M jDj X
X M d 1
1
Rik
ln ½ak pk ðti j/k Þ ¼ Rik ln ak þ ln ð2pÞ2 jRk j2 ðti lk ÞT R1
k ðt i lk Þ ð7Þ
i¼1 k¼1 i¼1 k¼1
2
where Rik is the estimated posterior probability of the ith data item being generated by the kth component of the global mod-
el and is computed in the E-step as
ak pk ðti j/k Þ
Rik ¼ pðDik ¼ 1j/k ; t i Þ ¼ PM ð8Þ
k¼1 ak pk ðt i j/k Þ
The M-step which maximizes Eq. 7 is summarized as follow:

PjDj
Rik t i
lk ¼ Pi¼1
jDj
ð9Þ
i¼1 Rik
1 X
jDj
ak ¼ Rik ð10Þ
jDj i¼1
PjDj
i¼1 Rik ðt i lk Þðti lk ÞT
Rk ¼ PjDj : ð11Þ
i¼1 Rik
3.1.2. Learning GMM from abstracted data

In the following, we use l as the index to the components of the aggregated data abstraction and k as the index to those of
the global model. Given that the data is being abstracted according to Eq. 3 and the global model is a GMM, the correspond-
ing likelihood function is here defined as
" #al
Y
jCj XM
ak pðll ; Rl jlk ; Rk Þ ð12Þ
l¼1 k¼1
where jCj is the total number of the Gaussian components in the aggregated data abstraction and M is the total number of the
Gaussian components in the global GMM. By assuming that ll and Rl are independently distributed given lk and Rk, and Rl
and lk are assumed to be independent given Rk, p(ll, Rljlk, Rk) becomes
pðll ; Rl jlk ; Rk Þ ¼ pðll jlk ; Rk ÞpðRl jRk Þ ð13Þ
We model p(lljlk, Rk) and p(RljRk) using a multivariate Gaussian distribution and a central Wishart distribution3, given as

d 1 1
pðll jlk ; Rk Þ ¼ ð2pÞ2 jRk j2 exp ðll lk ÞT R1
k ð ll l k Þ
2
dþ1
j Rl j 2 1
pðRl jRk Þ ¼ dðd1Þ exp tr R1 k Rl
p 4 2
where tr(X) denotes the trace of the matrix X. The expected log likelihood function then becomes
!
X
M X
jCj
ak XM X jCj
al Rlk T 1
al Rlk ln dþ1
ðtrðR1
k Rl Þ þ ðll lk Þ Rk ðll lk ÞÞ:
k¼1 l¼1 ð2pÞ
d
2
dðd1Þ
p 4
1
jRk j jRl j
2 2
k¼1 l¼1
2
For the new E-step, Rlk is to be computed according to Eqs. (5 and 6). It involves a KL divergence term
D(plocal(tijhl)jjpglobal(tij/k)) which, for this case, can be shown to be (see Appendix A):
1 1
1
j Rk j 2 T 1
ln 1
þ trðR1
k Rl Þ d þ ðll lk Þ Rk ðll lk Þ ð14Þ
j Rl j2 2 2
The sum of the first two terms in Eq. (14) essentially measures the difference between the covariance matrices of the two
components and will become zero if the two matrices are identical. The third term is essentially the Mahanalobis distance
between the mean vectors of the two components.
3
The Wishart distribution is known to be the distribution of the sample covariance matrix for a sample obtained from a multivariate normal distribution and
thus is here used [46].
Fig. 4. GTM maps a low-dimensional latent space to a high-dimensional data space via a non-linear transformation function.
The new M-step can then be derived by maximizing Eq. (14) with respect to {lk, Rk, ak}, given as
P
jCj
aR l
lk ¼ Pl¼1jCj l lk l ð15Þ
l¼1 al Rlk
X
jCj
ak ¼ al Rlk ð16Þ
l¼1
PjCj
al Rlk ðRl þ ll lTl Þ
Rk ¼ l¼1
PjCj lk lTk ð17Þ
l¼1 al Rlk
The detailed derivation can be found in Appendix A.
3.1.3. GMM initialization

The performance of the proposed EM algorithm, inherited from the conventional EM algorithm, depends on model initial-
ization. Many methods have been proposed for the initialization EM algorithms. For instance, the k-means algorithm and
random initialization are often used as the initialization of the EM algorithms given that the number of clusters is known,
their performance varies according to the datasets and the optimization criteria used [32]. Greedy algorithms have also been
for improving the initialization quality [45], which however only work for some datasets. There also exist some unsupervised
clustering methods for determining the number of clusters automatically without user input [16].
All the aforementioned model initialization methods is restricted to modeling data items. While it is possible to modify
such methods for abstracted data, both their formulation and implementation will be needed to be carefully studied, mod-
ified, compared and evaluated. For the experiments reported in this paper, we learned the global GMM by selecting best ini-
tialization from several random initializations. The main focus of this paper being on the learning framework. Further
research effort in studying the model initialization issue for abstracted data is worth pursuing.
3.2. Learning GTM for manifold discovery
Generative topographic mapping (GTM) [10] is a probabilistic non-linear latent variable model which can be used to ex-
plore the intrinsic manifold of a set of high-dimensional data. In the literature, other nonlinear manifold discovery methods
includes locally linear embedding (LLE) [36], isometric feature mapping (ISOMAP) [43], etc. GTM was chosen in this paper
mainly because of its generative nature. While it is still interesting to see if models related to LLE and ISOMAP can have some
generative interpretations so that the proposed learning approach can be applied, this is nevertheless not the main focus of
this paper.
GTM assumes that the data are generated by a set of latent variables which are ordered as a lattice in a low-dimensional,
usually two-dimensional (2D), latent space. via a non-linear mapping, the lattice in the latent space is mapped to the ob-
served data in the data space so that the original data topology can be preserved in the latent space, as illustrated in
Fig. 4. The maximum likelihood estimate of the mapping can be learned from the observed data using the EM algorithm.
By projecting back the original high-dimensional data to the latent space, GTM can ‘‘unfold’’ the high-dimensional data man-
ifold and project it on a 2D map for ease of visualization. Such an unfolding result, in many cases, can help understand the
underlying structure and organization of the high-dimensional data.
In the following, we first provide the key formulation for learning GTM from raw data.4 Then, the detailed derivation of the
proposed EM algorithm for learning GTM from abstracted data is described.
4
Readers are referred to [10] for more related details.
3.2.1. Learning GTM from raw data

Let zk 2 RL denote the kth lattice point (altogether M) defined in the latent space and y(z; W) :¼ WW(z) denote a general-
ized linear regression model W weighted by W which maps a point z in the latent space onto a corresponding point y in the
data space in a non-linear fashion. By computing the best estimate of W and another parameter b which indicates the reci-
procal of the Gaussian noise variance around y(z; W), the low dimensional manifold of the observed data can be captured.
The conditional probability distribution of the data ti given zk is formulated as

d d b
pðt i jzk ; W; bÞ ¼ ð2pÞ2 b2 exp kðti yðzk ; WÞk2 ð18Þ
2
The corresponding log likelihood function is then given as
X
jDj
1 XM
ln pðti jzk ; W; bÞ ð19Þ
i¼1
M k¼1
To estimate the GTM’s parameters {W, b} using the conventional EM algorithm, the E-step is given as
pðt i jzk ; W old ; bold Þ
Rik ðW old ; bold Þ ¼ Pðzk jti ; W old ; bold Þ ¼ PM
j¼1 pðt i jzj ; W old ; bold Þ
where Rik is the posterior probability of the ith data item being originated from the kth latent point of the global GTM via the
underlying non-linear mapping. Wold and bold are the current estimates of the GTM’s parameters. The M-step is then given as
X
jDj X
M
Rik ðW old ; bold ÞðW new Wðzk Þ t i ÞWðzk ÞT ¼ 0 ð20Þ
i¼1 k¼1
1 XX
jDj M
1
¼ Rik ðW old ; bold ÞkW new Wðzk Þ ti k2 ð21Þ
bnew jDjd i¼1 k¼1
where W(zk) is the neighboring function defined with a fixed spherical Gaussian basis function.
3.2.2. Learning GTM from abstracted data

Based on Eq. (18) and adopting a similar formulation as that of Eq. (14), the expected log likelihood function for the GTM
model can easily be shown to be
d
!
X
M X
jCj
b2 X
M X
jCj
1
al Rlk ln d
al Rlk ðbtrðRl Þ d þ bðll yðzk ; WÞÞT ðll yðzk ; WÞÞÞ
k¼1 l¼1 ð2pÞ 2
k¼1 l¼1
2
For the new E-step, the posterior probability that the lth local GMM component being generated by the kth lattice point of
the global GTM, Rlk, can be formulated as
expfDðplocal ðtjhl Þjjpgtm ðtjzk ; W; bÞÞg
Rlk ¼ PM ð22Þ
j¼1 expfDðplocal ðtjhl Þjjpgtm ðtjzj ; W; bÞÞg
where
d
b2 1 1
Dðplocal jjpgtm Þ ¼ ln 1
þ ðbtrðRl Þ dÞ þ bðyðzk ; WÞ ll ÞT ðyðzk ; WÞ ll Þ ð23Þ
j Rl j2 2 2
It is to be noted that when the local abstraction is detailed to the level that one component is representing an individual data
item, the first two terms in Eq. (A7) become constant. Eq. (A7) will be degenerated back to the E-step of the original GTM.
The new M-step can be derived by maximizing the log likelihood function with respect to {W, b}, given as
X
jCj X
M

al Rlk ðW old ; bold Þ W new Wðzk Þ ll Wðzk ÞT ¼ 0 ð24Þ
l¼1 k¼1
!
1 X X 1 X X
M jCj M jCj
1
¼ T
al Rlk ðW old ; bold Þ Rl þ ll ll ðW new Wðzk ÞÞ2 al Rlk ð25Þ
bnew jCjd k¼1 l¼1 jCjd k¼1 l¼1
The detailed derivation can be found in Appendix A.2.
3.2.3. GTM initialization

The conventional GTM uses principle component analysis (PCA) for initializing W and b, where the covariance matrix of
the observed data is needed. Given only the aggregated local abstraction, the initialization of the global GTM becomes not
straight-forward. Fortunately, it can be shown that the global covariance matrix can analytically be derived based on the
local data’s covariance matrices, given as
PjCj
l¼1 jDl j ll
lglobal ¼
jDj
PjCj
l¼1 jDl jð Rl þ ll lTl Þ
Rglobal ¼ lglobal lTglobal
jDj
Thus, the GTM initialization based on the local abstraction can be done equivalently as that of the conventional GTM.
3.3. Computational gain and communication overhead
Assuming that the hierarchical clustering of the local data can be performed off-line and reused when needed, the mod-
ified EM algorithm based on data abstraction can run much faster than the conventional one based on raw data. To compare
the computational complexity between the conventional and revised EM algorithms, we make reference to the number of
product operations that are involved in the corresponding E-step and M-step. To recall, jDj denotes the number of data items,
d denotes the dimension of data space, jCj denotes the number of components in the aggregated local abstraction and M de-
notes the number of components in the global model. Also, we used the subscript ‘‘lfa’’ to stand for ‘‘learning from abstracted
data’’ and ‘‘lfd’’ to stand for ‘‘learning from raw data’’.
For learning GMM from abstracted data, Rlk is computed in the E-step for each pair of local and global components where
there are altogether jCj M pairs. According to Eqs. (5) and (6), each evaluation of Rlk needs M KL divergence computation.
The complexity of the divergence computation is dominated by the second and third terms of Eq. (14), and thus gives O(d2).
2
Therefore, the overall complexity of the E-step becomes C Elfa ¼ OðjCj M2 d Þ. For the M-step (Eq. (17)), the complexity is
dominated by the computation of the new global covariance matrices. It can easily be shown that the overall complexity
2
of the M-step is C Mlfa ¼ OðjCj M d Þ. The overall complexity per iteration for the revised EM algorithm is thus
2
C lfa ¼ OðjCj M 2 d Þ ð26Þ
Similarly, it can be shown that the complexity of the conventional EM algorithm is again dominated by the E-step, given as
2
C lfd ¼ OðjDj M 2 d Þ ð27Þ
By comparing Eqs. (26) and (27), the speedup factor is jDj/jCj. In addition, according to our experiments, the modified EM
algorithm generally converges much faster than the conventional one. The number of iterations dramatically dropped when
the number of local components decreased.
With a similar argument, one can show that the complexity of the revised EM algorithm for GTM is also dominated by
the E-step, with the overall complexity per iteration being O(jCj M2 d2) where M is the number of lattice points in GTM.
The complexity per iteration for the conventional one can also be shown as O(jDj M2 d2). The speedup factor is again jDj/
jCj.
In term of communication overhead, the learning-from-raw-data approach requires the transmission of all the raw data
O(jDj) to the centralized server for global learning from the local sources. For the learning-from-abstracted-data approach, it
expects the transmission of the set of mean vectors (O(d)) and covariance matrices (O(d2)) corresponding to the jCj local com-
ponents. So, the communication overhead is O(jCj d2). For problems where the data situated at each local source are not
randomly scattered around but clustered, they can then be effectively represented as GMMs with a small number of mixture
components. Thus, jCj is much smaller that jDj. In addition, it is also common for the dimension of the data set d to be much
smaller than jDj. Thus, transmitting the abstracted data in general is considered to be more bandwidth-efficient when com-
pared with transmitting the raw data.
4. Experiments
We have applied the learning-from-abstraction approach as described in Section 2.2 to learn GMM and GTM for data clus-
tering and data manifold discovery respectively. A series of experiments have been performed to evaluate their effectiveness.
Special attention has been paid to the model accuracy, the scalability of the learning algorithms and their robustness against
imbalanced data distributions over multiple sources.
4.1. Performance on clustering abstracted data using GMM
The following describes how our experiments are designed. For the ease of visualizing the experimental results, we cre-
ated four reference Gaussian mixture models for generating four sets of artificial 2-dimensional data sets. We labeled the
four models as g4, g5, g6 and g7 where the ‘‘x’’ in gx refers to the number of Gaussian components of the reference model.
The mean and the covariance matrices of the data sets were as follow:
g4 :¼ ½l1 ¼ ð1; 1Þ; R1 ¼ ð:1; :05; :05; :2Þ; l2 ¼ ð1; 5Þ; R2 ¼ ð:1; 0; 0; :1Þ; l3 ¼ ð5; 5Þ; R3 ¼ ð:1; :05; :05; :1Þ;
l4 ¼ ð5; 1Þ; R4 ¼ ð:1; :01; :01; :1Þ
g5 :¼ ½l1 ¼ ð4; 2Þ; R1 ¼ ð1; :05; :05; :8Þ; l2 ¼ ð3; 4Þ; R2 ¼ ð:5; 0; 0; :5Þ; l3 ¼ ð2; 6Þ; R3 ¼ ð:4; :2; :15; :3Þ;
l4 ¼ ð7; 5Þ; R4 ¼ ð:9; :5; :2; :8Þ; l5 ¼ ð8; 7Þ; R5 ¼ ð:2; :01; :01; :2Þ
g6 :¼ ½l1 ¼ ð1; 6Þ; R1 ¼ ð:3; :3; :1; :7Þ; l2 ¼ ð2:5; 2:5Þ; R2 ¼ ð:15; :05; 0; :1Þ; l3 ¼ ð5; 3Þ; R3 ¼ ð:2; :02; :02; :1Þ;
l4 ¼ ð3; 6Þ; R4 ¼ ð:15; 0; 0; :05Þ; l5 ¼ ð6:5; 4:5Þ; R5 ¼ ð:1; 0; 0; :3Þ; l6 ¼ ð11:5; 5Þ; R6 ¼ ð:4; 0; 0; :4Þ
g7 :¼ ½l1 ¼ ð1; 6Þ; R1 ¼ ð:3; :3; :1; :7Þ; l2 ¼ ð2:5; 2:5Þ; R2 ¼ ð:15; :05; 0; :1Þ; l3 ¼ ð5; 3Þ; R3 ¼ ð:2; :02; :02; :1Þ;
l4 ¼ ð3; 6Þ; R4 ¼ ð:15; 0; 0; :05Þ; l5 ¼ ð6:5; 4:5Þ; R5 ¼ ð:1; 0; 0; :3Þ; l6 ¼ ð11:5; 5Þ; R6 ¼ ð:4; 0; 0; :4Þ;
l7 ¼ ð14; 2:5Þ; R7 ¼ ð:2; 0; 0; :2Þ
For each model, 400 data items were generated per component and partitioned into multiple data subsets to mimic the
process of distributing the data to local sources. For each data subset, we applied to it the AGH algorithm as described in
Algorithm 1 and computed a corresponding GMM-based abstraction as described in Section 2.1. The abstraction quality
was chosen based on the user’s expectation on the local model quality qL(Dl) defined in Section 2.1. We set the number
of local components to 10 for g4 and up to 30 for g7. We then aggregated the abstractions computed based on those subsets
to mimic the process of collecting abstracted data from distributed sources. Global cluster models were learned afterwards
using the EM algorithm proposed in Section 3.2.2. The performance of the global cluster models learned from the abstracted
data is then compared to those learned from the raw data directly. The results are tabulated in Table 1. The effect of local
model quality on the quality of the global model being learned is also evaluated.
4.1.1. Accuracy and speedup

According to the experimental results we obtained, we found that reasonably accurate global GMM models can be learned
directed from abstracted data based on the proposed framework. As shown in Table 1, with the abstraction level properly
chosen, the global models learned from compact data abstraction and from raw data are close to each other in term of KL
divergence. As indicated in the third and fourth columns of Table 1, the KL divergence (log likelihood) value between the
models learned from abstracted data and raw data reaches a minimum (maximum) when the number of global components
comes close to that of the reference model for the cases from g4 to g7. Some of the clustering results are shown in Fig. 5. One
can observe that the two sets of clustering results, learned from the abstracted data and from the raw data, agrees with each
other.
Regarding the learning efficiency, as indicated in the fifth and sixth columns of Table 1, the speedup achieved ranges from
4.1 to 58.2 when the abstracted data were used instead of the raw data. One primary reason is due to the computational gain
in each EM step, as explained in Section 3.3. Also, we found that the conventional EM algorithm converged slowly for data
sets with highly overlapping clusters. The convergence rate of the revised EM algorithm for abstracted data was found to be
relatively less sensitive to those cases, possibly due to the data abstraction being adopted.
Table 1
Performance comparison between the global GMMs learned from abstracted data (LFA) and those learned from the raw data (LFD). gx denotes the reference
data model used for creating the test data set.
Reference model No. of global components Model accuracy Learning efficiency

LFA log likelihood KL D(PLFAkPLFD) Training time tLFA(s) Speed-up tLFD/tLFA
g4 2 2495.3 1.16 0.17 8.8
3 1989.3 0.53 0.25 12.6
4 1625.3 0.09 0.31 2.8
g5 2 3688.4 0.45 0.20 39.8
3 3624.0 0.03 0.28 52.1
4 3617.7 0.02 0.37 58.2
5 3556.4 0.03 0.44 52.6
g6 2 4871.5 1.29 0.16 9.4
3 4305.9 0.84 0.22 9.0
4 4104.1 0.77 0.27 4.1
5 3839.5 0.54 0.33 8.0
6 3770.6 0.26 0.50 9.6
g7 2 5978.7 1.3 0.20 8.5
3 5405.6 0.99 0.28 12.8
4 5225.3 0.74 0.35 13.8
5 4964.0 0.66 0.43 10.3
6 4667.3 0.59 0.51 9.1
7 4349.8 0.42 0.60 11.9
Fig. 5. Visualization of global GMMs learned from abstracted data (dotted lines) and those learned from raw data (solid). q denotes the local model quality.
In addition, a small decrease in local model quality in general only results in graceful degradation of clustering perfor-
mance. As an illustration, Fig. 6 shows several global GMM models with three components learned from abstracted data with
different local model quality (normalized log likelihood) values, ranging from 0.49 to 1.679. We can observe that the clus-
tering performance improves from Fig. 6(a) to Fig. 6(c) as the local model quality increases. Fig. 7 shows that in general the
discrepancy between the reference model and the learned model, measured in term of the KL divergence value, decreases
gracefully as the local model quality increases.
To demonstrate the efficacy of the proposed approach on more sizeable data sets, we generated two more data sets, one
using a GMM containing 100 components (g100) and another with 1000 components (g1000). Fifty data items per compo-
Fig. 6. A global GMM with three components learned given different local model quality value q.
Fig. 7. The change of the KL divergence value between GMMs learned from raw and abstracted data.
Fig. 8. Performance of learning a global GMM with 10 components based on the aggregated local abstractions of g100 and g1000.
nent were created for both cases. Data items were uniformly partitioned into three sources at random and each source was
abstracted by a local GMM with 100 components. For g100, a global GMM was learned from aggregated abstraction with the
number of global components ranging from 1 to 100. Fig. 9 shows the change of the KL divergence of the converged global
model with different numbers of global components. It is noted that the KL divergence value decreases rapidly from 111.1 to
45.9 as the number of components of the global model increases from 1 to 10. After this rapid reduction, the curve gradually
converges to zero. Fig. 8 shows some clustering results on g100 and g1000. The computing time for learning from g1000 is
85.2 s based on the abstracted data while the corresponding time needed using the raw data set is 53.0 min. The speedup
factor is around 40.
We have also partitioned the data into different numbers of equal-sized data subsets for testing. As revealed in Table 2,
the achieved speedup remains more or less the same given the same local model quality.
Fig. 9. The change of the accuracy of a global GMM with different number of components.
4.1.2. Robustness against local sources with heterogeneous data distributions

In the previous section, all the data sets generated for testing are randomly partitioned to form the distributed local
sources such that the data distributions of the local sources are essentially the same. However, in many cases, how the data
items are distributed cannot be controlled and the local sources can have very different data distributions among them (see
Fig. 1).
To demonstrate the fact that the performance of the proposed approach is robust against the way how the overall set of
data items are distributed over the local sources, we deliberately partitioned a particular data set in some non-uniform man-
ner so that the partitions have different sizes and follow different data distributions. Again, at each local source, a GMM
abstraction was created and the number of local components used in the experiments was around 12. The global GMMs with
three components learned under different data partition settings are close to each other as shown in Fig. 10.
4.1.3. Performance on WebKB and Corel5k datasets

To further evaluate the applicability of the proposed approach, we applied the proposed algorithm to two benchmark
datasets. The first one is a subset of WebKB dataset [2] which contains over 8000 web pages collected from four universities
Table 2
Efficiency comparison between learning global GMMs from abstracted (LFA) and raw data (LFD). For all the cases, the number of global components is set to be
5.
No. of sources Local quality Training timetLFA(s) Speed-up tLFD/tLFA

3 2.31 2.46 9.43
3.31 8.98 2.58
4.31 38.48 0.6
5 2.31 2.99 7.76
3.31 11.37 2.04
4.31 41.6 0.56
7 2.31 3.42 6.78
3.31 15.9 1.46
4.31 55.7 0.42
9 2.31 5.4 4.29
3.31 22.68 1.02
4.31 68.14 0.34
Fig. 10. Global GMMs learned with different data partition ratios among three sources. Data from the three sources are labeled as triangles, crosses and
pluses, respectively.
and grouped under seven categories. We selected a subset from the dataset which contains 546 pages from three categories
and represented the web pages using a vector space model with 551 terms used. The second one is Corel5k from the Corel
stock photograph collection and consists of 5000 pictures, ranging from nature scenes to people portraits or sports photo-
graphs. Each image is labeled with 1–5 words and there are altogether 371 distinct words in the vocabulary. Each image
is represented using a 128-dimensional colored pattern appearance model (CPAM) which was proposed to capture both col-
or and texture information of small patches in natural color images, and has been successfully applied to image coding,
indexing and retrieval [34].
Two commonly used clustering evaluation criteria, normalized mutual information (NMI) [12] and purity [29], were com-
puted. The two criteria are defined as:
1X
PurityðW; CÞ ¼ max jwk \ cj j
N k j
where W denotes the learned cluster labels, C denotes the true cluster labels and N denotes the size of the data set, and
IðW; CÞ
NMIðW; CÞ ¼
ðHðWÞ þ HðCÞÞ=2
where I(W, C) is the mutual information between W and C, and H(W) and H(C) are their corresponding entropy values. For
perfect clustering, the NMI and Purity values should both be one.
To test the proposed learning framework on WebKB and Corel5k, we randomly partitioned the dataset into three equal-
sized data subsets, with each subset abstracted based on a common local model quality threshold. A global GMM learned
from the abstracted data (LFA) is to be compared with a GMM learned from the raw data (LFD) as shown in Figs. 11 and
12. Consistent results regarding the effectiveness of the proposed approach were observed where more abstracted local data
(with less local components) result only in graceful degradation of clustering quality measured in terms of purity and NMI.
Fig. 11. Performance comparison on the WebKB dataset.
Fig. 12. Performance comparison on the Corel5k dataset.

4.2. Performance on discovering data manifold using GTM
To demonstrate the effectiveness of the proposed learning framework for learning GTM, two benchmark data sets, namely
the ‘‘oil flow’’ data set and the ‘‘S-curve’’ data set, were used. The oil flow data set was originally used in [10] for mimicking
the measurements of oil flows mixed with gas and water along multi-phase pipelines. The data set consists of 1000 data
items of dimension 12 evenly distributed among three different geometrical configurations – ‘‘stratified’’, ‘‘annular’’ and
‘‘homogeneous’’. The S-curve data set is an ‘S’-shaped 2-D manifold embedded in a 3-D data space and is also commonly used
for evaluating the performance of nonlinear manifold learning algorithms. In our experiment, we sampled 2000 data items
from the S-shaped manifold to form the S-curse data set.
To carry out the experiments, we first partitioned the two data sets randomly into 3, 5, 7 and 9 equal-sized data subsets
for again mimicking the distributed data sources, abstracted the subsets based on a given local model quality threshold for
each source, and then applied the modified GTM learning algorithm. In our experiment, 400 latent lattice points and 81 basis
functions were chosen as the GTM setting which was experimentally found to work well for the two data sets we used. We
have also tried varying the local model quality threshold and the number of sources. The resulting manifolds discovered for
the oil flow data and the S-curve data obtained under different settings are shown in Figs. 15 and 16, respectively.
To facilitate the evaluation of the manifold discovery results by visual inspection (which is commonly adopted in related
work), we labeled the data items of different groups using different markers. That is, the oil flow data of different categories
(‘‘stratified’’, ‘‘annular’’ and ‘‘homogeneous’’) were marked differently. For the S-curve data set, the data items were marked
so that the six consecutive parts along the 2-D manifold share the same labels (see Fig. 13). Good discovery results should
end up with a smaller degree of overlapping among different markers on the projected map.
4.2.1. Accuracy and speedup

According to the experimental results, we observed that GTMs can be accurately learned from reasonably abstracted data.
By examining Figs. 15 and 16 column-wise, we observed that the performance of GTM improves as the local model quality
increases. In particular, for the oil flow data, when the local model quality increases from 1.98 to 2.98, most of the global
Fig. 13. Visualization of the oil flow ( = annular, 4 = homogeneous, ⁄ = stratified ) and S-curve data sets discovered using the conventional GTM with 400
latent variables and 81 basis functions.
Fig. 14. Manifold discovery results with heterogeneous local model quality requirements among the sources. S1–3 denote the three data sources and the
number next to each of them refers to the corresponding local model quality value.
Fig. 15. Results of manifold discovery on the oil flow data set using GTMs learned from abstractions under different numbers of sources (jCj) and different
levels of local model quality (q). L is the number of components of the aggregated data abstraction.
models learned (Fig. 15 (f)–(h)) are much improved, except for the case with three partitions (Fig. 15(e)). When the local
quality further increases to 3.98, the cases with seven sources (Fig. 15(k)) and nine sources (Fig. 15(l)) give the discovery
results almost identical to the case without using abstraction (Fig. 13) and the data of different categories are well separated
from each other. We observed the same pattern when the S-curve data set was tested.
To see the effect of having different quality requirements imposed on the local sources, we partitioned again the data sets
uniformly into three subsets. Each subset was associated with a different local model quality requirement. We tried different
sets of heterogeneous local quality requirements for the oil flow data set and the S-curve data as shown in Fig. 14.
For the oil flow data set (Fig. 14(a)–(c)), the results corresponding to two different combinations of local quality levels, as
shown in Fig. 14(a) and (b), were found to be a bit inferior to another one shown in Fig. 14(c). Referring to the caption of
Fig. 14(c), one can deduce that details from the third source were more important than those of the other two in giving more
accurate global analysis results (even the data subsets were uniformly sampled from one data set). For the S-curve data
(Fig. 14(d)–(f)), similar results were observed and the second source turned out to be the more important one. If the subsets
were sampled in a more biased manner, the effect would be even more obvious. This implies that there exists the need for
dynamically setting the local inaccuracy levels to obtain global analysis results in a more cost-effective manner. See Section 5
for further elaboration on this point.
Significant speedup is again achieved when learning GTM from abstracted data as reflected in our experimental results. In
particular, we compared the computational time needed in learning the oil flow data manifold from the raw data and from
the abstracted data. According to Table 3, we observed that the speedup gained ranges from 0.44 to 4.1. As anticipated, the
speedup factor increases when the number of local sources continues to increase. The speedup factor achieved is however
only the half of the value derived in Section 3.3. This was partially due to the fact that our implementation was based on
Matlab and some non-matrix operations have not been optimized equally well.
4.2.2. Robustness against different numbers of local sources

We have also tested the effect of having different numbers of sources. Given a common local model quality threshold for
all sources, our experimental results show that the cases with the data distributed into different numbers of sources behave
differently. Referring to Fig. 15(a)–(d), when the common local model quality threshold is 1.98 and the number of data
sources increases from 3 to 7, the data corresponding to two particular oil flow configurations (triangles and circles) twisted
Fig. 16. Results of manifold discovery on the S-curve data set using GTM learned from abstractions under different numbers of sources (jCj) and different
levels of local model quality (q). L is the number of components of the aggregated data abstraction.
Table 3
Efficiency comparison between global GTMs learned from abstracted (LFA) and raw data (LFD).
No. of sources Local quality No. of global components Training time tLFA(s) Speed-up tLFD/tLFA
3 1.98 76 2.19 4.71
2.98 185 4.71 2.19
3.98 515 11.39 0.9
5 1.98 108 2.91 3.56
2.98 313 7.25 1.43
3.98 770 17.3 0.6
7 1.98 151 3.89 2.66
2.98 426 10 1.04
3.98 990 23.17 0.45
9 1.98 194 4.57 2.26
2.98 543 12.21 0.85
3.98 991 23.43 0.44
together near the upper part of the sub-figures (Fig. 15(a)–(c)). When the number of sources increases to 9 (Fig. 15(d)), the
data of the two configurations are more separated from each other, except for those near the upper right corner. A similar
situation can be obtained on the row corresponding to the local model quality threshold equal to 3.98. However, the situ-
ation was not that obvious for the row corresponding to the local model quality threshold equal to 2.98. Also, changing
the number of sources does not affect much for the results on the S-curve data (Fig. 16). As a data set is uniformly partitioned
into the subsets, applying a given GMM-based abstraction to each of the subsets will end up with a different likelihood value.
Thus, if we apply the same local model quality requirement to both the original data and its subsets, more details, or local
components, will be revealed in some subsets and less in some others. This accounts for the variation in GTM’s accuracy
when the number of data partitions varies.
Fig. 17. Manifold discovery results on the oil flow data set given different local model quality requirements q and three different settings of non-uniform
data partitioning for seven sources. The three data partitioning settings are P1 = {0.09, 0.09, 0.21, 0.20, 0.20, 0.12, 0.09};
P2 = {0.23, 0.19, 0.18, 0.03, 0.04, 0.16, 0.17}; P3 = {0.01, 0.03, 0.09, 0.25, 0.13, 0.23, 0.25} respectively.
4.2.3. Robustness against local sources with heterogeneous data distributions

To test the robustness of the proposed approach given the local sources with heterogeneous data distributions, we par-
titioned the data sets into seven portions of different sizes. Given a particular partition setting, four different thresholds for
the local model quality requirement were examined. The results obtained with respect to the oil flow data set were shown in
Fig. 17. By examining the sub-figures in a row-wise manner where each row corresponds to one partition setting, we found
no significant difference for the overall manifold unfolding quality caused by the biased distributions. Similar results were
observed for the S-curve data set in Fig. 18.
4.2.4. On manifold preserving data abstraction

As revealed in Section 4.2.1, the need of finer details for better manifold unfolding was found to be more important for the
oil flow data than for the S-curve data. A good local abstraction should be able to well preserve local data topological rela-
tionships. To validate this point, we used a graph-based clustering algorithm called minimum cut [47] instead of AGH for
abstracting local data sources. The minimum cut algorithm models the data as an undirected weighted graph, where the data
items form the vertices of the graph and the similarity value between any two data items forms the weight of the edge con-
necting the two corresponding vertices.
We applied the graph-based approach for the local abstraction on the oil flow data set using the same experimental set-
tings as explained before. The minimum cut algorithm implemented in the clustering toolkit CLUTO [1] was used. The sim-
ilarity between data items was chosen as the inverse of their Euclidean distance and the number of local neighbors chosen
was 40. We used 400 latent lattice points for the GTM. We have tested the scenarios with 10, 20 and 50 local components in
each local data abstraction.
The discovered manifolds of the oil flow data set based on the two local abstraction approaches are shown in Fig. 19. The
first column corresponds to the results of the global GTMs learned from the AGH-based abstraction and the second column
corresponds to those learned using the abstraction derived using the minimum cut clustering. For the global GTMs learned
from the AGH-based abstraction, it can be observed that data with different labels are still mingled together in the upper part
of the visualized GTM even if the number of local components at each local source increases to 50 (see Fig. 19(e)). But for
those learned with the minimum cut abstraction used, the circles and the triangles were well separated when only 10 local
components were used at each source, as shown in Fig. 19(b). The quality of local manifold unfolding is further improved
Fig. 18. Manifold discovery results on the S-curve data set given different local model quality requirements q and different settings of non-uniform data
partitioning for seven sources. The three settings are P1 = {0.33, 0.01, 0.06, 0.12, 0.09, 0.15, 0.24}, P2 = {0.07, 0.15, 0.25, 0.28, 0.17, 0.07, 0.005} and
P3 = {0.17, 0.13, 0.21, 0.12, 0.26, 0.09, 0.01} respectively.
Fig. 19. The visualization of the oil flow data using GTMs with different approach of local abstraction, namely AGH and Graph-based approach. The
posterior means of the projected data of the three different configurations, namely homogeneous, annular and stratified, are labeled as circles, triangles and
asterisks, respectively.
when the number of local components increases 20, as shown in Fig. 19(d). When the number of the local components per
source increases to 50, the performance of the GTMs obtained becomes essentially equivalent to that of the original GTM.
5. Conclusion and future work
In this paper, a unified framework for learning generative models from data abstracted was proposed and tested to be
effective for distributed data mining tasks like clustering and manifold discovery. Given the local data being abstracted as
Gaussian mixture models, the EM algorithms that can learn directly from distributed and abstracted data were derived
for learning global GMMs and GTMs respectively. The proposed framework is shown to be effective and be able to achieve
significant speed-up. Also, the effects of factors like the number of local data sources, the uniformity of the local abstractions
and the quality level of the local abstractions were carefully studied and compared. Similar mathematical formulations ob-
tained for learning the two models hints at the potential of extending the proposed approach to other LVMs. However, to
apply the learning-from-abstraction idea to other types of data mining models (e.g., k-means), the distance measure between
the model components and each data item should be replaced by a new distance measure which can reflect the ‘‘distance’’
between the model components and each abstracted data component. To what extent that the accuracy will be affected and
that the speedup can be achieved due to the abstracted input requires to be examined both theoretically and empirically.
Another limitation of the proposed approach is that it requires all the participating local sources to possess additional
computational power, storage and communication capability for obtaining, storing and transmitting the local abstraction
information. Some ubiquitous computing environments may find the requirement hard to be satisfied.
In addition, while the proposed approach can provide a mechanism to control the level of local data abstraction for data
mining tasks to be performed, there still exists the subsequent need for assisting the user to better manage such an abstrac-
tion-based data mining environment. Further research effort will be needed to better quantify the quality of local data
abstraction so that it can be more intuitive to be interpreted and used. Also, it will be interesting to see how the proposed
approach can go autonomous, and the global broker and the local data sources can actively negotiate for setting the individ-
ual local model quality requirement in a need-to-know manner.
Acknowledgement
The authors thank the anonymous reviewers for their valuable comments for better shaping the paper. This work was
partially supported by a research grant from the RGC Competitive Earmarked Research Grant sponsored by the Research
Grants Council of Hong Kong (Project No. HKBU 210206).
Appendix A. Detailed derivation of the modified EM algorithm
A.1. GMM as the global model
The expected log likelihood for learning a global GMM model is

!
X
M X
jCj
ak XM X jCj
al Rlk 1
L¼ al Rlk ln dþ1
tr Rk Rl þ ðll lk ÞT R1
k ðll lk Þ ðA1Þ
k¼1 l¼1
1
jRk j jRl j
2 2
k¼1 l¼1
2
PM
where the constant terms are removed. By Incorporating the constraint k¼1 ak ¼ 1, we use the Lagrange multiplier to mod-
ify L as
! ! !
X
M X
jCj
ak 1 T 1
XM
Lk ¼ al Rlk ln dþ1
ðtrðR1
k R l Þ þ ð ll lk Þ R k ð ll lk ÞÞ k ak 1
k¼1 l¼1
1
jRk j2 jRl j 2 2 k¼1
E-step: We first compute the KL divergence [12] between the lth local Gaussian component pl and the kth global Gaussian
component pg which is defined as
Z Z Z
pl ðtjhl Þ
Dðpl ðtjhl Þjjpg ðtj/k ÞÞ ¼ pl ðtjhl Þ ln dt ¼ pl ðtjhl Þ ln pl ðtjhl Þdt pl ðtjhl Þ ln pg ðtj/k Þdt ðA2Þ
t pg ðtj/k Þ t t
The second term of Eq. (A2) can be computed as

Z
d 1 1 2d 12 1
ð2pÞ2 jRl j2 exp ðt ll ÞT R1l ðt ll Þ lnðð2 p Þ jR k j Þ ðt l k ÞT 1
R k ðt lk Þ dt
t 2 2
1 Z
d 1 2d 2
1 1
¼ ln ð2pÞ2 jRk j2 þ ðt lk ÞT R1 T 1
k ðt lk Þð2pÞ Rl exp ðt ll Þ Rl ðt ll Þ dt:
2 t 2
According to [15], the closed-form solution exists, given as
d 1 1 T 1

¼ lnðð2pÞ2 jRk j2 Þ þ trðR1
k Rl Þ þ ðll lk Þ Rk ðll lk Þ :
2
Similarly, the first term of Eq. (A2) can be obtained as
d 1 d
lnðð2pÞ2 jRl j2 Þ
2
Thus, Eq. (A2) (i.e., the KL divergence) can be rewritten as
d 1
d d 1
1
T 1
ln ð2pÞ2 jRl j2 þ ln ð2pÞ2 jRj2k þ trðR1
k Rl Þ þ ðll lk Þ Rk ðll lk Þ
2 2
jRk j2 1 1 1
1
¼ ln 1
þ tr Rk Rl d þ ðll lk ÞT R1
k ðll lk Þ ðA3Þ
jRl j2 2 2
and Rlk can then be estimated accordingly.

M-step: We first take the derivative of Lk in Eq. (A2) w.r.t. lk and setting it to zero, we get
@Lk X
jCj
¼ al Rlk ððll lk ÞT R1
k Þ ¼ 0
@ lk l¼1
PjCj ðA4Þ
aR l
lk ¼ Pl¼1jCj l lk l
l¼1 al Rlk
In a similar manner, we take the derivative of Lk w.r.t. Rk and set it to zero, resulting in
@Lk X
jCj
1 1 T T T 1 T T T
¼ al Rlk ð RT
k þ Rk Rl Rk þ Rk ðll lk Þðll lk Þ Rk Þ ¼ 0
@ Rk l¼1
2 2 2
X
jCj X
jCj
al Rlk ¼ R1
k al Rlk ððll lk Þðll lk ÞT þ Rl Þ ðA5Þ
l¼1 l¼1
X
jCj X
jCj

Rk al Rlk ¼ al Rlk ll lTl lk lTl ll lTk þ lk lTk þ Rl
l¼1 l¼1
By substituting Eq. (A4) back into Eq. (A5), we have

PjCj
l¼1 al Rlk Rl þ ll lTl
Rk ¼ PjCj lk lTk ðA6Þ
l¼1 al R lk
For ak, we start with
@Lk X
jCj
1
¼ al Rlk k ¼ 0
@ ak l¼1
ak
1X
jCj
ak ¼ al Rlk
k l¼1
P PjCj PM
and incorporate the constraint k ak ¼ 1 so that we get k ¼ l¼1 k¼1 al Rlk ¼ 1, and thus
X
jCj
ak ¼ al Rlk
l¼1
A.2. GTM as the global model
The expected log likelihood function for a global GTM is

d
!
X
M X
jCj
b2 X
M X
jCj
1
L¼ al Rlk ln d
al Rlk ðbtrðRl Þ d þ bðll yðzk ; WÞÞT ðll yðzk ; WÞÞÞ
k¼1 l¼1 ð2pÞ2 k¼1 l¼1
2
E-step: Again, we need to compute the KL divergence between the lth local Gaussian component and the global Gaussian
component corresponding to each latent variable. According to Eq. (18), a univariate Gaussian distribution is assumed in
GTM for all the global Gaussian components and the data variance is captured by 1/b instead of using a set of covariance
matrices {Rk}. Other than that, there is no other main difference with the case for learning a global GMM. Thus, by referring
to Eq. (A3), it can easily be shown that the KL divergence can be computed as
d
b2 1 1
Dðplocal jjpgtm Þ ¼ ln 1
þ ðbtrðRl Þ dÞ þ bðyðzk ; WÞ ll ÞT ðyðzk ; WÞ ll Þ ðA7Þ
j Rl j2 2 2
M-step: The estimates of W and b can be obtained by setting the corresponding derivatives of Eq. (A7) to zero, resulting in
@L XM X jCj XM X jCj
¼ Rlk ðððll Wwðzk ÞÞT ðbÞðwðzk ÞÞÞÞ ¼ Rlk ðWwðzk Þ ll Þwðzk ÞT ¼ 0 ðA8Þ
@W k¼1 l¼1 k¼1 l¼1
and
PM PjCj PjCj 2
!
1 X X X
M jCj jCj
1 k¼1 ð a
l¼1 l Rlk ð Rl þ ll lTl Þ l¼1 Rlk ðWwðzk ÞÞ Þ
¼ PM PjCj ¼ al Rlk ðRl þ ll lTl Þ al Rlk ðWwðzk ÞÞ2
b d jCjd k¼1 l¼1
k¼1 l¼1 Rlk l¼1
References
[1] CLUTO: a clustering toolkit. Available from: <http://www.cs.umn.edu/karypis>.

[2] WebKB. Available from: <http://www.cs.cmu.edu/WebKB>.
[3] N.R. Adam, J.C. Wortmann, Security-control methods for statistical databases: a comparative study, ACM Computing Survey 21 (4) (1989) 515–556.
[4] C.C. Aggarwal, P.S. Yu, A condensation approach to privacy preserving data mining. in: Proceedings of the 9th International Conference on Extending
Database Technology, Heraklion-Crete, Greece, March 2004, pp. 183–199.
[5] R.M. Aliguliyev, Performance evaluation of density-based clustering methods, Information Sciences 179 (20) (2009) 3583–3602.
[6] J. Allan, Introduction to topic detection and tracking, in: Topic Detection and Tracking: Event-based Information Organization, Kluwer Academic
Publishers, Norwell, MA, USA, 2002, pp. 1–16.
[7] A.S.B. Gilburd, R. Wolff, k-TTP: a new privacy model for large-scale distributed environments, in: Proceedings of the Tenth ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, Seattle, August 2004, pp. 563–568.
[8] E. Bertino, B.C. Ooi, Y. Yang, R.H. Deng, Privacy and ownership preserving of outsourced medical data, in: Proceedings of the Twenty-first International
Conference on Data Engineering, IEEE Computer Society, Washington, DC, USA, 2005, pp. 521–532.
[9] C.M. Bishop, Neural Networks for Pattern Recognition, Oxford University Press, 1995.
[10] C.M. Bishop, M. Svensén, C.K.I. Williams, GTM: The generative topographic mapping, Neural Computation 10 (1) (1998) 215–235.
[11] R. Chen, S. Krishnamoorthy, A new algorithm for learning parameters of a Bayesian network from distributed data, in: Proceedings of the 2002 IEEE
International Conference on Data Mining, Maebashi City, Japan, December 2002, pp. 585–588.
[12] T.M. Cover, J.A. Thomas, Elements of Information Theory, Wiley, New York, 1991.
[13] A.P. Dempster, N.M. Laird, D.B. Rubin, Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society. Series B
(Methodological) 39 (1) (1977) 1–38.
[14] M. Ester, H. Kriegel, J. Sander, X. Xu, A density-based algorithm for discovering clusters in large spatial databases with noise, in: Proceedings of the
Second International Conference on Knowledge Discovery and Data Mining, Portland, 1996, pp. 226–231.
[15] K.-T. Fang, Y.-T. Zhang, Generalized Multivariate Analysis, Springer-Verlag, Berlin, 1990.
[16] M. Figueiredo, A. Jain, Unsupervised learning of finite mixture models, IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (3) (2002)
381–396.
[17] C. Fraley, A. Raftery, Model-based clustering discriminant analysis and density estimation, Journal of the American Statistical Association 97 (458)
(2002) 611–631.
[18] B.C.M. Fung, K. Wang, P.S. Yu, Anonymizing classification data for privacy preservation, IEEE Transactions on Knowledge and Data Engineering 19 (5)
(2007) 11–725..
[19] S. Han, W.K. Ng, L. Wan, V.C.S. Lee, Privacy-preserving gradient-descent methods, IEEE Transactions on Knowledge and Data Engineering 22 (6) (2010)
884–899.
[20] C.-L. Hsu, Y.-H. Chuang, A novel user identification scheme with key distribution preserving user anonymity for distributed computer networks,
Information Sciences 179 (4) (2009) 422–429.
[21] E. Januzaj, H.-P. Kriegel, M. Pfeifle, Scalable density-based distributed clustering, in: Proceedings of the Eighth European Conference on Principles and
Practice of Knowledge Discovery in Databases, New York, NY, USA, 2004, pp. 231–244.
[22] G.J. McLachlan, K.E. Basford, Mixture Models – Inference and Applications to Clustering, Marcel Dekker, New York, 1988.
[23] M. Jordan, Learning in Graphical Models, MIT Press, 1998.
[24] M. Kantarcioglu, C. Clifton, Privacy-preserving distributed mining of association rules on horizontally partitioned data, IEEE Transactions on
Knowledge and Data Engineering 16 (9) (2004) 1026–1037.
[25] H. Kargupta, B. Park, D. Hershberger, E. Johnson, Collective data mining: a new perspective towards distributed data mining, in: H. Kargupta, P. Chan
(Eds.), Advances in Distributed and Parallel Knowledge Discovery, MIT/AAAI Press, 2000, pp. 133–184.
[26] S.-W. Kim, S. Park, J.-I. Won, S.-W. Kim, Privacy preserving data mining of sequential patterns for network traffic data, Information Sciences 178 (3)
(2008) 694–713.
[27] L.V.S. Lakshmanan, R.T. Ng, G. Ramesh, To do or not to do: the dilemma of disclosing anonymized data, in: Proceedings of the 2005 ACM SIGMOD
International Conference on Management of Data, ACM Press, New York, NY, USA, 2005, pp. 61–72.
[28] N. Li, T. Li, S. Venkatasubramanian, Closeness: a new privacy measure for data publishing, IEEE Transactions on Knowledge and Data Engineering 22 (7)
(2010) 943–956.
[29] B. Liu, W.S. Lee, P.S. Yu, X. Li, Partially supervised classification of text documents, in: Proceedings of the Nineteenth International Conference on
Machine Learning, San Francisco, CA, USA, 2002, pp. 387–394.
[30] J.B. MacQueen, Some methods for classification and analysis of multivariate observations, in: Proceedings of Fifth Berkeley Symposium on
Mathematical Statistics and Probability, University of California Press, Berkeley, 1967, pp. 281–297.
[31] S. Merugu, J. Ghosh, Privacy-preserving distributed clustering using generative models, in: Proceedings of the Third IEEE International Conference on
Data Mining, Melbourne, FL, November 2003, pp. 211–218.
[32] F. Pernkopf, D. Bouchaffra, Genetic-based EM algorithm for learning Gaussian mixture models, IEEE Transactions on Pattern Analysis and Machine
Intelligence 27 (8) (2005) 1344–1348.
[33] A. Prodromidis, P. Chan, Meta-learning in distributed data mining systems: issues and approaches, in: H. Kargupta, P. Chan (Eds.), Advances of
Distributed Data Mining, MIT/AAAI Press, 2000.
[34] G. Qiu, Indexing chromatic and achromatic patterns for content-based colour image retrieval, Pattern Recognition 35 (8) (2002) 1675–1686.
[35] L.R. Rabiner, B.H. Juang, An introduction to hidden Markov models, IEEE ASSP Magazine, 1986, pp. 4–15.
[36] S.T. Roweis, L.K. Saul, Nonlinear dimensionality reduction by locally linear embedding, Science 290 (5500) (2000) 2323–2326.
[37] D. Sacharidis, K. Mouratidis, D. Papadias, k-anonymity in the presence of external databases, IEEE Transactions on Knowledge and Data Engineering 22
(3) (2010) 392–403.
[38] Y. Sang, H. Shen, H. Tian, Privacy-preserving tuple matching in distributed databases, IEEE Transactions on Knowledge and Data Engineering 21 (11)
(2009) 1767–1782.
[39] D. Shah, S. Zhong, Two methods for privacy preserving data mining with malicious participants, Information Sciences 177 (23) (2007) 5468–5483.
[40] M.-L. Shyu, C. Haruechaiyasak, S.-C. Chen, Category cluster discovery from distributed directories, Information Sciences 155 (3–4) (2003) 181–197.
[41] M. Steinbach, G. Karypis, V. Kumar, A comparison of document clustering techniques, in: Proceedings of KDD-2000 Workshop on Text Mining, Boston,
MA, USA, 2000, pp. 109–111.
[42] L. Sweeney, k-anonymity: a model for protecting privacy, International Journal on Uncertainty, Fuzziness and Knowledge-based Systems 10 (5) (2002)
57–570.
[43] J. Tenenbaum, V. de Silva, J. Langford, A global geometric framework for nonlinear dimensionality reduction, Science 290 (5500) (2000) 2319–2323.
[44] J. Vaidya, C. Clifton, Privacy preserving k-means clustering over vertically partitioned data, in: The Ninth ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, August 2003, pp. 206–215.
[45] J. Verbeek, N. Vlassis, B. Kröse, Efficient greedy learning of Gaussian mixture models, Neural Computation 15 (2) (2003) 469–485.
[46] J. Wishart, The generalised product moment distribution in samples from a normal multivariate population, Biometrika 20A (1–2) (1928) 32–52.
[47] Z. Wu, R. Leahy, An optimal graph theoretic approach to data clustering, IEEE Transactions on Pattern Analysis and Machine Intelligence 15 (11) (1993)
1101–1113.
[48] T. Zhang, R. Ramakrishnan, M. Livny, BIRCH: an efficient data clustering method for very large databases, in: H.V. Jagadish, I.S. Mumick (Eds.),
Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, Montreal, Quebec, Canada, June 4–6, 1996, ACM Press, 1996,
pp. 103–114.
[49] X. Zhang, W.K. Cheung, Learning global models based on distributed data abstractions, in: Proceedings of International Joint Conference on Artificial
Intelligence, Edinburgh, August 2005, pp. 1645–1646.
[50] X. Zhang, W.K. Cheung, Visualizing global manifold based on distributed local data abstractions, in: Proceedings of the Fifth IEEE International
Conference on Data Mining, IEEE Computer Society, Washington, DC, USA, 2005, pp. 821–824.
[51] X. Zhang, C. Lam, W.K. Cheung, Mining local data sources for learning global cluster models via local model exchange, The IEEE Intelligent Informatics
Bulletin 4 (2) (2004) 16–22.
[52] S. Zhong, Z. Yang, T. Chen, k-anonymous data collection, Information Sciences 179 (17) (2009) 2948–2963.

2011 Learning Latent Variable Models From Distributed and Abstracted Data

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2011 Learning Latent Variable Models From Distributed and Abstracted Data

Uploaded by

Copyright:

Available Formats

Information Sciences 181 (2011) 2964–2988

Contents lists available at ScienceDirect

Learning latent variable models from distributed and abstracted data

1.1. Related work on distributed data analysis

0020-0255/$ - see front matter Ó 2011 Published by Elsevier Inc.

1.2. Related work on analyzing abstracted data

1.3. Our goal and paper organization

2.1. Local data abstraction

Algorithm 1. A single-linked agglomerative hierarchical clustering

2.2. An EM-based learning algorithm for abstracted data

Fig. 3. A ﬂow chart of the proposed approach.

Algorithm 2. Learning from abstraction

13: end for

3. Learning global models from abstracted data

3.1. Learning GMM for clustering

3.1.1. Learning GMM from raw data

The M-step which maximizes Eq. 7 is summarized as follow:

3.1.2. Learning GMM from abstracted data

pðll ; Rl jlk ; Rk Þ ¼ pðll jlk ; Rk ÞpðRl jRk Þ ð13Þ

The detailed derivation can be found in Appendix A.

3.1.3. GMM initialization

3.2. Learning GTM for manifold discovery

3.2.1. Learning GTM from raw data

3.2.2. Learning GTM from abstracted data

The detailed derivation can be found in Appendix A.2.

3.2.3. GTM initialization

3.3. Computational gain and communication overhead

4.1. Performance on clustering abstracted data using GMM

4.1.1. Accuracy and speedup

Reference model No. of global components Model accuracy Learning efﬁciency

4.1.2. Robustness against local sources with heterogeneous data distributions

4.1.3. Performance on WebKB and Corel5k datasets

No. of sources Local quality Training timetLFA(s) Speed-up tLFD/tLFA

Fig. 11. Performance comparison on the WebKB dataset.

Fig. 12. Performance comparison on the Corel5k dataset.

4.2. Performance on discovering data manifold using GTM

4.2.1. Accuracy and speedup

4.2.2. Robustness against different numbers of local sources

4.2.3. Robustness against local sources with heterogeneous data distributions

4.2.4. On manifold preserving data abstraction

5. Conclusion and future work

Appendix A. Detailed derivation of the modiﬁed EM algorithm

A.1. GMM as the global model

The expected log likelihood for learning a global GMM model is

The second term of Eq. (A2) can be computed as

and Rlk can then be estimated accordingly.

By substituting Eq. (A4) back into Eq. (A5), we have

For ak, we start with

A.2. GTM as the global model

The expected log likelihood function for a global GTM is

[1] CLUTO: a clustering toolkit. Available from: <http://www.cs.umn.edu/karypis>.

You might also like

[1] CLUTO: a clustering toolkit. Available from: <http://www.cs.umn.edu/karypis>.