Professional Documents
Culture Documents
for learning
Latent Variable Models
By A. Anandkumar, R. Ge, D. Hsu, S.M. Kakade and M. Telgarsky. Journal of Machine Learning Research 15 (2014)
1
What’s a Latent Variable Model?
● The problem we want to solve:
○ We never observe directly outcomes from
○ The distribution of the data we observe depends on h:
○ We want to estimate the model parameters that generated the data
2
Previous approaches
NP-hard for latent trees
or topic models
● Exact MLE:
○ Maximize Likelihood of data observed
○ Usually intractable because of H is huge (mainly in sequence problems)
● Expectation Maximization:
○ Procedure:
- Initialize all the parameters \theat_0
- Expectation step: Under these parameters and the observed data, compute the expected
likelihood
- Maximization step: Find new parameters the make this expected likelihood higher
- Iterate until convergence
○ May have slow convergence and not global guarantee (heuristic)
3
Method of Moments
Simple example: if model is
gaussian, say that the mean is
● General Framework to estimate model parameters: equal to the empirical mean
○ Compute some statistics of the data (empirical) and that the variance is equal
○ Impose that model produces the same population quantities to
4
What’s a Tensor?
● A pth order tensor is ww (usually )
● The pth power of a vector is
● Maybe you don’t realize but you have already done that!
5
Generalize definitions from Matrices
● Symmetric tensor: invariant to index permutation
● Rank of a tensor: smallest k:
It can exceed the dimension !!
● Symmetric rank: smallest k
It might not be finite for a general tensor! (Need to be defined on )
Removal of best rank-1 approximation might increase the rank of the residual!
● Orthogonally decomposable:
Exists for symmetric matrix (p=2) but maybe not for symmetric tensors!
With matrices, obtain the term with largest \lambda_i efficiently by the power method. Then
deflate to find the others. How to do this for general tensors?
● Under mild regularity, if a Tensor is orthogonally decomposable, its
decomposition is UNIQUE! (up to signs changes if p even)
6
Tensors as multilinear maps
● We can take the product of k matrices with a pth order tensor:
7
Tensor Eigenvectors/Eigenvalues
● The previous definition brings properties different than for matrices:
A linear combination of eigenvectors might not be an eigenvector!
Even if all eigenvalues are different, the eigenvectors might not be unique! Ex:
The Rayleigh quotient can converge to ANY eigenvector (not only to largest eigenvalue)
● Robust Eigenvalue u:
8
Orthogonal decomposition and eigenvectors
(4) (6)
Now remember: we will express our Latent Variable Model parameter estimation as a
tensor decomposition, which in turn will be solvable by this generalized power method!
9
Back to the bag of words
● Represent each word xt as a d-dimensional vector:
10
How to retrieve , w?
Problem: The are not an orthogonal basis, which M2 and M3 might not even have!
How to find them with our previous theorem? Linear transformation!
Orthogonally decomposable!
11
Algorithm proposed
12
Tensor Power Method convergence
● Quadratic rate convergence to ONE of the robust eigenvectors (determined
by ):
13
Implementation and Complexity
● We can compute the contribution of a document to the second/third order
moment efficiently by aggregating the data in a word count vector c_i:
Second order contribution: Third order contribution:
14
More interesting models I
15
More interesting models II
16
More interesting models III
17
More interesting models IV
18
More interesting models ...
● Many more in this paper:
○ Multi-view models
○ Mixtures of Axis aligned gaussians
○ More mixtures of gaussians
○ Hidden Markov Models
○ yours?
19
Wrap-up
● Many latent variable models have highly structured lower order moments:
○ Express a moment as a weighted sum of p^th tensor powers of its parameters.
○ If we find this decomposition, we can estimate the parameters.
○ Usually this decomposition is not orthogonal so we apply a linear transformation to use the
theory about the power method.
● The method has good guarantees:
○ Converges quadratically to the global optimum
○ Can cope with repeated eigenvalues
○ With enough random reinitialization we can bound the error due to perturbations of the tensor
● The method still looks very expensive but:
○ No need to construct explicitly the tensors
○ Is very sparse and can use linearized random algebra
○ And can be parallelized!
20