You are on page 1of 20

Tensor decomposition

for learning
Latent Variable Models
By A. Anandkumar, R. Ge, D. Hsu, S.M. Kakade and M. Telgarsky. Journal of Machine Learning Research 15 (2014)

1
What’s a Latent Variable Model?
● The problem we want to solve:
○ We never observe directly outcomes from
○ The distribution of the data we observe depends on h:
○ We want to estimate the model parameters that generated the data

● Example: Exchangeable Single Topic Models.


○ First sample a topic H out of the [k] possible, each with probability Generative
○ Define the probability of the tth word being word i [d] as model for each
○ Then sample l words iid to form the ‘bag of words’ text document

We observe documents: → want to estimate

2
Previous approaches
NP-hard for latent trees
or topic models
● Exact MLE:
○ Maximize Likelihood of data observed
○ Usually intractable because of H is huge (mainly in sequence problems)
● Expectation Maximization:
○ Procedure:
- Initialize all the parameters \theat_0
- Expectation step: Under these parameters and the observed data, compute the expected
likelihood
- Maximization step: Find new parameters the make this expected likelihood higher
- Iterate until convergence
○ May have slow convergence and not global guarantee (heuristic)

3
Method of Moments
Simple example: if model is
gaussian, say that the mean is
● General Framework to estimate model parameters: equal to the empirical mean
○ Compute some statistics of the data (empirical) and that the variance is equal
○ Impose that model produces the same population quantities to

● Exploit structure of lower order moments:


○ The general latent variable case could be as desperate as the MLE
○ In many models (like the Exchangeable Single Topic), there is considerable structure in the
lower order moments! We will describe efficient methods to solve it:

Extracting orthogonal decomposition of Tensors

4
What’s a Tensor?
● A pth order tensor is ww (usually )
● The pth power of a vector is
● Maybe you don’t realize but you have already done that!

5
Generalize definitions from Matrices
● Symmetric tensor: invariant to index permutation
● Rank of a tensor: smallest k:
It can exceed the dimension !!
● Symmetric rank: smallest k
It might not be finite for a general tensor! (Need to be defined on )
Removal of best rank-1 approximation might increase the rank of the residual!
● Orthogonally decomposable:
Exists for symmetric matrix (p=2) but maybe not for symmetric tensors!
With matrices, obtain the term with largest \lambda_i efficiently by the power method. Then
deflate to find the others. How to do this for general tensors?
● Under mild regularity, if a Tensor is orthogonally decomposable, its
decomposition is UNIQUE! (up to signs changes if p even)

6
Tensors as multilinear maps
● We can take the product of k matrices with a pth order tensor:

● Some simple examples:


○ For matrices (p = 2):
○ Retrieving one entry:
● Useful for this study: p=3 If Orthogonally
decomposable
T(I,I,u) = = ;

Generalize def EIGENVECTOR:

7
Tensor Eigenvectors/Eigenvalues
● The previous definition brings properties different than for matrices:
A linear combination of eigenvectors might not be an eigenvector!
Even if all eigenvalues are different, the eigenvectors might not be unique! Ex:
The Rayleigh quotient can converge to ANY eigenvector (not only to largest eigenvalue)

● Robust Eigenvalue u:

Time for the theorems...

8
Orthogonal decomposition and eigenvectors

(4) (6)

Now remember: we will express our Latent Variable Model parameter estimation as a
tensor decomposition, which in turn will be solvable by this generalized power method!

9
Back to the bag of words
● Represent each word xt as a d-dimensional vector:

● The second moment is the tensor (matrix):


● The conditional expectation are:

● Hence we can prove that:

10
How to retrieve , w?
Problem: The are not an orthogonal basis, which M2 and M3 might not even have!
How to find them with our previous theorem? Linear transformation!

M2 is a symmetric matrix: diagonalizes in an orthonormal base U. Take W = UD½

Define: ; They are orthonormal:

Define: ⇒ Orthogonally decomposable!

Orthogonally decomposable!

11
Algorithm proposed

12
Tensor Power Method convergence
● Quadratic rate convergence to ONE of the robust eigenvectors (determined
by ):

● We actually don’t have T but T’=T+E, where E is a perturbation due to the


estimation noise and the error propagation due to the whitening W. Yields a
approximate solution? For their algorithm we can prove a theorem similar to
Wedin’s: If we initialize L time, with f and

With probability \eta

13
Implementation and Complexity
● We can compute the contribution of a document to the second/third order
moment efficiently by aggregating the data in a word count vector c_i:
Second order contribution: Third order contribution:

● The operations are : even storage prohibitive!


○ The expressions above allow not to explicitly construct the moments!
○ Usually sparse: can perform almost linearly in d and training data.
○ The matrix W can also be obtained efficiently with randomized linear algebra algorithms.
● Complexity of full algorithm:
○ Flattening the matrix and performing SVD is O(k4), but less stable (mainly to repeated eigenval)
○ The 1+ factor comes from the random initialization: alleviate by better initialization.

14
More interesting models I

15
More interesting models II

16
More interesting models III

17
More interesting models IV

18
More interesting models ...
● Many more in this paper:
○ Multi-view models
○ Mixtures of Axis aligned gaussians
○ More mixtures of gaussians
○ Hidden Markov Models
○ yours?

19
Wrap-up
● Many latent variable models have highly structured lower order moments:
○ Express a moment as a weighted sum of p^th tensor powers of its parameters.
○ If we find this decomposition, we can estimate the parameters.
○ Usually this decomposition is not orthogonal so we apply a linear transformation to use the
theory about the power method.
● The method has good guarantees:
○ Converges quadratically to the global optimum
○ Can cope with repeated eigenvalues
○ With enough random reinitialization we can bound the error due to perturbations of the tensor
● The method still looks very expensive but:
○ No need to construct explicitly the tensors
○ Is very sparse and can use linearized random algebra
○ And can be parallelized!

20

You might also like