Professional Documents
Culture Documents
Abstract
We describe GAMBLR, a Gibbs annealing scheme for multilinear mod-
els. Our scheme combines a Gibbs sampling scheme with an annealing
procedure similar to that of simulated annealing. Under a Gaussian like-
lihood and an inverse-gamma prior, we show that this Gibbs annealing
scheme can be performed via a simple adaptive noise injection procedure,
where noise and prior regularization is added to the data and then clas-
sical Ordinary Least Squares regression is carried out. This results in an
extremely simple procedure with full Bayesian interpretation.
X ⇡ f (V1 , . . . , VK ).
We consider here the case of nonlinear models that are multilinear, which is
that if we freeze all sets of variables but one, then the model is linear in this
set of variables. This property must hold for each of the sets of variables. If
we define gk̄ (Ṽk ) = f (V1 , . . . , Vk 1 , Ṽk , Vk+1 , . . . , VK ), then gk̄ (Ṽk ) needs to be a
linear function of Ṽk , for all k. A prototypical example of such kind of modeling
is low-rank matrix approximation where we write
X ⇡ V.W T ,
1
2 Alternating least squares
A typical approach to such multilinear problems is to use the technique of
alternating least squares (REF). Given that each problem is linear when other
variables are held fixed, we can attempt to repeatedly solve the sequence of
linear problems by least-square linear regression. Given some initial condition
(0) (0)
V1 , . . . , VK , we iteratively and sequentially solve the least-squares problems:
x = V w + .",
ŵ = (V T V ) 1
V T x = V † x,
2
standard treatment, we select as likelihood a multivariate normal distribution
(which is naturally associated to our quadratic error model):
For the prior, we need a distribution p(w, 2 ), which we will decompose via
the product rule as p(w, 2 ) = p(w| 2 ).p( 2 ) with p(w| 2 ) = N (µ0 , 2 ⇤0 1 )
normally distributed given 2 and parametrized by the prior mean µ0 and the
prior precision matrix ⇤0 , and with p( 2 ) = J (a0 , b0 ), an inverse gamma dis-
tribution:
2 2 2
p(w, ) = p(w| ).p( )
= N (µ0 , 2
⇤0 1 ).J (a0 , b0 ).
Given this choice of conjugate likelihood and prior, the posterior distribution
is of the same form:
2 2 2
p(w, |x, V ) = p(w| , x, V ).p( |x, V )
2 1
= N (µn , ⇤n ).J (an , bn ),
with
µn = (V T V + ⇤0 ) 1 (V T x + ⇤0 µ0 )
⇤n = (V T V + ⇤0 )
an = a0 + n/2
bn = b0 + (xT x + µT0 ⇤0 µ0 µTn ⇤n µn )
Since ⇤0 is a precision matrix, it admits a decomposition of the form ⇤0 =
LT0 L0 with L0 square (for example, a Cholesky decomposition). If we define
V̄ = [V ; L0 ] as the vertical concatenation
✓ ◆ of the two matrices, we have that
x
⇤n = V̄ T V̄ . If we define x̄ = , we have that
L0 µ0
µn = (V̄ T V̄ ) 1
V̄ T x̄ = V̄ † x̄.
= xT x xT V.V † x
= xT x xT (V † )T V T V.V † x
= xT x µT ⇤µ,
3
where µ and ⇤ denote the version of µn and ⇤n without regularitization (⇤0 =
0). Hence, bn behaves as a regularized version of the sum-of-squares error.
The mode of the inverse-gamma distribution J (an , bn ) is given by bn /(an + 1),
which closely resembles then the maximum likelihood estimate of the variance
T
in Ordinary Least Squares: ˆ 2 = "ˆn"ˆ . Thus, the sampling of p( 2 |x, V ) =
J (an , bn ) is essentially centered around the residual error of the model.
One more thing to note is that the Bayesian linear regression will also some-
times be so that the response variables are in matrix rather than vector form:
X = V W + .E,
(i)
where we draw Vk from its conditional distribution, conditioned on the current
values of the other blocks of variables. Under the condition that the scheme
forms an irreducible and aperiodic Markov chain (REF), then this scheme
will converge towards drawing dependent samples from the joint distribution
p(V1 , . . . , VK ). The early part of the sequence where samples are not considered
to be drawn from the joint distribution is called the burn-in. Depending on
whether the length of the burn-in is acceptable or not, the Markov is deemed
well or poorly mixing.
If we have a multilinear model X ⇡ f (V1 , . . . , VK ), we can then set up a
block Gibbs sampler for the posterior distribution p(V1 , 1 , . . . , VK , K |X) by
chaining Bayesian linear regressions over the different blocks (where k̄ denotes
the estimates for the linear model resulting from freezing all the other blocks
of variables and the “.” notation denotes the fact that when V is a matrix we
perform one regression per column):
(i) (i) 2 (i) (i)
V.k ⇠ N (V ; µ.k̄ , ( ).k̄ .(⇤.k̄ ) 1 ),
2 (i) (i) (i)
( )k̄ ⇠ J (ak̄ , bk̄ ).
4
The precise setup of the regressions will depend on the specific model at hand.
If this sampling scheme is irreducible and aperiodic, then the sampling
scheme is guaranteed to converge in distribution to some equilibrium distribu-
tion. Given that the normal distribution has a non-zero probability to sample
any point of the space, and that the inverse Gamma has a non-zero probability
to sample any positive value, all possible paths have a non-zero value, hereby
guaranteeing irreducibility and aperiodicity. In fact, this property corresponds
to that of an ergodic Markov chain. Note however, that this does not provide
any guarantee that the sampling scheme will have good mixing property. How-
ever, given that we sample from a limited number of large blocks, we can expect
that the sampling scheme should have fairly good mixing properties. If so, after
the initial period, the successive samples from this schemes will be dependent
samples from the desired posterior distributionp(V1 , 1 , . . . , VK , K |X).
Although all the conditional distributions considered (if we ignore the vari-
ances) are normally distributed, the resulting joint distribution cannot be nor-
mally distributed because the corresponding estimation model is nonlinear. It is
also unclear what the corresponding noise model is (even though the conditional
models correspond to isotropic Gaussian noise).
5
Thus, lowering the temperature corresponds to reducing the variance linearly
with the temperature parameter.
Going back to the Bayesian linear regression problem, once we have com-
puted the posterior distribution, we now need to sample from it. Taking the
notation from the previous section, we want at each step to sample from some
multivariate normal distribution N (µn , 2 .T.⇤n 1 ) with
µn = (V̄ T V̄ ) 1
V̄ T w̄ = V̄ † w̄
⇤n = V̄ T V̄
z = V̄ † .(w̄ + .T.").
This means that each step of the Gibbs sampler consists in solving the original
Ordinary Least Squares problem from the Alternating Least Squares strategy,
simply injecting the appropriate level of noise directly on the data! We still
need to sample at the end of each iteration, but this is only a one-dimensional
parameter that can be sampled from the appropriate inverse-gamma distribution
easily.
Solving the least-squares problem can be classically done through the use of
the pseudoinverse. Alternatively, we can also consider the following approach
that replaces the computation of the pseudoinverse, by the solution of the normal
system of linear equations:
which is equivalent to
V̄ z = w̄ + ."
6
solve the linear system of equations for z, which gives then the next sample from
the Gibbs sampler. We then repeat this for each of the multilinear components
towards convergence of the Gibbs sampling scheme.
For the variance parameter, we propose to replace sampling by an iterated
conditional mode procedure (REF), where the mode of the inverse gamma dis-
tribution J (a, b) is given by b/(a + 1). Given that the noise parameter is scalar
while the normal distribution is multidimensional, we do not expect the greedi-
ness of the noise estimation to significantly affect the procedure. We also expect
that the iterated conditional mode estimation of the noise parameter will allow
to monitor the convergence of the noise parameter. Given that we sample from a
limited number of blocks, we expect that the convergence of the noise parameter
can be used to assess convergence of the Markov chain. Once this convergence
has been attained, we switch from burn-in to the stationary Gibbs sampling
regime. We can then collect dependent samples from the posterior distribution
to model the uncertainty of our parameter estimation. After enough samples
have been collected, we can then switch to an annealing regime, for example
T (s) = ↵s with ↵ smaller than one but close to one and s indicating the itera-
tion of a full round of the sampling scheme. As s ! 1, T ! 0, and if the cooling
is sufficiently slow, the sampling scheme will now converge to the maximum a
posteriori solution of the probabilistic model of the multilinear model.
7 Examples
Work out the specific example of probabilistic matrix factorization and canonical
polyadic decomposition?
8 Noise injection
The proposed technique of annealed noise injection is specific to our multilin-
ear setting. However, it provides an intriguing general idea. When learning a
model, if a model estimate has a given noise level, then the uncertainty on the
parameters of the model is directly dependent on the level of the noise. This is
also the idea of model-based bootstrapping, where bootstrap copies of the data
are generated by resampling residuals from the model estimate (REF) and then
reestimating the parameters on those bootstrap copies to estimate the variabil-
ity of the parameter estimates. Given the model estimate, all bootstrap copies
7
are equally likely. Here, a similar idea would be that if a model estimate has
a certain noise level, it should not be able to distinguish between copies of the
data where a similar level of noise has been injected. Note that we cannot use
model-based bootstrap for learning because the best model that can be learned
from the bootstrap copies is the underlying model that was used to generate
them, which means that no learning would take place. However, by injecting
the noise directly on the original data, learning is possible since the model will
still try to fit the data. As we go along the learning process, the error of the
model will start to decrease despite the noise injection. As the error decreases,
the noise level will decrease accordingly and thus the noise injection. We can
hope that this will converge to some noise level that corresponds to the best
achievable error by the model. Then, we can start annealing the noise injection
further to converge to an optimal estimate. Note that if the model is properly
regularized, the annealing of the noise will not lead to overfitting. Conversely,
if the model is not regularized and prone to overfitting, the non-annealed ver-
sion should provide a form of regularization. It would however not provide an
optimal solution, although taking the average of the solutions after convergence
of the noise might provide a good point estimate.
Although we cannot expect that this procedure will in general correspond to
a true Bayesian scheme, we can nevertheless hope that it provides a simple, yet
interesting procedure to search for a global optimum of the learning problem.