You are on page 1of 8

GAMBLR: Gibbs Annealing for Multiway

Bayesian Linear Regressions


Yves Moreau, Adam Arany, Pooya Zakeri, Jaak Simm
November 2014

Abstract
We describe GAMBLR, a Gibbs annealing scheme for multilinear mod-
els. Our scheme combines a Gibbs sampling scheme with an annealing
procedure similar to that of simulated annealing. Under a Gaussian like-
lihood and an inverse-gamma prior, we show that this Gibbs annealing
scheme can be performed via a simple adaptive noise injection procedure,
where noise and prior regularization is added to the data and then clas-
sical Ordinary Least Squares regression is carried out. This results in an
extremely simple procedure with full Bayesian interpretation.

1 Multiway linear regressions


We consider a data object X that is a vector, matrix, or tensor of real values.
We want to model this data in terms of some sets of variables V1 , . . . , VK , which
are themselves organized as vectors or matrices:

X ⇡ f (V1 , . . . , VK ).

We consider here the case of nonlinear models that are multilinear, which is
that if we freeze all sets of variables but one, then the model is linear in this
set of variables. This property must hold for each of the sets of variables. If
we define gk̄ (Ṽk ) = f (V1 , . . . , Vk 1 , Ṽk , Vk+1 , . . . , VK ), then gk̄ (Ṽk ) needs to be a
linear function of Ṽk , for all k. A prototypical example of such kind of modeling
is low-rank matrix approximation where we write

X ⇡ V.W T ,

with X 2 Rn⇥p , V 2 Rn⇥r , W 2 Rp⇥r with r < min(n, p) = rank(V ) =


rank(W ) (CHECK: do we need to impose a rank condition on V and W? Be-
cause if we do, we would need to show that our algorithms enforce this condition
/CHECK). (CHECK: does the formulation of regression below implictly assume
that p<n? /CHECK)

1
2 Alternating least squares
A typical approach to such multilinear problems is to use the technique of
alternating least squares (REF). Given that each problem is linear when other
variables are held fixed, we can attempt to repeatedly solve the sequence of
linear problems by least-square linear regression. Given some initial condition
(0) (0)
V1 , . . . , VK , we iteratively and sequentially solve the least-squares problems:

(i) (i) (i) (i 1) (i 1)


Vk = argminṼk X f (V1 , . . . , Vk 1 , Ṽk , Vk+1 , . . . , VK ) ,
F

for k = 1, . . . , K and i = 1, . . . , where the matrix norm is the sum-of-squares


Frobenius norm, and using some appropriate stopping criteria to obtain a final
solution. A limitation of the alternating least squares approach is that, because
of the nonlinearity of the problem, there is no guarantee that the procedure will
converge to the global solution of the problem:

(V1⇤ , . . . , VK⇤ ) = argminṼ1 ,...,ṼK X f (Ṽ1 , . . . , ṼK ) .


F

Also, it does not provide an estimate of the uncertainty of the solution.

3 Bayesian linear regression


Because we want to approach the problem from the perspective of Markov Chain
Monte Carlo methods, we now consider the problem of linear regression from a
Bayesian standpoint. Instead of searching for the least-square estimate of the
parameters, we consider their posterior distribution given the available data.
We write the Bayesian linear regression in a notation slightly different from the
standard one to emphasize the basic idea of alternative linear least squares in
the case of matrix factorization. Consider the following regression model:

x = V w + .",

with x an n-dimensional vector, V an n ⇥ r matrix, w an r-dimensional vector,


and " ⇠ N (0, Ir ) an r-dimensional isotropic, spherical normal noise vector with
scalar noise level .
The Ordinary Least Square estimate for this regression model is given by

ŵ = (V T V ) 1
V T x = V † x,

where V † is the Moore-Penrose pseudo-inverse.


For Bayesian inference, we need to set up a likelihood distribution and a
prior distribution. We consider x and V given and we want to estimate the
posterior distribution of w and 2 . (For convenience, we use 2 as a variable
rather than .) We will select a conjugate pair of distributions for the likelihood
and the prior, so that the posterior can be computed analytically. Following the

2
standard treatment, we select as likelihood a multivariate normal distribution
(which is naturally associated to our quadratic error model):

p(x|V, w, ) = N (x; V w, .Ir ).

For the prior, we need a distribution p(w, 2 ), which we will decompose via
the product rule as p(w, 2 ) = p(w| 2 ).p( 2 ) with p(w| 2 ) = N (µ0 , 2 ⇤0 1 )
normally distributed given 2 and parametrized by the prior mean µ0 and the
prior precision matrix ⇤0 , and with p( 2 ) = J (a0 , b0 ), an inverse gamma dis-
tribution:

2 2 2
p(w, ) = p(w| ).p( )
= N (µ0 , 2
⇤0 1 ).J (a0 , b0 ).

Given this choice of conjugate likelihood and prior, the posterior distribution
is of the same form:
2 2 2
p(w, |x, V ) = p(w| , x, V ).p( |x, V )
2 1
= N (µn , ⇤n ).J (an , bn ),

with
µn = (V T V + ⇤0 ) 1 (V T x + ⇤0 µ0 )
⇤n = (V T V + ⇤0 )
an = a0 + n/2
bn = b0 + (xT x + µT0 ⇤0 µ0 µTn ⇤n µn )
Since ⇤0 is a precision matrix, it admits a decomposition of the form ⇤0 =
LT0 L0 with L0 square (for example, a Cholesky decomposition). If we define
V̄ = [V ; L0 ] as the vertical concatenation
✓ ◆ of the two matrices, we have that
x
⇤n = V̄ T V̄ . If we define x̄ = , we have that
L0 µ0

µn = (V̄ T V̄ ) 1
V̄ T x̄ = V̄ † x̄.

Moreover, the estimates of the inverse gamma distribution can be expressed


in terms of the sum of the residuals. In the case where no regularization by
the prior is involved (⇤0 = 0), the residuals from the model are of the form
"ˆ = x V ŵ = x V.V † x. The sum-of-squares error is then

"ˆT "ˆ = (x V.V † x)T (x V.V † x)


= xT x xT V.V † x xT (V † )T V T x xT (V † )T V T V.V † x

= T
x x T
x V.V x x V.V x + x V.V † x
T † T

= xT x xT V.V † x
= xT x xT (V † )T V T V.V † x
= xT x µT ⇤µ,

3
where µ and ⇤ denote the version of µn and ⇤n without regularitization (⇤0 =
0). Hence, bn behaves as a regularized version of the sum-of-squares error.
The mode of the inverse-gamma distribution J (an , bn ) is given by bn /(an + 1),
which closely resembles then the maximum likelihood estimate of the variance
T
in Ordinary Least Squares: ˆ 2 = "ˆn"ˆ . Thus, the sampling of p( 2 |x, V ) =
J (an , bn ) is essentially centered around the residual error of the model.
One more thing to note is that the Bayesian linear regression will also some-
times be so that the response variables are in matrix rather than vector form:

X = V W + .E,

where X is n ⇥ m, V is n ⇥ r, W is r ⇥ m, and E is n ⇥ m. E is a matrix whose


columns are "j ⇠ N (0, Ir ), j = 1, . . . , m. In this case, the regression problem
can be split into m independent regression problems:

X.j = V Wj. + ."j .

4 Gibbs sampling for multiway Bayesian linear


regression(s)
The cyclical update of the alternating least squares is reminiscent of the Gauss-
Seidel method (REF), but also of the Gibbs sampler (REF). In the (blocked)
Gibbs sampler, we want to sample from some joint probability distribution
p(V1 , . . . , VK ) and set up the following scheme:
(i) (i) (i) (i 1) (i 1)
Vk ⇠ p(Ṽk |V1 , . . . , Vk 1 , Vk+1 , . . . , VK ), for k = 1, . . . , K and i = 1, . . . ,

(i)
where we draw Vk from its conditional distribution, conditioned on the current
values of the other blocks of variables. Under the condition that the scheme
forms an irreducible and aperiodic Markov chain (REF), then this scheme
will converge towards drawing dependent samples from the joint distribution
p(V1 , . . . , VK ). The early part of the sequence where samples are not considered
to be drawn from the joint distribution is called the burn-in. Depending on
whether the length of the burn-in is acceptable or not, the Markov is deemed
well or poorly mixing.
If we have a multilinear model X ⇡ f (V1 , . . . , VK ), we can then set up a
block Gibbs sampler for the posterior distribution p(V1 , 1 , . . . , VK , K |X) by
chaining Bayesian linear regressions over the different blocks (where k̄ denotes
the estimates for the linear model resulting from freezing all the other blocks
of variables and the “.” notation denotes the fact that when V is a matrix we
perform one regression per column):
(i) (i) 2 (i) (i)
V.k ⇠ N (V ; µ.k̄ , ( ).k̄ .(⇤.k̄ ) 1 ),
2 (i) (i) (i)
( )k̄ ⇠ J (ak̄ , bk̄ ).

4
The precise setup of the regressions will depend on the specific model at hand.
If this sampling scheme is irreducible and aperiodic, then the sampling
scheme is guaranteed to converge in distribution to some equilibrium distribu-
tion. Given that the normal distribution has a non-zero probability to sample
any point of the space, and that the inverse Gamma has a non-zero probability
to sample any positive value, all possible paths have a non-zero value, hereby
guaranteeing irreducibility and aperiodicity. In fact, this property corresponds
to that of an ergodic Markov chain. Note however, that this does not provide
any guarantee that the sampling scheme will have good mixing property. How-
ever, given that we sample from a limited number of large blocks, we can expect
that the sampling scheme should have fairly good mixing properties. If so, after
the initial period, the successive samples from this schemes will be dependent
samples from the desired posterior distributionp(V1 , 1 , . . . , VK , K |X).
Although all the conditional distributions considered (if we ignore the vari-
ances) are normally distributed, the resulting joint distribution cannot be nor-
mally distributed because the corresponding estimation model is nonlinear. It is
also unclear what the corresponding noise model is (even though the conditional
models correspond to isotropic Gaussian noise).

5 GAMBLR: Gibbs Annealing for Multiway Bayesian


Linear Regression(s)
The Gibbs sampling scheme above is designed to converge in distribution to
the posterior distribution of the parameters of the distribution given the data.
However, we may want to obtain an optimal point estimate rather than col-
lecting a sample from the posterior. A classical setting contrasting those two
goals is Metropolis-Hasting vs. simulated annealing. In the Metropolis-Hastings
algorithm, we aim at sampling from a given distribution while in simulated an-
nealing, we ithntroduce a temperature parameter. As the sampling scheme gets
cooled off, the algorithm converges to the maximum of the probability distri-
bution. Essentially, the effect of the temperature parameter is to change the
original distribution from which the Metropolis-Hastings algorithm samples into
a Metropolis-Hastings scheme that samples from an increasing power of the orig-
inal distribution. As the power approaches infinity, the distribution approaches
a (set of) Dirac distribution(s) at the position(s) where the original probability
distribution is maximal.
The Gibbs sampling scheme can be seen as a particular case of the Metropolis-
Hastings algorithm. Considering the sampling from the multivariate normal
distribution,
2 1
p(w) = N (w; µ, ⇤) / exp( 2
(w µ)T ⇤(w µ)),
2
if we now want to sample from p1/T (w) instead of p(w), then we observe that
1 1
p1/T (w) / (exp( 2
(w µ)T ⇤(w µ)))1/T = exp( 2T
(w µ)T ⇤(w µ)) = N (w; µ, 2
T.⇤).
2 2

5
Thus, lowering the temperature corresponds to reducing the variance linearly
with the temperature parameter.
Going back to the Bayesian linear regression problem, once we have com-
puted the posterior distribution, we now need to sample from it. Taking the
notation from the previous section, we want at each step to sample from some
multivariate normal distribution N (µn , 2 .T.⇤n 1 ) with

µn = (V̄ T V̄ ) 1
V̄ T w̄ = V̄ † w̄
⇤n = V̄ T V̄

To sample from this normal distribution, we can set z = µn + V̄ † . .T." with


" ⇠ N (0, I). Then, E(z) = µn and cov(z) = V̄ † . 2 ..T.I.(V̄ † )T = 2 .T.(V̄ T V̄ ) 1 =
2
.T.⇤n 1 by the properties of the pseudoinverse. However, µn = V̄ † w̄. Hence,

z = V̄ † .(w̄ + .T.").

This means that each step of the Gibbs sampler consists in solving the original
Ordinary Least Squares problem from the Alternating Least Squares strategy,
simply injecting the appropriate level of noise directly on the data! We still
need to sample at the end of each iteration, but this is only a one-dimensional
parameter that can be sampled from the appropriate inverse-gamma distribution
easily.
Solving the least-squares problem can be classically done through the use of
the pseudoinverse. Alternatively, we can also consider the following approach
that replaces the computation of the pseudoinverse, by the solution of the normal
system of linear equations:

z = V̄ † .(w̄ + .T.") = (V̄ T V̄ ) 1


V̄ T .(w̄ + .T."),

which is equivalent to

(V̄ T V̄ )z = ⇤n z = V̄ T .(w̄ + ."),

and is also equivalent to solving

V̄ z = w̄ + ."

in the least-square sense.


Solving this problem does not necessarily require computing the inverse of ⇤n
or pseudoinverse of V̄ , but can be tackled instead using Krylov subspace meth-
ods, such as Conjugate Gradients (REF), SYMMLQ (REF), GMRES (REF),
or BiCGStab (REF). In particular, these methods can be advantageous when V̄
or V̄ T V̄ is large and sparse. However, when we need to perform regressions for
all the columns of a matrix, explicit computation of the (pseudo)inverse might
be more favorable.
Given the data w, the prior µ0 , ⇤0 , we can compute V̄ and w̄. Given the
current noise level , we can draw a random unit normal sample ". We then

6
solve the linear system of equations for z, which gives then the next sample from
the Gibbs sampler. We then repeat this for each of the multilinear components
towards convergence of the Gibbs sampling scheme.
For the variance parameter, we propose to replace sampling by an iterated
conditional mode procedure (REF), where the mode of the inverse gamma dis-
tribution J (a, b) is given by b/(a + 1). Given that the noise parameter is scalar
while the normal distribution is multidimensional, we do not expect the greedi-
ness of the noise estimation to significantly affect the procedure. We also expect
that the iterated conditional mode estimation of the noise parameter will allow
to monitor the convergence of the noise parameter. Given that we sample from a
limited number of blocks, we expect that the convergence of the noise parameter
can be used to assess convergence of the Markov chain. Once this convergence
has been attained, we switch from burn-in to the stationary Gibbs sampling
regime. We can then collect dependent samples from the posterior distribution
to model the uncertainty of our parameter estimation. After enough samples
have been collected, we can then switch to an annealing regime, for example
T (s) = ↵s with ↵ smaller than one but close to one and s indicating the itera-
tion of a full round of the sampling scheme. As s ! 1, T ! 0, and if the cooling
is sufficiently slow, the sampling scheme will now converge to the maximum a
posteriori solution of the probabilistic model of the multilinear model.

6 Collapsed Gibbs sampler


It should be possible to write a collapsed Gibbs sampler where the noise is
marginalized out. This probably leads to sampling from a non-standardized
Student’s t-distribution instead of a normal distribution. There is no more
sampling of 2 required. Whenever 2 is needed, its distribution can be derived
conditioned on the other parameters.

7 Examples
Work out the specific example of probabilistic matrix factorization and canonical
polyadic decomposition?

8 Noise injection
The proposed technique of annealed noise injection is specific to our multilin-
ear setting. However, it provides an intriguing general idea. When learning a
model, if a model estimate has a given noise level, then the uncertainty on the
parameters of the model is directly dependent on the level of the noise. This is
also the idea of model-based bootstrapping, where bootstrap copies of the data
are generated by resampling residuals from the model estimate (REF) and then
reestimating the parameters on those bootstrap copies to estimate the variabil-
ity of the parameter estimates. Given the model estimate, all bootstrap copies

7
are equally likely. Here, a similar idea would be that if a model estimate has
a certain noise level, it should not be able to distinguish between copies of the
data where a similar level of noise has been injected. Note that we cannot use
model-based bootstrap for learning because the best model that can be learned
from the bootstrap copies is the underlying model that was used to generate
them, which means that no learning would take place. However, by injecting
the noise directly on the original data, learning is possible since the model will
still try to fit the data. As we go along the learning process, the error of the
model will start to decrease despite the noise injection. As the error decreases,
the noise level will decrease accordingly and thus the noise injection. We can
hope that this will converge to some noise level that corresponds to the best
achievable error by the model. Then, we can start annealing the noise injection
further to converge to an optimal estimate. Note that if the model is properly
regularized, the annealing of the noise will not lead to overfitting. Conversely,
if the model is not regularized and prone to overfitting, the non-annealed ver-
sion should provide a form of regularization. It would however not provide an
optimal solution, although taking the average of the solutions after convergence
of the noise might provide a good point estimate.
Although we cannot expect that this procedure will in general correspond to
a true Bayesian scheme, we can nevertheless hope that it provides a simple, yet
interesting procedure to search for a global optimum of the learning problem.

You might also like