Professional Documents
Culture Documents
Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 1 / 34
Outline and learning objectives
80
60
6 8 10 12 14 16
Years of Education
Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 2 / 34
R packages used
rm(list=ls())
if(!require(car))install.packages(”car”)
if(!require(ABC.RAP))install.packages(”MCMCglmm”)
if(!require(BASS))install.packages(”BASS”)
if(!require(brnn))install.packages(”brnn”)
if(!require(tgp))install.packages(”tgp”)
Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 3 / 34
Non-linearly related data
The Prestige data set contains prestige rating, average income and
educational level for various different professions.
library(car)
help(”Prestige”)
plot(prestige income,xlab=”Average Income”,ylab=”Prestige”,data=Prestige)
Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 4 / 34
A general regression model
y = g (x) +
where g (·) represents an unknown functional form and typically, the error,
is assumed to follow a standard normal distribution.
with(Prestige, lines(lowess(income, prestige, f=0.5, iter=0), lwd=2))
The lowess command gives a classical, kernel based, local regression fit.
Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 5 / 34
Polynomial regression
One, parametric way to fit the data could be via polynomial regression,
setting
Xk
g (x) = βi x i .
i=0
However:
Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 6 / 34
Polynomial regression
One, parametric way to fit the data could be via polynomial regression,
setting
Xk
g (x) = βi x i .
i=0
However:
For large k, the matrix XT X often becomes ill conditioned ...
Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 6 / 34
Polynomial regression
One, parametric way to fit the data could be via polynomial regression,
setting
Xk
g (x) = βi x i .
i=0
However:
For large k, the matrix XT X often becomes ill conditioned ...
... which means that both classical and Bayesian regression with flat priors
is complicated.
Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 6 / 34
Polynomial regression
One, parametric way to fit the data could be via polynomial regression,
setting
Xk
g (x) = βi x i .
i=0
However:
For large k, the matrix XT X often becomes ill conditioned ...
... which means that both classical and Bayesian regression with flat priors
is complicated.
To avoid this, we can use orthogonal polynomials.
Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 6 / 34
Orthogonal polynomials
for all j 6= i.
The orthogonal polynomials are easily calculated via e.g. Gram Schmidt
orthogonalization:
poly(Prestige$income,2)
Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 7 / 34
Orthogonal polynomials
The classical MLE and Bayesian posterior mean given uniform priors
for the βi is Pn
Pi (xr )yr
β̂i = Prn=1 2
.
r =1 Pi (xr )
The order of the polynomial can be selected using an information
criterion (e.g. AIC or BIC in the classical case or DIC in the Bayesian
case).
Bayesian inference can be done in a conjugate way or via MCMC.
Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 8 / 34
Model selection
For the prestige data, we can implement polynomial regressions of various
orders and compare via DIC.
dic = rep(0,5)
library(MCMCglmm)
c = MCMCglmm(prestige ˜ poly(income,1),data=Prestige)
dic[1] = c$DIC
c = MCMCglmm(prestige ˜ poly(income,2),data=Prestige)
dic[2] = c$DIC
c = MCMCglmm(prestige ˜ poly(income,3),data=Prestige)
dic[3] = c$DIC
c = MCMCglmm(prestige ˜ poly(income,4),data=Prestige)
dic[4] = c$DIC
c = MCMCglmm(prestige ˜ poly(income,5),data=Prestige)
dic[5] = c$DIC
dic
plot(c(1:1000),c$Sol[,1],type=’l’,xlab=’index’,
ylab=expression(beta[0]),col=’red’)
plot(c(1:1000),cumsum(c$Sol[,1])/c(1:1000),type=’l’,xlab=’index’,
ylab=expression(paste(’E[’,beta[0],’]’)),col=’red’)
plot(acf(c$Sol[,1]),main=expression(paste(’ACF plot of the generated values of
’,beta[0])),col=’red’,lwd=2)
plot(density(c$Sol[,1]),xlab=expression(beta[0]),ylab=’f’,
main=expression(paste(’Posterior density of ’,beta[0])),lwd=2,col=’red’)
Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 10 / 34
Prediction
c = MCMCglmm(prestige ˜ poly(income,2),data=Prestige)
yhatpoly = predict(c)
d = sort(Prestige$income,index.return=TRUE)
plot(prestige ˜ income,xlab=”Average
Income”,ylab=”Prestige”,data=Prestige)
lines(d$x,yhatpoly[d$ix],type=’l’,col=’red’,lwd=2)
Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 11 / 34
Predictive uncertainty
q = matrix(rep(0,204),nrow=102)
vv = poly(Prestige$income,2)
for (j in 1:102){
m = c$Sol[,1]+c$Sol[,2]*vv[j,1]+c$Sol[,3]*vv[j,2]
z = rnorm(1000,m,sqrt(c$VCV))
q[j,] = quantile(z,probs=c(0.025,0.975))
}
lines(d$x,q[,1][d$ix],type=’l’,lwd=2,col=’red’,lty=2)
lines(d$x,q[,2][d$ix],type=’l’,lwd=2,col=’red’,lty=2)
Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 12 / 34
Regression splines
Often, high order polynomial models are required to provide an adequate
fit so polynomial regression is very inefficient. However, locally lower order
polynomials might provide good fits to different sections of the data.
Spline regression divides the range of x into regions and fits each region
via a polynomial regression, with the constraint that each curve is
connected at the region bounds.
80
60
Prestige
40
20
6 8 10 12 14 16
Years of Education
Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 13 / 34
Regression spline structure
M
X
g (x) = β0 + Bm (x)
m=1
Km
Y
Bm (x) = gkm [skm (xvkm − tkm )]α+
k=1
gkm = [(skm + 1)/2 − skm tkm ]α
where skm ∈ {−1, 1} is a sign, tkm ∈ [0, 1] is a knot, vkm selects a variable,
Km is the degree of interaction, α determines the degree of the polynomial
fit and gkm normalizes the basis function. The function [·]+ is defined as
max(0, ·).
Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 14 / 34
Bayesian spline regression
Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 15 / 34
Prediction for the prestige data
library(BASS)
mod = bass(Prestige$income,Prestige$prestige)
yfit = predict(mod,Prestige$income,verbose = T)
yhat splines = colMeans(yfit)
plot(prestige ˜ income,xlab=”Average
Income”,ylab=”Prestige”,data=Prestige)
lines(d$x,yhat splines[d$ix],type=’l’,col=’green’,lwd=2)
quants = apply(yfit,2,quantile,probs=c(0.025,0.975))
lines(d$x,quants[1,][d$ix],type=’l’,lty=3,col=’green’,lwd=2)
lines(d$x,quants[2,][d$ix],type=’l’,lty=3,col=’green’,lwd=2)
Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 16 / 34
Prediction for the prestige data
Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 17 / 34
Neural networks
A difficulty with both polynomial and spline regression is that they are
hard to extend efficiently to the case of highly multivariate x. An
alternative is to use feed forward neural networks (deep learning).
The basic model assumes that, when we have inputs x = (x1 , ..., xp )T , then
S
X p
X
g (x) = β0 + wk g (ak + bj xj )
k=1 j=1
Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 18 / 34
Bayesian inference for neural nets
Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 19 / 34
Prediction for the prestige data
library(brnn)
out=brnn(Prestige$prestige˜Prestige$income,neurons=2)
plot(prestige ˜ income,xlab=”Average
Income”,ylab=”Prestige”,data=Prestige)
lines(d$x,predict(out)[d$ix],type=’l’,lwd=2,col=’magenta’)
Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 20 / 34
Gaussian processes
The most general approach is to directly define a prior distribution for the
function g (·). The usual approach is via the Gaussian process.
Assume we have an input x. Then the associated function is g (x).
Then g (·) is said to follow a Gaussian process if, for any set of inputs,
(x1 , ..., xn ) for any n, then the joint distribution of g (x1 ), ..., g (xn ) is
normal.
The process is determined by the mean (usually set at 0 a priori for all x)
and covariance functions.
Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 21 / 34
Covariance function
Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 22 / 34
Inference
Assume we have data (x1 , y1 ), ...(xn , yn ). Then,
and variance
Ktt − Ktn (Kn + σ 2 I)−1 Knt + σ 2 .
Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 23 / 34
Inference
Prediction is easy given the model parameters. The only difficulty therefore
is estimation of σ 2 and the parameters of the covariance function.
This can be done via MCMC but this is very slow and usually,
approximations via e.g. likelihood maximization or variational Bayes are
preferred.
Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 24 / 34
Gaussian processes and alternative models
Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 25 / 34
Gaussian processes and alternative models
Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 25 / 34
Prediction for the prestige data
library(tgp)
xmin = min(Prestige$income)
xmax = max(Prestige$income)
xx = xmin+c(0:100)*(xmax-xmin)/100
yfit = bgp(X=Prestige$income,Z=Prestige$prestige,XX=xx,verb=0)
plot(prestige ˜ income,xlab=”Average
Income”,ylab=”Prestige”,data=Prestige,ylim=c(0,100))
with(Prestige, lines(lowess(income, prestige, f=0.5, iter=0), lwd=2))
lines(xx,yfit$ZZ.mean,type=’l’,lwd=2,col=’red’)
lines(xx,yfit$ZZ.q1,type=’l’,lty=2,lwd=2,col=’red’)
lines(xx,yfit$ZZ.q2,type=’l’,lty=2,lwd=2,col=’red’)
Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 26 / 34
Bayesian classification
We have test and training data divided into different groups.
1.0
0.5
X1
Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 27 / 34
Test data
1.0
0.5
0.0
X1
Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 28 / 34
Linear probit regression
Suppose data come from two classes (C = 0 or 1).
The (linear) probit model assumes that
P(C = 1|x) = Φ xT θ
Y = xT θ +
Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 29 / 34
Nonparametric, multinomial probit regression
The simple probit model only allows for two classes and assumes an
underlying linear structure.
Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 30 / 34
Nonparametric, multinomial probit regression
The simple probit model only allows for two classes and assumes an
underlying linear structure.
A parametric multinomial probit regression model is
Ci ∼ categorical(pi1 , ..., pik ) where pij = P(Ci = j|xi ).
Underlying latent variables are Yij = βxi + j for j = 1, ..., k. We set
Ci = j if Yij > Yir for all r 6= j.
We can generalize to a non-linear case by setting Yij = gi (xi ) + j and
modeling the function gi with a GP prior.
Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 30 / 34
Model fitting
1.0
1.0
0.5
0.5
X2
X2
0.0
0.0
−0.5
−0.5
−1.0
−1.0
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
X1 X1
Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 31 / 34
Classification errors
1.0
0.5
0.0
incorrectly classified.
−0.5
−1.0
X1
Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 32 / 34
Probabilities for misclassified points
Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 33 / 34
Conclusions and extensions
Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 34 / 34