You are on page 1of 39

Chapter 4: Non-linear and non-parametric

regression and classification

Conchi Ausı́n and Mike Wiper


Department of Statistics
Universidad Carlos III de Madrid

Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 1 / 34
Outline and learning objectives
80
60

Introduce the ideas of Bayesian


Prestige

nonparametric regression modeling


and classification.
40
20

6 8 10 12 14 16

Years of Education

Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 2 / 34
R packages used

rm(list=ls())
if(!require(car))install.packages(”car”)
if(!require(ABC.RAP))install.packages(”MCMCglmm”)
if(!require(BASS))install.packages(”BASS”)
if(!require(brnn))install.packages(”brnn”)
if(!require(tgp))install.packages(”tgp”)

Other packages are available: blm or BayesSummaryStatLM for linear


models, cpr for spline regression, GPfit for Gaussian process models.

Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 3 / 34
Non-linearly related data

The Prestige data set contains prestige rating, average income and
educational level for various different professions.

library(car)
help(”Prestige”)
plot(prestige income,xlab=”Average Income”,ylab=”Prestige”,data=Prestige)

We can observe a clear non-linear relationship between prestige and


income.

Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 4 / 34
A general regression model

The basic nonparametric regression model is of form

y = g (x) + 

where g (·) represents an unknown functional form and typically, the error,
 is assumed to follow a standard normal distribution.
with(Prestige, lines(lowess(income, prestige, f=0.5, iter=0), lwd=2))

The lowess command gives a classical, kernel based, local regression fit.

Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 5 / 34
Polynomial regression

One, parametric way to fit the data could be via polynomial regression,
setting
Xk
g (x) = βi x i .
i=0

However:

Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 6 / 34
Polynomial regression

One, parametric way to fit the data could be via polynomial regression,
setting
Xk
g (x) = βi x i .
i=0

However:
For large k, the matrix XT X often becomes ill conditioned ...

Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 6 / 34
Polynomial regression

One, parametric way to fit the data could be via polynomial regression,
setting
Xk
g (x) = βi x i .
i=0

However:
For large k, the matrix XT X often becomes ill conditioned ...
... which means that both classical and Bayesian regression with flat priors
is complicated.

Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 6 / 34
Polynomial regression

One, parametric way to fit the data could be via polynomial regression,
setting
Xk
g (x) = βi x i .
i=0

However:
For large k, the matrix XT X often becomes ill conditioned ...
... which means that both classical and Bayesian regression with flat priors
is complicated.
To avoid this, we can use orthogonal polynomials.

Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 6 / 34
Orthogonal polynomials

Suppose the data are (x1 , y1 ), ..., (xn , yn ).

Assume that g (x) = ki=0 βi Pi (x) where Pi (x) is an orthogonal


P
polynomial of order i such that
n
X
Pi (xr )Pj (xr ) = 0
r =1

for all j 6= i.
The orthogonal polynomials are easily calculated via e.g. Gram Schmidt
orthogonalization:

poly(Prestige$income,2)

Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 7 / 34
Orthogonal polynomials

The classical MLE and Bayesian posterior mean given uniform priors
for the βi is Pn
Pi (xr )yr
β̂i = Prn=1 2
.
r =1 Pi (xr )
The order of the polynomial can be selected using an information
criterion (e.g. AIC or BIC in the classical case or DIC in the Bayesian
case).
Bayesian inference can be done in a conjugate way or via MCMC.

Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 8 / 34
Model selection
For the prestige data, we can implement polynomial regressions of various
orders and compare via DIC.

dic = rep(0,5)
library(MCMCglmm)
c = MCMCglmm(prestige ˜ poly(income,1),data=Prestige)
dic[1] = c$DIC
c = MCMCglmm(prestige ˜ poly(income,2),data=Prestige)
dic[2] = c$DIC
c = MCMCglmm(prestige ˜ poly(income,3),data=Prestige)
dic[3] = c$DIC
c = MCMCglmm(prestige ˜ poly(income,4),data=Prestige)
dic[4] = c$DIC
c = MCMCglmm(prestige ˜ poly(income,5),data=Prestige)
dic[5] = c$DIC
dic

A second order polynomial is selected.


Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 9 / 34
Convergence checking

As we are using MCMC, it is important to perform various checks on


convergence.

plot(c(1:1000),c$Sol[,1],type=’l’,xlab=’index’,
ylab=expression(beta[0]),col=’red’)
plot(c(1:1000),cumsum(c$Sol[,1])/c(1:1000),type=’l’,xlab=’index’,
ylab=expression(paste(’E[’,beta[0],’]’)),col=’red’)
plot(acf(c$Sol[,1]),main=expression(paste(’ACF plot of the generated values of
’,beta[0])),col=’red’,lwd=2)
plot(density(c$Sol[,1]),xlab=expression(beta[0]),ylab=’f’,
main=expression(paste(’Posterior density of ’,beta[0])),lwd=2,col=’red’)

Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 10 / 34
Prediction

In order to carry out prediction for each set of sampled values


[r ] [r ] [r ]
β0 , β1 , β2 , σ [r ] , then for a given value of x, we simply generate y [r ] from
[r ]
a normal distribution with mean g [r ] (x) = 2i=0 βi Pi (x) and standard
P
[r ]
deviation σ and take an average.

c = MCMCglmm(prestige ˜ poly(income,2),data=Prestige)
yhatpoly = predict(c)
d = sort(Prestige$income,index.return=TRUE)
plot(prestige ˜ income,xlab=”Average
Income”,ylab=”Prestige”,data=Prestige)
lines(d$x,yhatpoly[d$ix],type=’l’,col=’red’,lwd=2)

Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 11 / 34
Predictive uncertainty

Prediction uncertainty can be incorporated by taking a 95% interval of the


set of MCMC sampled values y [ r ].

q = matrix(rep(0,204),nrow=102)
vv = poly(Prestige$income,2)
for (j in 1:102){
m = c$Sol[,1]+c$Sol[,2]*vv[j,1]+c$Sol[,3]*vv[j,2]
z = rnorm(1000,m,sqrt(c$VCV))
q[j,] = quantile(z,probs=c(0.025,0.975))
}
lines(d$x,q[,1][d$ix],type=’l’,lwd=2,col=’red’,lty=2)
lines(d$x,q[,2][d$ix],type=’l’,lwd=2,col=’red’,lty=2)

Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 12 / 34
Regression splines
Often, high order polynomial models are required to provide an adequate
fit so polynomial regression is very inefficient. However, locally lower order
polynomials might provide good fits to different sections of the data.
Spline regression divides the range of x into regions and fits each region
via a polynomial regression, with the constraint that each curve is
connected at the region bounds.

80
60
Prestige

40
20

6 8 10 12 14 16

Years of Education

Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 13 / 34
Regression spline structure

M
X
g (x) = β0 + Bm (x)
m=1
Km
Y
Bm (x) = gkm [skm (xvkm − tkm )]α+
k=1
gkm = [(skm + 1)/2 − skm tkm ]α

where skm ∈ {−1, 1} is a sign, tkm ∈ [0, 1] is a knot, vkm selects a variable,
Km is the degree of interaction, α determines the degree of the polynomial
fit and gkm normalizes the basis function. The function [·]+ is defined as
max(0, ·).

Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 14 / 34
Bayesian spline regression

When fitting regression splines, lower order polynomials are typically


considered. Then, the fit is mainly determined by the number and position
of the knots.
In a classical context, knots are often chosen by eye.
In a Bayesian context, we can place a prior distribution on these and then
carry out inference via transdimensional MCMC.
Prediction can be carried out in a similar way as for polynomial regression

Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 15 / 34
Prediction for the prestige data

library(BASS)
mod = bass(Prestige$income,Prestige$prestige)
yfit = predict(mod,Prestige$income,verbose = T)
yhat splines = colMeans(yfit)
plot(prestige ˜ income,xlab=”Average
Income”,ylab=”Prestige”,data=Prestige)
lines(d$x,yhat splines[d$ix],type=’l’,col=’green’,lwd=2)
quants = apply(yfit,2,quantile,probs=c(0.025,0.975))
lines(d$x,quants[1,][d$ix],type=’l’,lty=3,col=’green’,lwd=2)
lines(d$x,quants[2,][d$ix],type=’l’,lty=3,col=’green’,lwd=2)

The plotted intervals are for the mean curve.

Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 16 / 34
Prediction for the prestige data

To calculate predictive intervals, we also need to consider the residual


variance σ 2 .
q = matrix(rep(0,204),nrow=102)
for (j in 1:102){
z = rnorm(1000,mod$yhat[,j],sqrt(mod$s2))
q[j,] = quantile(z,probs=c(0.025,0.975))
}
lines(d$x,q[,1][d$ix],type=’l’,lwd=2,col=’green’,lty=2)
lines(d$x,q[,2][d$ix],type=’l’,lwd=2,col=’green’,lty=2)

Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 17 / 34
Neural networks

A difficulty with both polynomial and spline regression is that they are
hard to extend efficiently to the case of highly multivariate x. An
alternative is to use feed forward neural networks (deep learning).
The basic model assumes that, when we have inputs x = (x1 , ..., xp )T , then
S
X p
X
g (x) = β0 + wk g (ak + bj xj )
k=1 j=1

where g (·) represents a suitable basis function form, such as a logistic


distribution:
g (x) = 1/(1 + exp(−x)).

Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 18 / 34
Bayesian inference for neural nets

Classically, we need to use penalized regression to avoid fitting too many


neurons (keep K small). Using Bayesian priors we can incorporate suitable
penalization.
Exact Bayesian inference can be carried out via MCMC, but this is
typically very slow.
More recent approaches use approximations (variational Bayes or empirical
Bayes) to find a fast, tractable estimate of the posterior distributions.

Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 19 / 34
Prediction for the prestige data

library(brnn)
out=brnn(Prestige$prestige˜Prestige$income,neurons=2)
plot(prestige ˜ income,xlab=”Average
Income”,ylab=”Prestige”,data=Prestige)
lines(d$x,predict(out)[d$ix],type=’l’,lwd=2,col=’magenta’)

Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 20 / 34
Gaussian processes

The most general approach is to directly define a prior distribution for the
function g (·). The usual approach is via the Gaussian process.
Assume we have an input x. Then the associated function is g (x).
Then g (·) is said to follow a Gaussian process if, for any set of inputs,
(x1 , ..., xn ) for any n, then the joint distribution of g (x1 ), ..., g (xn ) is
normal.
The process is determined by the mean (usually set at 0 a priori for all x)
and covariance functions.

Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 21 / 34
Covariance function

We want K (x, x0 ) → 0 as x0 is more distant from x. A simple covariance


function is the squared exponential
 !
1 ||x − x0 || 2

0 2
K (x, x ) = σ0 exp −
2 λ

Many other possibilities can be considered.

Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 22 / 34
Inference
Assume we have data (x1 , y1 ), ...(xn , yn ). Then,

yn = (y1 , .., yn )T ∼ N(0, Kn + σ 2 I)

where Knij = K (xi , xj ) and I is the identity.


To make predictions for test data (xt , yt ) observe that the joint
distribution of (yn , yt ) is normal with covariance
 
Kn Knt
Knt = .
Ktn Ktt

A posteriori, yt |yn is then also normal with mean

Ktn (Kn + σ 2 I)−1 yn

and variance
Ktt − Ktn (Kn + σ 2 I)−1 Knt + σ 2 .
Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 23 / 34
Inference

Prediction is easy given the model parameters. The only difficulty therefore
is estimation of σ 2 and the parameters of the covariance function.
This can be done via MCMC but this is very slow and usually,
approximations via e.g. likelihood maximization or variational Bayes are
preferred.

Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 24 / 34
Gaussian processes and alternative models

Gaussian process regression can often be thought of as a limiting case of


other models:

Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 25 / 34
Gaussian processes and alternative models

Gaussian process regression can often be thought of as a limiting case of


other models:
In the case of neural nets, if we assume the intercept β0 and weights wk
are independent and that the remaining terms are i.i.d., then using the
Central limit theorem, as k → ∞, the neural nets model approaches the
GP.

Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 25 / 34
Prediction for the prestige data

library(tgp)
xmin = min(Prestige$income)
xmax = max(Prestige$income)
xx = xmin+c(0:100)*(xmax-xmin)/100
yfit = bgp(X=Prestige$income,Z=Prestige$prestige,XX=xx,verb=0)
plot(prestige ˜ income,xlab=”Average
Income”,ylab=”Prestige”,data=Prestige,ylim=c(0,100))
with(Prestige, lines(lowess(income, prestige, f=0.5, iter=0), lwd=2))
lines(xx,yfit$ZZ.mean,type=’l’,lwd=2,col=’red’)
lines(xx,yfit$ZZ.q1,type=’l’,lty=2,lwd=2,col=’red’)
lines(xx,yfit$ZZ.q2,type=’l’,lty=2,lwd=2,col=’red’)

Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 26 / 34
Bayesian classification
We have test and training data divided into different groups.
1.0
0.5

100 training data with 6 x values.


Data are classified into 3 classes
0.0
X2

according to the size of the sum of


squares of the first two components.
−0.5
−1.0

−0.5 0.0 0.5 1.0

X1

Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 27 / 34
Test data
1.0
0.5

500 data. The objective is to fit a


model to the training data and use it
X2

0.0

to predict the test data.


−0.5
−1.0

−1.0 −0.5 0.0 0.5 1.0

X1

Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 28 / 34
Linear probit regression
Suppose data come from two classes (C = 0 or 1).
The (linear) probit model assumes that
 
P(C = 1|x) = Φ xT θ

where Φ(·) is a standard normal c.d.f.


We can treat this as a latent variable model by defining

Y = xT θ + 

where  ∼ N(0, 1) and then



1 if Y > 0
C= .
0 otherwise

Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 29 / 34
Nonparametric, multinomial probit regression

The simple probit model only allows for two classes and assumes an
underlying linear structure.

Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 30 / 34
Nonparametric, multinomial probit regression

The simple probit model only allows for two classes and assumes an
underlying linear structure.
A parametric multinomial probit regression model is
Ci ∼ categorical(pi1 , ..., pik ) where pij = P(Ci = j|xi ).
Underlying latent variables are Yij = βxi + j for j = 1, ..., k. We set
Ci = j if Yij > Yir for all r 6= j.
We can generalize to a non-linear case by setting Yij = gi (xi ) + j and
modeling the function gi with a GP prior.

Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 30 / 34
Model fitting
1.0

1.0
0.5

0.5
X2

X2
0.0

0.0
−0.5

−0.5
−1.0

−1.0

−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0

X1 X1

Test data. Predictions.

Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 31 / 34
Classification errors
1.0
0.5

Less than 3% of the data are


X2

0.0

incorrectly classified.
−0.5
−1.0

−1.0 −0.5 0.0 0.5 1.0

X1

Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 32 / 34
Probabilities for misclassified points

True class Predicted class P(C = 1) P(C = 2) P(C = 3)


1 3 0.39 0.16 0.45
1 2 0.44 0.45 0.11
1 3 0.39 0.20 0.41
1 2 0.43 0.12 0.45
.. .. .. .. ..
. . . . .
3 1 0.48 0.09 0.43

When misclassification occurs it is by a small probability difference.

Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 33 / 34
Conclusions and extensions

We have illustrated various methods for fitting non-parametric regressions


and shown how classification can be implemented via multinomial probit
with Gaussian processes.

In the next session, we examine a related problem to classification, that is


clustering.

Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 34 / 34

You might also like