Chapter 4: Non-Linear and Non-Parametric Regression and Classification

Chapter 4: Non-linear and non-parametric
regression and classification
Conchi Ausı́n and Mike Wiper

Department of Statistics
Universidad Carlos III de Madrid
Conchi Ausı́n and Mike Wiper Bayesian Methods for Big Data Masters in Big Data 1 / 34
Outline and learning objectives
80
60
Introduce the ideas of Bayesian

Prestige
nonparametric regression modeling

and classification.
40
20
6 8 10 12 14 16
Years of Education
R packages used
rm(list=ls())
if(!require(car))install.packages(”car”)
if(!require(ABC.RAP))install.packages(”MCMCglmm”)
if(!require(BASS))install.packages(”BASS”)
if(!require(brnn))install.packages(”brnn”)
if(!require(tgp))install.packages(”tgp”)
Other packages are available: blm or BayesSummaryStatLM for linear

models, cpr for spline regression, GPfit for Gaussian process models.
Non-linearly related data
The Prestige data set contains prestige rating, average income and
educational level for various different professions.
library(car)
help(”Prestige”)
plot(prestige income,xlab=”Average Income”,ylab=”Prestige”,data=Prestige)
We can observe a clear non-linear relationship between prestige and

income.
A general regression model
The basic nonparametric regression model is of form
y = g (x) +
where g (·) represents an unknown functional form and typically, the error,
is assumed to follow a standard normal distribution.
with(Prestige, lines(lowess(income, prestige, f=0.5, iter=0), lwd=2))
The lowess command gives a classical, kernel based, local regression fit.
Polynomial regression
One, parametric way to fit the data could be via polynomial regression,
setting
Xk
g (x) = βi x i .
i=0
However:
setting
Xk
g (x) = βi x i .
i=0
However:
For large k, the matrix XT X often becomes ill conditioned ...
setting
Xk
g (x) = βi x i .
i=0
However:
... which means that both classical and Bayesian regression with flat priors
is complicated.
setting
Xk
g (x) = βi x i .
i=0
However:
... which means that both classical and Bayesian regression with flat priors
is complicated.
To avoid this, we can use orthogonal polynomials.
Orthogonal polynomials
Suppose the data are (x1 , y1 ), ..., (xn , yn ).
Assume that g (x) = ki=0 βi Pi (x) where Pi (x) is an orthogonal

P
polynomial of order i such that
n
X
Pi (xr )Pj (xr ) = 0
r =1
for all j 6= i.
The orthogonal polynomials are easily calculated via e.g. Gram Schmidt
orthogonalization:
poly(Prestige$income,2)
Orthogonal polynomials
The classical MLE and Bayesian posterior mean given uniform priors
for the βi is Pn
Pi (xr )yr
β̂i = Prn=1 2
.
r =1 Pi (xr )
The order of the polynomial can be selected using an information
criterion (e.g. AIC or BIC in the classical case or DIC in the Bayesian
case).
Bayesian inference can be done in a conjugate way or via MCMC.
Model selection
For the prestige data, we can implement polynomial regressions of various
orders and compare via DIC.
dic = rep(0,5)
library(MCMCglmm)
c = MCMCglmm(prestige ˜ poly(income,1),data=Prestige)
dic[1] = c$DIC
dic[2] = c$DIC
dic[3] = c$DIC
dic[4] = c$DIC
dic[5] = c$DIC
dic
A second order polynomial is selected.

Convergence checking
As we are using MCMC, it is important to perform various checks on

convergence.
plot(c(1:1000),c$Sol[,1],type=’l’,xlab=’index’,
ylab=expression(beta[0]),col=’red’)
plot(c(1:1000),cumsum(c$Sol[,1])/c(1:1000),type=’l’,xlab=’index’,
ylab=expression(paste(’E[’,beta[0],’]’)),col=’red’)
plot(acf(c$Sol[,1]),main=expression(paste(’ACF plot of the generated values of
’,beta[0])),col=’red’,lwd=2)
plot(density(c$Sol[,1]),xlab=expression(beta[0]),ylab=’f’,
main=expression(paste(’Posterior density of ’,beta[0])),lwd=2,col=’red’)
Prediction
In order to carry out prediction for each set of sampled values

[r ] [r ] [r ]
β0 , β1 , β2 , σ [r ] , then for a given value of x, we simply generate y [r ] from
[r ]
a normal distribution with mean g [r ] (x) = 2i=0 βi Pi (x) and standard
P
[r ]
deviation σ and take an average.
yhatpoly = predict(c)
d = sort(Prestige$income,index.return=TRUE)
plot(prestige ˜ income,xlab=”Average
Income”,ylab=”Prestige”,data=Prestige)
lines(d$x,yhatpoly[d$ix],type=’l’,col=’red’,lwd=2)
Predictive uncertainty
Prediction uncertainty can be incorporated by taking a 95% interval of the

set of MCMC sampled values y [ r ].
q = matrix(rep(0,204),nrow=102)
vv = poly(Prestige$income,2)
for (j in 1:102){
m = c$Sol[,1]+c$Sol[,2]*vv[j,1]+c$Sol[,3]*vv[j,2]
z = rnorm(1000,m,sqrt(c$VCV))
q[j,] = quantile(z,probs=c(0.025,0.975))
}
lines(d$x,q[,1][d$ix],type=’l’,lwd=2,col=’red’,lty=2)
lines(d$x,q[,2][d$ix],type=’l’,lwd=2,col=’red’,lty=2)
Regression splines
Often, high order polynomial models are required to provide an adequate
fit so polynomial regression is very inefficient. However, locally lower order
polynomials might provide good fits to different sections of the data.
Spline regression divides the range of x into regions and fits each region
via a polynomial regression, with the constraint that each curve is
connected at the region bounds.
80
60
Prestige
40
20
6 8 10 12 14 16
Years of Education
Regression spline structure
M
X
g (x) = β0 + Bm (x)
m=1
Km
Y
Bm (x) = gkm [skm (xvkm − tkm )]α+
k=1
gkm = [(skm + 1)/2 − skm tkm ]α
where skm ∈ {−1, 1} is a sign, tkm ∈ [0, 1] is a knot, vkm selects a variable,
Km is the degree of interaction, α determines the degree of the polynomial
fit and gkm normalizes the basis function. The function [·]+ is defined as
max(0, ·).
Bayesian spline regression
When fitting regression splines, lower order polynomials are typically

considered. Then, the fit is mainly determined by the number and position
of the knots.
In a classical context, knots are often chosen by eye.
In a Bayesian context, we can place a prior distribution on these and then
carry out inference via transdimensional MCMC.
Prediction can be carried out in a similar way as for polynomial regression
Prediction for the prestige data
library(BASS)
mod = bass(Prestige$income,Prestige$prestige)
yfit = predict(mod,Prestige$income,verbose = T)
yhat splines = colMeans(yfit)
lines(d$x,yhat splines[d$ix],type=’l’,col=’green’,lwd=2)
quants = apply(yfit,2,quantile,probs=c(0.025,0.975))
lines(d$x,quants[1,][d$ix],type=’l’,lty=3,col=’green’,lwd=2)
lines(d$x,quants[2,][d$ix],type=’l’,lty=3,col=’green’,lwd=2)
The plotted intervals are for the mean curve.
To calculate predictive intervals, we also need to consider the residual

variance σ 2 .
q = matrix(rep(0,204),nrow=102)
for (j in 1:102){
z = rnorm(1000,mod$yhat[,j],sqrt(mod$s2))
q[j,] = quantile(z,probs=c(0.025,0.975))
}
lines(d$x,q[,1][d$ix],type=’l’,lwd=2,col=’green’,lty=2)
lines(d$x,q[,2][d$ix],type=’l’,lwd=2,col=’green’,lty=2)
Neural networks
A difficulty with both polynomial and spline regression is that they are
hard to extend efficiently to the case of highly multivariate x. An
alternative is to use feed forward neural networks (deep learning).
The basic model assumes that, when we have inputs x = (x1 , ..., xp )T , then
S
X p
X
g (x) = β0 + wk g (ak + bj xj )
k=1 j=1
where g (·) represents a suitable basis function form, such as a logistic

distribution:
g (x) = 1/(1 + exp(−x)).
Bayesian inference for neural nets
Classically, we need to use penalized regression to avoid fitting too many

neurons (keep K small). Using Bayesian priors we can incorporate suitable
penalization.
Exact Bayesian inference can be carried out via MCMC, but this is
typically very slow.
More recent approaches use approximations (variational Bayes or empirical
Bayes) to find a fast, tractable estimate of the posterior distributions.
library(brnn)
out=brnn(Prestige$prestige˜Prestige$income,neurons=2)
lines(d$x,predict(out)[d$ix],type=’l’,lwd=2,col=’magenta’)
Gaussian processes
The most general approach is to directly define a prior distribution for the
function g (·). The usual approach is via the Gaussian process.
Assume we have an input x. Then the associated function is g (x).
Then g (·) is said to follow a Gaussian process if, for any set of inputs,
(x1 , ..., xn ) for any n, then the joint distribution of g (x1 ), ..., g (xn ) is
normal.
The process is determined by the mean (usually set at 0 a priori for all x)
and covariance functions.
Covariance function
We want K (x, x0 ) → 0 as x0 is more distant from x. A simple covariance

function is the squared exponential
!
1 ||x − x0 || 2

0 2
K (x, x ) = σ0 exp −
2 λ
Many other possibilities can be considered.
Inference
Assume we have data (x1 , y1 ), ...(xn , yn ). Then,
yn = (y1 , .., yn )T ∼ N(0, Kn + σ 2 I)
where Knij = K (xi , xj ) and I is the identity.

To make predictions for test data (xt , yt ) observe that the joint
distribution of (yn , yt ) is normal with covariance

Kn Knt
Knt = .
Ktn Ktt
A posteriori, yt |yn is then also normal with mean
Ktn (Kn + σ 2 I)−1 yn
and variance
Ktt − Ktn (Kn + σ 2 I)−1 Knt + σ 2 .
Inference
Prediction is easy given the model parameters. The only difficulty therefore
is estimation of σ 2 and the parameters of the covariance function.
This can be done via MCMC but this is very slow and usually,
approximations via e.g. likelihood maximization or variational Bayes are
preferred.
Gaussian processes and alternative models
Gaussian process regression can often be thought of as a limiting case of

other models:
Gaussian processes and alternative models
Gaussian process regression can often be thought of as a limiting case of

other models:
In the case of neural nets, if we assume the intercept β0 and weights wk
are independent and that the remaining terms are i.i.d., then using the
Central limit theorem, as k → ∞, the neural nets model approaches the
GP.
library(tgp)
xmin = min(Prestige$income)
xmax = max(Prestige$income)
xx = xmin+c(0:100)*(xmax-xmin)/100
yfit = bgp(X=Prestige$income,Z=Prestige$prestige,XX=xx,verb=0)
Income”,ylab=”Prestige”,data=Prestige,ylim=c(0,100))
with(Prestige, lines(lowess(income, prestige, f=0.5, iter=0), lwd=2))
lines(xx,yfit$ZZ.mean,type=’l’,lwd=2,col=’red’)
lines(xx,yfit$ZZ.q1,type=’l’,lty=2,lwd=2,col=’red’)
lines(xx,yfit$ZZ.q2,type=’l’,lty=2,lwd=2,col=’red’)
Bayesian classification
We have test and training data divided into different groups.
1.0
0.5
100 training data with 6 x values.

Data are classified into 3 classes
0.0
X2
according to the size of the sum of

squares of the first two components.
−0.5
−1.0
−0.5 0.0 0.5 1.0
X1
Test data
1.0
0.5
500 data. The objective is to fit a

model to the training data and use it
X2
0.0
to predict the test data.

−0.5
−1.0
−1.0 −0.5 0.0 0.5 1.0
X1
Linear probit regression
Suppose data come from two classes (C = 0 or 1).
The (linear) probit model assumes that

P(C = 1|x) = Φ xT θ
where Φ(·) is a standard normal c.d.f.

We can treat this as a latent variable model by defining
Y = xT θ +
where ∼ N(0, 1) and then

1 if Y > 0
C= .
0 otherwise
Nonparametric, multinomial probit regression
The simple probit model only allows for two classes and assumes an
underlying linear structure.
Nonparametric, multinomial probit regression
The simple probit model only allows for two classes and assumes an
underlying linear structure.
A parametric multinomial probit regression model is
Ci ∼ categorical(pi1 , ..., pik ) where pij = P(Ci = j|xi ).
Underlying latent variables are Yij = βxi + j for j = 1, ..., k. We set
Ci = j if Yij > Yir for all r 6= j.
We can generalize to a non-linear case by setting Yij = gi (xi ) + j and
modeling the function gi with a GP prior.
Model fitting
1.0
1.0
0.5
0.5
X2
X2
0.0
0.0
−0.5
−0.5
−1.0
−1.0
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
X1 X1
Test data. Predictions.
Classification errors
1.0
0.5
Less than 3% of the data are

X2
0.0
incorrectly classified.
−0.5
−1.0
−1.0 −0.5 0.0 0.5 1.0
X1
Probabilities for misclassified points
True class Predicted class P(C = 1) P(C = 2) P(C = 3)

1 3 0.39 0.16 0.45
1 2 0.44 0.45 0.11
1 3 0.39 0.20 0.41
1 2 0.43 0.12 0.45
.. .. .. .. ..
. . . . .
3 1 0.48 0.09 0.43
When misclassification occurs it is by a small probability difference.
Conclusions and extensions
We have illustrated various methods for fitting non-parametric regressions

and shown how classification can be implemented via multinomial probit
with Gaussian processes.
In the next session, we examine a related problem to classification, that is

clustering.

Chapter 4: Non-Linear and Non-Parametric Regression and Classification

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 4: Non-Linear and Non-Parametric Regression and Classification

Uploaded by

Copyright:

Available Formats

Chapter 4: Non-linear and non-parametric

regression and classification

Conchi Ausı́n and Mike Wiper

Introduce the ideas of Bayesian

nonparametric regression modeling

Other packages are available: blm or BayesSummaryStatLM for linear

We can observe a clear non-linear relationship between prestige and

The basic nonparametric regression model is of form

Suppose the data are (x1 , y1 ), ..., (xn , yn ).

Assume that g (x) = ki=0 βi Pi (x) where Pi (x) is an orthogonal

A second order polynomial is selected.

As we are using MCMC, it is important to perform various checks on

In order to carry out prediction for each set of sampled values

Prediction uncertainty can be incorporated by taking a 95% interval of the

When fitting regression splines, lower order polynomials are typically

The plotted intervals are for the mean curve.

To calculate predictive intervals, we also need to consider the residual

where g (·) represents a suitable basis function form, such as a logistic

Classically, we need to use penalized regression to avoid fitting too many

We want K (x, x0 ) → 0 as x0 is more distant from x. A simple covariance

Many other possibilities can be considered.

yn = (y1 , .., yn )T ∼ N(0, Kn + σ 2 I)

where Knij = K (xi , xj ) and I is the identity.

A posteriori, yt |yn is then also normal with mean

Ktn (Kn + σ 2 I)−1 yn

Gaussian process regression can often be thought of as a limiting case of

Gaussian process regression can often be thought of as a limiting case of

100 training data with 6 x values.

according to the size of the sum of

−0.5 0.0 0.5 1.0

500 data. The objective is to fit a

to predict the test data.

−1.0 −0.5 0.0 0.5 1.0

where Φ(·) is a standard normal c.d.f.

where  ∼ N(0, 1) and then

Test data. Predictions.

Less than 3% of the data are

−1.0 −0.5 0.0 0.5 1.0

True class Predicted class P(C = 1) P(C = 2) P(C = 3)

When misclassification occurs it is by a small probability difference.

We have illustrated various methods for fitting non-parametric regressions

In the next session, we examine a related problem to classification, that is

You might also like

where ∼ N(0, 1) and then