Professional Documents
Culture Documents
27–43
© 2004 Biometrika Trust
Printed in Great Britain
S
By extending Schwarz’s (1978) basic idea we derive a Bayesian information criterion
which enables us to evaluate models estimated by the maximum penalised likelihood
method or the method of regularisation. The proposed criterion is applied to the choice
of smoothing parameters and the number of basis functions in radial basis function net-
work models. Monte Carlo experiments were conducted to examine the performance of
the nonlinear modelling strategy of estimating the weight parameters by regularisation
and then determining the adjusted parameters by the Bayesian information criterion. The
simulation results show that our modelling procedure performs well in various situations.
Some key words: Bayes approach; Maximum penalised likelihood; Model selection; Neural network; Nonlinear
regression.
1. I
Recent years have seen the development of various types of nonlinear model such as
neural networks, kernel methods and splines. Nonlinear models are generally characterised
by including a large number of parameters. Since maximum likelihood methods yield
unstable parameter estimates, the adopted model is usually estimated by the maximum
penalised likelihood method (Good & Gaskins, 1971; Green & Silverman, 1994) or the
method of regularisation or the Bayes approach and so on.
Crucial issues with nonlinear modelling are the choice of a smoothing parameter, the
number of basis functions in splines and the number of hidden units in neural networks.
Choosing these parameters in the modelling process can be viewed as a model selection
and evaluation problem. Schwarz (1978) proposed the Bayesian information criterion, .
However, theoretically, the covers only models estimated by the maximum likelihood
method. It still remains to construct a criterion for evaluating nonlinear models estimated
by the maximum penalised likelihood method.
This paper has two aims. First, the is extended to cover the evaluation of models
estimated by the maximum penalised likelihood method. Secondly, the criterion is used
28 S. K, T. A S. I
to construct radial basis function network nonlinear regression models. In §2, by extending
Schwarz’s basic ideas, we present various types of Bayesian information criterion. Section 3
describes nonlinear regression modelling based on radial basis function networks. In §4
we investigate the performance of the nonlinear modelling techniques, using Monte Carlo
simulations. Section 5 provides discussion, including possibilities for future work.
pr (M )
k P f (y|h )p (h |l )dh ) pr (M ) f (y|l ).
k k k k k k k k k
(1)
The quantity f ( y|l ) obtained by integrating over the parameter space H is the marginal
k k k
probability of the data y under model M , and it can be rewritten as
k
f (y|l )=
k k P exp {nq (h |y, l )}dh ,
k k k k
(2)
where
q (h |y, l )=n−1{log f (y|h )+log p (h |l )}. (3)
k k k k k k k k
We first consider the case where log p (h |l )=O(n). Let h@ be the mode of q (h |y, l ).
k k k k k k k
Then, using the Laplace method for integrals in the Bayesian framework developed by
Tierney & Kadane (1986), Tierney et al. (1989) and Kass et al. (1990), we have under
certain regularity conditions the Laplace approximation to the marginal distribution (2)
in the form
f (y|l )=
k k P exp {nq (h |y, l )}dh
k k k k
(2p)pk/2
= exp {nq (h@ |y, l )}{1+O (n−1)}, (4)
npk/2|Q (h@ )|D k k k p
lk k
where
K
∂2{q (h |y, l )}
Q (h@ )=− k k k .
lk k ∂h ∂hT
k k hk=h@k
Bayesian analysis of radial basis function networks 29
Substituting the Laplace approximation in equation (1) and taking the logarithm of the
resulting formula, we have
−2 log {pr (M )f (y|l )}=−2 log f (y|h@ )−2 log p (h@ |l )+p log n+log |Q (h@ )|
k k k k k k k k k lk k
−2 log pr (M )−p log 2p+O (n−1). (5)
k k p
Choosing the model with the largest posterior probability among a set of candidate models
for given values of l is equivalent to choosing the model that minimises the criterion (5).
k
We next consider the case where log p (h |l )=O(1). Then the mode h@ of q (h |y, l )
k k k k k k k
in (3) can be expanded as
K
1 ∂
h@ =h@ (ML) + J−1 (h@ (ML) ) log p (h |l ) +O (n−2), (6)
k k n k k ∂h k k k
@
p
k hk=hk(ML)
where h@ (ML) is the maximum likelihood estimate of h in the model f ( y|h ) and
k k k k
K
1 ∂2 log f (y|h )
A B
nl
p(h|l)=(2p)−(p−d)/2(nl)(p−d)/2|D|D exp − hTDh , (9)
+ 2
in which |D| is the product of the ( p−d) nonzero eigenvalues of D. For the Bayesian
+
justification of the maximum penalised likelihood approach, we refer to Silverman (1985)
and Wahba (1990, Ch. 1).
It follows from equation (4) that minus twice the log marginal likelihood, −2 log { f ( y|l)},
can be approximated by
−2 log { f (y|l)}j−2 log f (y|h@ )−2 log p(h@ |l)+p log n+log |Q (h@ )|−p log 2p.
l
(10)
By substituting the prior density (9) into this equation, we have
q A B r
∂q(h|y, l) ∂ 1 l p−d nl 1
= log f (y|h)− hTDh+ log + log |D| =0. (12)
∂h ∂h n 2 2n 2p 2n +
This implies that h@ is the maximiser of the penalised loglikelihood function (8).
We choose the smoothing parameter l to minimise (l). In the context of model
P
selection, we minimise (l) over l for each model, and then choose a model for which
P
(l) is minimised over a set of competing models, which might consist of various types
P
of nonparametric regression model estimated by the maximum penalised likelihood method.
A B
dx−m d2
w (x)=w (x; m , s )=exp − j ( j=1, 2, . . . , p),
j j j j 2ns2
j
where m is the q-dimensional vector determining the location of the basis function, s2
j j
determines the width, n is a hyperparameter and d . d is the Euclidian norm. The hyperpara-
A B
BTCB+ny@ lDT D BTe/y@
1 2 2
Q (w@ , y@ )= @ n .
l ny eTB/y@ −y@ ∑ q
a
a=1
Here C is an n×n diagonal matrix, e is an n-dimensional vector and W q is the second
a a
derivative of l (w, y)/n with respect to y, with
l
(y −m@ ){u◊∞(j@ )h∞(m@ )+u◊(j@ )2h◊(m@ )} 1
C = a a a a a a + ,
aa {u◊(j@ )h∞(m@ )}3 u◊(j@ )h∞(m@ )2
a a a a
e =(y −m@ )/{u◊(j@ )h∞(m@ )},
a a a a a
q =2[y r{w@ Tb(x )}−s{w@ Tb(x )}]/y@ 3+∂2v(y , y)/∂y2| @ ,
a a a a a y=y
for a=1, . . . , n.
Bayesian analysis of radial basis function networks 33
Canonical link functions relate the parameter j in the exponential family (13) directly
a
to the predictor g =W p w w (x )+w in (14), and lead to
a j=1 j j a 0
f (y |x ; w@ , y@ )=exp ([y w@ Tb(x )−u{w@ Tb(x )}]/y@ +v(y , y@ )). (17)
a a a a a a
Then we have the following theorem.
T 2. L et h( . ) be the canonical link function so that h( . )=u∞−1( . ). T hen a Bayesian
information criterion for evaluating the statistical model f (y |x ; w@ , y@ ) given by equation (17)
a a
is
n
(C) =−2 ∑ ([y w@ Tb(x )−u{w@ Tb(x )}]/y@ +v(y , y@ ))+nlw@ TDT D w@
P a a a a 2 2
a=1
−3 log (2p/n)+log |Q(C) (w@ , y@ )|−log |DT D | −( p−1) log l,
l 2 2+
where Q(C) (w@ , y@ ) can be obtained by replacing C, q and e in Q (w@ , y@ ) by, respectively,
l a l
C D
1 1
f (y |x ; w@ , s@ 2)= exp − {y −w@ Tb(x )}2 (a=1, . . . , n), (18)
N a a √(2ps@ 2) 2s@ 2 a a
where
1 n
w@ =(BTB+nbDT D )−1BTy, s@ 2= ∑ {y −w@ Tb(x )}2
2 2 n a a
a=1
with b=ls2.
If in the Bayesian framework we specify a Gaussian prior density p(w|l) as in (9), the
posterior distribution for w is normal with mean vector w@ and covariance matrix
(B∞B/s2+nlD)−1; see Silverman (1985) and Spiegelhalter et al. (2002).
Taking u(j@ )=j@ 2 /2, y=s@ 2, v( y , y)=−y2 /(2s@ 2)−log {s@ √(2p)} and h( m@ )=m@ in
a a a a a a
Theorem 2, we obtain the following Bayesian information criterion for evaluating the
statistical model f ( y |x ; w@ , s@ 2) in (18):
N a a
(N) (b, n, p)=(n+p−1) log s@ 2+nbw@ TDT D w@ /s@ 2+n+(n−3) log (2p)+3 log n
P 2 2
+log |Q(G) (w@ , s@ 2)|−log |DT D | −( p−1) log b, (19)
b 2 2+
34 S. K, T. A S. I
Fig. 1: Example 1. Comparison of the true curve, dashed lines, and the smoothed curve, solid lines,
where
BTe/s@ 2
A B
1 BTB+nbDT2 D2
Q(G) (w@ , s@ 2)= .
b ns@ 2 eTB/s@ 2 n/(2s@ 2)
For illustration, data {( y , x ); a=1, . . . , 100} were generated from the true model
a a
y =0·2 sin (3px2 )+exp {−(x −0·5)2}+exp {−16(x −0·6)2}+e
a a a a a
with Gaussian noise N(0, 0·16), where the design points are uniformly distributed in [0, 1].
Figures 1 (a) and (b) give the fitted curves and the radial basis functions with n=1 and
n=10. The fitted curve in Fig. 1 (a) is obviously undersmoothed, while the one in Fig. 1 (b)
gives a good representation of the underlying function over the region [0, 1].
Example 2: Radial basis function network logistic regression model. Let y , . . . , y be
1 n
independent binary random variables with
pr (Y =1|x )=p(x ), pr (Y =0|x )=1−p(x ),
a a a a
where x are q-dimensional explanatory variables. We model p(x ) by
a a
q r
p(x )
a p
log = ∑ w w (x )+w .
1−p(x ) j j a 0
a j=1
Estimating the ( p+1)-dimensional parameter vector w by maximum penalised likelihood
gives the model
f (y |x ; w@ )=p@ (x )ya {1−p@ (x )}1−ya (a=1, . . . , n), (20)
L a a a a
where p@ (x )=exp {w@ Tb(x )}/[1+exp {w@ Tb(x )}] is the estimated conditional probability.
a a a
By taking
m@
u(j@ )=log {1+exp (j@ )}, v(y , y)=0, h(m@ )=log a , y=1,
a a a a 1−m@
a
Bayesian analysis of radial basis function networks 35
Fig. 2: Example 2. Fitting radial basis function network logistic regression models for the true models
in Theorem 2, we obtain
n
(L) (l, n, p)=2 ∑ (log [1+exp {w@ Tb(x )}]−y w@ Tb(x ))+nlw@ TDT D w@
P a a a 2 2
a=1
−2 log (2p/n)+log |Q(L) (w@ )|−log |DT D | −( p−1) log l, (21)
l 2 2+
where
where the design points are uniformly distributed in [0, 1]. We fitted the radial basis
function network logistic model (20) to these data. Figure 2 shows the true and estimated
conditional probability functions; the circles indicate the data.
Example 3: Radial basis function network Poisson regression model. Suppose that we
have n independent observations y , each from a Poisson distribution with conditional
a
expectation E(Y |x )=c(x ), where x consists of q covariates. It is assumed that the
a a a a
conditional expectation is of the form
p
log {c(x )}= ∑ w w (x )+w =wTb(x ) (a=1, 2, . . . , n).
a j j a 0 a
j=1
36 S. K, T. A S. I
We estimate the unknown parameter vector w by maximising the penalised loglikelihood
function, and then have the model
f (y |x ; w@ )=exp {−c@ (x )}c@ (x )ya /y !, (22)
P a a a a a
where c@ (x ) is the estimated conditional expectation.
a
By taking u(j@ )=exp (j@ ), y=1, v( y , y)=−log ( y !) and h(m@ )=log (m@ ) in Theorem 2,
a a a a a a
we obtain the model-evaluation criterion
n
(P) (l, n, p)=2 ∑ [exp {w@ Tb(x )}−y w@ Tb(x )+log (y !)]+nlw@ TDT D w@
P a a a a 2 2
a=1
−2 log (2p/n)+log |Q(P) (w@ )|−log |DT D | −( p−1) log l,
l 2 2+
where Q(P) (w@ )=BTC(P)B/n+lDT D with C(P)=exp {w@ Tb(x )} as the ath diagonal element of
l 2 2 aa a
C(P). The adjusted parameters l, n and p are determined as the minimisers of (P) (l, n, p).
P
When we set l=0 in (16), the solution becomes the ordinary maximum likelihood
4. N
Monte Carlo experiments were conducted to compare the effectiveness of the criteria
(N) in (19) and (L) in (21) with -type criteria.
P P
Akaike’s information criterion (1973) was derived as an estimator of the Kullback &
Leibler (1951) information from the predictive point of view and is given by
−2l(h@ )+2 (the number of parameters),
ML
where l(h@ ) is the loglikelihood of a model estimated by the maximum likelihood method.
ML
The number of parameters is a measure of the complexity of the model. However, in
nonlinear models, especially models estimated by regularisation, the number of parameters
is not a suitable measure of model complexity, since the complexity may depend on both
the regularisation term and the observed data. The concept of number of parameters was
extended to the effective number of parameters by Hastie & Tibshirani (1990, Ch. 3),
Moody (1992), Spiegelhalter et al. (2002) and others.
In Example 1, the fitted value y@ is expressed as y@ =S y for given b, where S is the
b b
smoother matrix given by S =B(BTB+nbDT D )−1BT. Hastie & Tibshirani (1990) used
b 2 2
the trace of the smoother matrix as an approximation to the effective number of para-
meters. By replacing the number of parameters in and in (7) by tr S , we formally
b
obtain information criteria for the radial basis function network Gaussian regression
model (18) in the form
=n log (2ps@ 2)+n+2 tr S , (23)
M b
=n log (2ps@ 2)+n+(tr S ) log n, (24)
M b
where s@ 2=dy−S yd2/n.
b
Bayesian analysis of radial basis function networks 37
Hurvich et al. (1998) gave an improved version of for choosing a smoothing para-
meter in various types of nonparametric regression model; see also Sugiura (1978) and
Hurvich & Tsai (1989) for Gaussian linear regression and autoregressive time series
models. The version for the radial basis function network Gaussian regression model is
formally given by
=n log (2ps@ 2)+n+2n(tr S +2)/(n−tr S −1). (25)
C b b
An advantage of the criteria , and is that they can be applied in an
M M C
automatic way to each practical situation where there is a smoother matrix. There is
however no theoretical justification, since and are criteria for evaluating models
estimated by the maximum likelihood method.
Recently, Spiegelhalter et al. (2002) proposed a measure for the effective number of
parameters in a model, as the difference between the posterior mean of the deviance and
the deviance at the posterior means of the parameters of interest, using an information
theoretic argument. They provided a nice review of various types of information theoretic
estimators of regression functions, but may be applied to other nonlinear models such as
multilayer perceptron neural networks. We conclude from the numerical studies that
(N) and (L) perform well in practical situations.
P P
5. D
The criteria and have been widely used for variable selection, mainly in linear
models such as autoregressive time series models. In this paper we concentrate on selection
of smoothing parameters in nonlinear models. The proposed criterion may be used for
selecting an optimal subset of variables in nonlinear modelling; optimal values of smoothing
parameters are obtained as the minimisers of the criterion for each model, and then we
choose a statistical model for which the value of the criterion is minimised over a set of
competing models. Further work remains to be done towards constructing nonlinear
modelling strategies of this nature in the context of areas such as neural network models,
which are characterised by a large number of parameters.
A
The authors would like to thank the editor and anonymous reviewers for constructive
and helpful comments that improved the quality of the paper considerably. We are also
grateful to them for pointing out Andrieu et al. (2001), Spiegelhalter et al. (2002) and
several other references.
R
A, H. (1973). Information theory and an extension of the maximum likelihood principle. In 2nd Int.
Symp. on Information T heory, Ed. B. N. Petrov and F. Csaki, pp. 267–81. Budapest: Akademiai Kiado.
A, T., I, S. & K, S. (2001). Estimating nonlinear regression models based on radial basis
function networks (in Japanese). Jap. J. Appl. Statist. 30, 19–35.
42 S. K, T. A S. I
A, C., F, N. & D, A. (2001). Robust full Bayesian learning for radial basis networks.
Neural Comp. 13, 2359–407.
B, C. M. (1995). Neural Networks for Pattern Recognition. Oxford University Press.
B, D. S. & L, D. (1988). Multivariable functional interpolation and adaptive networks. Complex
Syst. 2, 321–35.
C, B. S. & B, A. R. (1994). Jeffreys’ proir is asymptotically least favorable under entropy risk.
J. Statist. Plan. Infer. 41, 37–60.
D, A. C. (1986). Approximate predictive likelihood. Biometrika 73, 323–32.
E, B. (1986). How biased is the apparent error rate of a prediction rule? J. Am. Statist. Assoc. 81, 461–70.
G, I. J. & G, R. A. (1971). Nonparametric roughness penalites for probability densities. Biometrika
58, 255–77.
G, P. J. & S, B. W. (1994). Nonparametric Regression and Generalized L inear Models. London:
Chapman and Hall.
G, P. J. & Y, B. (1985). Semi-parametric generalized linear models. In Generalized L inear Models,
Ed. R. Gilchrist, B. J. Francis and J. Whittaker, Lecture Notes in Statistics, 32, pp. 44–55. Berlin: Springer.
H, T. J. & T, R. J. (1990). Generalized Additive Models. London: Chapman and Hall.
H, C. C. & M, B. K. (1998). Bayesian radial basis functions of variable dimension. Neural Comp.
10, 1217–33.
H, C. M. & T, C.-L. (1989). Regression and time series model selection in small samples. Biometrika