Radial Basis Function Networks

CSCI 5521: Paul Schrater

Introduction
In this lecture we will look at RBFs, networks where the activation of hidden is based on the distance between the input vector and a prototype vector

Radial Basis Functions have a number of interesting properties
• Strong connections to other disciplines
– function approximation, regularization theory, density estimation and interpolation in the presence of noise [Bishop, 1995]

• RBFs allow for a straightforward interpretation of the internal representation produced by the hidden layer • RBFs have training algorithms that are significantly faster than those for MLPs • RBFs can be implemented as support vector machines
CSCI 5521: Paul Schrater

Radial Basis Functions
• Discriminant function flexibility
–NON-Linear
•But with sets of linear parameters at each layer •Provably general function approximators for sufficient nodes

• Error function
–Mixed- Different error criteria typically used for hidden vs. output layers. –Hidden: Input approx. –Output: Training error

• Optimization
– Simple least squares - with hybrid training solution is unique.
CSCI 5521: Paul Schrater

with free parameters CSCI 5521: Paul Schrater .Two layer Non-linear NN Non-linear functions are radially symmetric ‘kernels’.

which requires D(D+3)/2 parameters to be learned – or a diagonal structure. 1995] CSCI 5521: Paul Schrater . with only (D+1) independent parameters In practice.Input to Hidden Mapping • Each region of feature space by means of a radially symmetric function – Activation of a hidden unit is determined by the DISTANCE between the input vector x and a prototype vector µ • Choice of radial basis function • Although several forms of radial basis may be used. Gaussian kernels are most commonly used – The Gaussian kernel may have a full-covariance structure. a trade-off exists between using a small number of basis with many parameters or a larger number of less flexible functions [Bishop.

Builds up a function out of ‘Spheres’ CSCI 5521: Paul Schrater .

RBF vs. NN CSCI 5521: Paul Schrater .

CSCI 5521: Paul Schrater .

CSCI 5521: Paul Schrater .

RBFs have their origins in techniques for performing exact interpolation These techniques place a basis function at each of the training examples f(x) and compute the coefficients wk so that the “mixture model” has zero error at examples CSCI 5521: Paul Schrater .

such that h(xi) = ti. for inputs xi and targets ti y = h( x) = %"11 "12 '" 21 " 22 'M M ' &" n1 " n 2 $ w "( x # x ) i i i = 1 :n Form Solve for w CSCI 5521: Paul Schrater L "1n (% w1 ( % t1 ( L " 2 n *'w 2 * 't 2 * = ' * " ij = " x i # x j * ' * M M M M ' * ' * L " nn * w t )&r n ) & n ) r "w = t r #1 r T T w= " " " t Recall the Kernel trick ( ) ( ) .Formally: Exact Interpolation Goal: Find a function h(x).

Mhaskar and Micchelli. 1992.13. Advances in Applied Mathematics.Solving conditions • Micchelli’s Theorem: If the points xi are distinct. 350-373. CSCI 5521: Paul Schrater . then the ! matrix will be nonsingular. Approximation by superposition of sigmoidal and radial basis functions.

CSCI 5521: Paul Schrater .

g. & x$µ j 2 ) e.m 0 CSCI 5521: Paul Schrater . 1 " ( x ) . ki i i= 0:m " 0 (x ) = 1 Rewrite this as : .Radial Basis function Networks n outputs. m basis functions y k = h k ( x) = = # w " ( x) + w ki i i= 1:m k0 . " i ( x ) = exp( $ 2% 2 + j * ' # w " (x)." 0 ( x ) / r r r y = W" ( x ). " ( x ) = . M 1 .

linear – CSCI 5521: Paul Schrater .Learning RBFs are commonly trained following a hybrid procedure that operates in two stages Unsupervised selection of RBF centers – RBF centers are selected so as to match the distribution of training examples in the input feature space – This is the critical step in training. normally performed in a slow iterative manner – A number of strategies are used to solve this problem • Supervised computation of output vectors Hidden-to-output weight vectors are determined so as to minimize the sumsquared – error between the RBF outputs and the desired targets – Since the outputs are linear. the optimal weights can be computed using fast.

with labels tl Form the least square error function : E= k 2 l # l =1:L #{ y k ( x l ) " t } . t lk is the k th value of the target vector r t r r r E = # ( y ( x l ) " tl ) ( y ( x l ) " tl ) l =1:L k =1:n r r t r r t t = # ($ ( x l )W " tl ) ($ ( x l )W " tl ) l =1:L E= # (%W l =1:L t " T) (%W t " T) = t # W% %W t l =1:L t " 2W%t T +Tt T &E = 0 = 2%t%W t " 2%t T dW W = (% %) %t T t t CSCI 5521: Paul Schrater "1 .Least squares solution for output weights Given L input vectors xl.

if we know the input parameters. How do we set the input parameters? – Gradient descent--like Back-prop – Density estimation viewpoint: By looking at the meaning of the RBF network for classification.Solving for input parameters • Now we have a linear solution. the input parameters can be estimated via density estimation and/or clustering techniques – We will skip these in lecture CSCI 5521: Paul Schrater .

Maximization algorithm The spread parameters for each center are automatically obtained from the covariance matrices of the corresponding Gaussian components CSCI 5521: Paul Schrater . the spread parameters σ j can be estimated. or from the sample covariance of the examples of each cluster • Density estimation – – The position of the RB centers may also be obtained by modeling the feature space density with a Gaussian Mixture Model using the Expectation. from the average distance between neighboring centers • Clustering – Alternatively.Unsupervised methods overview • Random selection of centers – The simplest approach is to randomly select a number of training examples as RBF centers – This method has the advantage of being very fast. RBF centers may be obtained with a clustering procedure such as the k-means algorithm The spread parameters can be computed as before. for instance. but the network will likely require an excessive number of centers – Once the center positions have been selected.

Probabilistic Interpretion of RBFs for Classification = Introduce Mixture CSCI 5521: Paul Schrater .

Prob RBF con’d where Basis function outputs: Posterior jth Feature probabilities weights: Class probability given jth Feature value CSCI 5521: Paul Schrater .

"w ) ( y ."w ) + #w T $w Solution CSCI 5521: Paul Schrater r w = (" " + #$) " y T %1 T Generalized Ridge Regression ."w ) ( y ."w ) + #w T $w T Make a penalty on large curvature $ d 2# i ( x ) '$ d 2# j ( x ) ' "ij = * & )dx )& 2 2 % dx (% dx ( Minimize T ! L = ( y .Regularized Spline Fits Given $ #1 ( X1 ) L # N ( X1 ) ' & ) "=& M ) & %#1 ( X M ) L # N ( X M )) ( y pred = "w Minimize L = ( y .

CSCI 5521: Paul Schrater .

CSCI 5521: Paul Schrater .

CSCI 5521: Paul Schrater .

Equivalent Kernel • Linear Regression solution admit a kernel interpretation r w = (" " + #$) " y T %1 T r y pred ( x ) = "( x ) w = "( x )(" " + #$) " y T %1 T Let y pred (" " + #$) " = S " r ( x ) = "( x )' S & ( x ) y r = ' "( x ) S & ( x ) y = ' K ( x. x ) y T T %1 T # i # i # i i i i CSCI 5521: Paul Schrater .

e.Eigenanalysis and DF • An eigenanalysis of the effective kernel gives the effective degrees of freedom (DF). x j ) Eigenanalysis K = VDV "1 = USU T CSCI 5521: Paul Schrater ! . # of basis functions Evaluate K at all points K ij = K ( x i . i.

Equivalent Basis Functions CSCI 5521: Paul Schrater .

CSCI 5521: Paul Schrater .

Computing Error Bars on Predictions • It is important to be able to compute error bars on linear regression predictions. We can do that by propagating the error in the measured values to the predicted values. Simple Least squares Given the Gram Matrix $ #1 ( X1 ) L # N ( X1 ) ' *1 T r & ) T "=& M w = (" ") " y ) & %#1 ( X M ) L # N ( X M )) ( *1 T r r T predictions y pred = "w = "(" ") " y = Sy r r T T T cov[ y ] = cov[ S y ] = S cov[ y ] S = S + I S = + SS ( ) Error pred CSCI 5521: Paul Schrater .

CSCI 5521: Paul Schrater .

CSCI 5521: Paul Schrater .

CSCI 5521: Paul Schrater .

Sign up to vote on this title
UsefulNot useful