You are on page 1of 45

Visualization and Implementation of Feedforward Neural Networks

Humboldt-University of Berlin Department of Economics, Institute of Statistics and Econometrics Spandauer Strasse 1, D-10178 Berlin e-mail: URL: sigbert/index.html DKFZ Heidelberg Department of Biostatistics, P.O. Box 101949 D-69009 Heidelberg e-mail: URL: /grassmann.html

Sigbert Klinke

Janet Grassmann

Feedforward neural networks are often used methods for regression and classi cation. But mostly they are treated as black boxes, that will nd the "right" model by themselves. The advantage of exibility is then compensated by nontransparency of the training process and the nal model. To understand the internal behaviour of a feedforward neural network we have applied non-metric multidimensional scaling. The weights of the connections between the units of di erent layers are transformed into distances between these units. Finally, we get two-dimensional gures of projected units, where the distances between them give us an idea of the in uence of individual input or hidden units to other units. This method should be seen as an opportunity to play with a feedforward neural network by removing or adding units or clusters of them and then to see what happens. To represent the idea of our work we have chosen two real-world applications as the credit data from Fahrmeier and Hamerle (1981) and the protein folding class data from Reczko et al (1994).

1 Introduction
The paper is organized into seven chapters. In the second chapter we describe the general structure of a feed forward neural network as it is well known from the literature. The third chapter describes how we come from weights to distances and which transformation functions we have used. Then the technique of multidimensional scaling is described brie y. Finally we give an example what we expect to see for a very simple network. The fourth chapter describes the implementation of our technique and the implementation of feed forward neural networks in XploRe 3. The implementation is based on four commands and one macro. In the following two chapters we apply our technique to the two real world datasets mentioned in the abstract. We analyze the usefulness of both transformation functions. The last chapter contains a discussion of further improvements for the visualization and implementation.

2 Feedforward neural networks

The most often used neural networks for regression and classi cation tasks are feedforward networks (FFN). For some applications in economy (e.g., credit scoring) and molecular biology (e.g., protein structure prediction, prediction of functional sites at the DNA), they belong to the large amount of standard tools. But too often they are treated as black boxes in daily use. The advantage of exibility is mostly compensated by non-transparency of the training process and the nal model.

2.1 The architecture of a FFN

A feedforward neural network consists of a set of layers of computational units (see Figure 1). The weighted connections from the input units over the hidden units to the output units are only in forward direction. No feedback connections are allowed so that the output of such a network can be expressed as a deterministic function of the inputs. The output of each unit is calculated by an activation function of its input. Common activation functions are usually monotonically increasing functions with values between 0 and 1, e.g., the jump function or a continuous sigmoidal function (see section 4.1). A multilayer FFN with one hidden layer and linear or logistic output units can be written as 0 !1 p m X X fk (x) = G @wk + wkj G wj + wji xi A ;
1 (2) 0 (2)

j =1

(1) 0



(1) wji




out1 =f 1(x)

xp w(1) j0 1



outK=f K (x)

Figure 1: Multilayer feedforward neural network where G is a nonlinear activation function and G is typically either a linear or a sigmoidal activation function. fk (x) represents the value of the kth response variable given by the kth output unit.
0 1

2.2 Error functions

Several learning rules can be applied to minimize the error between the actual net n output yk and a target output tn for the nth object. During this iterative minik mization procedure the weights (= parameters of the FFN model) are estimated. In the regression case the average squared error XX n n ES = (tk ? yk )

n = 1; :::; N; k = 1; :::; K with N the sample size is minimized. An often used error function for classi cation tasks is the Kullback-Leibler distance ! !) XX( n tn + (1 ? tn) log 1 ? tn : k k tk log yn EKL = k n 1 ? yk n k k Obviously, its minimization is equivalent to minimize the negative log-likelihood function XX n n n EML = ? log(L) = ? ftk log yk + (1 ? tn) log(1 ? yk )g : k
For the minimization we have to choose an appropriate optimization method which depends on the smoothness of the activation functions. For our examples we implemented a stochastic search algorithm, simulated annealing as well as a quadratic approximation algorithm. 3
n k

n k

2.3 Network complexity

There are several ways to control the e ective complexity of a neural network (see Bishop (1995) for an overview). We have considered only two very common approaches. The rst is weight decay and the second the early stopping method. Weight decay means to add a fraction d of the sum of squared weights of the current model to the error function and get X Ed = ES + d wij

with wij the weights in the network in order to penalize large weights during the training process. Before applying this regularization method the input variables should be normalized. One way to choose the weight decay parameter d is by cross-validation (CV) (Stone 1974, Efron & Tibshirani 1993), that is, by minimizing an estimate of the generalization ability with respect to the following algorithm. We divide the training set D in a speci ed number of subsets Dj of sizes Nj D = D ; \D = ;; P N = N ,
j j j


train the network for all but one of these subsets D ?D and estimate the generalization error CVD using the subset left out as a test set CV (d) = P (Y ? f (x ; D ?D )) .
j) j



xl ;Yl )2Dj l

d l


This is repeated for all subsets. Finally, the generalization error is averaged over all subsets which gives the cross-validation error CVD CV (d) = Pk CV ( ) ?! min!.
D k

j =1


The optimal d (out of a given one-dimensional grid) is that with minimal CVD within the training data set. Another common approach to control the e ective complexity of a network is by "early stopping". While the training error is in general monotonically decreasing during the training process the test error often begins to increase as the network starts to over t the data. The idea is to reduce the e ective number of degrees of freedom of the network by stopping the training process at the point of the smallest test error. It is expected that the resulting network will give a low generalization error for so far unknown examples. To estimate the future performance of the nal chosen model we use a test data set considered neither during the training process nor during the calibration of the network complexity. 4

Neural networks with at least one hidden layer and sigmoidal activation functions of the hidden units can approximate arbitrarily well any nonlinear function from a nite-dimensional space to another, supposed there are enough units in the hidden layer. Thus they are also able to approximate any decision boundary to arbitrary accuracy. In that case they estimate directly posterior class probabilities, if the response (output) variables represent the binary coded values for each class in a classi cation problem (Richard & Lippmann 1991). In the case of high-dimensional unknown interactions of the input variables and a large number of cases such a model can be very useful. One disadvantage is that an interpretation is only hardly to achieve, especially in a fully connected network. By pruning unimportant connections or units, respectively, one can achieve better interpretation and also better generalization ability.

3 Visualization
To understand how a neural network works and what it learns it is helpful to understand the topological behaviour of the weights of the network during and after the training process. In statistical modeling it is often desirable to interpret the model, that is to nd out which variables are contributing to one or more response variables and of what kind the contributions are (e.g., linear, nonlinear, interactive). To see the topological behaviour of the units we embed them in a high-dimensional space and visualize them in a two-dimensional projection via a statistical technique called \multidimensional scaling".

3.1 From weights to distances

The rst step is to transform the weights of the connections of the network into some kind of distances. Small weights should produce large distances and large weights small distances, such that we can easily identify weakly connected units and subsets of strongly connected units The rst choice is the transformation



= 0:1 +1jw j :

It uniquely transforms each weight in a distance such that large weights become small distances and vice versa. The addition of 0:1 is necessary to avoid that
Which connections are important depends obviously on the problem. In the classi cation context these are the large ones in magnitude.

a small weight will be transformed in a huge number which would distract the whole graphic. This transformation has the disadvantage that linearly related weights will be transformed into non-linear related distances. Thus another possible choice is the transformation
() (2)

= max jwi;j j ? jwi;j j i;j

The location of the units in a high-dimensional space can be determined by the : computed distances i;j . The fastest way to nd a low-dimensional projection would be metric scaling, but for that i;j would have to be distances in the mathematical sense. Since the weights are varied independently we will not be able to ful ll the triangle inequality. Thus we will use the non-metric scaling method, where f (dr;s ) = r;s or r;s with dr;s the projected distances.
(1) (2)

3.2 Multidimensional scaling

The aim of multidimensional scaling (MDS) is to nd a low-dimensional (dim = 1; 2; 3) space such that the distances dr;s between the objects r and s in the lowdimensional space match as well as possible the original dissimilarities r;s of a higher dimensional con guration space. For an overview see Cox & Cox (1994). We will look at three MDS-models: classical (metric) scaling, least-squares scaling, non-metric scaling. In metric scaling the dissimilarities r;s are taken immediately as euclidean distances. We minimize the stress function P (d ? ) s Sms = r6 P r;s d r;s : r6 s r;s
= 2 = 2

Such a stress function is de ned to measure the discrepancy between the ranks of the real and the computed distances. If r;s are distances from a metric (e.g. euclidean distances) we can compute dr;s easily by a non-iterative algorithm. In least-squares scaling a monotone (parametric) transformation f of the dissimilarities r;s is added. To nd a good low-dimensional projection the (stress) functional
2 1

A metric d : M M ! IR is given by the three properties: 1) d(x; y) = 0 , x = y, 2) d(x; y) = d(y; x) and 3) d(x; z ) d(x; y) + d(y; z ).

(dr;s ? f ( r;s)) r6 s X Sls = dr;s

1 = 2


is minimized. Non-metric scaling assumes that the level of measurement is at the nominal or at best ordinal scale. The transformation function f is a monotone function such that

f (dr;s) f (dt;u ) if
2 2

r;s < t;u

which means that the dissimilarity in uences only indirectly the stress function X (dr;s ? f (dr;s)) r;s X Sns = : dr;s
2 2 2


The stress function Sns and the minimization of it was proposed in Kruskal (1964a, 1964b). The algorithm is 1. 2. 3. 4. 5. 6. Choose an initial location X in the high-dimensional space for the units Normalize the location, e.g. such that mean(X ) = 0 and var(X ) = 1 Compute dr;s Fit f (dr;s ), e.g. by monotonic least squares regression Compute a new con guration X by minimizing the stress function Go to 2.

Drawbacks. The drawback of MDS is that the true structure is high-dimensional.

If we use two or three dimensions for visualization we will get only the best twoor three-dimensional approximation of the true structure. Nevertheless we hope to grasp some important properties of the network structure.

w 13

w 12

w 23


Figure 2: A simple feedforward network with one input, one hidden and one output unit and with a direct connection from the input to the output unit

3.3 Examples
A simple net. To clarify the idea let us assume a very simple network with
12 13 23

three units and three weights w ; ; w ; and w ; as in Figure 2 with one input unit, one hidden unit and one output unit. If we have a common activation function for the hidden and the output unit G, then we can build the model

y = G(w ; x + w ; G(w ; x)): ^

13 23 12

no. 1 2 3 4 5

small weights model ? y = G(w ; x + w ; G(w ; x)) ^ w; y = G(w ; x + w ; G(0)) ^ w; y = G(w ; G(w ; x)) ^ w; y = G(w ; x) ^ w ; ;w ; y = G(w ; ) ^
13 23 12 12 13 23 13 23 12 23 13 12 13 23

Table 1: Models which can be build with 3 units. We can set some weights wi;j to 0 and get the models in Table 1. The missing case w ; ; w ; 0 is equal to the model 4 in Table 1 and w ; ; w ; 0 is a special case of model 5. Figure 3 shows the graphical representation of all ve models. The Figure 3 can be analyzed as follows:
12 23 13 23

Model 1 The weights are equally chosen with w ; = w ; = w ; = c which

12 13 23 12 13 23

results in a triangle with equidistant vertices, Model 2 the weight w ; is small and the weights w ; and w ; are equally chosen. The weight between the input unit and the hidden unit is nearly 0. Thus the input unit and the hidden unit have to be placed with a large distance. The weights between the input unit and the output unit and the weights between the hidden unit and the output unit are large and have 8




+ o

o * *


++ * o

Figure 3: Visualization of the models 1 ? 5 of Table 1. + is the input unit, o is the output unit and * is the hidden unit. the same value. Thus the input unit and the hidden unit have to be placed in the same distance from the output unit. This results in placing all three units in a line starting with the input unit, then the output unit and the last the hidden unit. Model 3, 4 The result can be constructed in the same way as in model 2, but the role of the units has to be exchanged. Model 5 In this model we have two small weights and one large weight. The distance between the hidden unit and the output unit has to be small since the weight is large. The distances from the input unit to the other units has to be large since the weights are small. We can recognize if the weights are chosen equally on the whole network (or on a subset of units). In that case we will observe more or less regular triangles. We are able to identify weights which are small and thus may lead to super uous connections between units. As in any approach of visualizing multivariate data we have to learn to interpret the patterns which represent several features in the neural network. Logistic regression. A FFN with no hidden layer and one output unit with a logistic activation function is equivalent to a dichotomous logistic regression model 9


w1 w2 out=f(x)


wp w

Figure 4: A FFN without a hidden layer and only one output unit is equivalent to a dichotomous logistic regression model

P (Y = 1jx) = 1 + exp(?w ? w1 x ? ::: ? w x ) : ; ; ;p p

10 11 1 1

If we use in the non-metric multidimensional scaling (NMDS) just those weights wk;i (i = 1; :::; p; k = 1; :::; K ) given in the model we can interpret only the distance from the input units to the output unit as an indicator of the size of the weights (see Figure 4). In principle we also have weights between the input units, but they are assumed to be 0. If we would include these weights implicitly as zero in the NMDS routine we would get a more or less regular star with the output unit in the center. We disclaimed this possibility because 1. the computational e ort in the NMDS increases with n(n ? 1)=2 (n the number of the units) instead of n, 2. the visualization error becomes larger since a higher-dimensional space would be required to match the distances exactly and 3. the additional 0-weights would distort the representation of the real weights and we would need to use weighted distances. In the case of a FFN without a hidden layer, but with more than one outputs

P (Y = kjx) = 1 + exp(?w ? w1 x ? ::: ? w x ) for k = 1; :::; K k k kp p

0 1 1

we have to nd a low-dimensional representation of all weights to all of the output units. In this case the geometric information given by a two-dimensional plot will tell us really which input units are important for which output unit. 10

two hidden units. A layered FFN has connections from each unit of a layer to each unit of the next layer in forward direction (see Figure 1). With our technique we can interpret the in uence from one layer to the next, but not from the input layer to the output layer if there are no direct connections between them. Again, we would need the implicitly assumed 0-weights. A multi-hidden layer FFN. For more than one hidden layers the problem is very similar since we only have to generalize the interpretation process from the single-hidden layer FFN. Theoretically, a multi-hidden layer FFN is not necessary, since in theory a FFN with one hidden layer can approximate any measurable function, supposed that enough hidden units are available. A complex FFN. A complex FFN will allow connections within the individual layers of units. To represent these connections appropriately we should not include the implicitly assumed 0-weights.

A single-hidden layer FFN. A FFN with one hidden layer will contain at least

4 Implementation in XploRe
The implementation of multivariate regression and classi cation methods shows the trade-o between speed and transparency of the method. For example in S-Plus the multivariate regression methods are implemented as commands and they are therefore very quick in the execution. But the user gets no insight into the algorithm used. In contrast, in XploRe we (Hardle, Klinke & Turlach 1995, Klinke 1995) implemented these methods mostly in macro language. Therefore, the computation is slower compared to S-Plus, but the user can have a good insight how the methods work. It is the permanent responsibility of the programmer to balance between those two ways while producing a software.

4.1 Commands
The feedforward network implemented in XploRe 3.2 is based on four commands written in the C programming language: NNINIT for the initialization of a FFN, NNFUNC and NNUNIT to compute the outputs for a set of observations X and NNVISU to visualize the geometry of the network. The syntax is
l=NNINIT(weight unit)

with weight the weight matrix (n 3) which contains from where ( rst column) to where (second column) the connection between two units is. The starting weights (third column) are also given. The parameter unit (m 2) contains an 11

identi er for the unit type (?1 input unit, 1 output unit, otherwise a hidden unit) in the rst column and in the second column a number for the di erent kinds of activation functions 1. the identity, 2. the jump function, 3. the sigmoid function, 4. the tangens hyperbolicus exp( ? exp(?x) G(x) = tanh(x) = exp(x) + exp(?x) x) 5. and the arcus tangens function

G(x) = x G(x) = 0 if x < 0 1 otherwise

1 G(x) = 1 + exp(?x)

G(x) = arctan(x):
The output parameter l tells us how our weight matrix should be reordered with the XploRe command INDEX, so that it can be used to calculate the network output appropriately. In the optimization loop we can use the command NNFUNC with the syntax
y = NNFUNC (weight unit x).

It computes for a set of input values x the function values y using the net given by unit and weight. NNUNIT has the same syntax as NNFUNC but there the variable y contains the output of all units of the network and not only that of the output units. This command can be used to analyse the individual behaviour of each neuron. The last command NNVISU tries to visualize the FFN. The syntax is
(v verr) = NNVISU (d v).

The input parameter d comprises a set of distances whereas the input parameter v contains starting coordinates for the visualization. During the training process of the FFN the current state of the two-dimensional coordinates of the units is returned in v, while the corresponding visualization error is given by verr. 12

4.2 Macro
A macro, called NN, can be used to create, run, optimize and visualize a FFN. An example program is given by
proc()=main() func ("nn") ; load the NN-macro x=read("kredit") ; load the credit data t=read("tkredit") ; load training, test and validation y=x ,1] ; create y x=x ,2:21] ; create x x=(x-mean(x)')./sqrt(var(x)')~matrix(1000) ; standardizes the data nn(x y t) ; run the NN-macro endp

As input parameters a matrix of the values of input variables x, a matrix of corresponding values of output variables y and a vector t of indicators for the training and test data set (0 for training data, 1 for test data and 2 or other entries for validation data that will be ignored during the training and optimization process of the FFN) are required. As default a FFN with no hidden layers is generated. The resulting model is equivalent to a (multiple) logistic regression model. In the example we have 21 input units and 1 output unit (FFN (21-1)). Then, a menu with the following items will appear on the upper right corner of the screen:

Generate: You will be asked to input the number of hidden layers, the

number of units and the activation functions in each hidden layer. Default is a feedforward network without a hidden layer, logistic output units and random starting weights chosen from the standard normal distribution. Run: Runs a net, that is to train on a training data set, test on a test data set and to visualize it. In the case you used the test error to optimize the network complexity you should have left a validation set out to validate your nal model. Run CV: Run a net like above, but the generalization (test) error is estimated by cross-validation (CV). You are asked about the number of CV subsets. Error: Here you can determine the error function that shall be used (meansquared error (default), Maximum-Likelihood). This value is saved in the parameter err with err = 0 for MSE and err = 1 for ML. 13

(default), Simulated Annealing (SA), Quadratic approximation (Q-Approx)) can be chosen and the chosen value is saved in the parameter opt with opt = 0 for QSA, opt = 1 for SA and opt = 2 for Q-Approx. The early stopping method may not be applicable in the case of SA. Init: Choose the initialization of the weights (ran < 0 normal distribution with zero mean and standard deviation ran, ran > 0 uniform distribution within the interval (?ran; ran) with ran a positive real value) Decay: Select a weight decay parameter dec (default = 0:001). Restart: It is possible to give a number res of restarts for the network training. View Wei: You can see the weight matrix for the current network. View Unit: If you want to remember the architecture of the current network, you can choose this item and see the unit matrix containing the kind of units with corresponding activation functions. View Res: To make it complete you can look at the residuals of the best training model (0), that is the model with the lowest training error during the optimization process, or the best test model (1), which is the model with the lowest test error. Read: Asks you about the lenames of those les containing a weight matrix ( .wei), the units ( .unt) and the parameters dec, opt, err, ran, res ( .par). Since all these les are ASCII- les you can generate our own networks, especially if you want to have a di erent architecture than a layered network (see B for further explanation of these les). Write: You can give the lenames to save the current weight matrix, the matrix of units and the parameters dec, opt, err, ran, res. CV Decay: In the case you want to optimize the weight decay parameter dec by cross-validation (CV) you can do it by clicking this item. You will be asked for the minimum value, the maximum value and the step width of the weight decay parameter as well as for the number of CV data sets. Then the optimization of dec is done for the current number of hidden units. Regress: Shows a plot of the residuals against the original values of y. Class: Shows a table of misclassi cations. In the case of one output you get the rates for di erent threshold values. In the case of more than one output you will get a confusion matrix. 14

Optimize: One of three optimization methods (stochastic search (QSA)

Macros. There are several macros written in XploRe that are involved in the
neural network program.

transforms the weights w into distances d (see section 3.1). weidist2 transforms the weights w into distances d . erfqua and erfkl contain the quadratic squared error function or the negative log-Likelihood function, respectively. The input parameters are the tted (output) value y, the target value yt, the vector of the values of the weights w and a weight decay parameter d. The output parameter erf yields the value of the error function. It is possible to change the error function formula to an own version by the user. nnlayer generates the architecture of a FFN given the number of input units, the number of hidden units and the number of output units and the corresponding activation functions saved in v. As output parameters we have the weight matrix w, the matrix u containing the occurring units and their activation functions and a mask vector m containing the visualization parameters like colour and symbol for each unit. runinit calculates function values, training and test errors and coordinates for the visualization all for the network with initial weights. runshow changes the display (menu, visualization window, training and test error functions and info panel) after each successful iteration. runnet plays a central role. Depending on the error function err and the optimization algorithm opt it carries out the model tting process. The number res is the number of restarts of the optimization algorithm from the nal point of the previous iteration cycle with increased step width. cv estimates the generalization error by cross-validation.
(1) (2)

Info. On the Info window one can nd information about the current project.
Before starting to train the network one can see the indicators for the chosen optimization algorithm in OPTIM, error function in ERROR and starting weight generation in RAND. The decay parameter DECAY and the number of restarts RESTART can be checked. All these parameters can be changed by using the menu or by manipulating the main macro nn, respectively. During the training process the number of restarts not yet done is given by RESTART. DECAY gives the current weight decay parameter, VERR the current visualization error, N the number of training/test data and E the current training/test error. 15

5 Application to credit scoring

The data. The credit data are from a south German bank, collected by Fahrmeier & Hamerle (1981). The response variable is \creditability" and we have 20 covariates which are assumed to in uence creditability. For a bank it is of interest to predict if a client will pay back the credit as agreed before by contract. A detailed description of the variables can be found in Table 10. The labels given in Table 10 are not chosen by chance, but they represent a system of points which give in their sum an indicator for having a bad client or a good client; see Haussler (1981). Earlier results. Fahrmeier & Hamerle (1981) used di erent models to nd a good classi cation. They applied Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) in SPSS to subsets of the variables. Table 2 shows the error rates in percentage on each class for the resubtitution method (p. 338).
Variables 1 LDA 20:0 43:3 QDA 20:0 43:3 5 27:3 31:6 27:3 30:6 10 26:7 30:1 26:7 26:7 15 27:0 27:4 25:3 25:3 20 26:0 27:9 18:3 30:3 bad client good client bad client good client

Table 2: Percentage of misclassi cation by LDA and QDA. They used the full logit model

P (Y = 1jx) = 1 + exp(? ? 1 ? ::: ? X

0 1 1 4


X )

with the variable X recoded in a binary variable (\professional" or \private"'use). Since we restrict ourself on graphical representation we only give rounded results. They got the result that for a 5%-signi cance level the hypothesis H : i = 0; H : i 6= 0 is not rejected for i = 10, 11, 13, 14, 15, 16, 17, 18, 19 (p. 283). Fahrmeier & Tutz (1994) applied a logit model to a subset of variables (X 1], X 2], X , X , X , X ) recoded as binary (dummy) variables. Fahrmeier and Tutz state (p. 33-34) that an increasing credit duration increases the probability of not creditworthy. Females are more creditworthy than males, intented \private" use is more creditworthy than \professional" use. The hypothesis H : r = 0 is rejected for all parameters except the intercept .
0 1 1 1 3 5 6 7 0 0


The variables X and X of the above mentioned subset are eliminated by forward and backward selection. Turlach (1994) used the subset of metrical variables X ; X and X . He tted a generalized additive model to the data and obtained Figure 2.5 on page 38. He believes that the duration of the credit has a linear in uence whereas the amount of the credit is quadratic with a minimum at 300 DM. The age of the client can be separated in a linear in uence, if the client is younger than 40 and a constant in uence if the client is older than 40. He tted a generalized linear model with the variables
4 7 2 5 13

~ ~ ~ ~ ~ ~ X = X ; X = X ; X = X ; X = (X ? 300) ; X = IX13 ; X = IX13< X :

1 2 2 5 3 13 4 5 2 5 40 6 40 13

He showed that this generalized linear model behaves better than the model build up on X ; X and X . Preprocessing data. To compare the variables we will always standardize the variables such that E (Xi ) = 0 and V ar(Xi) = 1. This will allow in the logistic regression that we can compare the in uence of the variables directly. Another possibility is to rescale the data such that they fall into the interval 0; 1]. Since both methods involve just a linear transformation we can recompute the weights to the rst layer easily. Since we need a training set, a test set as well as an evaluation set of data we have randomly selected about 400 observations for the training set, about 400 observations for the test set and about 200 observations for the evaluation set. See Table 3 for the exact number of observations in each class.
2 5 13

not creditworthy creditworthy sum

Training 132 278 410

Test Evaluation 115 53 283 139 398 192

Table 3: Distribution of the credit data for the di erent sets.

The FFN (21-1). A logistic regression is a FFN with no hidden unit. The

input units have the identity as activation function and the one output unit the logistic function. Figure 5 shows the best training network while Figure 6 shows the best test network. We can interpret only the distances to the output unit as a size for the weights. If we compare the weights or the graphics we see that both networks have a very similar structure. 17


MNU$[,1] Training Test Edit Pic

X9 X2 X16 X10 X6 X8 X1 X15 X21 X5 o X19 X3 X13 X14 X7 X11 X12 X18


ISTR$[,1] TRAINING DECAY : 0.0010 VERR : 0.0000 N: 410/ 398 E: 0.53

Figure 5: The best training network for the logistic regression with one output-unit and weight transformation . The input units (variables) are denoted by X ,..., X and the single output unit by Y o.
(1) 1 21


MNU$[,1] Training Test Edit Pic

X10 X9 X16 X6 X8 X15 X21 X2 o X1 X5 X14 X13 X3 X19 X11 X12 X7



ISTR$[,1] TEST DECAY : 0.0010 VERR : 0.0000 N: 410/ 398 E: 0.53

Figure 6: The best generalization network for the logistic regression with one output-unit and weight transformation . The input units (variables) are denoted by X ,..., X and the single output unit by o.
(1) 1 21


From the weights we can easily derive an order of the importance of the variables since the data are standardized. This is shown in Table 4. The interpretation of the order of variables given in Table 4 is very simple. The general problem is that we have more \good" clients than \bad" clients (beside a sample selection problem). This results in a large constant term which pushes the logistic term towards 1. We can see three important variables X , X , X . If we would have to exclude variables we would choose X , X , X , X and X . We can compare the weights we have got by the neural network with weights we have got by a Generalized Linear Model with a binomial exponential family and a logistic link function. We see that they are very similar. The di erences come from the di erent search algorithms (stochastic search - iterated least squares) and that we have used a small weight decay (0:001) to avoid to t a constant function. A t of a constant function would result in large weights. The FFN (21-2). If we have a classi cation with two classes we can also insert two output units such that the rst neuron estimates
21 1 8 4 10 17 18 20

P (Y = 1jx) = 1 + exp(?
and the second neuron estimates


1 ? ; X ? ::: ?
11 1

1 20

X )

P (Y = 0jx) = 1 + exp(?


1 ? ; X ? ::: ?
01 1

0 20

X ):

This can easily achieved by using a second response variable 1 ? Y . The weights for both logistic regression models can be seen in Table 4 in the fourth and fth column. It can be checked how good the probabilities are estimated. Figure 7 shows the density estimates of P (Y = 1jX = x) + P (Y = 0jX = x) for all subgroups. The bandwidths are optimized by Least-Squares Cross-Validation; see Bianchi (1995). By drawing a connection from both output units to each input unit we get two possible positions. For example the line drawn in Figure 8 describes all input units which contribute equally two both units. Thus the most interesting variables are those which are far away from the line. But we are not interested in the variables which are too far away; in our case the variable X (marital status and sex) contributes a lot to be a \good" client (0:2888), but not to be a \bad" client (?0:1441). This also hints to the symmetry problem of the weights. Since we put in both Y and 1 ? Y we would expect that the weights have exactly the same absolute


Table 4: Weights for the best test and training network (decay = 0.001, res = 50, init = -0.1, Maximum-Likelihood) for the logistic regression with one output unit. The weights were given by the GLM-macro of XploRe 3.2 (exponential family: Binomial, link function: logit). In column 3 the weights for the best test network are given for the case of two output units (decay = 0.001, res = 50, init = -0.1, Maximum-Likelihood); one for estimating P (Y = 1jx) and one for P (Y = 0jx). The variables are ordered by the absolute value of the weight for the best test network with one output unit.

One output unit Test Training constant term 1:0721 0:9993 running account (1) 0:8464 0:8048 monthly payment (8) ?0:6213 ?0:6288 savings (6) 0:3585 0:3525 previous credit (3) 0:3472 0:3290 age of client (13) 0:3248 0:3218 telephone (19) 0:2680 0:2446 amount of credit (5) ?0:2564 ?0:2624 duration of credit (2) ?0:2502 ?0:2394 number of previous credits (16) ?0:2431 ?0:2453 properties (12) ?0:2334 ?0:2002 housing (15) 0:2303 0:2232 other credits (14) 0:2277 0:2388 time in apartment (11) ?0:2235 ?0:2248 time in current job (7) 0:1969 0:2182 marital status and sex (9) 0:1898 0:1727 debtors and guarantors (10) 0:0801 0:1042 purpose of credit (4) 0:0646 0:0433 profession (17) 0:0218 0:0170 persons to maintain (18) 0:0411 0:0254 foreign worker (20) 0:0161 0:0008


Two output units GLM P (Y = 1jx) P (Y = 0jx) 1:1483 0:9857 ?1:0087 0:7290 0:6837 ?0:7541 ?0:3336 ?0:6803 0:4418 0:3776 0:3520 ?0:4515 0:4138 0:3570 ?0:4186 0:1012 0:4463 ?0:4026 0:1446 0:2090 ?0:1765 ?0:2635 ?0:2479 ?0:0180 ?0:2961 ?0:2550 0:4899 ?0:1406 ?0:1925 0:2550 ?0:1919 ?0:1428 0:1384 0:1553 0:1273 ?0:0753 0:1706 0:2049 ?0:2032 ?0:0156 ?0:1162 0:1695 0:1832 0:1988 ?0:3127 0:1822 0:2888 ?0:1441 0:1658 0:1168 ?0:1764 0:0865 0:2228 ?0:0483 0:0123 0:1040 0:0482 ?0:0618 ?0:1045 ?0:0082 0:2186 0:2574 0:0649










Figure 7: Density estimate for the sum of the two output units for the FFN (21-2). The pictures are all scaled on the same range (from 0:67 to 1:52).

X17 X12 X10 X9 X19 X14 X16 X7 X2 X6 X13 o X8 X21 X1 o X3

MNU$[,1] Training Test Edit Pic

X18 X5 X4 X20


ISTR$[,1] TEST DECAY : 0.0010 VERR : 0.0071 N: 410/ 398 E: 1.06

Figure 8: The best generalization network for the logistic regression with two output-units. The line is drawn manually in the picture ( ). The input units (variables) are denoted by X ,..., X and the two output units by o.
(1) 1 21


X17 X16 X19 X15 X9X21 A2 X12 X5 X1 X8 Y1 X11 X2 X10 X13 X3 X6 X14 A1 X7 X20 X18

MNU$[,1] Training Test Edit Pic


ISTR$[,1] TEST DECAY : 0.0010 VERR : 0.0207 N: 410/ 398 E: 0.18

Figure 9: The best generalization network for the FFN (21-2-1) with . The input units (variables) are denoted by X ,..., X , the hidden units by A , A and the single output unit by Y .
(1) 1 21 1 2 1

value but di erent signs. Two reasons are quite possible why this is not the case: the network needs to be trained longer or these variables have no important contribution to the data. The FFN (21-2-1). The network we analyze now has two hidden units and one output unit, see Figure 9, 10 and 11. From Figure 9 and 10 we get two impressions: The most unimportant variables are the variables X (purpose of credit), X (time in current job), X (profession), X (persons to maintain) and X (foreign worker) and it seems that both hidden units are contributing nearly equally to the output unit. Table 5 shows all connections with an absolute weight larger than 0:5. If we have a look to the weights we see that just Figure 11 shows us the truth, that the unit A contributes stronger to the output unit than unit A . Both units pushing towards a classi cation to a \good" client. However all three gures tell us the same about the most important units: X (running account), X (savings), X (monthly payment), X (debtors and guarantors). That the behaviour of the variable X in contrast to the logistic model has changed much is a consequence of the high association between the independent variables; Table 17 shows the high variability of the coe cients for the FFN 21-1. The surprising fact is that both units have the same sign for the contribution to
4 7 17 18 20 1 2 1 6 8 10 10



X16 X19 X15 X14 X9 X12 X5 X13


Training Test Edit Pic

X1 X8 Y1 X6 X10


X3 X11 A1 X2
ISTR$[,1] TEST DECAY : 0.0010 VERR : 0.0207 N: 410/ 398 E: 0.18

Figure 10: The best generalization network for the FFN (21-2-1) with and zoomed to the inner region. The input units (variables) are denoted by X ,..., X , the hidden units by A , A and the single output unit by Y .
(2) 1 21 1 2 1

X9 X7 X3 X5 X2 X6 X10 X20 X11 X8 X12 X15 X14 X18 X17 X16 X19 Y1 X1 A1 A2 X13 X21 X4

MNU$[,1] Training Test Edit Pic

ISTR$[,1] TEST DECAY : 0.0010 VERR : 0.0021 N: 410/ 398 E: 0.18

Figure 11: The best generalization network for the FFN (21-2-1) with .The input units (variables) are denoted by X ,..., X , the hidden units by A , A and the single output unit by Y .
(2) 1 21 1 2 1


Unit Coe .







Coe . 1:1466 ?0:6873 ?0:6813 0:6717 0:6295 0:6218 ?0:5657 1:0471 ?0:5938 0:5241










Unit Meaning

X1 X10 X8 X6 X3 X7 X2 X1 X8 X6

running account debtors and guarantors monthly payment savings previous credit time in current job duration of credit running account monthly payment savings

Table 5: All weights with an absolute value larger than 0:5 in the FFN with 2 hidden units. the output variables and the same sign for the contribution of the most important variables to the output units. Both hidden units summarize the ad hoc criteria someone would take judge if a client would get a credit. The FFN (21-2-2). Again we use two output units; one to estimate Y and the other one to estimate 1 ? Y . Both output units fall together in Figure 12 and the weights have the same absolute value but di erent signs (to Y : 1:0564 and ?1:0603; to 1 ? Y : 1:7605 and ?1:7752). The graphic in Figure 12 makes us rst believe that we have a lot of input units contributing equally to both output units, but the zoom of the inner region in Figure 13 shows that this is not true. The density estimate of Figure 14 shows a surprising result. The range is much smaller compared to Figure 7. And we can clearly distinguish two peeks in the sum of the probabilities, one peek in the \good" clients (upper row) nearer to 1 and another peek in the \bad" clients (lower row). The network with 5 hidden neurons shows already the problem of our approaches: the high-dimensionality. Whereas Table 6 shows that Figure 15 completely fails to show the important connections of the network, the second transformation function shows a much better behaviour. However we already see an e ect that will become much more visible in the next much more complex network. A more complex network is shown in Figure 17, here we have a network with 10 hidden units. We can see that the non-linear transformation has a larger visualization error (VERR = 0:1919) than the linear transformation (VERR = 0:0346)(Figure 18). Note that with a visualization error of 0:2 and a true
(1) (2)

The FFN (21-5-1).

The FFN (21-10-1).


X7 X2 X16 X3 X21 A2 X14 * * X5 *o* * X15 X20 X4 A1 X18 X9

MNU$[,1] Training Test Edit Pic

ISTR$[,1] TEST DECAY : 0.0010 VERR : 0.0350 N: 410/ 398 E: 1.10

Figure 12: The best generalization network for the FFN (21-2-2) with . The input units (variables) are denoted by X ,..., X , the hidden units by A , A and the two output unit by o. The nearest input units to the output unit are marked by to avoid overplotting in the picture.
(1) 1 21 1 2

X16 X3 A2 X14 X12 X5 X8 o X11 X10 X1 X19 X6 X13 X21

MNU$[,1] Training Test Edit Pic

X15 X4

ISTR$[,1] TEST DECAY : 0.0010 VERR : 0.0350 N: 410/ 398 E: 1.10

Figure 13: The best generalization network for the FFN (21-2-2) with and zoomed to the inner region. 25










Figure 14: Density estimate for the sum of the two output units for the FFN (21-2-2). The pictures are all scaled on the same range (from 0:998 to 1:00).

X20 X16X5 X17 X12 A1 X14 X7 X4

MNU$[,1] Training Test Edit Pic

A2 X19 X9 X2


X15 X18

X10 X8 X3 Y1 X1 A4 A3 A5


X6 X21
ISTR$[,1] TEST DECAY : 0.0010 VERR : 0.1164 N: 410/ 398 E: 0.57

Figure 15: The best generalization network for the FFN (21-5-1) with . The input units (variables) are denoted by X ,..., X , the hidden units by A ,...,A and the single output unit by Y .
(1) 1 21 1 5 1


X6 X13 X2 X21 X3 X11X19 X4 X10 X7X12 X9 X1 X14 X8 Y1A5 A4 A3 A2 A1

MNU$[,1] Training Test Edit Pic Training-6 Test-6


X18 X15 X17X20 X16

ISTR$[,1] TEST DECAY : 0.0010 VERR : 0.0194 N: 410/ 398 E: 0.61

Figure 16: The best generalization network for the FFN (21-5-1) with . The input units (variables) are denoted by X ,..., X , the hidden units by A ,...,A and the single output unit by Y .
(2) 1 21 1 5 1

X19 A4 X15 X16 X3 X11 X6


X5 X1 X18 X7 X10

Training Test Edit Pic

X12 X9 X20 A8 A6 A1 A7

Y1 A9 X8 X21 X13 A10

A2 A3


X2 X17 X14 X4
ISTR$[,1] TEST DECAY : 0.0010 VERR : 0.1919 N: 410/ 398 E: 0.63

Figure 17: Visualization of an FFN with 10 hidden units using the transformation function . The input units (variables) are denoted by X ,..., X , the hidden units by A ,...,A and the single output unit by Y .
(1) 1 21 1 10 1



Coe . 2:1842

Y1 Y1 Y1







Coe .

A5 A3 A4


Y1 Y1




A2 A1


?1:6310 ?1:1086 1:0968 ?1:1163 ?0:8419 ?0:7937 0:7609 1:1982 ?1:0577 ?1:0147 0:9868 0:9517 1:3634 0:9399 0:8795 0:8404 1:1148 1:0552 ?0:9805 ?0:9805 ?0:8473 ?0:7811






Unit Meaning

X8 X10 X1 X3 X1 X14 X2 X1 X11 X5 X7 X21 X9 X5 X19 X10 X5 X20 X14 X16 X18 X7

monthly payment debtors and guarantors running account previous credit running account other credits duration of credit running account time in apartment amount of credit time in current job constant term marital status and sex amount of credit telephone debtors and guarantors amount of credit foreign worker other credits number of previous credits persons to maintain time in current job

Table 6: All weights with an absolute value larger than 0:75 in the FFN with 5 hidden units.


MNU$[,1] Training Test Edit Pic

X10 X14 X8 X15 X12 X7 X18 X17 X19 X20 X21 X13 X6 X16 X9 X11 A2 Y1 A7 A8 A3 A1A6 A4 A5 A9 A10 X1 X4


X3 X2
ISTR$[,1] TEST DECAY : 0.0010 VERR : 0.0346 N: 410/ 398 E: 0.63

Figure 18: Visualization of an FFN with 10 hidden units using the transformation function . The input units (variables) are denoted by X ,..., X , the hidden units by A ,...,A and the single output unit by Y .
(2) 1 21 1 10 1

coe cient of 1:2 we can get a visual approximation of a coe cient in the interval 1; 1:5]. Which units are nally the important ones in Figure 18 ? A clear sequence of important hidden units can be discovered (A , A , A ). The also important unit A (see Table 7) vanishes behind the cluster of hidden units. The coe cients of the units without A , A , A and A ranges from 0:6638 to ?0:4429. This may give us a hint that 4 hidden units are already su cient. We examine now the units nearby the hidden units to interpret the meaning of them. They describe various aspects of a client which has to be considered by the bank:
2 7 8 5 2 7 8 5

A balances the existence of debtors and guarantors against the behaviour of the account and the existence of other credits. It is interesting to note that a debtor or guarantor has a large negative in uence. A is mainly in uenced by the behaviour on previous credits. A is mainly in uenced by savings. A describes the properties of the credit (purpose and amount) and the running account
2 8 5 7



Coe . 1:9155 1:7798 ?1:5030 ?1:3840 0:6338 0:3990 0:3854 ?0:2674

Y1 Y1 Y1 Y1 Y1 Y1 Y1 Y1 Y1







Coe .

A2 A8 A5 A7 A9 A3 A4 A10 A1


?1:5624 1:0740 1:0696 1:0651 ?1:5021 ?1:8687 ?1:4527 ?1:0024 ?1:3846 1:0173 1:0468 1:1105 ?1:0937 ?1:1510 1:0089






Unit Meaning

X10 X1 X14 X3 X6 X1 X4 X5 X2 X12 X20 X21 X11 X9 X18

debtors and guarantors running account other credits previous credit savings running account purpose of credit amount of credit duration of credit properties foreign worker constant term time in current apartment marital status and sex persons to maintain

Table 7: Largest weights in the FFN with 10 hidden units. Again fails to show the important connections whereas the visualization with behaves much better. Conclusion. Let us now compare all classi cation methods in Table 8 with the misclassi cation rates. Since in Fahrmeier & Hamerle (1981) the rates were given for \bad" and \good" clients we give the rates for both classes separately. First we give the rates for the whole dataset. Since we have used two sets \Training" and \Test" for constructing the neural network we give the rates based on the best \Test"-network. The \Validation" rates give us hint how good we can classify new observations. For LDA and QDA we have only the overall rates. The architecture of the FFN is described by the numbers in parentheses, e.g. (21-1) means 21 input units (20 variables and constant term), no hidden unit and 1 output unit. If we use one output unit we can de ne a threshold value t such that if y > t we ^ decide to have a \good" client and otherwise a \bad" client. We see from the tables that QDA behaves better than LDA. With the FFN (211) we can achieve the LDA quality if we choose a threshold value t 0:7. To choose the threshold value well we would need a cost function. A possible choice, which is used in banks, is to obtain the same misclassi cation rate on \bad" and \good" clients. The tables 12 . show that a choice of t = 0:7 is not too bad in this sense. Due to the smaller number of \bad" clients, which is already arti cially increased,
(1) (2)


the FFNs behave much better on the \good" clients than on the \bad"'clients. We should remark that we did not investigate the stability of the solution. Since we use stochastic search algorithms and can start with di erent weights we may get di erent results ! Method LDA Client bad good QDA bad good GLM bad good FFN (21-1) bad good FFN (21-2-1) bad good FFN (21-5-1) bad good FFN (21-10-1) bad good FFN (21-2) bad good FFN (21-2-2) bad good FFN (21-5-2) bad good FFN (21-10-2) bad good

t Overall Training Test Validation 26:0 27:9 18:3 30:3 0:7 26:0 28:0 24:3 24:5 0:7 27:0 24:8 31:8 21:5 0:7 23:7 23:5 20:9 30:2 0:7 32:7 28:1 39:9 27:3 0:7 22:2 20:4 22:6 24:5 0:7 30:8 25:2 38:9 25:9 0:7 21:7 13:6 27:0 30:2 0:7 28:4 19:4 38:2 26:6 0:7 21:0 7:6 28:7 37:7 0:7 28:1 19:4 35:0 31:7 47:3 48:5 45:2 49:0 14:9 12:2 19:1 11:5 100:0 100:0 100:0 100:0 0:0 0:0 0:0 0:0 47:0 43:2 47:8 54:7 12:7 8:3 18:7 9:4 49:7 47:0 53:9 47:2 10:6 7:5 14:5 8:6

Table 8: Misclassi cation rates for di erent classi cation methods In terms to solve the classi cation task a FFN with 2 hidden neurons seems to be the best choice. However it is still worse then the simple GLM or the QDA. The visualization of the resulting networks seems a di cult problem. The function fails to visualize the network if it becomes too complex as Table 6 and Table 7 show. The second approach behaves much better. But both gures, Figure 16 and Figure 18, show the hidden units near a line through the center of the pictures and the input units (variables) on nearly parallel lines to the central line. What we have to judge in these pictures are the distances between the hidden units and the input units. The main point that the eye can see is the clustering


of units along the parallel lines although we have no direct connections between these units.

6 Application to protein data

lar biology. It is protein structure prediction from amino acid sequence, that is to nd a model which describes the relationship between the amino acid sequence and the three-dimensional structure of a protein. Because proteins are macromolecules with a very complex spatial structure there is no experimental method to determine the atomic coordinates for each new isolated protein with an acceptable expense. For many proteins the experimental structure determination is not possible at all with current methods. In order to make the statistical approach realistic we can simplify the prediction problem to a classi cation problem. For that we de ne structural (folding) classes of the proteins and try to classify a new molecule into one of those classes only on the basis of its amino acid sequence information. See Grassmann (1996) for more details. Here we will consider a very simple case. A protein consists of a sequence of amino acids. As input variables we have chosen the relative amino acid frequencies within the protein. There are 20 di erent amino acids in nature giving 20 input f variables xi; (i = 1; :::; 20) and xi = N with fi the absolute frequency of amino acid i in a protein consisting of a chain of N amino acids. We say that the protein has length N . For the current purpose a rather rough class de nition of four supersecondary structural classes is selected. Secondary structural elements (e.g., - helix, -strand, coil) are composed to the classes

The data. The following application comes from a very popular eld of molecu-

(only -helices) + (one part and one part ) = ( and alternatively) (only -strands) We have chosen 268 proteins from a structural protein database that was used by (Reczko, Bohr, Subramaniam, Pamidighantam & Hatzigeorgiou 1994).



classi cation error rate (%) training (decay, CV(10)) test CV(10) (n = 143) (n = 125) (n = 268) LDA 30.1 36.0 29.5 QDA 0.0 60.0 60.4 QDA(mono) 18.2 28.0 20.1 FNN(0) 30.8 (0.001, 44.1) 33.6 36.9 FNN(0) -1 -15 30.8 32.8 36.9 FNN(3) 25.9 (0.001, 43.4) 30.4 33.6 FNN(3) -1 -15 25.9 30.4 34.7 FNN(6) 2.8 (0.001, 39.9) 29.6 27.2 FNN(6) -1 -15 0.7 28.8 27.2 FNN(6) -1 -12 -15 -18 2.1 29.6 31.7 FNN(11) 0.0 (0.001, 35.7) 23.2 22.4 FNN(16) 2.8 (0.005, 38.5) 20.0 19.8 Table 9: This table shows the classi cation error rates for several classi cation methods, the linear discriminant analysis (LDA), the quadratic discriminant analysis (QDA), QDA with only pure quadratic terms (QDA(mono)) and FFNs with 0, 3, 6, 11 and 16 hidden units. The rst column obtains the classi cation error rate for the training data set. For the optimization of the weight decay parameter 10-fold cross-validation (CV(10)) was used within the training data set. The classi cation error rate for the test data set and the CV(10) for the whole data set gives an estimate of the prediction error. Line 5 means that a FFN without hidden units was tted without using the variables X 1 and X 15. To classify a protein with known amino acid sequence to one of K = 4 classes we rst build the model 1 fk (x) = P (yk = 1jx) = 1 + exp(?wT x) with k = 1; :::; K , x = (1; x ; :::; xp) and w = (w ; w ; :::; wKp). As gured out in section 3.3, this is a feedforward neural network without a hidden layer and logistic output units. This can probably be convenient if the objects correspond not surely to exactly one of the given classes. To reach the outputs sum to one the '"softmax"` method (Bridle 1990) can be chosen, that is to model
1 0 11

The classi cation and visualization results.

exp( T fk (x) = P (yk = 1jx) = PK w x) : T j exp(w x)



Both approaches should yield the same classi cation results as the polytomous logistic regression (Schumacher, Rossner & Vach 1996, Vach, Rossner & Schumacher 1996). In each case the class with maximum corresponding output is chosen as derived from the Bayesian decision rule. While the FFN models were tted in S-plus using the nnet library, the visualization was done in XploRe using the nn macro. In gure 19 one can see the visualized version of the FFN(21-4).
Y2 Y1 X5 X14 Y4 X19

X11 X2

Training Test Edit Pic Training-6 Test-6

X20 X6 X8 X21 X17 X18 X4 X16 X13 X12 X3 X10 X9 X7 X1 X15

ISTR$[,1] TRAINING DECAY : 0.0010 VERR : 0.0350 N: 72/ 72 E: 14.32

Figure 19: Visualization of a FFN(21-4) for the protein data. The input units are denoted by X and the output units by Y . Looking at this gure one can see the input units denoted by X and the output units denoted by Y . Because there are no hidden units the connections between the input units and the output units can interpreted directly. There seems no input variable especially important or unimportant in general. The variable X 14 can be nd near the output unit Y 4. Looking at the tted weight of their connection it has a rather large value. Similar the input unit X 19 and the output unit Y 1 behave. Besides one can nd two clusters of input variables, the one with stronger connections to the output units Y 1 and Y 4 and the other closer to Y 2 and Y 3. Within the lower left cluster the input variables X 1 which describes the constant term and X 15 are most weakly connected with all output variables represented by the output units. Removing these two variables from the model leads to similar or slightly better prediction results (see table 6). We can replace the linear term wT x by a non-linear one, for instance by adding hidden units. Then we obtain a multilayer network as described in section 2.1. In table 6 the classi cation error rates for di erent numbers of hidden units as well as for classical statistical classi cation methods are listed. LDA achieves a CV error of 29.5 %. QDA(mono) that is QDA with only pure quadratic terms is 34

exible enough to be better with a CV error of 20.1 % than LDA, whereas QDA in full version clearly over ts the data. In order to reach a similar or better result than with LDA 6 hidden units were needed in a feedforward neural network. To be better than QDA(mono) 16 hidden units were necessary. Thus, in each case there are more parameters to t in the neural network than in the classical statistical approach in order to get similar prediction ability. The visualization results are given in the gures 20, 21 and 22.
X20 A2 X6Y3
MNU$[,1] Training Test Edit Pic Training-6 Test-6

X3 X13 A3X16 X9 X5Y4 Y1 X18 X14 X12 A1 X7 X19 X2 Y2 X1 X10 X15 X21 X17 X11 X4 X8
ISTR$[,1] TRAINING DECAY : 0.0010 VERR : 0.0093 N: 72/ 72 E: 23.68

Figure 20: Visualization of a FFN(21-3-4) for the protein data. The input units are denoted by X , the hidden units by A and the output units by Y . The input units are denoted by X , the hidden units by A and the output units by Y . Here it is not allowed to interprete the positions of the input and output units to each other, because there are no direct connections between them. Especially, in gure 22 the hidden units distribute relatively uniform around both the input units and the output units. This could be interpreted so as all hidden units have a similar importance for the model. Following the visualized models by increasing order of 3 to 11 hidden units the importance of unique hidden units seems to decrease. But, in all models one can nd input units that are further away from the hidden units than others. Removing the corresponding variables can give better or at least similar results by decreasing the number of parameters of the model. When too many variables are removed then there is a danger of loosing too much information and the prediction ability will get worse (see FFN(6) where 4 variables are left out in table 6).


A3 A1 Y1 A4 A5 X5 Y4 X20 A6 X11 X21 X8 X16 X13 Y3 A2 Y2 X4 X7 X6 X15 X14 X2 X10 X9 X3 X17 X19 X18 X12 X1

MNU$[,1] Training Test Edit Pic Training-6 Test-6

ISTR$[,1] TRAINING DECAY : 0.0010 VERR : 0.0412 N: 72/ 72 E: 73.04

Figure 21: Visualization of a FFN(21-6-4) for the protein data.

MNU$[,1] Training Test Edit Pic Training-6 Test-6

A10 A3 A6

X1 X9 X18 X17X15 X5 X11 X12 X4X21 X8 X14 X20 X16 Y2 X6X2 X19 X13 Y1 X7X10 X3 Y3 Y4 A8 A11 A5 A4 A9 A7


ISTR$[,1] TRAINING DECAY : 0.0010 VERR : 0.0479 N: 72/ 72 E: 71.72

Figure 22: Visualization of a FFN(21-11-4) for the protein data.


7 Discussion
In this article we implemented and visualized feedforward neural networks (FFN's) in order to make the black box more transparent. Additionally to looking at the list of tted weights we see a two-dimensional representation of them found by non-metric multidimensional scaling (NMDS). The idea was to transform the tted weights of the connections between the units of a neural network into distances between the corresponding units. Then a distance preserving projection of the unit space onto the lower-dimensional space were carried out. In order to get good prediction results with neural networks there has often a large number of parameters to be tted and the individual meaning of them get lost. We applied our algorithm for two real-world examples, the credit and the protein data, were the visualization made it easier to recognize weakly connected units. Removing input units that were "far away" from the hidden or output units was resulting in similar or better prediction results. For the protein data it was characteristic that the more hidden units were taken the more uniform was the in uence of the individual ones. Of course, our results for a better understanding of the internal behaviour of feedforward neural networks are limited and the current state of model selection is just a heuristic method by trial and error. The criteria was to remove the most weakly connections until the prediction error begins to increase. Since the input variables are on the same scale, this should be realistic. But, the visualization is a useful method to get an idea of the relations between individual parts of a model. As in any stadium of solving a problem one can nd a potential of improvements. It would be useful, for example, to show coloured connections between units indicating negative or positive values of the weights, whereas the thickness could give information about the local visualization error. When requested the values of the weights and the calculated distances between the units should be seen. A further improvement in visualzing the weights and understanding the network can be achieved by dynamic graphics in order to diminish the problem of the dimensionality. An analysis of a network will be done step by step starting rst with the strongest connections and then continuing with single units.


A The credit data and results in detail

Sym Variable bol Y creditability Lab el 0 1 1 2 3 4 0 1 2 3 4 0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 Meaning not creditworthy creditworthy no account account in minus account 0; 200) account 200 DM in month bad payment other credits no credits before good payment all credits nished miscellaneous car (new) car (used) furniture radio/tv household reconstruction education holidays retraining company in DM no savings < 100 DM 100 ::: < 500 500 ::: < 1000 1000 DM relative percentage Y =0 Y =1 45:00 35:00 4:67 15:33 8:33 9:33 56:33 9:33 16:67 29:67 5:67 19:33 20:67 1:33 2:67 7:33 :00 0:33 11:33 1:67 72:33 11:33 3:67 2:00 10:67 19:86 23:43 7:00 49:71 2:14 3:00 51:57 8:57 34:71 20:71 12:29 17:57 31:14 1:14 2:00 4:00 :00 1:14 9:00 1:00 55:14 9:86 7:43 6:00 21:57

X1 running account X2 duration of credit X3 payment of previous credit X4 purpose of credit

X5 amount of credit X6 savings

Table 10: Variables and labels of the credit data of Fahrmeier & Hamerle (1981).


Sym Variable bol

Lab Meaning el 1 2 3 4 5 1 2 3 4 1 2 2 3 4 1 2 3 1 2 3 4 1 2 3 4 1 2 3 1 2 3 1 2 3 4 1 2 3 4 1 2 1 2 1 2 jobless < 1 year 1 ::: < 4 4 ::: < 7 7 years 35 25 ::: < 35 20 ::: < 25 < 20 male, divorced female male, non-married male female, non-married none other debtors guarantor < 1 year 1 ::: < 4 4 ::: < 7 7 years none car life assurance house, land in years at another bank at a warehouse none no rent rented property 1 2 or 3 4 or 5 6 jobless no training blue collar white collar 3 0; 1 or 2 no yes yes no

relative percentage Y =0 Y =1 7:67 23:33 34:67 13:00 21:33 11:33 20:67 15:00 53:00 6:67 11:33 25:00 48:67 8:33 90:67 6:00 3:33 12:00 32:33 14:33 41:33 20:00 23:67 34:00 22:33 19:00 6:33 74:67 23:33 62:00 14:67 66:67 30:67 2:00 :67 2:33 18:67 62:00 17:00 15:33 84:67 62:33 37:67 4:71 95:29 5:57 14:57 33:57 19:29 27:00 14:57 24:14 16:00 45:29 4:29 10:29 18:43 57:43 9:57 90:71 3:29 6:00 13:43 30:14 15:14 41:29 31:71 23:00 32:86 12:43 11:71 4:00 84:29 15:57 75:29 9:14 61:86 34:43 3:14 :57 2:14 20:57 63:43 13:86 15:57 84:43 58:43 41:57 1:33 98:67

X7 time in current job

X8 monthly payment in percentage of income X9 martial status and sex X10 other debtors and guarantors X11 time in current apartment X12 properties X13 age of client X14 other credits X15 housing X16 number of previous credits incl. the actual X17 profession X18 persons to maintain X19 telephone X20 foreign worker

Table 11: Table 10 continued. 39

Method Client t Overall Training Test Validation GLM bad 0:1 98:7 99:2 97:4 100:0 0:3 78:0 78:8 74:8 83:0 0:5 52:0 51:5 48:7 60:4 0:7 26:0 28:0 24:3 24:5 0:9 5:7 5:3 5:2 7:5 good 0:1 0:0 0:0 0:0 0:0 0:3 2:7 2:5 3:2 2:2 0:5 11:0 9:3 13:4 9:3 0:7 27:0 24:8 31:8 21:5 0:9 70:6 71:2 64:0 64:0

Table 12: Missclassi cation rates for di erent threshold values for GLM.
Method Client t Overall Training Test Validation FFN (21-1) bad 0:1 95:0 93:9 93:9 100:0 0:3 69:0 70:5 66:1 71:7 0:5 47:3 47:0 46:1 50:9 0:7 23:7 23:5 20:9 30:2 0:9 5:3 2:3 6:1 11:3 good 0:1 0:0 0:0 0:0 0:0 0:3 4:4 1:8 6:7 5:0 0:5 14:9 12:2 19:1 11:5 0:7 32:7 28:1 39:9 27:3 0:9 70:7 69:0 73:1 69:0

Table 13: Missclassi cation rates for di erent threshold values for FFN (211).
Method Client t Overall Training Test Validation FFN (21-2-1) bad 0:1 100:0 100:0 100:0 100:0 0:3 100:0 100:0 100:0 100:0 0:5 100:0 100:0 100:0 100:0 0:7 22:2 20:4 22:6 24:5 0:9 1:6 0:7 3:4 0:0 good 0:1 0:0 0:0 0:0 0:0 0:3 0:0 0:0 0:0 0:0 0:5 0:0 0:0 0:0 0:0 0:7 30:8 25:2 38:9 25:9 0:9 88:3 87:7 89:4 87:1

Table 14: Missclassi cation rates for di erent threshold values for FFN (212-1). 40

Method Client t Overall Training Test Validation FFN (21-5-1) bad 0:1 91:3 89:4 91:3 96:2 0:3 59:0 54:6 62:6 62:2 0:5 38:0 34:8 40:0 41:5 0:7 21:7 13:6 27:0 30:2 0:9 6:7 0:8 9:5 15:1 good 0:1 0:1 0:0 0:4 0:0 0:3 5:6 2:5 8:1 6:4 0:5 13:6 8:3 13:8 11:5 0:7 28:4 19:4 38:2 26:6 0:9 61:7 58:6 65:4 60:4

Table 15: Missclassi cation rates for di erent threshold values for FFN (215-1).

Method Client t Overall Training Test Validation FFN (21-10-1) bad 0:1 91:6 89:4 91:3 98:1 0:3 64:0 55:3 68:7 75:5 0:5 40:3 28:0 46:1 58:5 0:7 21:0 7:6 28:7 37:7 0:9 6:7 2:3 10:4 9:4 good 0:1 0:3 0:0 0:4 0:7 0:3 5:4 1:8 8:5 6:5 0:5 12:3 5:8 18:4 12:9 0:7 28:1 19:4 35:0 31:7 0:9 59:7 54:0 66:1 58:3

Table 16: Missclassi cation rates for di erent threshold values for FFN (2110-1).


Variable Mean coe cient Std. Dev. Minimum Maximum constant term 1.0606 0.09411 0.9643 1.3699 running account (1) 0.8086 0.07444 0.6895 1.0330 monthly payment (8) -0.6604 0.06094 -0.7719 -0.5499 savings (6) 0.3762 0.06070 0.2586 0.5443 previous credit (3) 0.3739 0.11550 0.0539 0.6274 age of client (13) 0.2689 0.07805 0.0000 0.3557 telephone (19) 0.2317 0.04494 0.1202 0.3053 amount of credit (5) -0.3146 0.07001 -0.4866 -0.2265 duration of credit (2) -0.2684 0.07995 -0.5065 -0.1740 number of previous credits (16) -0.2436 0.08877 -0.4986 -0.0242 properties (12) -0.2388 0.08883 -0.5218 -0.0825 housing (15) 0.2386 0.07224 0.0744 0.3460 other credits (14) 0.1961 0.07711 -0.0276 0.3454 time in apartment (11) -0.1418 0.13450 -0.2577 0.3354 time in current job (7) 0.1797 0.06994 0.0362 0.3434 martial status and sex (9) 0.2183 0.06131 0.1078 0.3649 debtors and guarantors (10) 0.1378 0.06397 0.0589 0.2906 purpose of credit (4) 0.0732 0.05326 -0.0429 0.1583 profession (17) 0.0397 0.06406 -0.0480 0.1918 persons to maintain (18) -0.0324 0.07790 -0.2711 0.0298 foreign worker (20) 0.0714 0.08454 -0.1057 0.2788

Table 17: Result of 20 replications of the FFN 21-1 (logit model) on the credit data.


B How the les that characterize a neural network in XploRe should look like?
In order to explain the form of the les that characterize a neural network for reading in XploRe we give a simple example with gure 23.
1 w 15 2 5 3 w35 w 56 w14 4 w46 6

Figure 23: A simple example for the explanation of the numbering of the units of a FFN

The weights. The le " .wei" should contain four columns (see Table 18). In

the rst column are the numbers of the units which have a connection to another unit while in the second column the numbers of the units that have a connection from previous unit are listed. The corresponding weights for the best training network have to be written in the third and for the best test network in the fourth column. The units. In the le " .unt" one should nd the speci cation of the units in the network. The rst column tells us about the layer in which the unit occurs (-1 for input, 0 for hidden and 1 for an output unit). In the second column the kind of the activation function is coded (0 for the identity, 1 for a jump function and 2 for the logistic function). The denotation of the units in the graphical outputs of the visualization and the internal code for the corresponding symbols are given in the third and fourth column. The tting parameters. In the le " .par" the weight decay parameter dec, the code for the optimization method opt, for the error function err, the kind of initialization ran of the weights and the number of restarts res of the optimization procedure are to be listed. The XploRe macros. The XploRe macro NN is available via WWW under: sigbert/nn.xpl. 43

1 2 3 1 2 3 4 5

4 4 4 5 5 5 6 6

w w w w w w w w

0 14 0 24 0 34 0 15 0 25 0 35 0 46 0 56

w w w w w w w w

1 14 1 24 1 34 1 15 1 25 1 35 1 46 1 56

-1 -1 -1 0 0 1

0 0 0 2 2 2

X1 110 X2 110 X3 110 A1 45 A2 45 Y1 95

0.001 0 1 -0.25 25


Table 18: The form of the les that characterize the neural network in gure 23.

Bianchi, M. (1995). Bandwidth selection in density estimation, in W. Hardle, S. Klinke & B. Turlach (eds), XploRe - an interactive statistical computing environment, Springer, pp. 169{194. Bishop, C. (1995). Neural Networks for Pattern Recognition, Clarendon Press, Oxford. Bridle, J. (1990). Probabilistic interpretation of feedforward classi cation network outputs, with relationship to statistical pattern recognition, in F. F. Soulie & J. Herault. (eds), Neurocomputing: Algorithms, Architectures and Applications, Springer-Verlag, New York, pp. 227{236. Cox, T. & Cox, M. (1994). Multidimensional scaling, Chapman and Hall. Efron, B. & Tibshirani, R. (1993). An introduction to the bootstrap, Monographs on Statistics and Applied Probability, Chapman & Hall. Fahrmeier, L. & Hamerle, A. (1981). Multivariate statistische Verfahren, de Gruyter. Fahrmeier, L. & Tutz, G. (1994). Multivariate statistical modelling based on Generalized Linear Models, Springer. Grassmann, J. (1996). Statistical classi cation methods for protein fold class prediction, in A. Prat (ed.), COMPSTAT. Proceedings in Computational Statistics. 12th Symposium held in Barcelona, Spain., Physika-Verlag, Heidelberg, pp. 277{282. Hardle, W., Klinke, S. & Turlach, B. (1995). XploRe - an interactive statistical computing environment, Springer. 44

Haussler, W. (1981). Uber Verfahren der Punktebewertung und Diskrimination mit Anwendung auf Kreditscoringsysteme, PhD thesis, Institut fur Rechentechnik, TU Braunschweig. Klinke, S. (1995). Data Structures in Computational Statistics, PhD thesis, Institute of statistics, Catholic university of Louvain. Kruskal, J. (1964a). Multidimensional scaling by optimzing goodness-of- t to nonmetric hypothesis, Psychometrika 29: 1{27. Kruskal, J. (1964b). Nonmetric multidimensional scaling: a numerical method, Psychometrika 29: 115{129. Reczko, M., Bohr, H., Subramaniam, S., Pamidighantam, S. & Hatzigeorgiou, A. (1994). Fold class prediction by neural networks, in H. Bohr & S. Brunak (eds), Protein Structure by Distance Analysis, IOS Press, Amsterdam, pp. 277{286. Richard, M. & Lippmann, R. (1991). Neural network classi ers estimate bayesian a posteriori probabilities, Neural Computation 3: 461{483. Schumacher, M., Rossner, R. & Vach, W. (1996). Neural networks and logistic regression : Part i, Computational Statistics & Data Analysis 21(6): 661{ 682. Stone, M. (1974). Cross-validation choice and assessment of statistical predictions, Journal of the Royal Statistical Society pp. 111{147. Turlach, B. (1994). Computer-aided Additive Modeling, PhD thesis, Institute of statistics, Catholic university of Louvain. Vach, W., Rossner, R. & Schumacher, M. (1996). Neural networks and logistic regression : Part ii, Computational Statistics & Data Analysis 21(6): 683{ 701.