You are on page 1of 64

Introduction

What is an artificial neural network? An artificial neural network is a system based on the operation of biological neural networks, in other words, is an emulation of biological neural system. Why would be necessary the implementation of artificial neural networks? Although computing these days is truly advanced, there are certain tasks that a program made for a common microprocessor is unable to perform; even so a software implementation of a neural network can be made with their advantages and disadvantages. Advantages:

A neural network can perform tasks that a linear program can not. When an element of the neural network fails, it can continue without any problem by their parallel nature. A neural network learns and does not need to be reprogrammed. It can be implemented in any application. It can be implemented without any problem.

Disadvantages:

The neural network needs training to operate. The architecture of a neural network is different from the architecture of microprocessors therefore needs to be emulated. Requires high processing time for large neural networks. Another aspect of the artificial neural networks is that there are different architectures, which consequently requires different types of algorithms, but despite to be an apparently complex system, a neural network is relatively simple. Artificial neural networks are among the newest signal processing technologies nowadays. The field of work is very interdisciplinary, but the explanation I will give you here will be restricted to an engineering perspective. In the world of engineering, neural networks have two main functions: Pattern classifiers and as non linear adaptive filters. As its biological predecessor, an artificial neural network is an adaptive system. By adaptive, it means that each parameter is changed during its operation and it is deployed for solving the problem in matter. This is called the training phase. A artificial neural network is developed with a systematic step-by-step procedure which optimizes a criterion commonly known as the learning rule. The input/output training data is fundamental for these networks as it conveys the information which is necessary to discover the optimal operating point. In addition, a non linear nature make neural network processing elements a very flexible system.

Basically, an artificial neural network is a system. A system is a structure that receives an input, process the data, and provides an output. Commonly, the input consists in a data array which can be anything such as data from an image file, a WAVE sound or any kind of data that can be represented in an array. Once an input is presented to the neural network, and a corresponding desired or target response is set at the output, an error is composed from the difference of the desired response and the real system output. The error information is fed back to the system which makes all adjustments to their parameters in a systematic fashion (commonly known as the learning rule). This process is repeated until the desired output is acceptable. It is important to notice that the performance hinges heavily on the data. Hence, this is why this data should pre-process with third party algorithms such as DSP algorithms. In neural network design, the engineer or designer chooses the network topology, the trigger function or performance function, learning rule and the criteria for stopping the training phase. So, it is pretty difficult determining the size and parameters of the network as there is no rule or formula to do it. The best we can do for having success with our design is playing with it. The problem with this method is when the system does not work properly it is hard to refine the solution. Despite this issue, neural networks based solution is very efficient in terms of development, time and resources. By experience, I can tell that artificial neural networks provide real solutions that are difficult to match with other technologies. Fifteen years ago, Denker said: artificial neural networks are the second best way to implement a solution this motivated by their simplicity, design and universality. Nowadays, neural network technologies are emerging as the technology choice for many applications, such as patter recognition, prediction, system identification and control.

The Biological Model

Artificial neural networks born after McCulloc and Pitts introduced a set of simplified neurons in 1943. These neurons were represented as models of biological networks into conceptual components for circuits that could perform computational tasks. The basic model of the artificial neuron is founded upon the functionality of the biological neuron. By definition, Neurons are basic signaling units of the nervous system of a living being in which each neuron is a discrete cell whose several processes are from its cell body

The biological neuron has four main regions to its structure. The cell body, or soma, has two offshoots from it. The dendrites and the axon end in pre-synaptic terminals. The cell body is the heart of the cell. It contains the nucleolus and maintains protein synthesis. A neuron has many dendrites, which look like a tree structure, receives signals from other neurons. A single neuron usually has one axon, which expands off from a part of the cell body. This I called the axon hillock. The axon main purpose is to conduct electrical signals generated at the axon hillock down its length. These signals are called action potentials. The other end of the axon may split into several branches, which end in a pre-synaptic terminal. The electrical signals (action potential) that the neurons use to convey the information of the brain are all identical. The brain can determine which type of information is being received based on the path of the signal. The brain analyzes all patterns of signals sent, and from that information it interprets the type of information received. The myelin is a fatty issue that insulates the axon. The non-insulated parts of the axon area are called Nodes of Ranvier. At these nodes, the signal traveling down the axon is regenerated. This ensures that the signal travel down the axon to be fast and constant. The synapse is the area of contact between two neurons. They do not physically touch because they are separated by a cleft. The electric signals are sent through chemical interaction. The neuron sending the signal is called pre-synaptic cell and the neuron receiving the electrical signal is called postsynaptic cell. The electrical signals are generated by the membrane potential which is based on differences in concentration of sodium and potassium ions and outside the cell membrane.

Biological neurons can be classified by their function or by the quantity of processes they carry out. When they are classified by processes, they fall into three categories: Unipolar neurons, bipolar neurons and multipolar neurons. Unipolar neurons have a single process. Their dendrites and axon are located on the same stem. These neurons are found in invertebrates. Bipolar neurons have two processes. Their dendrites and axon have two separated processes too. Multipolar neurons: These are commonly found in mammals. Some examples of these neurons are spinal motor neurons, pyramidal cells and purkinje cells. When biological neurons are classified by function they fall into three categories. The first group is sensory neurons. These neurons provide all information for perception and motor coordination. The second group provides information to muscles, and glands. There are called motor neurons. The last group, the interneuronal, contains all other neurons and has two subclasses. One group called relay or protection interneurons. They are usually found in the brain and connect different parts of it. The other group called local interneurons are only used in local circuits.

The Mathematical Model


Once modeling an artificial functional model from the biological neuron, we must take into account three basic components. First off, the synapses of the biological neuron are modeled as weights. Lets remember that the synapse of the biological neuron is the one which interconnects the neural network and gives the strength of the connection. For an artificial neuron, the weight is a number, and represents the synapse. A negative weight reflects an inhibitory connection, while positive values designate excitatory connections. The following components of the model represent the actual activity of the neuron cell. All inputs are summed altogether and modified by the weights. This activity is referred as a linear combination. Finally, an activation function controls the amplitude of the output. For example, an acceptable range of output is usually between 0 and 1, or it could be -1 and 1. Mathematically, this process is described in the figure

From this model the interval activity of the neuron can be shown to be:

The output of the neuron, yk, would therefore be the outcome of some activation function on the value of vk.

Activation functions
As mentioned previously, the activation function acts as a squashing function, such that the output of a neuron in a neural network is between certain values (usually 0 and 1, or -1 and 1). In general, there are three types of activation functions, denoted by (.) . First, there is the Threshold Function which takes on a value of 0 if the summed input is less than a certain threshold value (v), and the value 1 if the summed input is greater than or equal to the threshold value.

Secondly, there is the Piecewise-Linear function. This function again can take on the values of 0

or 1, but can also take on values between that depending on the amplification factor in a certain region of linear operation.

Thirdly, there is the sigmoid function. This function can range between 0 and 1, but it is also sometimes useful to use the -1 to 1 range. An example of the sigmoid function is the hyperbolic tangent function.

The artifcial neural networks which we describe are all variations on the parallel distributed processing (PDP) idea. The architecture of each neural network is based on very similar building blocks which perform the processing. In this chapter we first discuss these processing units and discuss diferent neural network topologies. Learning strategies as a basis for an adaptive system

A framework for distributed representation


An artifcial neural network consists of a pool of simple processing units which communicate by sending signals to each other over a large number of weighted connections. A set of major aspects of a parallel distributed model can be distinguished :

a set of processing units ('neurons,' 'cells'); a state of activation yk for every unit, which equivalent to the output of the unit; connections between the units. Generally each connection is defined by a weight wjk which determines the effect which the signal of unit j has on unit k; a propagation rule, which determines the effective input sk of a unit from its external inputs; an activation function Fk, which determines the new level of activation based on the efective input sk(t) and the current activation yk(t) (i.e., the update); an external input (aka bias, offset) k for each unit; a method for information gathering (the learning rule); an environment within which the system must operate, providing input signals and|if necessary|error signals.

Processing units
Each unit performs a relatively simple job: receive input from neighbours or external sources and use this to compute an output signal which is propagated to other units. Apart from this processing, a second task is the adjustment of the weights. The system is inherently parallel in the sense that many units can carry out their computations at the same time. Within neural systems it is useful to distinguish three types of units: input units (indicated by an index i) which receive data from outside the neural network, output units (indicated by an index o) which send data out of the neural network, and hidden units (indicated by an index h) whose input and output signals remain within the neural network. During operation, units can be updated either synchronously or asynchronously. With synchronous updating, all units update their activation simultaneously; with asynchronous updating, each unit has a (usually fixed) probability of updating its activation at a time t, and usually only one unit will be able to do this at a time. In some cases the latter model has some advantages.

Neural Network topologies


In the previous section we discussed the properties of the basic processing unit in an artificial neural network. This section focuses on the pattern of connections between the units and the propagation of data. As for this pattern of connections, the main distinction we can make is between:

Feed-forward neural networks, where the data ow from input to output units is strictly feedforward. The data processing can extend over multiple (layers of) units, but no

feedback connections are present, that is, connections extending from outputs of units to inputs of units in the same layer or previous layers. Recurrent neural networks that do contain feedback connections. Contrary to feedforward networks, the dynamical properties of the network are important. In some cases, the activation values of the units undergo a relaxation process such that the neural network will evolve to a stable state in which these activations do not change anymore. In other applications, the change of the activation values of the output neurons are significant, such that the dynamical behaviour constitutes the output of the neural network (Pearlmutter, 1990). Classical examples of feed-forward neural networks are the Perceptron and Adaline. Examples of recurrent networks have been presented by Anderson (Anderson, 1977), Kohonen (Kohonen, 1977), and Hopfield (Hopfield, 1982) .

Training of artifcial neural networks


A neural network has to be configured such that the application of a set of inputs produces (either 'direct' or via a relaxation process) the desired set of outputs. Various methods to set the strengths of the connections exist. One way is to set the weights explicitly, using a priori knowledge. Another way is to 'train' the neural network by feeding it teaching patterns and letting it change its weights according to some learning rule. We can categorise the learning situations in two distinct sorts. These are:

Supervised learning or Associative learning in which the network is trained by providing it with input and matching output patterns. These input-output pairs can be provided by an external teacher, or by the system which contains the neural network (self-supervised).

Unsupervised learning or Self-organisation in which an (output) unit is trained to respond to clusters of pattern within the input. In this paradigm the system is supposed to discover statistically salient features of the input population. Unlike the supervised learning paradigm, there is no a priori set of categories into which the patterns are to be classified; rather the system must develop its own representation of the input stimuli.

Reinforcement Learning This type of learning may be considered as an intermediate form of the above two types of learning. Here the learning machine does some action on the environment and gets a feedback response from the environment. The learning system grades its action good (rewarding) or bad (punishable) based on the environmental response and accordingly adjusts its parameters. Generally, parameter adjustment is continued until an equilibrium state occurs, following which there will be no more changes in its parameters. The selforganizing neural learning may be categorized under this type of learning.

Modifying patterns of connectivity of Neural Networks


Both learning paradigms supervised learning and unsupervised learning result in an adjustment of the weights of the connections between units, according to some modification rule. Virtually all learning rules for models of this type can be considered as a variant of the Hebbian learning rule suggested by Hebb in his classic book Organization of Behaviour (1949) (Hebb, 1949). The basic idea is that if two units j and k are active simultaneously, their interconnection must be strengthened. If j receives input from k, the simplest version of Hebbian learning prescribes to modify the weight wjk with

where is a positive constant of proportionality representing the learning rate. Another common rule uses not the actual activation of unit k but the difference between the actual and desired activation for adjusting the weights: in which dk is the desired activation provided by a teacher. This is often called the Widrow-Hoff rule or the delta rule, and will be discussed in the next chapter. Many variants (often very exotic ones) have been published the last few years. Neural networks were started about 50 years ago. Their early abilities were exaggerated, casting doubts on the field as a whole There is a recent renewed interest in the field, however, because of new techniques and a better theoretical understanding of their capabilities.

Motivation for neural networks:


Scientists are challenged to use machines more effectively for tasks currently solved by humans. Symbolic Rules don't reflect processes actually used by humans Traditional computing excels in many areas, but not in others.

Types of Applications
Machine learning:

Having a computer program itself from a set of examples so you don't have to program it yourself. This will be a strong focus of this course: neural networks that learn from a set of examples. Optimization: given a set of constraints and a cost function, how do you find an optimal solution? E.g. traveling salesman problem. Classification: grouping patterns into classes: i.e. handwritten characters into letters. Associative memory: recalling a memory based on a partial match. Regression: function mapping

Cognitive science:

Modelling higher level reasoning: o language o problem solving Modelling lower level reasoning: o vision o audition speech recognition o speech generation

Neurobiology: Modelling models of how the brain works.


neuron-level higher levels: vision, hearing, etc. Overlaps with cognitive folks.

Mathematics:

Nonparametric statistical analysis and regression.

Philosophy:

Can human souls/behavior be explained in terms of symbols, or does it require something lower level, like a neurally based model?

Where are neural networks being used?


Signal processing: suppress line noise, with adaptive echo canceling, blind source separation Control: e.g. backing up a truck: cab position, rear position, and match with the dock get converted to steering instructions. Manufacturing plants for controlling automated machines. Siemens successfully uses neural networks for process automation in basic industries, e.g., in rolling mill control more than 100 neural networks do their job, 24 hours a day Robotics - navigation, vision recognition Pattern recognition, i.e. recognizing handwritten characters, e.g. the current version of Apple's Newton uses a neural net Medicine, i.e. storing medical records based on case information Speech production: reading text aloud (NETtalk)

Speech recognition Vision: face recognition , edge detection, visual search engines Business,e.g.. rules for mortgage decisions are extracted from past decisions made by experienced evaluators, resulting in a network that has a high level of agreement with human experts. Financial Applications: time series analysis, stock market prediction Data Compression: speech signal, image, e.g. faces Game Playing: backgammon, chess, go, ...

Neural Networks in the Brain


The brain is not homogeneous. At the largest anatomical scale, we distinguish cortex, midbrain, brainstem, and cerebellum. Each of these can be hierarchically subdivided into many regions, and areas within each region, either according to the anatomical structure of the neural networks within it, or according to the function performed by them. The overall pattern of projections (bundles of neural connections) between areas is extremely complex, and only partially known. The best mapped (and largest) system in the human brain is the visual system, where the first 10 or 11 processing stages have been identified. We distinguish feedforward projections that go from earlier processing stages (near the sensory input) to later ones (near the motor output), from feedback connections that go in the opposite direction. In addition to these long-range connections, neurons also link up with many thousands of their neighbours. In this way they form very dense, complex local networks.

Artificial Neuron Models


Computational neurobiologists have constructed very elaborate computer models of neurons in order to run detailed simulations of particular circuits in the brain. As Computer Scientists, we are more interested in the general properties of neural networks, independent of how they are actually "implemented" in the brain. This means that we can use much simpler, abstract "neurons", which (hopefully) capture the essence of neural computation even if they leave out much of the details of how biological neurons work. People have implemented model neurons in hardware as electronic circuits, often integrated on VLSI chips. Remember though that computers run much faster than brains - we can therefore run

fairly large networks of simple model neurons as software simulations in reasonable time. This has obvious advantages over having to use special "neural" computer hardware.

A Simple Artificial Neuron


Our basic computational element (model neuron) is often called a node or unit. It receives input from some other units, or perhaps from an external source. Each input has an associated weight w, which can be modified so as to model synaptic learning. The unit computes some function f of the weighted sum of its inputs:

Its output, in turn, can serve as input to other units.

The weighted sum is called the net input to unit i, often written neti. Note that wij refers to the weight from unit j to unit i (not the other way around). The function f is the unit's activation function. In the simplest case, f is the identity function, and the unit's output is just its net input. This is called a linear unit.

Neurons and Synapses


The basic computational unit in the nervous system is the nerve cell, or neuron. A neuron has:

Dendrites (inputs) Cell body Axon (output)

A neuron receives input from other neurons (typically many thousands). Inputs sum (approximately). Once input exceeds a critical level, the neuron discharges a spike - an electrical pulse that travels from the body, down the axon, to the next neuron(s) (or other receptors). This spiking event is also called depolarization, and is followed by a refractory period, during which the neuron is unable to fire. The axon endings (Output Zone) almost touch the dendrites or cell body of the next neuron. Transmission of an electrical signal from one neuron to the next is effected by neurotransmittors, chemicals which are released from the first neuron and which bind to receptors in the second. This link is called a synapse. The extent to which the signal from one neuron is passed on to the next depends on many factors, e.g. the amount of neurotransmittor available, the number and arrangement of receptors, amount of neurotransmittor reabsorbed, etc.

Synaptic Learning
Brains learn. Of course. From what we know of neuronal structures, one way brains learn is by altering the strengths of connections between neurons, and by adding or deleting connections between neurons. Furthermore, they learn "on-line", based on experience, and typically without the benefit of a benevolent teacher. The efficacy of a synapse can change as a result of experience, providing both memory and learning through long-term potentiation. One way this happens is through release of more neurotransmitter. Many other changes may also be involved.
Long-term Potentiation: An enduring (>1 hour) increase in synaptic efficacy that results from high-frequency stimulation of an afferent (input) pathway

Hebbs Postulate:
"When an axon of cell A... excites[s] cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells so that A's efficiency as one of the cells firing B is increased."

Bliss and Lomo discovered LTP in the hippocampus in 1973

Points to note about LTP:


Synapses become more or less important over time (plasticity) LTP is based on experience LTP is based only on local information (Hebb's postulate)

Summary
The following properties of nervous systems will be of particular interest in our neurally-inspired models:

parallel, distributed information processing high degree of connectivity among basic units connections are modifiable based on experience learning is a constant process, and usually unsupervised learning is based only on local information performance degrades gracefully if some units are removed

Examples of Activation Functions


Identity Function
The identity function is given by
>

f1 := proc(x) x; end;

>

plot(f1(x),x=-10..10);

Step Function
The step function is if and if . This is also called the heaviside function. Another common variation is for it to take on values -1 and +1 as shown below.
>

f2 := proc(x) 2*Heaviside(x) -1 end;

>

plot(f2(x),x=-10..10);

Logistic Function (Sigmoid)

The logistic function has the form


>

f3 := proc(x,a) 1/(1 + exp(-a*x)) end;

>

plot(f3(x,1),x=-10..10);

>

with(plots):

The parameter "a" in the logistic function determines how steep it is. The larger "a", the steeper it is.
> > >

display([plot(f3(x,1),x=-10..10), plot(f3(x,2),x=-10..10, color = blue), plot(f3(x,.5),x=-10..10, color = green)]);

Symmetric Sigmoid
The symmetric sigmoid is simply the sigmoid that is stretched so that the y range is 2 and then shifted down by 1 so that it ranges between -1 and 1. If g(x) is the standard sigmoid then the symmetric sigmoid is
>

f4 := proc(x) 2*f3(x,1)-1 end;

>

plot(f4(x),x=-10..10);

The symmetric sigmoid differs from the hyperbolic tangent by a constant factor. As you can see, the graph below is identical to the graph for the symmetric sigmoid.
>

plot(tanh(.5*x),x=-10..10);

Radial Basis Functions

A radial basis function is simply a gaussian, . It is called local because, unlike the previous functions, it is essentially zero everywhere except in a small region.
>

f5 := proc(x,a) exp(-a * x^2); end;

>

plot(f5(x,1),x=-10..10);

Derivatives

The derivative of the identity function is just 1. That is, if f(x) is the identity then The derivative of the step function is not defined which is exactly why it isn't used. The nice feature of sigmoids is that their derivatives are easy to compute. If f(x) is the logistic

function above then


>

diff(f3(x,a),x);

>

simplify(f3(x,a)*(1-f3(x,a)));

This is also true of hyperbolic tangent . If f(x) is tanh the


>

diff(tanh(x),x);

>

Linear Regression

Fitting a Model to Data


Consider the data below (for more complete auto data, see data description, raw data, and maple plots):

(Fig. 1) Each dot in the figure provides information about the weight (x-axis, units: U.S. pounds) and fuel consumption (y-axis, units: miles per gallon) for one of 74 cars (data from 1979). Clearly weight and fuel consumption are linked, so that, in general, heavier cars use more fuel. Now suppose we are given the weight of a 75th car, and asked to predict how much fuel it will use, based on the above data. Such questions can be answered by using a model - a short mathematical description - of the data (see also optical illusions). The simplest useful model here is of the form

y = w1 x + w 0

(1)

This is a linear model: in an xy-plot, equation 1 describes a straight line with slope w1 and intercept w0 with the y-axis, as shown in Fig. 2. (Note that we have rescaled the coordinate axes - this does not change the problem in any fundamental way.) How do we choose the two parameters w0 and w1 of our model? Clearly, any straight line drawn somehow through the data could be used as a predictor, but some lines will do a better job than others. The line in Fig. 2 is certainly not a good model: for most cars, it will predict too much fuel consumption for a given weight.

(Fig. 2)

The Loss Function


In order to make precise what we mean by being a "good predictor", we define a loss (also called objective or error) function E over the model parameters. A popular choice for E is the sum-squared error:

(2)

In words, it is the sum over all points i in our data set of the squared difference between the target value ti (here: actual fuel consumption) and the model's prediction yi, calculated from the input value xi (here: weight of the car) by equation 1. For a linear model, the sum-sqaured error is a quadratic function of the model parameters. Figure 3 shows E for a range of values of w0 and w1. Figure 4 shows the same functions as a contour plot.

(Fig. 3)

(Fig. 4)

Minimizing the Loss


The loss function E provides us with an objective measure of predictive error for a specific choice of model parameters. We can thus restate our goal of finding the best (linear) model as finding the values for the model parameters that minimize E.

For linear models, linear regression provides a direct way to compute these optimal model parameters. (See any statistics textbook for details.) However, this analytical approach does not

generalize to nonlinear models (which we will get to by the end of this lecture). Even though the solution cannot be calculated explicitly in that case, the problem can still be solved by an iterative numerical technique called gradient descent. It works as follows:
1. Choose some (random) initial values for the model parameters. 2. Calculate the gradient G of the error function with respect to each model parameter. 3. Change the model parameters so that we move a short distance in the direction of the greatest rate of decrease of the error, i.e., in the direction of -G. 4. Repeat steps 2 and 3 until G gets close to zero. How does this work? The gradient of E gives us the direction in which the loss function at the current settting of the w has the steepest slope. In ordder to decrease E, we take a small step in the opposite direction, -G (Fig. 5).

(Fig. 5) By repeating this over and over, we move "downhill" in E until we reach a minimum, where G = 0, so that no further progress is possible (Fig. 6).

(Fig. 6) Fig. 7 shows the best linear model for our car data, found by this procedure.

(Fig. 7)

It's a neural network!


Our linear model of equation 1 can in fact be implemented by the simple neural network shown in Fig. 8. It consists of a bias unit, an input unit, and a linear output unit. The input unit makes external input x (here: the weight of a car) available to the network, while the bias unit always has a constant output of 1. The output unit computes the sum:

y2 = y1 w21 + 1.0 w20

(3)

It is easy to see that this is equivalent to equation 1, with w21 implementing the slope of the straight line, and w20 its intercept with the y-axis.

Linear Neural Networks

Multiple regression
Our car example showed how we could discover an optimal linear function for predicting one variable (fuel consumption) from one other (weight). Suppose now that we are also given one or more additional variables which could be useful as predictors. Our simple neural network model can easily be extended to this case by adding more input units (Fig. 1). Similarly, we may want to predict more than one variable from the data that we're given. This can easily be accommodated by adding more output units (Fig. 2). The loss function for a network with multiple outputs is obtained simply by adding the loss for each output unit together. The network now has a typical layered structure: a layer of input units (and the bias), connected by a layer of weights to a layer of output units.

(Fig. 1)

(Fig. 2)

Computing the gradient


In order to train neural networks such as the ones shown above by gradient descent, we need to be able to compute the gradient G of the loss function with respect to each weight wij of the network. It tells us how a small change in that weight will affect the overall error E. We begin by splitting the loss function into separate terms for each point p in the training data:

(1)

where o ranges over the output units of the network. (Note that we use the superscript p to denote the training point - this is not an exponentiation!) Since differentiation and summation are interchangeable, we can likewise split the gradient into separate components for each training point:

(2)

In what follows, we describe the computation of the gradient for a single data point, omitting the superscript p in order to make the notation easier to follow.

First use the chain rule to decompose the gradient into two factors:

(3)

The first factor can be obtained by differentiating Eqn. 1 above:

(4)

Using

, the second factor becomes

(5)

Putting the pieces (equations 3-5) back together, we obtain

(6)

To find the gradient G for the entire data set, we sum at each weight the contribution given by equation 6 over all the data points. We can then subtract a small proportion (called the learning rate) of G from the weights to perform gradient descent.

The Gradient Descent Algorithm


1. Initialize all weights to small random values. 2. REPEAT until done 1. For each weight wij set 2. For each data point (x, t)p 1. set input units to x 2. compute value of output units 3. For each weight wij set

3. For each weight wij set

The algorithm terminates once we are at, or sufficiently near to, the minimum of the error function, where G = 0. We say then that the algorithm has converged. In summary: general case Training data Model parameters Model (x,t) w y = g(w,x) linear network (x,t) w

Error function

E(y,t)

Gradient with respect to wij

- (ti - yi) yj

Weight update rule

The Learning Rate


An important consideration is the learning rate , which determines by how much we change the weights w at each step. If is too small, the algorithm will take a long time to converge (Fig. 3).

(Fig. 3) Conversely, if is too large, we may end up bouncing around the error surface out of control - the algorithm diverges (Fig. 4). This usually ends with an overflow error in the computer's floating-point arithmetic.

(Fig. 4)

Batch vs. Online Learning


Above we have accumulated the gradient contributions for all data points in the training set before updating the weights. This method is often referred to as batch learning. An alternative approach is online learning, where the weights are updated immediately after seeing each data point. Since the gradient for a single data point can be considered a noisy approximation to the overall gradient G (Fig. 5), this is also called stochastic (noisy) gradient descent.

(Fig. 5) Online learning has a number of advantages:


it is often much faster, especially when the training set is redundant (contains many similar data points), it can be used when there is no fixed training set (new data keeps coming in), it is better at tracking nonstationary environments (where the best model gradually changes over time), the noise in the gradient can help to escape from local minima (which are a problem for gradient descent in nonlinear models).

These advantages are, however, bought at a price: many powerful optimization techniques (such as: conjugate and second-order gradient methods, support vector machines, Bayesian methods, etc.) - which we will not talk about in this course! - are batch methods that cannot be used online. (Of course this also means that in order to implement batch learning really well, one has to learn an awful lot about these rather complicated methods!) A compromise between batch and online learning is the use of "mini-batches": the weights are updated after every n data points, where n is greater than 1 but smaller than the training set size.

Multi-layer networks

A nonlinear problem
Consider again the best linear fit we found for the car data. Notice that the data points are not evenly distributed around the line: for low weights, we see more miles per gallon than our model predicts. In fact, it looks as if a simple curve might fit these data better than the straight line. We can enable our neural network to do such curve fitting by giving it an additional node which has a suitably curved (nonlinear) activation function. A useful function for this purpose is the S-shaped hyperbolic tangent (tanh) function (Fig. 1).

(Fig. 1)

(Fig. 2)

FIg. 2 shows our new network: an extra node (unit 2) with tanh activation function has been inserted between input and output. Since such a node is "hidden" inside the network, it is commonly called a hidden unit. Note that the hidden unit also has a weight from the bias unit. In general, all non-input neural network units have such a bias weight. For simplicity, the bias unit and weights are usually omitted from neural network diagrams - unless it's explicitly stated otherwise, you should always assume that they are there.

(Fig. 3)

When this network is trained by gradient descent on the car data, it learns to fit the tanh function to the data (Fig. 3). Each of the four weights in the network plays a particular role in this process: the two bias weights shift the tanh function in the x- and y-direction, respectively, while the other two weights scale it along those two directions. Fig. 2 gives the weight values that produced the solution shown in Fig. 3.

Hidden Layers
One can argue that in the example above we have cheated by picking a hidden unit activation function that could fit the data well. What would we do if the data looks like this (Fig. 4)?

(Fig. 4)

(Relative concentration of NO and NO2 in exhaust fumes as a function of the richness of the ethanol/air mixture burned in a car engine.) Obviously the tanh function can't fit this data at all. We could cook up a special activation function for each data set we encounter, but that would defeat our purpose of learning to model the data. We would like to have a general, non-linear function approximation method which would allow us to fit any given data set, no matter how it looks like.

(Fig. 5)

Fortunately there is a very simple solution: add more hidden units! In fact, a network with just two hidden units using the tanh function (Fig. 5) can fit the dat in Fig. 4 quite well - can you see how? The fit can be further improved by adding yet more units to the hidden layer. Note, however, that having too large a hidden layer - or too many hidden layers - can degrade the network's performance (more on this later). In general, one shouldn't use more hidden units than necessary to solve a given problem. (One way to ensure this is to start training with a very small

network. If gradient descent fails to find a satisfactory solution, grow the network by adding a hidden unit, and repeat.) Theoretical results indicate that given enough hidden units, a network like the one in Fig. 5 can approximate any reasonable function to any required degree of accuracy. In other words, any function can be expressed as a linear combination of tanh functions: tanh is a universal basis function. Many functions form a universal basis; the two classes of activation functions commonly used in neural networks are the sigmoidal (S-shaped) basis functions (to which tanh belongs), and the radial basis functions.

Error Backpropagation
We have already seen how to train linear networks by gradient descent. In trying to do the same for multi-layer networks we encounter a difficulty: we don't have any target values for the hidden units. This seems to be an insurmountable problem - how could we tell the hidden units just what to do? This unsolved question was in fact the reason why neural networks fell out of favor after an initial period of high popularity in the 1950s. It took 30 years before the error backpropagation (or in short: backprop) algorithm popularized a way to train hidden units, leading to a new wave of neural network research and applications.

(Fig. 1)

In principle, backprop provides a way to train networks with any number of hidden units arranged in any number of layers. (There are clear practical limits, which we will discuss later.) In fact, the network does not have to be organized in layers - any pattern of connectivity that permits a partial ordering of the nodes from input to output is allowed. In other words, there must be a way to order the units such that all connections go from "earlier" (closer to the input) to "later" ones (closer to the output). This is equivalent to stating that their connection pattern must not contain any cycles. Networks that respect this constraint are called feedforward networks; their connection pattern forms a directed acyclic graph or dag.

The Algorithm
We want to train a multi-layer feedforward network by gradient descent to approximate an unknown function, based on some training data consisting of pairs (x,t). The vector x represents a pattern of input to the network, and the vector t the corresponding target (desired output). As we have seen before, the overall gradient with respect to the entire training set is just the sum of the gradients for each pattern; in what follows we will therefore describe how to compute the gradient for just a single training pattern. As before, we will number the units, and denote the weight from unit j to unit i by wij. 1. Definitions:
o

the error signal for unit j:

the (negative) gradient for weight wij:

the set of nodes anterior to unit i:

the set of nodes posterior to unit j:

2. The gradient. As we did for linear networks before, we expand the gradient into two factors by use of the chain rule:

The first factor is the error of unit i. The second is

Putting the two together, we get

To compute this gradient, we thus need to know the activity and the error for all relevant nodes in the network.

3. Forward activaction. The activity of the input units is determined by the network's external input x. For all other units, the activity is propagated forward:

Note that before the activity of unit i can be calculated, the activity of all its anterior nodes (forming the set Ai) must be known. Since feedforward networks do not contain cycles, there is an ordering of nodes from input to output that respects this condition.
4. Calculating output error. Assuming that we are using the sum-squared loss

the error for output unit o is simply

5. Error backpropagation. For hidden units, we must propagate the error back from the output nodes (hence the name of the algorithm). Again using the chain rule, we can expand the error of a hidden unit in terms of its posterior nodes:

Of the three factors inside the sum, the first is just the error of node i. The second is

while the third is the derivative of node j's activation function:

For hidden units h that use the tanh activation function, we can make use of the special identity tanh(u)' = 1 - tanh(u)2, giving us

Putting all the pieces together we get

Note that in order to calculate the error for unit j, we must first know the error of all its posterior nodes (forming the set Pj). Again, as long as there are no cycles in the network, there is an ordering of nodes from the output back to the input that respects this condition. For example, we can simply use the reverse of the order in which activity was propagated forward.

Matrix Form
For layered feedforward networks that are fully connected - that is, each node in a given layer connects to every node in the next layer - it is often more convenient to write the backprop algorithm in matrix notation rather than using more general graph form given above. In this notation, the biases weights, net inputs, activations, and error signals for all units in a layer are combined into vectors, while all the non-bias weights from one layer to the next form a matrix W. Layers are numbered from 0 (the input layer) to L (the output layer). The backprop algorithm then looks as follows: 1. Initialize the input layer:

2. Propagate activity forward: for l = 1, 2, ..., L,

where bl is the vector of bias weights. 3. Calculate the error in the output layer:

4. Backpropagate the error: for l = L-1, L-2, ..., 1,

where T is the matrix transposition operator.

5. Update the weights and biases:

You can see that this notation is significantly more compact than the graph form, even though it describes exactly the same sequence of operations. *************************************************************************************

Classification

Pattern Classification and Single Layer Networks


Intro
We have just seen how a network can be trained to perform linear regression. That is, given a set of inputs (x) and output/target values (y), the network finds the best linear mapping from x to y.

Given an x value that we have not seen, our trained network can predict what the most likely y value will be. The ability to (correctly) predict the output for an input the network has not seen is called generalization.

This style of learning is referred to as supervised learning (or learning with a teacher) because we are given the target values. Later we will see examples of unsupervised learning which is used for finding patterns in the data rather than modeling input/output mappings. We now step away from linear regression for a moment and look at another type of supervised learning problem called pattern classification. We start by considering only single layer networks.

Pattern classification
A classic example of pattern classifiction is letter recognition. We are given, for example, a set of pixel values associated with an image of a letter. We want the computer to determine what letter it is. The pixel values are refered to as the inputs or the decision variables, and the letter categories are referred to as classes.

Now, a given letter such as "A" can look quite different depending on the type of font that is used or, in the case of handwritten letters, different people's handwriting. Thus, there will be a range of values for the decision variables that map to the same class. That is, if we plot the values of the decision variables, different regions will correspond to different classes.

Example 1:

Two Classes (class 0 and class 1), Two Inputs (x1 and x2).

Example 2:

Another example (see data description, data, Maple plots): class = types of iris decision variables = sepal and petal sizes
Example 3:

example of zipcode digits in Maple

Single layer Networks for Pattern Classification


We can apply a similar approach as in linear regression where the targets are now the classes. Note that the outputs are no longer continuous but rather take on discrete values.

Two Classes:

What does the network look like? If there are just 2 classes we only need 1 output node. The target is 1 if the example is in, say, class 1, and the target is 0 (or -1) if the target is in class 0. It seems reasonable that we use a binary step function to guarantee an appropriate output value.

Training Methods:

We will discuss two kinds of methods for training single-layer networks that do pattern classification:

Perceptron - guaranteed to find the right weights if they exist The Adaline (uses Delta Rule) - can easily be generalized to multi-layer nets (nonlinear problems)

But how do we know if the right weights exist at all????


Let's look to see what a single layer architecture can do ....

Single Layer with a Binary Step Function


Consider a network with 2 inputs and 1 output node (2 classes). The net output of the network is a linear function of the weights and the inputs net = W X = x1 w1 + x2 w2 y = f(net)

x1 w1 + x2 w2 = 0 defines a straight line through the input space. x2 = - w1/w2 x1 <- this is line through the origin with slope -w1/w2

Bias
What if the line dividing the 2 classes does not go through the origin?

Other interesting geometric points to note:

The weight vector (w1, w2) is normal to the decision boundary. Proof: Suppose z1 and z2 are points on the decision boundary.

Linear Separability
Classification problems for which there is a line that exactly separates the classes are called linearly separable. Single layer networks are only able to solve linearly separable problems. Most real world are not linearly separable.

The Perceptron

The perceptron learning rule is a method for finding the weights in a network.

We consider the problem of supervised learning for classification although other types of problems can also be solved. A nice feature of the perceptron learning rule is that if there exist a set of weights that solve the problem, then the perceptron will find these weights. This is true for either binary or bipolar representations.

Assumptions:

We have single layer network whose output is, as before, output = f(net) = f(W X) where f is a binary step function f whose values are (+-1).

We assume that the bias treated as just an extra input whose value is 1 p = number of training examples (x,t) where t = +1 or -1

Geometric Interpretation:
With this binary function f, the problem reduces to finding weights such that sign( W X) = t That is, the weight must be chosen so that the projection of pattern X onto W has the same sign as the target t. But the boundary between positive and negative projections is just the plane W X = 0 , i.e. the same decision boundary we saw before.

The Perceptron Algorithm


1. initialize the weights (either to zero or to a small random value) 2. pick a learning rate ( this is a number between 0 and 1) 3. Until stopping condition is satisfied (e.g. weights don't change):

For each training pattern (x, t):


compute output activation y = f(w x) If y = t, don't change weights

If y != t, update the weights:

w(new) = w(old) + 2 t x or w(new) = w(old) + (t - y ) x, for all t

Consider wht happens below when the training pattern p1 or p2 is chosen. Before updating the weight W, we note that both p1 and p2 are incorrectly classified (red dashed line is decision boundary). Suppose we choose p1 to update the weights as in picture below on the left. P1 has target value t=1, so that the weight is moved a small amount in the direction of p1. Suppose we choose p2 to update the weights. P2 has target value t=-1 so the weight is moved a small amount in the direction of -p2. In either case, the new boundary (blue dashed line) is better than before.

Comments on Perceptron

The choice of learning rate does not matter because it just changes the scaling of w. The decision surface (for 2 inputs and one bias) has equation:

x2 = - (w1/w2) x1 - w3 / w2 where we have defined w3 to be the bias: W = (w1,w2,b) = (w1,w2,w3)

From this we see that the equation remains the same if W is scaled by a constant.

Outline: Find a lower bound L(k) for |w|2 as a function of iteration k. Then find an upper bound U(k) for |w|2. Then show that the lower bound grows at a faster rate than the upper bound. Since the lower bound can't be larger than the upper bound, there must be a finite k such that the weight is no longer updated. However, this can only happen if all patterns are correctly classified.

Perceptron Decision Boundaries

Delta Rule

Also known by the names:


Adaline Rule Widrow-Hoff Rule Least Mean Squares (LMS) Rule

Change from Perceptron:


Replace the step function in the with a continuous (differentiable) activation function, e.g linear For classification problems, use the step function only to determine the class and not to update the weights. Note: this is the same algorithm we saw for regression. All that really differs is how the classes are determined.

Delta Rule: Training by Gradient Descent Revisited


Construct a cost function E that measures how well the network has learned. For example

(one output node) where n = number of examples ti = desired target value associated with the i-th example yi = output of network when the i-th input pattern is presented to network

To train the network, we adjust the weights in the network so as to decrease the cost (this is where we require differentiability). This is called gradient descent.

Algorithm

Initialize the weights with some small random value Until E is within desired tolerance, update the weights according to where E is evaluated at W(old), is the learning rate.: and the gradient is

More than Two Classes.


If there are mor ethan 2 classes we could still use the same network but instead of having a binary target, we can let the target take on discrete values. For example of there ar 5 classes, we could have t=1,2,3,4,5 or t= -2,-1,0,1,2. It turns out, however, that the network has a much easier time if we have one output for class. We can think of each output node as trying to solve a binary problem (it is either in the given class or it isn't).

Two Layer Net: The above is not the most general region. Here, we have assumed the top layer is an AND function. Problem: In the general for the 2- and 3- layers cases, there is no simple way to determine the weights.

Delta Rule

Also known by the names:


Adaline Rule Widrow-Hoff Rule

Least Mean Squares (LMS) Rule

Change from Perceptron:


Replace the step function in the with a continuous (differentiable) activation function, e.g linear For classification problems, use the step function only to determine the class and not to update the weights. Note: this is the same algorithm we saw for regression. All that really differs is how the classes are determined.

Delta Rule: Training by Gradient Descent Revisited


Construct a cost function E that measures how well the network has learned. For example

(one output node) where

n = number of examples ti = desired target value associated with the i-th example yi = output of network when the i-th input pattern is presented to network

To train the network, we adjust the weights in the network so as to decrease the cost (this is where we require differentiability). This is called gradient descent.

Algorithm

Initialize the weights with some small random value Until E is within desired tolerance, update the weights according to where E is evaluated at W(old), is the learning rate.: and the gradient is

More than Two Classes.


If there are mor ethan 2 classes we could still use the same network but instead of having a binary target, we can let the target take on discrete values. For example of there ar 5 classes, we could have t=1,2,3,4,5 or t= -2,-1,0,1,2. It turns out, however, that the network has a much easier time if we have one output for class. We can think of each output node as trying to solve a binary problem (it is either in the given class or it isn't).

Doing Classification Correctly

The Old Way


When there are more than 2 classes, we so far have suggested doing the following:

Assign one output node to each class. Set the target value of each node to be 1 if it is the correct class and 0 otherwise. Use a linear network with a mean squared error function. Determine the network class prediction by picking the output node with the largest value.

There are problems with this method. First, there is a disconnect between the definition of the error function and the determination of the class. A minimum error does not necessary produce the network with the largest number of correct prediction.

By varying the above method a little bit we can remove this inconsistency. Let us start by changing the interpretation of the output:

The New Way


New Interpretation: The output of yi is interpreted as the probability that i is the correct class. This means that:

The output of each node must be between 0 and 1 The sum of the outputs over all nodes must be equal to 1.

How do we achieve this? There are several things to vary.


We can vary the activation function, for example, by using a sigmoid. Sigmoids range continuously between 0 and 1. Is a sigmoid a good choice? We can vary the cost function. We need not use mean squared error (MSE). What are our other options?

To decide, let's start by thinking about what makes sense intuitively. With a linear network using gradient descent on a MSE function, we found that the weight updates were proportional to the error (t-y). This seems to make sense. If we use a sigmoid activation function, we obtain a more complicated formula:

See derivatives of activation functions to see where this comes from. This is not quite what we want. It turns out that there is a better error function/activation function combination that gives us what we want.

Error Function:
Cross Entropy is defined as

where c is the number of classes (i.e. the number of output nodes).

This equation comes from information theory and is often applied when the outputs (y) are interpreted as probabilities. We won't worry about where it comes from but let's see if it makes sense for certain special cases.

Suppose the network is trained perfectly so that the targets exactly match the network output. Suppose class 3 is chosen. This means that output of node 3 is 1 (i.e. the probability is 1 that 3 is correct) and the outputs of the other nodes are 0 (i.e. the probability is 0 that class != 3 is correct). In this case do you see that the above equation is 0, as desired. Suppose the network gives an output of y=.5 for all of the output i.e. that there is complete uncertainty about which is the correct class. It turns out that E has a maximum value in this case. Thus, the more uncertain the network is, the larger the error E. This is as it should be.

Activation function:
Softmax is defined as

where fi is the activation function of the ith output node and c is the number of classes. Note that this has the following good properties:

it is always a number between 0 and 1 when combined with the error function gives a weight update proportional to (t-y).

Optimal Weight and Learning Rates for Linear Networks


Regression Revisited
Suppose we are given a set of data (x(1),y(1)),(x(2),y(2))...(x(p),y(p)):

If we assume that g is linear, then finding the best line that fits the data (linear regression) can be done algebraically: The solution is based on minimizing the squared error (Cost) between the network output and the data:

where y = w x.

Finding the best set of weights


1-input, 1 output, 1 weight

But the derivative of E is zero at the minimum so we can solve for wopt.

n-inputs, m outputs: nm weights

The same analysis can be done in the multi-dimensional case except that now everything becomes matrices:

where wopt is an mxn matrix, H is an nxn matrix and is an mxn matrix. Matrix inversion is an expensive operation. Also, if the input dimension, n, is very large then H is huge and may not even b possible to compute. If we are not able to compute the inverse Hessian or if we don't want to spend the time, then we can use gradient descent.

Gradient Descent: Picking the Best Learning Rate

For linear networks, E is quadratic then we can write

so that we have

But this is just a Taylor series expansion of E(w) about w0. Now, suppose we want to determine the optimal weight, wopt. We can differentiate E(w) and evaluate the result at wopt, noting that E`(wopt) is zero:

Solving for wopt we obtain:

comparing this to the update equation, we find that the learning "rate" that takes us directly to the minimum is equal to the inverse Hessian, which is a matrix and not a scalar. Why do we need a matrix?

2-D Example

Curvature axes aligned with the coordinate axes:

or in matrix form:

1 and 2 are inversely related to the size of the curvature along each axis. Using the above learning rate matrix has the effect of scaling the gradient differently to make the surface "look" spherical. If the axes are not aligned with coordinate axes, the we need a full matrix of learning rates. This matrix is just the inverse Hessian. In general, H-1 is not diagonal. We can obtain the curvature along each axis, however, by computing the eigenvalues of H. Anyone remember what eigenvalues are??

Taking a Step Back


We have been spending a lot of time on some pretty tough math. Why? Because training a network can take a long time if you just blindly apply the basic algorithms. There are techniques that can improve the rate of convergence by orders of magnitude. However, understanding these techniques requires a deep understanding of the underlying characteristics of the problem (i.e. the mathematics). Knowing what speed-up techniques to apply, can make a difference between having a net that takes 100 iterations to train vs. 10000 iterations to train (assuming it trains at all). The previous slides are trying to make the following point for linear networks (i.e. those networks whose cost function is a quadratic function of the weights):
1. The shape of the cost surface has a significant effect on how fast a net can learn. Ideally, we want a spherically symmetric surface. 2. The correlation matrix is defined as the average over all inputs of xxT

3. The Hessian is the second derivative of E with respect to w. For linear nets, the Hessian is the same as the correlation matrix. 4. The Hessian, tells you about the shape of the cost surface: 5. The eigenvalues of H are a measure of the steepness of the surface along the curvature directions. 6. a large eigenvalue => steep curvature => need small learning rate 7. the learning rate should be proportional to 1/eigenvalue 8. if we are forced to use a single learning rate for all weights, then we must use a learning rate that will not cause divergence along the steep directions (large eigenvalue directions). Thus, we must choose a learning rate that is on the order of 1/max where max is the largest eigenvalue. 9. If we can use a matrix of learning rates, this matrix is proportional to H-1. 10. For real problems (i.e. nonlinear), you don't know the eigenvalues so you just have to guess. Of course, there are algorithms that will estimate max ....We won't be considering these here. 11. An alternative solution to speeding up learning is to transform the inputs (that is, x -> Px, for some transformation matrix P) so that the resulting correlation matrix, (Px)(Px)T, is equal to the identity. 12. The above suggestions are only really true for linear networks. However, the cost surface of nonlinear networks can be modeled as a quadratic in the vicinity of the current weight. We can then apply the similar techniques as above, however, they will only be approximations.

Summary of Linear Nets


Characteristics of Networks

number of layers number of nodes per layer activation function (linear, binary, softwmax) error function (mean squared error (MSE), cross entropy) type of learning algorithms (gradient descent, perceptron, delta rule)

Types of Applications and Associated Nets

Regression: o uses a one-layer linear network (activation function is identity) o uses MSE cost function o uses gradient decent learning Classification - Perceptron Learning o uses a one-layer network with a binary step activation function o uses MSE cost function o uses the perceptron learning algorithm (identical with gradient descent when targets are +1 and -1) Classification - Delta Rule o uses a one-layer network with a linear activation function o uses MSE cost function o uses gradient descent

o the network chooses the class by picking the output node with the largest output Classification - Gradient Descent (the right way) o uses a one-layer network with a softmax activation function o uses the cross entropy error function o outputs are interpreted as probabilities o the network chooses the class with the highest probability

Modes of Learning for Gradient Descent


Batch At each iteration, the gradient is computed by averaging over all inputs Online (stochastic) o At each iteration, the gradient is estimated by picking one (or a small number) of inputs. o Because the gradient is only being esitimated, there is a lot of noise in the weight updates. The error comes down quicly but then tends to jiggle around. To remove this noise one can switch to batch at the point where the error levels out and or to continue to use online but to decrease the learning rate (called annealing the learning rate). One way annealing is to use = 0/t where 0 us the originial learning rate and t is the number of timesteps after annealing is turned on.
o

Picking Learning Rates


Learning rates that are too big cause the algorithm to diverge Learning rates that are too small cause the algorithm to converge very slowly. The optimal learning rate for linear networks is /(H-1) where H is the Hessian and is defined as the second derivative of the cost function with respect to the weights. Unfortunately, this is a matrix whose inverse can be costly to compute. The best learning rate for batch is the inverse Hessian. More details if you are interested: o The next best thing is to use a separate learning rate for each weight. If the Hessian is diagonal these learning rates are just one over the eigenvalues of the Hessian. Fat chance that the hessian is diagonal though! o If using a single scalar learning then the best one to use is 1 over the largest eigenvalue of the Hessian. There are fairly inexpensive algorithms for estimating this. However, many people just use the ol' brute force method of picking the learning rate - trial and error. T o For linear networks the Hessian is < x x > and is independent of the weights. For nonlinear networks (i.e. any network that has an activation function that isn't the identity), the Hessian depends on the value of the weights and so changes everytime the weights are updated - arrgh! That is why people love the trial and error approach.

Limitations of Linear Networks


For regression, we can only fit a straight line through the data points. Many problems are not linear. For classification, we can only lay down linear boundaries between classes. This is often inadequate for most real world problems.

where ij = 0 if i=j and zero otherwise. Note that if r is the correct class then tr = 1 and RHS of the above equation reduces to (tr-yr)xs. If q!=r is the correct class then tr = 0 the above also reduces to (tr-yr)xs. Thus we have

You might also like