This action might not be possible to undo. Are you sure you want to continue?
What is an artificial neural network? An artificial neural network is a system based on the operation of biological neural networks, in other words, is an emulation of biological neural system. Why would be necessary the implementation of artificial neural networks? Although computing these days is truly advanced, there are certain tasks that a program made for a common microprocessor is unable to perform; even so a software implementation of a neural network can be made with their advantages and disadvantages. Advantages:
A neural network can perform tasks that a linear program can not. When an element of the neural network fails, it can continue without any problem by their parallel nature. A neural network learns and does not need to be reprogrammed. It can be implemented in any application. It can be implemented without any problem.
The neural network needs training to operate. The architecture of a neural network is different from the architecture of microprocessors therefore needs to be emulated. Requires high processing time for large neural networks. Another aspect of the artificial neural networks is that there are different architectures, which consequently requires different types of algorithms, but despite to be an apparently complex system, a neural network is relatively simple. Artificial neural networks are among the newest signal processing technologies nowadays. The field of work is very interdisciplinary, but the explanation I will give you here will be restricted to an engineering perspective. In the world of engineering, neural networks have two main functions: Pattern classifiers and as non linear adaptive filters. As its biological predecessor, an artificial neural network is an adaptive system. By adaptive, it means that each parameter is changed during its operation and it is deployed for solving the problem in matter. This is called the training phase. A artificial neural network is developed with a systematic step-by-step procedure which optimizes a criterion commonly known as the learning rule. The input/output training data is fundamental for these networks as it conveys the information which is necessary to discover the optimal operating point. In addition, a non linear nature make neural network processing elements a very flexible system.
Basically, an artificial neural network is a system. A system is a structure that receives an input, process the data, and provides an output. Commonly, the input consists in a data array which can be anything such as data from an image file, a WAVE sound or any kind of data that can be represented in an array. Once an input is presented to the neural network, and a corresponding desired or target response is set at the output, an error is composed from the difference of the desired response and the real system output. The error information is fed back to the system which makes all adjustments to their parameters in a systematic fashion (commonly known as the learning rule). This process is repeated until the desired output is acceptable. It is important to notice that the performance hinges heavily on the data. Hence, this is why this data should pre-process with third party algorithms such as DSP algorithms. In neural network design, the engineer or designer chooses the network topology, the trigger function or performance function, learning rule and the criteria for stopping the training phase. So, it is pretty difficult determining the size and parameters of the network as there is no rule or formula to do it. The best we can do for having success with our design is playing with it. The problem with this method is when the system does not work properly it is hard to refine the solution. Despite this issue, neural networks based solution is very efficient in terms of development, time and resources. By experience, I can tell that artificial neural networks provide real solutions that are difficult to match with other technologies. Fifteen years ago, Denker said: “artificial neural networks are the second best way to implement a solution” this motivated by their simplicity, design and universality. Nowadays, neural network technologies are emerging as the technology choice for many applications, such as patter recognition, prediction, system identification and control.
The Biological Model
Artificial neural networks born after McCulloc and Pitts introduced a set of simplified neurons in 1943. These neurons were represented as models of biological networks into conceptual components for circuits that could perform computational tasks. The basic model of the artificial neuron is founded upon the functionality of the biological neuron. By definition, “Neurons are basic signaling units of the nervous system of a living being in which each neuron is a discrete cell whose several processes are from its cell body”
The biological neuron has four main regions to its structure. The cell body, or soma, has two offshoots from it. The dendrites and the axon end in pre-synaptic terminals. The cell body is the heart of the cell. It contains the nucleolus and maintains protein synthesis. A neuron has many dendrites, which look like a tree structure, receives signals from other neurons. A single neuron usually has one axon, which expands off from a part of the cell body. This I called the axon hillock. The axon main purpose is to conduct electrical signals generated at the axon hillock down its length. These signals are called action potentials. The other end of the axon may split into several branches, which end in a pre-synaptic terminal. The electrical signals (action potential) that the neurons use to convey the information of the brain are all identical. The brain can determine which type of information is being received based on the path of the signal. The brain analyzes all patterns of signals sent, and from that information it interprets the type of information received. The myelin is a fatty issue that insulates the axon. The non-insulated parts of the axon area are called Nodes of Ranvier. At these nodes, the signal traveling down the axon is regenerated. This ensures that the signal travel down the axon to be fast and constant. The synapse is the area of contact between two neurons. They do not physically touch because they are separated by a cleft. The electric signals are sent through chemical interaction. The neuron sending the signal is called pre-synaptic cell and the neuron receiving the electrical signal is called postsynaptic cell. The electrical signals are generated by the membrane potential which is based on differences in concentration of sodium and potassium ions and outside the cell membrane.
Multipolar neurons: These are commonly found in mammals. or it could be -1 and 1. an acceptable range of output is usually between 0 and 1. These neurons provide all information for perception and motor coordination. bipolar neurons and multipolar neurons. This activity is referred as a linear combination. the synapses of the biological neuron are modeled as weights. while positive values designate excitatory connections. Unipolar neurons have a single process. For example. Mathematically. Bipolar neurons have two processes. First off. The last group. The second group provides information to muscles. pyramidal cells and purkinje cells. The first group is sensory neurons. For an artificial neuron. this process is described in the figure . Some examples of these neurons are spinal motor neurons. we must take into account three basic components. an activation function controls the amplitude of the output. There are called motor neurons. and represents the synapse. Their dendrites and axon have two separated processes too.Biological neurons can be classified by their function or by the quantity of processes they carry out. The following components of the model represent the actual activity of the neuron cell. Their dendrites and axon are located on the same stem. The other group called local interneurons are only used in local circuits. contains all other neurons and has two subclasses. When biological neurons are classified by function they fall into three categories. They are usually found in the brain and connect different parts of it. Let’s remember that the synapse of the biological neuron is the one which interconnects the neural network and gives the strength of the connection. The Mathematical Model Once modeling an artificial functional model from the biological neuron. A negative weight reflects an inhibitory connection. the weight is a number. they fall into three categories: Unipolar neurons. and glands. Finally. These neurons are found in invertebrates. All inputs are summed altogether and modified by the weights. One group called relay or protection interneurons. the interneuronal. When they are classified by processes.
there is the Piecewise-Linear function. First. Activation functions As mentioned previously. In general. there are three types of activation functions. Secondly. and the value 1 if the summed input is greater than or equal to the threshold value. yk. there is the Threshold Function which takes on a value of 0 if the summed input is less than a certain threshold value (v). such that the output of a neuron in a neural network is between certain values (usually 0 and 1. the activation function acts as a squashing function.) .From this model the interval activity of the neuron can be shown to be: The output of the neuron. would therefore be the outcome of some activation function on the value of vk. or -1 and 1). This function again can take on the values of 0 . denoted by Φ(.
but can also take on values between that depending on the amplification factor in a certain region of linear operation. Thirdly. Learning strategies as a basis for an adaptive system . there is the sigmoid function. but it is also sometimes useful to use the -1 to 1 range.or 1. The architecture of each neural network is based on very similar building blocks which perform the processing. This function can range between 0 and 1. The artifcial neural networks which we describe are all variations on the parallel distributed processing (PDP) idea. In this chapter we first discuss these processing units and discuss diferent neural network topologies. An example of the sigmoid function is the hyperbolic tangent function.
but no . This section focuses on the pattern of connections between the units and the propagation of data. A set of major aspects of a parallel distributed model can be distinguished : a set of processing units ('neurons. each unit has a (usually fixed) probability of updating its activation at a time t. connections between the units. with asynchronous updating. Processing units Each unit performs a relatively simple job: receive input from neighbours or external sources and use this to compute an output signal which is propagated to other units.' 'cells'). units can be updated either synchronously or asynchronously. In some cases the latter model has some advantages. Neural Network topologies In the previous section we discussed the properties of the basic processing unit in an artificial neural network. Within neural systems it is useful to distinguish three types of units: input units (indicated by an index i) which receive data from outside the neural network. output units (indicated by an index o) which send data out of the neural network. which equivalent to the output of the unit. a propagation rule. which determines the effective input sk of a unit from its external inputs. an external input (aka bias. a method for information gathering (the learning rule).e. During operation. The data processing can extend over multiple (layers of) units. offset) øk for each unit. which determines the new level of activation based on the efective input sk(t) and the current activation yk(t) (i. As for this pattern of connections. The system is inherently parallel in the sense that many units can carry out their computations at the same time. Apart from this processing.A framework for distributed representation An artifcial neural network consists of a pool of simple processing units which communicate by sending signals to each other over a large number of weighted connections. a state of activation yk for every unit. and hidden units (indicated by an index h) whose input and output signals remain within the neural network. providing input signals and|if necessary|error signals. all units update their activation simultaneously. a second task is the adjustment of the weights. an environment within which the system must operate.. where the data ow from input to output units is strictly feedforward. the update). an activation function Fk. With synchronous updating. Generally each connection is defined by a weight wjk which determines the effect which the signal of unit j has on unit k. the main distinction we can make is between: Feed-forward neural networks. and usually only one unit will be able to do this at a time.
feedback connections are present, that is, connections extending from outputs of units to inputs of units in the same layer or previous layers. Recurrent neural networks that do contain feedback connections. Contrary to feedforward networks, the dynamical properties of the network are important. In some cases, the activation values of the units undergo a relaxation process such that the neural network will evolve to a stable state in which these activations do not change anymore. In other applications, the change of the activation values of the output neurons are significant, such that the dynamical behaviour constitutes the output of the neural network (Pearlmutter, 1990). Classical examples of feed-forward neural networks are the Perceptron and Adaline. Examples of recurrent networks have been presented by Anderson (Anderson, 1977), Kohonen (Kohonen, 1977), and Hopfield (Hopfield, 1982) .
Training of artifcial neural networks
A neural network has to be configured such that the application of a set of inputs produces (either 'direct' or via a relaxation process) the desired set of outputs. Various methods to set the strengths of the connections exist. One way is to set the weights explicitly, using a priori knowledge. Another way is to 'train' the neural network by feeding it teaching patterns and letting it change its weights according to some learning rule. We can categorise the learning situations in two distinct sorts. These are:
Supervised learning or Associative learning in which the network is trained by providing it with input and matching output patterns. These input-output pairs can be provided by an external teacher, or by the system which contains the neural network (self-supervised).
Unsupervised learning or Self-organisation in which an (output) unit is trained to respond to clusters of pattern within the input. In this paradigm the system is supposed to discover statistically salient features of the input population. Unlike the supervised learning paradigm, there is no a priori set of categories into which the patterns are to be classified; rather the system must develop its own representation of the input stimuli.
Reinforcement Learning This type of learning may be considered as an intermediate form of the above two types of learning. Here the learning machine does some action on the environment and gets a feedback response from the environment. The learning system grades its action good (rewarding) or bad (punishable) based on the environmental response and accordingly adjusts its parameters. Generally, parameter adjustment is continued until an equilibrium state occurs, following which there will be no more changes in its parameters. The selforganizing neural learning may be categorized under this type of learning.
Modifying patterns of connectivity of Neural Networks
Both learning paradigms supervised learning and unsupervised learning result in an adjustment of the weights of the connections between units, according to some modification rule. Virtually all learning rules for models of this type can be considered as a variant of the Hebbian learning rule suggested by Hebb in his classic book Organization of Behaviour (1949) (Hebb, 1949). The basic idea is that if two units j and k are active simultaneously, their interconnection must be strengthened. If j receives input from k, the simplest version of Hebbian learning prescribes to modify the weight wjk with
where ϒ is a positive constant of proportionality representing the learning rate. Another common rule uses not the actual activation of unit k but the difference between the actual and desired activation for adjusting the weights: in which dk is the desired activation provided by a teacher. This is often called the Widrow-Hoff rule or the delta rule, and will be discussed in the next chapter. Many variants (often very exotic ones) have been published the last few years. Neural networks were started about 50 years ago. Their early abilities were exaggerated, casting doubts on the field as a whole There is a recent renewed interest in the field, however, because of new techniques and a better theoretical understanding of their capabilities.
Motivation for neural networks:
Scientists are challenged to use machines more effectively for tasks currently solved by humans. Symbolic Rules don't reflect processes actually used by humans Traditional computing excels in many areas, but not in others.
Types of Applications
Having a computer program itself from a set of examples so you don't have to program it yourself. This will be a strong focus of this course: neural networks that learn from a set of examples. Optimization: given a set of constraints and a cost function, how do you find an optimal solution? E.g. traveling salesman problem. Classification: grouping patterns into classes: i.e. handwritten characters into letters. Associative memory: recalling a memory based on a partial match. Regression: function mapping
Modelling higher level reasoning: o language o problem solving Modelling lower level reasoning: o vision o audition speech recognition o speech generation
Neurobiology: Modelling models of how the brain works.
neuron-level higher levels: vision, hearing, etc. Overlaps with cognitive folks.
Nonparametric statistical analysis and regression.
Can human souls/behavior be explained in terms of symbols, or does it require something lower level, like a neurally based model?
Where are neural networks being used?
Signal processing: suppress line noise, with adaptive echo canceling, blind source separation Control: e.g. backing up a truck: cab position, rear position, and match with the dock get converted to steering instructions. Manufacturing plants for controlling automated machines. Siemens successfully uses neural networks for process automation in basic industries, e.g., in rolling mill control more than 100 neural networks do their job, 24 hours a day Robotics - navigation, vision recognition Pattern recognition, i.e. recognizing handwritten characters, e.g. the current version of Apple's Newton uses a neural net Medicine, i.e. storing medical records based on case information Speech production: reading text aloud (NETtalk)
we can therefore run . The best mapped (and largest) system in the human brain is the visual system. where the first 10 or 11 processing stages have been identified. either according to the anatomical structure of the neural networks within it. we distinguish cortex. Financial Applications: time series analysis. Remember though that computers run much faster than brains . and only partially known. visual search engines Business.. independent of how they are actually "implemented" in the brain. faces Game Playing: backgammon... In this way they form very dense. In addition to these long-range connections.g. image. abstract "neurons". often integrated on VLSI chips. and areas within each region. we are more interested in the general properties of neural networks. complex local networks. At the largest anatomical scale. We distinguish feedforward projections that go from earlier processing stages (near the sensory input) to later ones (near the motor output). stock market prediction Data Compression: speech signal. go. edge detection. or according to the function performed by them. The overall pattern of projections (bundles of neural connections) between areas is extremely complex. Artificial Neuron Models Computational neurobiologists have constructed very elaborate computer models of neurons in order to run detailed simulations of particular circuits in the brain. rules for mortgage decisions are extracted from past decisions made by experienced evaluators. chess.e. resulting in a network that has a high level of agreement with human experts. e. . Neural Networks in the Brain The brain is not homogeneous. People have implemented model neurons in hardware as electronic circuits. neurons also link up with many thousands of their neighbours.g. midbrain. This means that we can use much simpler. which (hopefully) capture the essence of neural computation even if they leave out much of the details of how biological neurons work. Speech recognition Vision: face recognition . and cerebellum. brainstem. Each of these can be hierarchically subdivided into many regions. from feedback connections that go in the opposite direction. As Computer Scientists.
and the unit's output is just its net input. This has obvious advantages over having to use special "neural" computer hardware. can serve as input to other units. A Simple Artificial Neuron Our basic computational element (model neuron) is often called a node or unit. The weighted sum is called the net input to unit i. This is called a linear unit. The function f is the unit's activation function.fairly large networks of simple model neurons as software simulations in reasonable time. in turn. Note that wij refers to the weight from unit j to unit i (not the other way around). which can be modified so as to model synaptic learning. In the simplest case. Each input has an associated weight w. It receives input from some other units. or perhaps from an external source. f is the identity function. The unit computes some function f of the weighted sum of its inputs: Its output. . often written neti.
and is followed by a refractory period. Transmission of an electrical signal from one neuron to the next is effected by neurotransmittors.Neurons and Synapses The basic computational unit in the nervous system is the nerve cell. chemicals which are released from the first neuron and which bind to receptors in the second. This link is called a synapse. to the next neuron(s) (or other receptors). The axon endings (Output Zone) almost touch the dendrites or cell body of the next neuron. the amount of neurotransmittor available. e.an electrical pulse that travels from the body. during which the neuron is unable to fire. Once input exceeds a critical level. etc. This spiking event is also called depolarization. The extent to which the signal from one neuron is passed on to the next depends on many factors. Inputs sum (approximately). the number and arrangement of receptors. down the axon.g. or neuron. . amount of neurotransmittor reabsorbed. the neuron discharges a spike . A neuron has: Dendrites (inputs) Cell body Axon (output) A neuron receives input from other neurons (typically many thousands).
. The efficacy of a synapse can change as a result of experience." Bliss and Lomo discovered LTP in the hippocampus in 1973 . and typically without the benefit of a benevolent teacher. providing both memory and learning through long-term potentiation. Many other changes may also be involved. From what we know of neuronal structures. Of course. Furthermore. one way brains learn is by altering the strengths of connections between neurons. and by adding or deleting connections between neurons. One way this happens is through release of more neurotransmitter. excites[s] cell B and repeatedly or persistently takes part in firing it. some growth process or metabolic change takes place in one or both cells so that A's efficiency as one of the cells firing B is increased.Synaptic Learning Brains learn. they learn "on-line".. Long-term Potentiation: An enduring (>1 hour) increase in synaptic efficacy that results from high-frequency stimulation of an afferent (input) pathway Hebbs Postulate: "When an axon of cell A. based on experience.
> plot(f1(x).Points to note about LTP: Synapses become more or less important over time (plasticity) LTP is based on experience LTP is based only on local information (Hebb's postulate) Summary The following properties of nervous systems will be of particular interest in our neurally-inspired models: parallel. end. distributed information processing high degree of connectivity among basic units connections are modifiable based on experience learning is a constant process.x=-10. and usually unsupervised learning is based only on local information performance degrades gracefully if some units are removed Examples of Activation Functions Identity Function The identity function is given by > f1 := proc(x) x.10).. .
.x=-10.10).. Another common variation is for it to take on values -1 and +1 as shown below. This is also called the heaviside function. > f2 := proc(x) 2*Heaviside(x) -1 end.Step Function The step function is if and if . > plot(f2(x).
.x=-10...1).x=-10. > plot(f3(x.10.10.x=-10. . plot(f3(x.x=-10. color = blue).Logistic Function (Sigmoid) The logistic function has the form > f3 := proc(x. color = green)]). > with(plots): The parameter "a" in the logistic function determines how steep it is. plot(f3(x.1)..2).10). the steeper it is. > > > display([plot(f3(x. The larger "a"..5).a) 1/(1 + exp(-a*x)) end.10).
1)-1 end..Symmetric Sigmoid The symmetric sigmoid is simply the sigmoid that is stretched so that the y range is 2 and then shifted down by 1 so that it ranges between -1 and 1. If g(x) is the standard sigmoid then the symmetric sigmoid is > f4 := proc(x) 2*f3(x. .10). > plot(f4(x).x=-10.
end.5*x).The symmetric sigmoid differs from the hyperbolic tangent by a constant factor. It is called local because. .. it is essentially zero everywhere except in a small region. . > plot(tanh(. the graph below is identical to the graph for the symmetric sigmoid.10).x=-10.1). > f5 := proc(x.10). unlike the previous functions. > plot(f5(x..a) exp(-a * x^2). Radial Basis Functions A radial basis function is simply a gaussian.x=-10. As you can see.
. The nice feature of sigmoids is that their derivatives are easy to compute. If f(x) is the logistic function above then > .a)*(1-f3(x. That is.Derivatives The derivative of the identity function is just 1.a). if f(x) is the identity then The derivative of the step function is not defined which is exactly why it isn't used.a))). > simplify(f3(x.x). diff(f3(x.
see data description. raw data. and maple plots): . > Linear Regression Fitting a Model to Data Consider the data below (for more complete auto data. If f(x) is tanh the > diff(tanh(x).x).This is also true of hyperbolic tangent .
pounds) and fuel consumption (y-axis. and asked to predict how much fuel it will use.a short mathematical description .of the data (see also optical illusions). in general. 1) Each dot in the figure provides information about the weight (x-axis. Clearly weight and fuel consumption are linked. units: miles per gallon) for one of 74 cars (data from 1979).S. based on the above data. units: U. Now suppose we are given the weight of a 75th car. Such questions can be answered by using a model . so that. The simplest useful model here is of the form .(Fig. heavier cars use more fuel.
2 is certainly not a good model: for most cars. . equation 1 describes a straight line with slope w1 and intercept w0 with the y-axis. as shown in Fig.this does not change the problem in any fundamental way. The line in Fig. 2. (Note that we have rescaled the coordinate axes .) How do we choose the two parameters w0 and w1 of our model? Clearly. any straight line drawn somehow through the data could be used as a predictor. it will predict too much fuel consumption for a given weight.y = w1 x + w 0 (1) This is a linear model: in an xy-plot. but some lines will do a better job than others.
2) The Loss Function In order to make precise what we mean by being a "good predictor". A popular choice for E is the sum-squared error: .(Fig. we define a loss (also called objective or error) function E over the model parameters.
(Fig. it is the sum over all points i in our data set of the squared difference between the target value ti (here: actual fuel consumption) and the model's prediction yi. For a linear model. calculated from the input value xi (here: weight of the car) by equation 1. Figure 3 shows E for a range of values of w0 and w1. 3) . the sum-sqaured error is a quadratic function of the model parameters. Figure 4 shows the same functions as a contour plot.(2) In words.
4) Minimizing the Loss The loss function E provides us with an objective measure of predictive error for a specific choice of model parameters. this analytical approach does not . (See any statistics textbook for details. For linear models. We can thus restate our goal of finding the best (linear) model as finding the values for the model parameters that minimize E.) However. linear regression provides a direct way to compute these optimal model parameters.(Fig.
In ordder to decrease E. (Fig. i. where G = 0. 5).generalize to nonlinear models (which we will get to by the end of this lecture). .e. 7 shows the best linear model for our car data. 3. the problem can still be solved by an iterative numerical technique called gradient descent. Repeat steps 2 and 3 until G gets close to zero. found by this procedure. It works as follows: 1. we move "downhill" in E until we reach a minimum. 6). 4. Change the model parameters so that we move a short distance in the direction of the greatest rate of decrease of the error. so that no further progress is possible (Fig. 6) Fig. 5) By repeating this over and over. How does this work? The gradient of E gives us the direction in which the loss function at the current settting of the w has the steepest slope. in the direction of -G.. Calculate the gradient G of the error function with respect to each model parameter. Choose some (random) initial values for the model parameters. 2. -G (Fig. we take a small step in the opposite direction. (Fig. Even though the solution cannot be calculated explicitly in that case.
8. The output unit computes the sum: . while the bias unit always has a constant output of 1. It consists of a bias unit. The input unit makes external input x (here: the weight of a car) available to the network. an input unit. 7) It's a neural network! Our linear model of equation 1 can in fact be implemented by the simple neural network shown in Fig. and a linear output unit.(Fig.
. with w21 implementing the slope of the straight line. 1). Similarly. The loss function for a network with multiple outputs is obtained simply by adding the loss for each output unit together. This can easily be accommodated by adding more output units (Fig. Linear Neural Networks Multiple regression Our car example showed how we could discover an optimal linear function for predicting one variable (fuel consumption) from one other (weight).0 w20 (3) It is easy to see that this is equivalent to equation 1. we may want to predict more than one variable from the data that we're given. Suppose now that we are also given one or more additional variables which could be useful as predictors. and w20 its intercept with the y-axis.y2 = y1 w21 + 1. Our simple neural network model can easily be extended to this case by adding more input units (Fig. The network now has a typical layered structure: a layer of input units (and the bias). 2). connected by a layer of weights to a layer of output units.
2) Computing the gradient In order to train neural networks such as the ones shown above by gradient descent.this is not an exponentiation!) Since differentiation and summation are interchangeable. We begin by splitting the loss function into separate terms for each point p in the training data: (1) where o ranges over the output units of the network. . (Note that we use the superscript p to denote the training point .(Fig. we need to be able to compute the gradient G of the loss function with respect to each weight wij of the network. omitting the superscript p in order to make the notation easier to follow. we describe the computation of the gradient for a single data point. we can likewise split the gradient into separate components for each training point: (2) In what follows. 1) (Fig. It tells us how a small change in that weight will affect the overall error E.
compute value of output units 3. The Gradient Descent Algorithm 1. For each weight wij set 2. Initialize all weights to small random values. we obtain (6) To find the gradient G for the entire data set. we sum at each weight the contribution given by equation 6 over all the data points. REPEAT until done 1.First use the chain rule to decompose the gradient into two factors: (3) The first factor can be obtained by differentiating Eqn. 2. For each data point (x. t)p 1. We can then subtract a small proportion µ (called the learning rate) of G from the weights to perform gradient descent. 1 above: (4) Using . set input units to x 2. the second factor becomes (5) Putting the pieces (equations 3-5) back together. For each weight wij set .
the minimum of the error function. In summary: general case Training data Model parameters Model (x.x) linear network (x. We say then that the algorithm has converged.yi) yj Weight update rule The Learning Rate An important consideration is the learning rate µ.t) Gradient with respect to wij . or sufficiently near to. the algorithm will take a long time to converge (Fig.t) w y = g(w. If µ is too small. For each weight wij set The algorithm terminates once we are at.3. where G = 0. 3). which determines by how much we change the weights w at each step.(ti . .t) w Error function E(y.
we may end up bouncing around the error surface out of control .(Fig. (Fig. This usually ends with an overflow error in the computer's floating-point arithmetic. Since the gradient for a single data point can be considered a noisy approximation to the overall gradient G (Fig. if µ is too large. 5). this is also called stochastic (noisy) gradient descent. 3) Conversely. 4) Batch vs. where the weights are updated immediately after seeing each data point. . An alternative approach is online learning.the algorithm diverges (Fig. 4). Online Learning Above we have accumulated the gradient contributions for all data points in the training set before updating the weights. This method is often referred to as batch learning.
(Of course this also means that in order to implement batch learning really well. support vector machines. 1). A useful function for this purpose is the S-shaped hyperbolic tangent (tanh) function (Fig.are batch methods that cannot be used online. Notice that the data points are not evenly distributed around the line: for low weights. Multi-layer networks A nonlinear problem Consider again the best linear fit we found for the car data. Bayesian methods.which we will not talk about in this course! . . 5) Online learning has a number of advantages: it is often much faster. the noise in the gradient can help to escape from local minima (which are a problem for gradient descent in nonlinear models). etc. We can enable our neural network to do such curve fitting by giving it an additional node which has a suitably curved (nonlinear) activation function. one has to learn an awful lot about these rather complicated methods!) A compromise between batch and online learning is the use of "mini-batches": the weights are updated after every n data points. bought at a price: many powerful optimization techniques (such as: conjugate and second-order gradient methods. it is better at tracking nonstationary environments (where the best model gradually changes over time).(Fig. In fact. where n is greater than 1 but smaller than the training set size. however. it can be used when there is no fixed training set (new data keeps coming in). it looks as if a simple curve might fit these data better than the straight line. we see more miles per gallon than our model predicts. especially when the training set is redundant (contains many similar data points). These advantages are.) .
you should always assume that they are there.(Fig. 2 shows our new network: an extra node (unit 2) with tanh activation function has been inserted between input and output. it is commonly called a hidden unit. all non-input neural network units have such a bias weight. . In general. Since such a node is "hidden" inside the network. For simplicity. the bias unit and weights are usually omitted from neural network diagrams . 2) FIg. 1) (Fig.unless it's explicitly stated otherwise. Note that the hidden unit also has a weight from the bias unit.
(Fig. while the other two weights scale it along those two directions. it learns to fit the tanh function to the data (Fig. Hidden Layers One can argue that in the example above we have cheated by picking a hidden unit activation function that could fit the data well. Fig. Each of the four weights in the network plays a particular role in this process: the two bias weights shift the tanh function in the x. 4)? . 3) When this network is trained by gradient descent on the car data. 3). 2 gives the weight values that produced the solution shown in Fig.and y-direction. 3. What would we do if the data looks like this (Fig. respectively.
a network with just two hidden units using the tanh function (Fig. 4) (Relative concentration of NO and NO2 in exhaust fumes as a function of the richness of the ethanol/air mixture burned in a car engine. that having too large a hidden layer .can you see how? The fit can be further improved by adding yet more units to the hidden layer.can degrade the network's performance (more on this later). 5) Fortunately there is a very simple solution: add more hidden units! In fact. but that would defeat our purpose of learning to model the data. We would like to have a general. In general. 5) can fit the dat in Fig. one shouldn't use more hidden units than necessary to solve a given problem.(Fig. (Fig. no matter how it looks like.) Obviously the tanh function can't fit this data at all. 4 quite well . Note. We could cook up a special activation function for each data set we encounter.or too many hidden layers . however. (One way to ensure this is to start training with a very small . non-linear function approximation method which would allow us to fit any given data set.
In other words. In trying to do the same for multi-layer networks we encounter a difficulty: we don't have any target values for the hidden units.network. Networks that respect this constraint are called feedforward networks. In other words.) Theoretical results indicate that given enough hidden units. leading to a new wave of neural network research and applications.any pattern of connectivity that permits a partial ordering of the nodes from input to output is allowed. (Fig. Many functions form a universal basis. If gradient descent fails to find a satisfactory solution. and repeat. This seems to be an insurmountable problem . which we will discuss later. the two classes of activation functions commonly used in neural networks are the sigmoidal (S-shaped) basis functions (to which tanh belongs). This is equivalent to stating that their connection pattern must not contain any cycles. backprop provides a way to train networks with any number of hidden units arranged in any number of layers.) In fact. . It took 30 years before the error backpropagation (or in short: backprop) algorithm popularized a way to train hidden units. 5 can approximate any reasonable function to any required degree of accuracy. grow the network by adding a hidden unit. and the radial basis functions.how could we tell the hidden units just what to do? This unsolved question was in fact the reason why neural networks fell out of favor after an initial period of high popularity in the 1950s. 1) In principle. a network like the one in Fig. there must be a way to order the units such that all connections go from "earlier" (closer to the input) to "later" ones (closer to the output). Error Backpropagation We have already seen how to train linear networks by gradient descent. (There are clear practical limits. any function can be expressed as a linear combination of tanh functions: tanh is a universal basis function. their connection pattern forms a directed acyclic graph or dag. the network does not have to be organized in layers .
we get . As we have seen before.t).The Algorithm We want to train a multi-layer feedforward network by gradient descent to approximate an unknown function. we thus need to know the activity and the error for all relevant nodes in the network. As we did for linear networks before. in what follows we will therefore describe how to compute the gradient for just a single training pattern. based on some training data consisting of pairs (x. As before. Definitions: o the error signal for unit j: o the (negative) gradient for weight wij: o the set of nodes anterior to unit i: o the set of nodes posterior to unit j: 2. The gradient. and denote the weight from unit j to unit i by wij. To compute this gradient. we will number the units. 1. . we expand the gradient into two factors by use of the chain rule: The first factor is the error of unit i. the overall gradient with respect to the entire training set is just the sum of the gradients for each pattern. The second is Putting the two together. and the vector t the corresponding target (desired output). The vector x represents a pattern of input to the network.
the first is just the error of node i. Calculating output error. 4. the activity of all its anterior nodes (forming the set Ai) must be known. Since feedforward networks do not contain cycles.3. Error backpropagation. giving us . we can expand the error of a hidden unit in terms of its posterior nodes: Of the three factors inside the sum. the activity is propagated forward: Note that before the activity of unit i can be calculated. there is an ordering of nodes from input to output that respects this condition. For all other units. The activity of the input units is determined by the network's external input x. The second is while the third is the derivative of node j's activation function: For hidden units h that use the tanh activation function. Forward activaction. Assuming that we are using the sum-squared loss the error for output unit o is simply 5. we can make use of the special identity tanh(u)' = 1 .tanh(u)2. Again using the chain rule. For hidden units. we must propagate the error back from the output nodes (hence the name of the algorithm).
. while all the non-bias weights from one layer to the next form a matrix W. Propagate activity forward: for l = 1. as long as there are no cycles in the network. .that is. . The backprop algorithm then looks as follows: 1.. 1. activations. net inputs. where bl is the vector of bias weights.. L-2.it is often more convenient to write the backprop algorithm in matrix notation rather than using more general graph form given above. 2. we can simply use the reverse of the order in which activity was propagated forward. In this notation.. and error signals for all units in a layer are combined into vectors. Matrix Form For layered feedforward networks that are fully connected . each node in a given layer connects to every node in the next layer . 3. Again. For example. the biases weights. . Layers are numbered from 0 (the input layer) to L (the output layer).. there is an ordering of nodes from the output back to the input that respects this condition. Calculate the error in the output layer: 4. we must first know the error of all its posterior nodes (forming the set Pj).. Initialize the input layer: 2. L. where T is the matrix transposition operator. Backpropagate the error: for l = L-1.Putting all the pieces together we get Note that in order to calculate the error for unit j.
5. Update the weights and biases: You can see that this notation is significantly more compact than the graph form. The pixel values are refered to as the inputs or the decision variables. This style of learning is referred to as supervised learning (or learning with a teacher) because we are given the target values. given a set of inputs (x) and output/target values (y). Later we will see examples of unsupervised learning which is used for finding patterns in the data rather than modeling input/output mappings. We want the computer to determine what letter it is. Pattern classification A classic example of pattern classifiction is letter recognition. We are given. . The ability to (correctly) predict the output for an input the network has not seen is called generalization. for example. a set of pixel values associated with an image of a letter. We now step away from linear regression for a moment and look at another type of supervised learning problem called pattern classification. ************************************************************************************* Classification Pattern Classification and Single Layer Networks Intro We have just seen how a network can be trained to perform linear regression. That is. Given an x value that we have not seen. even though it describes exactly the same sequence of operations. We start by considering only single layer networks. the network finds the best linear mapping from x to y. and the letter categories are referred to as classes. our trained network can predict what the most likely y value will be.
Note that the outputs are no longer continuous but rather take on discrete values. . Maple plots): class = types of iris decision variables = sepal and petal sizes Example 3: example of zipcode digits in Maple Single layer Networks for Pattern Classification We can apply a similar approach as in linear regression where the targets are now the classes. if we plot the values of the decision variables. Example 1: Two Classes (class 0 and class 1).Now. different people's handwriting. That is. Thus. different regions will correspond to different classes. a given letter such as "A" can look quite different depending on the type of font that is used or. Two Inputs (x1 and x2). there will be a range of values for the decision variables that map to the same class. Example 2: Another example (see data description. in the case of handwritten letters. data.
It seems reasonable that we use a binary step function to guarantee an appropriate output value.Two Classes: What does the network look like? If there are just 2 classes we only need 1 output node. The net output of the network is a linear function of the weights and the inputs net = W X = x1 w1 + x2 w2 y = f(net) . and the target is 0 (or -1) if the target is in class 0.. class 1. say.can easily be generalized to multi-layer nets (nonlinear problems) But how do we know if the right weights exist at all???? Let's look to see what a single layer architecture can do .. The target is 1 if the example is in. Single Layer with a Binary Step Function Consider a network with 2 inputs and 1 output node (2 classes)..guaranteed to find the right weights if they exist The Adaline (uses Delta Rule) . Training Methods: We will discuss two kinds of methods for training single-layer networks that do pattern classification: Perceptron .
x2 = .x1 w1 + x2 w2 = 0 defines a straight line through the input space. .w1/w2 x1 <. w2) is normal to the decision boundary.this is line through the origin with slope -w1/w2 Bias What if the line dividing the 2 classes does not go through the origin? Other interesting geometric points to note: The weight vector (w1. Proof: Suppose z1 and z2 are points on the decision boundary.
Linear Separability Classification problems for which there is a line that exactly separates the classes are called linearly separable. . The Perceptron The perceptron learning rule is a method for finding the weights in a network. Single layer networks are only able to solve linearly separable problems. Most real world are not linearly separable.
Assumptions: We have single layer network whose output is. A nice feature of the perceptron learning rule is that if there exist a set of weights that solve the problem. as before. then the perceptron will find these weights. the same decision boundary we saw before. We consider the problem of supervised learning for classification although other types of problems can also be solved. t): compute output activation y = f(w x) If y = t. don't change weights . We assume that the bias treated as just an extra input whose value is 1 p = number of training examples (x. the weight must be chosen so that the projection of pattern X onto W has the same sign as the target t. weights don't change): For each training pattern (x. i. initialize the weights (either to zero or to a small random value) 2. pick a learning rate ( this is a number between 0 and 1) 3. output = f(net) = f(W X) where f is a binary step function f whose values are (+-1). This is true for either binary or bipolar representations.g.e. But the boundary between positive and negative projections is just the plane W X = 0 . The Perceptron Algorithm 1.t) where t = +1 or -1 Geometric Interpretation: With this binary function f. the problem reduces to finding weights such that sign( W X) = t That is. Until stopping condition is satisfied (e.
Before updating the weight W. the new boundary (blue dashed line) is better than before. P2 has target value t=-1 so the weight is moved a small amount in the direction of -p2. so that the weight is moved a small amount in the direction of p1. Suppose we choose p2 to update the weights. update the weights: w(new) = w(old) + 2 t x or w(new) = w(old) + (t . If y != t. Suppose we choose p1 to update the weights as in picture below on the left. P1 has target value t=1. . for all t Consider wht happens below when the training pattern p1 or p2 is chosen. we note that both p1 and p2 are incorrectly classified (red dashed line is decision boundary). In either case.y ) x.
Perceptron Decision Boundaries Delta Rule Also known by the names: Adaline Rule Widrow-Hoff Rule Least Mean Squares (LMS) Rule Change from Perceptron: Replace the step function in the with a continuous (differentiable) activation function. Then show that the lower bound grows at a faster rate than the upper bound. Since the lower bound can't be larger than the upper bound.w2. All that really differs is how the classes are determined. Note: this is the same algorithm we saw for regression. The decision surface (for 2 inputs and one bias) has equation: x2 = .(w1/w2) x1 . this can only happen if all patterns are correctly classified. e. However. Outline: Find a lower bound L(k) for |w|2 as a function of iteration k. there must be a finite k such that the weight is no longer updated. Then find an upper bound U(k) for |w|2.b) = (w1. .w3 / w2 where we have defined w3 to be the bias: W = (w1. use the step function only to determine the class and not to update the weights.g linear For classification problems.w2.Comments on Perceptron The choice of learning rate does not matter because it just changes the scaling of w.w3) From this we see that the equation remains the same if W is scaled by a constant.
For example (one output node) where n = number of examples ti = desired target value associated with the i-th example yi = output of network when the i-th input pattern is presented to network To train the network.Delta Rule: Training by Gradient Descent Revisited Construct a cost function E that measures how well the network has learned. we adjust the weights in the network so as to decrease the cost (this is where we require differentiability). This is called gradient descent. .
For example of there ar 5 classes. update the weights according to where E is evaluated at W(old). that the network has a much easier time if we have one output for class.Algorithm Initialize the weights with some small random value Until E is within desired tolerance.: and the gradient is More than Two Classes. If there are mor ethan 2 classes we could still use the same network but instead of having a binary target.2. is the learning rate. however. .4.1. It turns out.0.-1.5 or t= -2. We can think of each output node as trying to solve a binary problem (it is either in the given class or it isn't).3.2. we can let the target take on discrete values. we could have t=1.
we have assumed the top layer is an AND function. Problem: In the general for the 2. Delta Rule Also known by the names: Adaline Rule Widrow-Hoff Rule .and 3.layers cases. there is no simple way to determine the weights. Here.Two Layer Net: The above is not the most general region.
Least Mean Squares (LMS) Rule Change from Perceptron: Replace the step function in the with a continuous (differentiable) activation function. e. Delta Rule: Training by Gradient Descent Revisited Construct a cost function E that measures how well the network has learned. use the step function only to determine the class and not to update the weights. Note: this is the same algorithm we saw for regression. For example (one output node) where . All that really differs is how the classes are determined.g linear For classification problems.
update the weights according to where E is evaluated at W(old). It turns out. Algorithm Initialize the weights with some small random value Until E is within desired tolerance. is the learning rate.3.: and the gradient is More than Two Classes. we adjust the weights in the network so as to decrease the cost (this is where we require differentiability). that the network has a much easier time if we have one output for class.2.5 or t= -2. This is called gradient descent.4.n = number of examples ti = desired target value associated with the i-th example yi = output of network when the i-th input pattern is presented to network To train the network.-1.0. we can let the target take on discrete values. For example of there ar 5 classes. If there are mor ethan 2 classes we could still use the same network but instead of having a binary target. . however.1.2. We can think of each output node as trying to solve a binary problem (it is either in the given class or it isn't). we could have t=1.
Doing Classification Correctly The Old Way When there are more than 2 classes. we so far have suggested doing the following: Assign one output node to each class. Set the target value of each node to be 1 if it is the correct class and 0 otherwise. there is a disconnect between the definition of the error function and the determination of the class. A minimum error does not necessary produce the network with the largest number of correct prediction. Determine the network class prediction by picking the output node with the largest value. . There are problems with this method. First. Use a linear network with a mean squared error function.
This seems to make sense. . Let us start by changing the interpretation of the output: The New Way New Interpretation: The output of yi is interpreted as the probability that i is the correct class. Is a sigmoid a good choice? We can vary the cost function. Error Function: Cross Entropy is defined as where c is the number of classes (i. we found that the weight updates were proportional to the error (t-y).By varying the above method a little bit we can remove this inconsistency.e. by using a sigmoid. With a linear network using gradient descent on a MSE function. It turns out that there is a better error function/activation function combination that gives us what we want. we obtain a more complicated formula: See derivatives of activation functions to see where this comes from. This means that: The output of each node must be between 0 and 1 The sum of the outputs over all nodes must be equal to 1. the number of output nodes). We can vary the activation function. How do we achieve this? There are several things to vary. let's start by thinking about what makes sense intuitively. This is not quite what we want. If we use a sigmoid activation function. What are our other options? To decide. for example. Sigmoids range continuously between 0 and 1. We need not use mean squared error (MSE).
Optimal Weight and Learning Rates for Linear Networks Regression Revisited Suppose we are given a set of data (x(1). the larger the error E. Suppose the network is trained perfectly so that the targets exactly match the network output.. Thus.y(2)).e. In this case do you see that the above equation is 0. the more uncertain the network is. Activation function: Softmax is defined as where fi is the activation function of the ith output node and c is the number of classes. This is as it should be.(x(2). This means that output of node 3 is 1 (i. It turns out that E has a maximum value in this case.(x(p).e. Suppose class 3 is chosen.y(p)): .This equation comes from information theory and is often applied when the outputs (y) are interpreted as probabilities. as desired. the probability is 1 that 3 is correct) and the outputs of the other nodes are 0 (i. We won't worry about where it comes from but let's see if it makes sense for certain special cases.5 for all of the output i.. that there is complete uncertainty about which is the correct class. Note that this has the following good properties: it is always a number between 0 and 1 when combined with the error function gives a weight update proportional to (t-y). Suppose the network gives an output of y=.e.y(1)). the probability is 0 that class != 3 is correct).
m outputs: nm weights The same analysis can be done in the multi-dimensional case except that now everything becomes matrices: .If we assume that g is linear. 1 weight But the derivative of E is zero at the minimum so we can solve for wopt. n-inputs. Finding the best set of weights 1-input. 1 output. then finding the best line that fits the data (linear regression) can be done algebraically: The solution is based on minimizing the squared error (Cost) between the network output and the data: where y = w x.
suppose we want to determine the optimal weight. Gradient Descent: Picking the Best Learning Rate For linear networks. Matrix inversion is an expensive operation.where wopt is an mxn matrix. If we are not able to compute the inverse Hessian or if we don't want to spend the time. then we can use gradient descent. is very large then H is huge and may not even b possible to compute. We can differentiate E(w) and evaluate the result at wopt. E is quadratic then we can write so that we have But this is just a Taylor series expansion of E(w) about w0. wopt. if the input dimension. n. Also. H is an nxn matrix and Á is an mxn matrix. noting that E`(wopt) is zero: . Now.
Why do we need a matrix? 2-D Example . we find that the learning "rate" that takes us directly to the minimum is equal to the inverse Hessian.Solving for wopt we obtain: comparing this to the update equation. which is a matrix and not a scalar.
understanding these techniques requires a deep understanding of the underlying characteristics of the problem (i. We can obtain the curvature along each axis. 10000 iterations to train (assuming it trains at all). The previous slides are trying to make the following point for linear networks (i. Ideally. the we need a full matrix of learning rates. H-1 is not diagonal.e. This matrix is just the inverse Hessian. Why? Because training a network can take a long time if you just blindly apply the basic algorithms. 2.Curvature axes aligned with the coordinate axes: or in matrix form: 1 and 2 are inversely related to the size of the curvature along each axis. The correlation matrix is defined as the average over all inputs of xxT . Knowing what speed-up techniques to apply. However. There are techniques that can improve the rate of convergence by orders of magnitude. Anyone remember what eigenvalues are?? Taking a Step Back We have been spending a lot of time on some pretty tough math. however. Using the above learning rate matrix has the effect of scaling the gradient differently to make the surface "look" spherical. can make a difference between having a net that takes 100 iterations to train vs.e. we want a spherically symmetric surface. If the axes are not aligned with coordinate axes. by computing the eigenvalues of H. The shape of the cost surface has a significant effect on how fast a net can learn. In general. those networks whose cost function is a quadratic function of the weights): 1. the mathematics).
10. is equal to the identity.Perceptron Learning o uses a one-layer network with a binary step activation function o uses MSE cost function o uses the perceptron learning algorithm (identical with gradient descent when targets are +1 and -1) Classification . An alternative solution to speeding up learning is to transform the inputs (that is. nonlinear). The above suggestions are only really true for linear networks. Summary of Linear Nets Characteristics of Networks number of layers number of nodes per layer activation function (linear. you don't know the eigenvalues so you just have to guess.Delta Rule o uses a one-layer network with a linear activation function o uses MSE cost function o uses gradient descent . the cost surface of nonlinear networks can be modeled as a quadratic in the vicinity of the current weight. tells you about the shape of the cost surface: 5. 12. The eigenvalues of H are a measure of the steepness of the surface along the curvature directions. we must choose a learning rate that is on the order of 1/»max where »max is the largest eigenvalue. However. Thus..e. 9. for some transformation matrix P) so that the resulting correlation matrix. cross entropy) type of learning algorithms (gradient descent. 4.. they will only be approximations. For linear nets. Of course. The Hessian is the second derivative of E with respect to w. then we must use a learning rate that will not cause divergence along the steep directions (large eigenvalue directions). The Hessian. 11. (Px)(Px)T. For real problems (i. the Hessian is the same as the correlation matrix. delta rule) Types of Applications and Associated Nets Regression: o uses a one-layer linear network (activation function is identity) o uses MSE cost function o uses gradient decent learning Classification .We won't be considering these here. there are algorithms that will estimate »max . If we can use a matrix of learning rates. binary. perceptron. a large eigenvalue => steep curvature => need small learning rate 7. if we are forced to use a single learning rate for all weights. We can then apply the similar techniques as above. the learning rate should be proportional to 1/eigenvalue 8. this matrix is proportional to H-1.3. 6. softwmax) error function (mean squared error (MSE).. x -> Px. however.
The best learning rate for batch is the inverse Hessian. To remove this noise one can switch to batch at the point where the error levels out and or to continue to use online but to decrease the learning rate (called annealing the learning rate). . The optimal learning rate for linear networks is /(H-1) where H is the Hessian and is defined as the second derivative of the cost function with respect to the weights. For nonlinear networks (i.trial and error. Unfortunately.e. o Because the gradient is only being esitimated. the gradient is computed by averaging over all inputs Online (stochastic) o At each iteration. The error comes down quicly but then tends to jiggle around. the Hessian depends on the value of the weights and so changes everytime the weights are updated . Fat chance that the hessian is diagonal though! o If using a single scalar learning then the best one to use is 1 over the largest eigenvalue of the Hessian. There are fairly inexpensive algorithms for estimating this. any network that has an activation function that isn't the identity). there is a lot of noise in the weight updates. many people just use the ol' brute force method of picking the learning rate . o Picking Learning Rates Learning rates that are too big cause the algorithm to diverge Learning rates that are too small cause the algorithm to converge very slowly. the gradient is estimated by picking one (or a small number) of inputs. Many problems are not linear. we can only lay down linear boundaries between classes. However. One way annealing is to use = 0/t where 0 us the originial learning rate and t is the number of timesteps after annealing is turned on. This is often inadequate for most real world problems. o the network chooses the class by picking the output node with the largest output Classification . More details if you are interested: o The next best thing is to use a separate learning rate for each weight.Gradient Descent (the right way) o uses a one-layer network with a softmax activation function o uses the cross entropy error function o outputs are interpreted as probabilities o the network chooses the class with the highest probability Modes of Learning for Gradient Descent Batch At each iteration.arrgh! That is why people love the trial and error approach. T o For linear networks the Hessian is < x x > and is independent of the weights. If the Hessian is diagonal these learning rates are just one over the eigenvalues of the Hessian. Limitations of Linear Networks For regression. this is a matrix whose inverse can be costly to compute. we can only fit a straight line through the data points. For classification.
Thus we have .where ij = 0 if i=j and zero otherwise. If q!=r is the correct class then tr = 0 the above also reduces to (tr-yr)xs. Note that if r is the correct class then tr = 1 and RHS of the above equation reduces to (tr-yr)xs.