You are on page 1of 82

INTRODUCTORY LECTURE

ON
MACHINE LEARNING
1

Example
Let’s start with an example. Suppose we are charged with providing automated access
control to a building. Before entering the building each person has to look into a camera so
we can take a still image of their face. For our purposes it suffices just to decide based on
the image whether the person can enter the building. It might be helpful to (try to) also
identify each person but this might require type of information we do not have (e.g., names
or whether any two face images correspond to the same person). We only have face images
of people recorded while access control was still provided manually. As a result of this
experience we have labeled images. An image is labeled positive if the person in question
should gain entry and negative otherwise. To supplement the set of negatively labeled
images (as we would expect only few cases of refused entries under normal circumstances)
we can use any other face images of people who we do not expect to be permitted to
enter the building. Images taken with similar camera-face orientation (e.g., from systems
operational in other buildings) would be preferred. Our task then is to come up with a
function – a classifier – that maps pixel images to binary (±1) labels. And we only have
the small set of labeled images (the training set) to constrain the function.
Let’s make the task a bit more formal. We assume that each image (grayscale) is represented
as a column vector x of dimension d. So, the pixel intensity values in the image, column by
column, are concatenated into a single column vector. If the image has 100 by 100 pixels,
then d = 10000. We assume that all the images are of the same size. Our classifier is a
binary valued function f : Rd → {−1, 1} chosen on the basis of the training set alone.
For our task here we assume that the classifier knows nothing about images (or faces for
that matter) beyond the labeled training set. So, for example, from the point of view
of the classifier, the images could have been measurements of weight, height, etc. rather
than pixel intensities. The classifier only has a set of n training vectors x1 , . . . , xn with
binary ±1 labels y1 , . . . , yn . This is the only information about the task that we can use to
constraint what the function f should be.

What kind of solution would suffice?

Suppose now that we have n = 50 labeled pixel images that are 128 by 128, and the pixel
intensities range from 0 to 255. It is therefore possible that we can find a single pixel,
say pixel i, such that each of our n images have a distinct value for that pixel. We could
then construct a simple binary function based on this single pixel that perfectly maps the
training images to their labels. In other words, if xti refers to pixel i in the tth training
2

image, and x�i is the ith pixel in any image x� , then


yt , if xti = x�i for some t = 1, . . . , n (in this order)


fi (x ) = (1)
−1, otherwise
would appear to solve the task. In fact, it is always possible to come up with such a
“perfect” binary function if the training images are distinct (no two images have identical
pixel intensities for all pixels). But do we expect such rules to be useful for images not
in the training set? Even an image of the same person varies somewhat each time the
image is taken (orientation is slightly different, lighting conditions may have changed, etc).
These rules provide no sensible predictions for images that are not identical to those in the
training set. The primary reason for why such trivial rules do not suffice is that our task is
not to correctly classify the training images. Our task is to find a rule that works well for
all new images we would encounter in the access control setting; the training set is merely
a helpful source of information to find such a function. To put it a bit more formally, we
would like to find classifiers that generalize well, i.e., classifiers whose performance on the
training set is representative of how well it works for yet unseen images.

Model selection

So how can we find classifiers that generalize well? The key is to constrain the set of
possible binary functions we can entertain. In other words, we would like to find a class of
binary functions such that if a function in this class works well on the training set, it is also
likely to work well on the unseen images. The “right” class of functions to consider cannot
be too large in the sense of containing too many clearly different functions. Otherwise we
are likely to find rules similar to the trivial ones that are close to perfect on the training
set but do not generalize well. The class of function should not be too small either or we
run the risk of not finding any functions in the class that work well even on the training
set. If they don’t work well on the training set, how could we expect them to work well on
the new images? Finding the class of functions is a key problem in machine learning, also
known as the model selection problem.

Linear classifiers through origin

Let’s just fix the function class for now. Specifically, we will consider only a type of linear
classifiers. These are thresholded linear mappings from images to labels. More formally,
we only consider functions of the form
f (x; θ) = sign θ1 x1 + . . . + θd xd = sign θT x
� � � �
(2)
3

where θ = [θ1 , . . . , θd ]T is a column vector of real valued parameters. Different settings of


the parameters give different functions in this class, i.e., functions whose value or output
in {−1, 1} could be different for some input images x. Put another way, the functions in
our class are parameterized by θ ∈ Rd .
We can also understand these linear classifiers geometrically. The classifier changes its
prediction only when the argument to the sign function changes from positive to negative
(or vice versa). Geometrically, in the space of image vectors, this transition corresponds to
crossing the decision boundary where the argument is exactly zero: all x such that θT x = 0.
The equation defines a plane in d-dimensions, a plane that goes through the origin since
x = 0 satisfies the equation. The parameter vector θ is normal (orthogonal) to this plane;
this is clear since the plane is defined as all x for which θT x = 0. The θ vector as the
normal to the plane also specifies the direction in the image space along which the value of
θT x would increase the most. Figure 1 below tries to illustrate these concepts.
images labeled +1
θT x > 0
x x
decision boundary
θ θT x = 0
x
x x
x
x images labeled -1
x
x θT x < 0
x x

Figure 1: A linear classifier through origin.

Before moving on let’s figure out whether we lost some useful properties of images as a
result of restricting ourselves to linear classifiers? In fact, we did. Consider, for example,
how nearby pixels in face images relate to each other (e.g., continuity of skin). This
information is completely lost. The linear classifier is perfectly happy (i.e., its ability
to classify images remains unchanged) if we get images where the pixel positions have
been reordered provided that we apply the same transformation to all the images. This
permutation of pixels merely reorders the terms in the argument to the sign function in Eq.
(2). A linear classifier therefore does not have access to information about which pixels are
close to each other in the image.
4

Learning algorithm: the perceptron

Now that we have chosen a function class (perhaps suboptimally) we still have to find a
specific function in this class that works well on the training set. This is often referred to
as the estimation problem. Let’s be a bit more precise. We’d like to find a linear classifier
that makes the fewest mistakes on the training set. In other words, we’d like to find θ that
minimizes the training error
n � � n
1� 1�
Ê(θ) = 1 − δ(yt , f (xt ; θ)) = Loss(yt , f (xt ; θ)) (3)
n t=1 n t=1
where δ(y, y � ) = 1 if y = y � and 0 otherwise. The training error merely counts the average
number of training images where the function predicts a label different from the label
provided for that image. More generally, we could compare our predictions to labels in
terms of a loss function Loss(yt , f (xt ; θ)). This is useful if errors of a particular kind are
more costly than others (e.g., letting a person enter the building when they shouldn’t). For
simplicity, we use the zero-one loss that is 1 for mistakes and 0 otherwise.
What would be a reasonable algorithm for setting the parameters θ? Perhaps we can just
incrementally adjust the parameters so as to correct any mistakes that the corresponding
classifier makes. Such an algorithm would seem to reduce the training error that counts
the mistakes. Perhaps the simplest algorithm of this type is the perceptron update rule.
We consider each training image one by one, cycling through all the images, and adjust
the parameters according to
θ� ← θ + yt xt if yt =
� f (xt ; θ) (4)
In other words, the parameters (classifier) is changed only if we make a mistake. These
updates tend to correct mistakes. To see this, note that when we make a mistake the sign
of θT xt disagrees with yt and the product yt θT xt is negative; the product is positive for
correctly classified images. Suppose we make a mistake on xt . Then the updated parameters
are given by θ� = θ + y t xt , written here in a vector form. If we consider classifying the
same image xt after the update, then
yt θ�T xt = yt (θ + yt xt )T xt = yt θT xt + yt2 xTt xt = yt θT xt + �xt �2 (5)
In other words, the value of yt θT xt increases as a result of the update (becomes more
positive). If we consider the same image repeatedly, then we will necessarily change the
parameters such that the image is classified correctly, i.e., the value of yt θT xt becomes
positive. Mistakes on other images may steer the parameters in different directions so it
may not be clear that the algorithm converges to something useful if we repeatedly cycle
through the training images.
5

Analysis of the perceptron algorithm

The perceptron algorithm ceases to update the parameters only when all the training
images are classified correctly (no mistakes, no updates). So, if the training images are
possible to classify correctly with a linear classifier, will the perceptron algorithm find such
a classifier? Yes, it does, and it will converge to such a classifier in a finite number of
updates (mistakes). We’ll show this in lecture 2.
Lecture
on
Perceptron, convergence, and
generalization
1

Perceptron, convergence, and generalization


Recall that we are dealing with linear classifiers through origin, i.e.,

f (x; θ) = sign θT x
� �
(1)

where θ ∈ Rd specifies the parameters that we have to estimate on the basis of training
examples (images) x1 , . . . , xn and labels y1 , . . . , yn .
We will use the perceptron algorithm to solve the estimation task. Let k denote the number
of parameter updates we have performed and θ(k) the parameter vector after k updates.
Initially k = 0 and θ(k) = 0. The algorithm then cycles through all the training instances
(xt , yt ) and updates the parameters only in response to mistakes, i.e., when the label is
predicted incorrectly. More precisely, we set θ(k+1) = θ(k) + yt xt when yt (θ(k) )T xt < 0
(mistake), and otherwise leave the parameters unchanged.

Convergence in a finite number of updates

Let’s now show that the perceptron algorithm indeed convergences in a finite number of
updates. The same analysis will also help us understand how the linear classifier generalizes
to unseen images. To this end, we will assume that all the (training) images have bounded
Euclidean norms, i.e., �xt � ≤ R for all t and some finite R. This is clearly the case for any
pixel images with bounded intensity values. We also make a much stronger assumption
that there exists a linear classifier in our class with finite parameter values that correctly
classifies all the (training) images. More precisely, we assume that there is some γ > 0 such
that yt (θ∗ )T xt ≥ γ for all t = 1, . . . , n. The additional number γ > 0 is used to ensure that
each example is classified correctly with a finite margin.
The convergence proof is based on combining two results: 1) we will show that the inner
product (θ∗ )T θ(k) increases at least linearly with each update, and 2) the squared norm
�θ(k) �2 increases at most linearly in the number of updates k. By combining the two
we can show that the cosine of the angle between θ(k) and θ∗ has to increase by a finite
increment due to each update. Since cosine is bounded by one, it follows that we can only
make a finite number of updates.
Part 1: we simply take the inner product (θ∗ )T θ(k) before and after each update. When
making the k th update, say due to a mistake on image xt , we get

(θ∗ )T θ(k) = (θ∗ )T θ(k−1) + yt (θ∗ )T xt ≥ (θ∗ )T θ(k−1) + γ (2)


2

since, by assumption, yt (θ∗ )T xt ≥ γ for all t (θ∗ is always correct). Thus, after k updates,

(θ∗ )T θ(k) ≥ kγ (3)

Part 2: Our second claim follows simply from the fact that updates are made only on
mistakes:

�θ(k) �2 = �θ(k−1) + yt xt �2 (4)


= �θ(k−1) �2 + 2yt (θ(k−1) )T xt + �xt �2 (5)
≤ �θ(k−1) �2 + �xt �2 (6)
≤ �θ(k−1) �2 + R2 (7)

since yt (θ(k−1) )T xt < 0 whenever an update is made and, by assumption, �xt � ≤ R. Thus,

�θ(k) �2 ≤ kR2 (8)

We can now combine parts 1) and 2) to bound the cosine of the angle between θ∗ and θ(k) :

(θ∗ )T θ(k) 1) kγ 2) kγ
cos(θ∗ , θ(k) ) = (k) ∗
≥ (k) ∗
≥ √ (9)
�θ � �θ � �θ � �θ � kR2 �θ∗ �
Since cosine is bounded by one, we get

kγ R2 �θ∗ �2
1≥ √ or k ≤ (10)
kR2 �θ∗ � γ2

Margin and geometry

It is worthwhile to understand this result a bit further. For example, does �θ∗ �2 /γ 2 relate
to how difficult the classification problem is? Indeed, it does. We claim that its inverse,
i.e., γ/�θ∗ � is the smallest distance in the image space from any example (image) to the
decision boundary specified by θ∗ . In other words, it serves as a measure of how well the
two classes of images are separated (by a linear boundary). We will call this the geometric
margin or γgeom (see figure 1). γg−1eom is then a fair measure of how difficult the problem is:
the smaller the geometric margin that separates the training images, the more difficult the
problem.
To calculate γgeom we measure the distance from the decision boundary θ∗ T x = 0 to one of
the images xt for which ytθ∗T xt = γ. Since θ∗ specifies the normal to the decision boundary,
3

x x
decision boundary
θ∗ T x = 0
θ∗
x x
γgeom

x x

Figure 1: Geometric margin

the shortest path from the boundary to the image xt will be parallel to the normal. The
image for which yt θ∗ T xt = γ is therefore among those closest to the boundary. Now, let’s
define a line segment from x(0) = xt , parallel to θ∗ , towards the boundary. This is given
by
yt θ ∗
x(ξ) = x(0) − ξ (11)
�θ∗ �
where ξ defines the length of the line segment since it multiplies a unit length vector. It
remains to find the value of ξ such that θ∗ T x(ξ) = 0, or, equivalently, yt θ∗ T x(ξ) = 0. This
is the point where the segment hits the decision boundary. Thus
yt θ ∗
� �
∗T ∗T
yt θ x(ξ) = yt θ x(0) − ξ ∗ (12)
�θ �
yt θ ∗
� �
∗T
= yt θ xt − ξ ∗ (13)
�θ �
�θ∗ �2
= yt θ∗ T xt − ξ ∗ (14)
�θ �
= γ − ξ�θ∗ � = 0 (15)

implying that the distance is exactly ξ = γ/�θ∗ � as claimed. As a result, the bound on the
number of perceptron updates can be written more succinctly in terms of the geometric
4

margin γgeom (distance to the boundary):


� �2
R
k≤ (16)
γgeom
with the understanding that γgeom is the largest geometric margin that could be achieved
by a linear classifier for this problem. Note that the result does not depend (directly) on
the dimension d of the�examples, nor the number of training examples n. It is nevertheless
�2
R
tempting to interpret γgeom as a measure of difficulty (or complexity) of the problem of
learning linear classifiers in this setting. You will see later in the course that this is exactly
the case, cast in terms of a measure known as VC-dimension.

Generalization guarantees

We have so far discussed the perceptron algorithm only in relation to the training set but
we are more interested in how well the perceptron classifies images we have not yet seen,
i.e., how well it generalizes to new images. Our simple analysis above actually provides
some information about generalization. Let’s assume then that all the images and labels
we could possibly encounter satisfy the same two assumptions. In other words, 1) �xt � ≤ R
and 2) yt θ∗ T xt ≥ γ for all t and some finite θ∗ . So, in essence, we assume that there is a
linear classifier that works for all images and labels in this problem, we just don’t know
what this linear classifier is to start with. Let’s now imagine getting the images and labels
one by one and performing only a single update per image, if misclassified, and move on.
The previous situation concerning the training set corresponds to encountering the same
set of images repeatedly. How many mistakes are we now going to make in this infinite
arbitrary sequence of images and labels, subject only to the two assumptions? The same
number k ≤ (R/γgeom )2 . Once we have made this many mistakes we would classify all the
new images correctly. So, provided that the two assumptions hold, especially the second
one, we obtain a nice guarantee of generalization. One caveat here is that the perceptron
algorithm does need to know when it has made a mistake. The bound is after all cast in
terms of the number of updates based on mistakes.

Maximum margin classifier?

We have so far used a simple on-line algorithm, the perceptron algorithm, to estimate a
linear classifier. Our reference assumption has been, however, that there exists a linear
classifier that has a large geometric margin, i.e., whose decision boundary is well separated
5

from all the training images (examples). Can’t we find such a large margin classifier
directly? Yes, we can. The classifier is known as the Support Vector Machine or SVM for
short. See the next lecture for details.
Lecture on
Support Vector Machine
1

The Support Vector Machine


So far we have used a reference assumption that there exists a linear classifier that has
a large geometric margin, i.e., whose decision boundary is well separated from all the
training images (examples). Such a large margin classifier seems like one we would like to
use. Can’t we find it more directly? Yes, we can. The classifier is known as the Support
Vector Machine or SVM for short.
You could imagine finding the maximum margin linear classifier by first identifying any
classifier that correctly classifies all the examples (Figure 2a) and then increasing the ge­
ometric margin until the classifier “locks in place” at the point where we cannot increase
the margin any further (Figure 2b). The solution is unique.

x decision boundary
x x
decision boundary θ̂T x = 0
θT x = 0 x θ̂
θ x x
x
a) x b) γgeom
x
x
x x

Figure 1: a) A linear classifier with a small geometric margin, b) maximum margin linear
classifier.

More formally, we can set up an optimization problem for directly maximizing the geometric
margin. We will need the classifier to be correct on all the training examples or yt θT xt ≥ γ
for all t = 1, . . . , n. Subject to these constraints, we would like to maximize γ/�θ�, i.e.,
the geometric margin. We can alternatively minimize the inverse �θ�/γ or the inverse
squared 12 (�θ�/γ)2 subject to the same constraints (the factor 1/2 is included merely for
later convenience). We then have the following optimization problem for finding θ̂:
1
minimize �θ�2 /γ 2 subject to yt θT xt ≥ γ for all t = 1, . . . , n (1)
2
We can simplify this a bit further by getting rid of γ. Let’s first rewrite the optimization
problem in a way that highlights how the solution depends (or doesn’t depend) on γ:
1
minimize �θ/γ�2 subject to yt (θ/γ)T xt ≥ 1 for all t = 1, . . . , n (2)
2
2

In other words, our classification problem (the data, setup) only tells us something about
the ratio θ/γ, not θ or γ separately. For example, the geometric margin is defined only on
the basis of this ratio. Scaling θ by a constant also doesn’t change the decision boundary.
We can therefore fix γ = 1 and solve for θ from
1
minimize �θ�2 subject to yt θT xt ≥ 1 for all t = 1, . . . , n (3)
2
This optimization problem is in the standard SVM form and is a quadratic programming
problem (objective is quadratic in the parameters with linear constraints). The resulting
geometric margin is 1/�θ̂� where θ̂ is the unique solution to the above problem. The
decision boundary (separating hyper-plane) nor the value of the geometric margin were
affected by our choice γ = 1.

General formulation, offset parameter

We will modify the linear classifier here slightly by adding an offset term so that the decision
boundary does not have to go through the origin. In other words, the classifier that we
consider has the form

f (x; θ, θ0 ) = sign( θT x + θ0 ) (4)

with parameters θ (normal to the separating hyper-plane) and the offset parameter θ0 , a
real number. As before, the equation for the separating hyper-plane is obtained by setting
the argument to the sign function to zero or θT x + θ0 = 0. This is a general equation for a
hyper-plane (line in 2-dim). The additional offset parameter can lead to a classifier with a
larger geometric margin. This is illustrated in Figures 2a and 2b. Note that the vector θ̂
corresponding to the maximum margin solution is different in the two figures.
The offset parameter changes the optimization problem only slightly:
1
minimize �θ�2 subject to yt (θT xt + θ0 ) ≥ 1 for all t = 1, . . . , n (5)
2
Note that the offset parameter only appears in the constraints. This is different from
simply modifying the linear classifier through origin by feeding it with examples that have
an additional constant component, i.e., x� = [x; 1]. In the above formulation we do not bias
in any way where the separating hyper-plane should appear, only that it should maximize
the geometric margin.
3

decision boundary
θ̂T x + θ̂0 = 0
decision boundary
θ̂T x = 0
x x
θ̂
x x
θ̂
x x x x

γgeom x γgeom x
x x

a)
b)

Figure 2: a) Maximum margin linear classifier through origin; b) Maximum margin linear
classifier with an offset parameter

Properties of the maximum margin linear classifier

The maximum margin classifier has several very nice properties, and some not so advanta­
geous features.
Benefits. On the positive side, we have already motivated these classifiers based on the
perceptron algorithm as the “best reference classifiers”. The solution is also unique based
on any linearly separable training set. Moreover, drawing the separating boundary as far
from the training examples as possible makes it robust to noisy examples (though not
noisy labels). The maximum margin linear boundary also has the curious property that
the solution depends only on a subset of the examples, those that appear exactly on the
margin (the dashed lines parallel to the boundary in the figures). The examples that lie
exactly on the margin are called support vectors (see Figure 3). The rest of the examples
could lie anywhere outside the margin without affecting the solution. We would therefore
get the same classifier if we had only received the support vectors as training examples. Is
this is a good thing? To answer this question we need a bit more formal (and fair) way of
measuring how good a classifier is.
One possible “fair” performance measure evaluated only on the basis of the training exam­
ples is cross-validation. This is simply a method of retraining the classifier with subsets of
training examples and testing it on the remaining held-out (and therefore fair) examples,
pretending we had not seen them before. A particular version of this type of procedure
is called leave-one-out cross-validation. As the name suggests, the procedure is defined as
follows: select each training example in turn as the single example to be held-out, train the
4

decision boundary
θ̂ T x + θ̂0 = 0

x θ̂
x x

γgeom x
x

Figure 3: Support vectors (circled) associated the maximum margin linear classifier

classifier on the basis of all the remaining training examples, test the resulting classifier on
the held-out example, and count the errors. More precisely, let the superscript ’−i’ denote
the parameters we would obtain by finding the maximum margin linear separator without
the ith training example. Then
n � �
1� −i −i
leave-one-out CV error = Loss yi , f (xi ; θ , θ0 ) (6)
n i=1

where Loss(y, y � ) is the zero-one loss. We are effectively trying to gauge how well the
classifier would generalize to each training example if it had not been part of the training
set. A classifier that has a low leave-one-out cross-validation error is likely to generalize
well though it is not guaranteed to do so.
Now, what is the leave-one-out CV error of the maximum margin linear classifier? Well,
examples that lie outside the margin would be classified correctly regardless of whether
they are part of the training set. Not so for support vectors. They are key to defining the
linear separator and thus, if removed from the training set, may be misclassified as a result.
We can therefore derive a simple upper bound on the leave-one-out CV error:
# of support vectors
leave-one-out CV error ≤ (7)
n
A small number of support vectors – a sparse solution – is therefore advantageous. This is
another argument in favor of the maximum margin linear separator.
Problems. There are problems as well, however. Even a single training example, if misla­
beled, can radically change the maximum margin linear classifier. Consider, for example,
5

what would happen if we switched the label of the top right support vector in Figure 3.

Allowing misclassified examples, relaxation

Labeling errors are common in many practical problems and we should try to mitigate
their effect. We typically do not know whether examples are difficult to classify because
of labeling errors or because they simply are not linearly separable (there isn’t a linear
classifier that can classify them correctly). In either case we have to articulate a trade-off
between misclassifying a training example and the potential benefit for other examples.
Perhaps the simplest way to permit errors in the maximum margin linear classifier is to
introduce “slack” variables for the classification/margin constraints in the optimization
problem. In other words, we measure the degree to which each margin constraint is vio­
lated and associate a cost for the violation. The costs of violating constraints are minimized
together with the norm of the parameter vector. This gives rise to a simple relaxed opti­
mization problem:
n
1 �
minimize �θ�2 + C ξt (8)
2 t=1
subject to yt (θT xt + θ0 ) ≥ 1 − ξt and ξt ≥ 0 for all t = 1, . . . , n (9)

where ξt are the slack variables. The margin constraint is violated when we have to set
ξt > 0 for some example. The penalty for this violation is Cξt and it is traded-off with
the possible gain in minimizing the squared norm of the parameter vector or �θ�2 . If we
increase the penalty C for margin violations then at some point all ξt = 0 and we get back
the maximum margin linear separator (if possible). On the other hand, for small C many
margin constraints can be violated. Note that the relaxed optimization problem specifies
a particular quantitative trade-off between the norm of the parameter vector and margin
violations. It is reasonable to ask whether this is indeed the trade-off we want.
Let’s try to understand the setup a little further. For example, what is the resulting margin
when some of the constraints are violated? We can still take 1/�θ̂� as the geometric margin.
This is indeed the geometric margin based on examples for which ξt∗ = 0 where ‘*’ indicates
the optimized value. So, is it the case that we get the maximum margin linear classifier
for the subset of examples for which ξt∗ = 0? No, we don’t. The examples that violate the
margin constraints, including those training examples that are actually misclassified (larger
violation), do affect the solution. In other words, the parameter vector θ̂ is defined on the
basis of examples right at the margin, those that violate the constraint but not enough to
6

be misclassified, and those that are misclassified. All of these are support vectors in this
sense.
Lecture
on
Artificial Neural Networks

1
What are Neural Networks?
• Neural Networks are networks of neurons, for example, as found in
real (i.e. biological) brains
• Artificial neurons are crude approximations of the neurons found in
real brains. They may be physical devices, or purely mathematical
constructs.
• Artificial Neural Networks (ANNs) are networks of Artificial
Neurons and hence constitute crude approximations to parts of real
brains. They maybe physical devices, or simulated on conventional
computers.
• From a practical point of view, an ANN is just a parallel
computational system consisting of many simple processing
elements connected together in a specific way in order to perform a
particular task
• One should never lose sight of how crude the approximations are,
and how over-simplified our ANNs are compared to real brains.

2
Why are Artificial Neural Networks worth studying?

• They are extremely powerful computational devices


• Massive parallelism makes them very efficient
• They can learn and generalize from training data – so there is no
need for enormous feats of programming
• They are particularly fault tolerant
• They are very noise tolerant – so they can cope with situations
where normal symbolic systems would have difficulty
• In principle, they can do anything a symbolic/logic system can do,
and more

3
What are Neural Networks used for?
There are two basic goals for neural network research:

Brain modelling: The biological goal of constructing models of how real brains
work. This can potentially help us understand the nature of perception,
actions, learning and memory, thought and intelligence and/or formulate
medical solutions to brain damaged patients

Artificial System Construction: The engineering goal of building efficient


systems for real world applications. This may make machines more
powerful and intelligent, relieve humans of tedious tasks, and may even
improve upon human performance.

Both methodologies should be regarded as complementary and not


competing. We often use exactly the same network architectures and
methodologies for both. Progress is made when the two approaches are
allowed to feed one another. There are fundamental differences though,
e.g. the need for biological plausibility in brain modelling, and the need for
computational efficiency in artificial system construction.

4
Learning Processes in Neural Networks

Among the many interesting properties of a neural network, is the


ability of the network to learn from its environment, and to improve
its performance through learning. The improvement in
performance takes place over time in accordance with some
prescribed measure.

A neural network learns about its environment through an iterative


process of adjustments applied to its synaptic weights and
thresholds. Ideally, the network becomes more knowledgeable
about its environment after each iteration of the learning process.

There are three broad types of learning:


1. Supervised learning (i.e. learning with an external teacher)
2. Unsupervised learning (i.e. learning with no help)
3. Reinforcement learning (i.e. learning with limited feedback)

5
Historical Notes
1943 McCulloch and Pitts proposed the McCulloch-Pitts neuron model
1949 Hebb published his book The Organization of Behaviour, in which the
Hebbian learning rule was introduced
1958 Rosenblatt introduced the simple single layer networks called Perceptrons
1969 Minsky and Papert’s book Perceptrons demonstrated the limitation of
single layer perceptrons
1980 Grossberg introduced his Adaptive Resonance Theory (ART)
1982 Hopfield published a series of papers on Hopfield networks
1982 Kohonen developed the Self-Organizing Feature Maps
1986 Back-propagation learning algorithm for multi-layer perceptrons was re-
discovered, and the whole field took off again
1990s ART-variant networks were developed
1990s Radial Basis Functions were developed
2000s Support Vector Machines were developed

6
Neural Network Applications

Brain modelling
Aid our understanding of how the brain works, how behaviour emerges from
the interaction of networks of neurons, what needs to “get fixed” in brain
damaged patients

Real world applications


Financial modelling – predicting the stock market
Time series prediction – climate, weather, seizures
Computer games – intelligent agents, chess, backgammon
Robotics – autonomous adaptable robots
Pattern recognition – speech recognition, seismic activity, sonar signals
Data analysis – data compression, data mining
Bioinformatics – DNA sequencing, alignment

7
The Nervous System

The human nervous system can be broken down into three stages that can be
represented in block diagram form as

Stimulus Receptors Neural Net Effectors Response

(adapted from Arbib, 1987)

The receptors convert stimuli from the external environment into electrical
impulses that convey information to the neural net (brain)
The effectors convert electrical impulses generated by the neural net into
responses as system outputs
The neural net (brain) continually receives information, perceives it and
makes appropriate decisions.
The flow of information is represented by arrows – feedforward and feedback

8
Brains vs. Computers

Processing elements: There are 1014 synapses in the brain,


compared with 108 transistors in the computer
Processing speed: 100 Hz for the brain compared to 109 Hz for the
computer
Style of computation: The brain computes in parallel and distributed
mode, whereas the computer mostly serially and centralized.
Fault tolerant: The brain is fault tolerant, whereas the computer is not
Adaptive: The brain learns fast, whereas the computer doesn’t even
compare with an infant’s learning capabilities
Intelligence and consciousness: The brain is highly intelligent and
conscious, whereas the computer shows lack of intelligence
Evolution: The brains have been evolving for tens of millions of years,
computers have been evolving for decades.

9
Levels of Organization in the Brain

In the brain there are both small-scale and large-scale anatomical


organizations, and different functions take place at lower and
higher levels.

There is a hierarchy of interwoven levels of organization:


1. Behaviour
2. Systems ANN
3. Microcircuits
4. Neurons
5. Dendrites Brain
6. Synapses
7. Molecules
8. Genes

10
Microscopic View of the Nervous System

• Nervous system is made up of


cells
• A cell has a fatty membrane,
which is filled with liquid and
proteins known as cytoplasm as
well as smaller functional parts
called organelles
• There are two major types of
brain cells: (1) neurons, and (2)
glia
• Neurons are the principal
elements involved in
information processing in the
brain
• Glia provide support and
homeostasis to neurons.

11
Schematic Diagram of a Biological Neuron

12
Basic Components of Biological Neurons

• The majority of neurons encode their activation or outputs as a


series of brief electrical pulses (i.e. spikes or action potentials)
• The neuron’s cell body (soma) processes the incoming activations
and converts them into output activations
• The neuron’s nucleus contains the genetic material (DNA)
• Dendrites are fibres which emanate from the cell body and provide
the receptive zone that receive activation from other neurons
• Axons are fibres acting as transmission lines that send action
potentials to other neurons
• The junctions that allow signal transmission between the axons and
the dendrites are called synapses. The process of transmission is
by diffusion of chemicals called neurotransmitters across the
synaptic cleft.

13
The McCulloch-Pitts Neuron

• This vastly simplified model of real neurons is also known as a Threshold


Logic Unit:
I1 Wj1
I2 Aj Yj

I3

In Wjn

1. A set of synapses (i.e. connections) brings in activations from other


neurons
2. A processing unit sums the inputs, and then applies a non-linear
activation function
3. An output line transmits the result to other neurons

14
How the Model Neuron Works

• Each input Ii is multiplied by a weight wji (synaptic strength)


• These weighted inputs are summed to give the activation level, Aj
• The activation level is then transformed by an activation function to
produce the neuron’s output, Yi
• Wji is known as the weight from unit i to unit j
– Wji > 0, synapse is excitatory
– Wji < 0, synapse is inhibitory
• Note that Ii may be
– External input
– The output of some other neuron

15
The McCulloch-Pitts Neuron Equation

We can now write down the equation for the output Yj of a McCulloch-Pitts
neuron as a function of its inputs Ii:

n
Y j = sgn( ∑ Ii − θ )
i =1

where θ is the neuron’s activation threshold. When

n n
Y j = 1, if ∑ Ik ≥ θ Y j = 0, if ∑ Ik < θ
k =1 k =1

Note that the McCulloch-Pitts neuron is an extremely simplified model of


real biological neurons. Nevertheless, they are computationally very
powerful. One can show that assemblies of such neurons are capable of
universal computation.

16
Recap: McCulloch-Pitts Neuron

• This vastly simplified model of real neurons is also known as a Threshold


Logic Unit:
I1 Wj1
I2 Aj Yj

I3

In Wjn

1. A set of synapses (i.e. connections) brings in activations from other


neurons
2. A processing unit sums the inputs, and then applies a non-linear
activation function
3. An output line transmits the result to other neurons

2
Networks of McCulloch-Pitts Neurons

One neuron can’t do much on its own. Usually we will have many neurons
labelled by indices k, i, j and activation flows between via synapses with
strengths wki, wij:

I1i
W1i
I2i Yi Iij
∑ wij

I3i θi

Synapse ij Neuron j
Neuron k Iki Neuron i

n
I ki = Yk ⋅ wki Yi = sgn( ∑ I ki − θi ) I ij = Yi ⋅ wij
k =1

3
The Perceptron
We can connect any number of McCulloch-Pitts neurons together in
any way we like

An arrangement of one input layer of McCulloch-Pitts neurons feeding


forward to one output layer of McCulloch-Pitts neurons is known as
a Perceptron.

i j
1 θ1 1

n
Y j = sgn(∑ Yi ⋅ wij − θ j )
2 θ2 2
wij
: : i =1
: :
N θM M

4
Implementing Logic Gates with MP Neurons

We can use McCulloch-Pitts neurons to implement the basic logic


gates (e.g. AND, OR, NOT).

It is well known from logic that we can construct any logical function
from these three basic logic gates.

All we need to do is find the appropriate connection weights and neuron


thresholds to produce the right outputs for each set of inputs.

We shall see explicitly how one can construct simple networks that
perform NOT, AND, and OR.

5
Implementation of Logical NOT, AND, and OR

AND OR
NOT in1 in2 out
in1 in2 out
in out 0 0 0
0 0 0
0 1 0 1 1
0 1 0
1 0 1 0 0 1 0 1

1 1 1 1 1 1

? ?

Problem: Train network to calculate the appropriate weights and thresholds in


order to classify correctly the different classes (i.e. form decision boundaries
between classes).

6
Decision Surfaces

Decision surface is the surface at which the output of the unit is


precisely equal to the threshold, i.e. ∑wiIi=θ

In 1-D the surface is just a point:


I1=θ/w1 I1

Y=0 Y=1

In 2-D, the surface is

I1 ⋅ w1 + I 2 ⋅ w2 − θ = 0

which we can re-write as


θ w So, in 2-D the decision boundaries are
I2 = − 1 I1 always straight lines.
w2 w2
7
Decision Boundaries for AND and OR

We can now plot the decision boundaries of our logic gates


AND OR
w1=1, w2=1, θ=1.5 w1=1, w2=1, θ=0.5
OR
I1 I2 out I1

0 0 0
0 1 1
AND
I1 (1, 0) (1, 1)
1 0 1
I1 I2 out
0 0 0 1 1 1

0 1 0
(1, 1)
1 0 0 (1, 0)
1 1 1 I2
(0, 0) (0, 1)

I2
(0, 0) (0, 1)

8
Decision Boundary for XOR

The difficulty in dealing with XOR is rather obvious. We need two straight
lines to separate the different outputs/decisions:
I1

XOR

I1 I2 out I1
0 0 0

0 1 1
I2

1 0 1

1 1 0

I2

Solution: either change the transfer function so that it has more than one
decision boundary, or use a more complex network that is able to generate
more complex decision boundaries.
9
ANN Architectures
Mathematically, ANNs can be represented as weighted directed graphs. The
most common ANN architectures are:

Single-Layer Feed-Forward NNs: One input layer and one output layer of
processing units. No feedback connections (e.g. a Perceptron)

Multi-Layer Feed-Forward NNs: One input layer, one output layer, and one or
more hidden layers of processing units. No feedback connections (e.g. a
Multi-Layer Perceptron)

Recurrent NNs: Any network with at least one feedback connection. It may, or
may not, have hidden units

Further interesting variations include: sparse connections, time-delayed


connections, moving windows, …

10
Examples of Network Architectures

Single Layer Multi-Layer Recurrent


Feed-Forward Feed-Forward Network

11
Types of Activation/Transfer Function

Threshold Function
f(x)

1 if x ≥ 0
f ( x) = 
0 if x < 0
x

Piecewise-Linear Function
f(x)
 1 if x ≥ 0.5

f ( x) =  x + 0.5 if − 0.5 ≤ x ≤ 0.5
 0 if x ≤ 0.5
 x

Sigmoid Function

f(x)
1
f ( x) =
1 + e− x
x
12
The Threshold as a Special Kind of Weight
The basic Perceptron equation can be simplified if we consider that the
threshold is another connection weight:
n
∑Ii ⋅ wij −θ j = I1 ⋅ w1 j + I2 ⋅ w2 j + K+ In ⋅ wnj −θ j
i =1

If we define w0j=-θj and I0=1 then


n n
∑ Ii ⋅ wij − θ j = I1 ⋅ w1 j + I 2 ⋅ w2 j + K + I n ⋅ wnj + I 0 ⋅ w0 j = ∑ Ii ⋅ wij
i =1 i =0

The Perceptron equation then becomes


n n
Y j = sgn( ∑ I i ⋅ wij − θ j ) = sgn(∑ I i ⋅ wij )
i =1 i =0

So, we only have to compute the weights.

13
Example: A Classification Task

A typical neural network application is classification. Consider the simple example of


classifying trucks given their masses and lengths:

Mass Length Class

10.0 6 Lorry

20.0 5 Lorry

5.0 4 Van

2.0 5 Van

2.0 5 Van

3.0 6 Lorry

10.0 7 Lorry

15.0 8 Lorry

5.0 9 Lorry

How do we construct a neural network that can classify any Lorry and Van?
14
Cookbook Recipe for Building Neural Networks
Formulating neural network solutions for particular problems is a multi-stage
process:
1. Understand and specify the problem in terms of inputs and required
outputs
2. Take the simplest form of network you think might be able to solve your
problem
3. Try to find the appropriate connection weights (including neuron
thresholds) so that the network produces the right outputs for each input
in its training data
4. Make sure that the network works on its training data and test its
generalization by checking its performance on new testing data
5. If the network doesn’t perform well enough, go back to stage 3 and try
harder
6. If the network still doesn’t perform well enough, go back to stage 2 and
try harder
7. If the network still doesn’t perform well enough, go back to stage 1 and
try harder
8. Problem solved – or not

15
Building a Neural Network (stages 1 & 2)

For our truck example, our inputs can be direct encodings of the masses and
lengths. Generally we would have one output unit for each class, with
activation 1 for ‘yes’ and 0 for ‘no’. In our example, we still have one output
unit, but the activation 1 corresponds to ‘lorry’ and 0 to ‘van’ (or vice versa).
The simplest network we should try first is the single layer Perceptron. We
can further simplify things by replacing the threshold by an extra weight as
we discussed before. This gives us:

Class=sgn(w0+w1.Mass+w2.Length)

w2
w0
w1

1 Mass Length

16
Training the Neural Network (stage 3)
Whether our neural network is a simple Perceptron, or a much
complicated multi-layer network, we need to develop a systematic
procedure for determining appropriate connection weights.

The common procedure is to have the network learn the appropriate


weights from a representative set of training data.

For classifications a simple Perceptron uses decision boundaries (lines


or hyperplanes), which it shifts around until each training pattern is
correctly classified.

The process of “shifting around” in a systematic way is called learning.

The learning process can then be divided into a number of small steps.

17
Supervised Training

1. Generate a training pair or pattern:


- an input x = [ x1 x2 … xn]
- a target output ytarget (known/given)
2. Then, present the network with x and allow it to generate an
output y
3. Compare y with ytarget to compute the error
4. Adjust weights, w, to reduce error
5. Repeat 2-4 multiple times

18
Perceptron Learning Rule
1. Initialize weights at random
2. For each training pair/pattern (x, ytarget)
- Compute output y
- Compute error, δ=(ytarget – y)
- Use the error to update weights as follows:
∆w = w – wold=η*δ*x or wnew = wold + η*δ*x
where η is called the learning rate or step size and it determines how
smoothly the learning process is taking place.
3. Repeat 2 until convergence (i.e. error δ is zero)

The Perceptron Learning Rule is then given by


wnew = wold + η*δ*x
where
δ=(ytarget – y)

19
Mathematical Preliminaries: Vector Notation

Vectors appear in lowercase bold font


e.g. input vector: x = [x0 x1 x2 … xn]

Dot product of two vectors:


wx = w0 x0 + w1 x1 + … + wn xn =
n
= ∑ wi xi
i =0

E.g.: x = [1,2,3], y = [4,5,6] xy = (1*4)+(2*5)+(3*6) = 4+10+18 = 32

2
Review of the McCulloch-Pitts/Perceptron Model

I1 Wj1
I2 Aj Yj

I3

In Wjn

Neuron sums its weighted inputs: n


w0 x0 + w1 x1 + … + wn xn = ∑ wi xi =wx
i =0

Neuron applies threshold activation function:


y = f(w x)
where, e.g. f(w x) = + 1 if w x > 0
f(w x) = - 1 if w x ≤ 0

3
Review of Geometrical Interpretation
X2

Y=1

X1
Y=-1

wx = 0

Neuron defines two regions in input space where it outputs -1 and 1.


The regions are separated by a hyperplane wx = 0 (i.e. decision
boundary)

4
Review of Supervised Learning
x
Supervisor ytarget
Generator

Learning
y
Machine

Training: Learn from training pairs (x, ytarget)


Testing: Given x, output a value y close to the supervisor’s output ytarget

5
Learning by Error Minimization

The Perceptron Learning Rule is an algorithm for adjusting the network


weights w to minimize the difference between the actual and the
desired outputs.

We can define a Cost Function to quantify this difference:

1
E ( w) = ∑∑ desired
2 p j
( y − y ) 2

Intuition:
• Square makes error positive and penalises large errors more
• ½ just makes the math easier
• Need to change the weights to minimize the error – How?
• Use principle of Gradient Descent

6
Principle of Gradient Descent

Gradient descent is an optimization algorithm that approaches a local


minimum of a function by taking steps proportional to the negative of
the gradient of the function as the current point.

So, calculate the derivative (gradient) of the Cost Function with respect
to the weights, and then change each weight by a small increment in
the negative (opposite) direction to the gradient

∂E ∂E ∂y
= ⋅ = −( ydesired − y ) x = −δ x
∂w ∂y ∂w

To reduce E by gradient descent, move/increment weights in the


negative direction to the gradient, -(-δx)= +δx

7
Graphical Representation of Gradient Descent

8
Widrow-Hoff Learning Rule
(Delta Rule)

∂E or
∆w = w − wold = −η = +η δ x w = wold + η δ x
∂w

where δ = ytarget – y and η is a constant that controls the learning rate


(amount of increment/update ∆w at each training step).
Note: Delta rule (DR) is similar to the Perceptron Learning Rule
(PLR), with some differences:
1. Error (δ) in DR is not restricted to having values of 0, 1, or -1
(as in PLR), but may have any value
2. DR can be derived for any differentiable output/activation
function f, whereas in PLR only works for threshold output
function

9
Convergence of PLR/DR

The weight changes ∆wij need to be applied repeatedly for each weight wij in
the network and for each training pattern in the training set.

One pass through all the weights for the whole training set is called an epoch
of training.

After many epochs, the network outputs match the targets for all the training
patterns, all the ∆wij are zero and the training process ceases. We then say
that the training process has converged to a solution.

It has been shown that if a possible set of weights for a Perceptron exist, which
solve the problem correctly, then the Perceptron Learning rule/Delta Rule
(PLR/DR) will find them in a finite number of iterations.

Furthermore, if the problem is linearly separable, then the PLR/DR will find a
set of weights in a finite number of iterations that solves the problem
correctly.

10
Review of Gradient Descent Learning
1. The purpose of neural network training is to minimize the output
errors on a particular set of training data by adjusting the network
weights w
2. We define an Cost Function E(w) that measures how far the
current network’s output is from the desired one
3. Partial derivatives of the cost function ∂E(w)/∂w tell us which
direction we need to move in weight space to reduce the error
4. The learning rate η specifies the step sizes we take in weight
space for each iteration of the weight update equation
5. We keep stepping through weight space until the errors are ‘small
enough’.

2
Review of Perceptron Training
1. Generate a training pair or pattern x that you wish your network to
learn
2. Setup your network with N input units fully connected to M output
units.
3. Initialize weights, w, at random
4. Select an appropriate error function E(w) and learning rate η
5. Apply the weight change ∆w = - η ∂E(w)/ ∂w to each weight w for
each training pattern p. One set of updates for all the weights for
all the training patterns is called one epoch of training.
6. Repeat step 5 until the network error function is ‘small enough’

3
Review of XOR and Linear Separability

Recall that it is not possible to find weights that enable Single Layer Perceptrons
to deal with non-linearly separable problems like XOR

XOR

in1 in2 out


I1
0 0 0

0 1 1

1 0 1

1 1 0

I2

The proposed solution was to use a more complex network that is able to
generate more complex decision boundaries. That network is the Multi-Layer
Perceptron.
4
Multi-Layer Perceptrons (MLPs)

Y1 Y2 Yk
Yk = f ( ∑ w jk ⋅ O j )
j ∫ ∫ ∫ Output layer, k

O1 Oj
Oj = f ( ∑ wij ⋅ X i ) ∫ ∫ ∫ ∫
Hidden layer, j
i

X1 X2 X3 Xi Input layer, i

5
Can We Use a Generalized Form of the
PLR/Delta Rule to Train the MLP?
Recall the PLR/Delta rule: Adjust neuron weights to reduce error at neuron’s
output:
w = wold + η δ x where δ = yt arg et − y

Main problem: How to adjust the weights in the hidden layer, so they reduce
the error in the output layer, when there is no specified target response in the
hidden layer?
Solution: Alter the non-linear Perceptron (discrete threshold) activation
function to make it differentiable and hence, help derive Generalized DR for
MLP training.
+1
+1

-1 -1

Threshold function Sigmoid function

6
Sigmoid (S-shaped) Function Properties

• Approximates the threshold function


• Smoothly differentiable (everywhere), and hence DR applicable
• Positive slope
• Popular choice is
1
y = f (a) =
1 + e− a

• Derivative of sigmoidal function is:


f ' (a) = f (a) ⋅ (1 − f (a))

7
Weight Update Rule

Generally, weight change from any unit j to unit k by gradient descent (i.e.
weight change by small increment in negative direction to the gradient)
is now called Generalized Delta Rule (GDR or Backpropagation):

∂E
∆w = w − wold = −η = +η δ x
∂w
So, the weight change from the input layer unit i to hidden layer unit j is:

∆wij = η ⋅ δ j ⋅ xi where δ j = o j (1 − o j )∑ w jk ⋅ δ k
k

The weight change from the hidden layer unit j to the output layer unit k is:

∆w jk = η ⋅ δ k ⋅ o j where δ k = ( yt arg et ,k − y k ) y k (1 − y k )

8
Training of a 2-Layer Feed Forward Network

1. Take the set of training patterns you wish the network to learn
2. Set up the network with N input units fully connected to M hidden non-
linear hidden units via connections with weights wij, which in turn are
fully connected to P output units via connections with weights wjk
3. Generate random initial weights, e.g. from range [-wt, +wt]
4. Select appropriate error function E(wjk) and learning rate η
5. Apply the weight update equation ∆wjk=-η∂E(wjk)/∂wjk to each weight
wjk for each training pattern p.
6. Do the same to all hidden layers.
7. Repeat step 5-6 until the network error function is ‘small enough’

9
Graphical Representation of GDR

Total Error

Local minimum

Global minimum

Weight, wi
Ideal weight

10
Practical Considerations for Learning Rules

There are a number of important issues about training single layer neural
networks that need further resolving:
1. Do we need to pre-process the training data? If so, how?
2. How do we choose the initial weights from which we start the training?
3. How do we choose an appropriate learning rate η?
4. Should we change the weights after each training pattern, or after the
whole set?
5. Are some activation/transfer functions better than others?
6. How do we avoid local minima in the error function?
7. How do we know when we should stop the training?
8. How many hidden units do we need?
9. Should we have different learning rates for the different layers?

We shall now consider each of these issues one by one.

11
Pre-processing of the Training Data
In principle, we can just use any raw input-output data to train our networks.
However, in practice, it often helps the network to learn appropriately if we
carry out some pre-processing of the training data before feeding it to the
network.

We should make sure that the training data is representative – it should not
contain too many examples of one type at the expense of another. On the
other hand, if one class of pattern is easy to learn, having large numbers of
patterns from that class in the training set will only slow down the over-all
learning process.

12
Choosing the Initial Weight Values
The gradient descent learning algorithm treats all the weights in the same way,
so if we start them all off with the same values, all the hidden units will end
up doing the same thing and the network will never learn properly.

For that reason, we generally start off all the weights with small random values.
Usually we take them from a flat distribution around zero [–wt, +wt], or from
a Gaussian distribution around zero with standard deviation wt.

Choosing a good value of wt can be difficult. Generally, it is a good idea to


make it as large as you can without saturating any of the sigmoids.

We usually hope that the final network performance will be independent of the
choice of initial weights, but we need to check this by training the network
from a number of different random initial weight sets.

13
Choosing the Learning Rate

Choosing a good value for the learning rate η is constrained by two opposing
facts:
1. If η is too small, it will take too long to get anywhere near the minimum
of the error function.
2. If η is too large, the weight updates will over-shoot the error minimum
and the weights will oscillate, or even diverge.

Unfortunately, the optimal value is very problem and network dependent, so one
cannot formulate reliable general prescriptions.

Generally, one should try a range of different values (e.g. η = 0.1, 0.01, 1.0,
0.0001) and use the results as a guide.

14
Batch Training vs. On-line Training

Batch Training: update the weights after all training patterns have
been presented.

On-line Training (or Sequential Training): A natural alternative is to


update all the weights immediately after processing each training
pattern.

On-line learning does not perform true gradient descent, and the
individual weight changes can be rather erratic. Normally a much
lower learning rate η will be necessary than for batch learning.
However, because each weight now has N updates (where N is the
number of patterns) per epoch, rather than just one, overall the
learning is often much quicker. This is particularly true if there is a
lot of redundancy in the training data, i.e. many training patterns
containing similar information.

15
Choosing the Transfer Function

We have already seen that having a differentiable transfer/activation


function is important for the gradient descent algorithm to work. We
have also seen that, in terms of computational efficiency, the
standard sigmoid (i.e. logistic function) is a particularly convenient
replacement for the step function of the Simple Perceptron.

The logistic function ranges from 0 to 1. There is some evidence that


an anti-symmetric transfer function, i.e. one that satisfies f(–x) = –
f(x), enables the gradient descent algorithm to learn faster.

When the outputs are required to be non-binary, i.e. continuous real


values, having sigmoidal transfer functions no longer makes sense.
In these cases, a simple linear transfer function f(x) = x is
appropriate.

16
Local Minima
Cost functions can quite easily have more than one minimum:

If we start off in the vicinity of the local minimum, we may end up at the local
minimum rather than the global minimum. Starting with a range of different initial
weight sets increases our chances of finding the global minimum. Any variation
from true gradient descent will also increase our chances of stepping into the
deeper valley.

17
When to Stop Training
The Sigmoid(x) function only takes on its extreme values of 0 and 1 at x = ±∞.
In effect, this means that the network can only achieve its binary targets
when at least some of its weights reach ±∞. So, given finite gradient
descent step sizes, our networks will never reach their binary targets.

Even if we off-set the targets (to 0.1 and 0.9 say) we will generally require an
infinite number of increasingly small gradient descent steps to achieve
those targets.

Clearly, if the training algorithm can never actually reach the minimum, we
have to stop the training process when it is ‘near enough’. What constitutes
‘near enough’ depends on the problem. If we have binary targets, it might
be enough that all outputs are within 0.1 (say) of their targets. Or, it might
be easier to stop the training when the sum squared error function becomes
less than a particular small value (0.2 say).

18
How Many Hidden Units?
The best number of hidden units depends in a complex way on many factors,
including:
1. The number of training patterns
2. The numbers of input and output units
3. The amount of noise in the training data
4. The complexity of the function or classification to be learned
5. The type of hidden unit activation function
6. The training algorithm

Too few hidden units will generally leave high training and generalisation errors
due to under-fitting. Too many hidden units will result in low training errors,
but will make the training unnecessarily slow, and will result in poor
generalisation unless some other technique (such as regularisation)
regularisation is
used to prevent over-fitting.

Virtually all “rules of thumb” you hear about are actually nonsense. A sensible
strategy is to try a range of numbers of hidden units and see which works
best.

19
Different Learning Rates for Different Layers?
A network as a whole will usually learn most efficiently if all its neurons are
learning at roughly the same speed. So maybe different parts of the network
should have different learning rates η. There are a number of factors that
may affect the choices:
1. The later network layers (nearer the outputs) will tend to have larger local
gradients (deltas) than the earlier layers (nearer the inputs).
2. The activations of units with many connections feeding into or out of them tend to
change faster than units with fewer connections.
3. Activations required for linear units will be different for Sigmoidal units.
4. There is empirical evidence that it helps to have different learning rates η for the
thresholds/biases compared with the real connection weights.

In practice, it is often quicker to just use the same rates η for all the weights
and thresholds, rather than spending time trying to work out appropriate
differences. A very powerful approach is to use evolutionary strategies to
determine good learning rates.

20
References

1. Tommi Jaakkola, Course Materials for 6.867 Machine Learning,


Fall 2006. MIT OpenCourseWare(http://ocw.mit.edu/),
Massachusetts Institute of Technology, USA
2. Lecture Notes of Professor Nando de Freitas, University of Oxford,
UK
3. Lecture Notes of Professor Kevin Swingler and Professor Bruce
Graham, University of Stirling, UK
4. Professor Osvaldo Simeone, A Brief Introduction to Machine
Learning for Engineers, Department of Informatics, King’s College
London

You might also like