Learning Concept1

Supervised Learning with Neural Networks We now look at how an agent might learn to solve a general problem by seeing
examples. Aims:
to present an outline of supervised learning as part of AI; to introduce much of the notation and terminology used; to introduce the classical perceptron, and to show how it can be applied more generally using kernels; to introduce multilayer perceptrons and the backpropagation algorithm for training them.
Reading: Russell and Norvig, chapter 18 and 19.
Copyright c Sean Holden 2002-2005.
An example A common source of problems in AI is medical diagnosis. Imagine that we want to automate the diagnosis of an embarrassing disease (call it ) by constructing a machine:
Machine
Measurements taken from the patient, e.g. heart rate, blood pressure, presence of green spots etc.
Could we do this by explicitly writing a program that examines the measurements and outputs a diagnosis? Experience suggests that this is unlikely.
if the patient suffers from otherwise
An example, continued... Lets look at an alternative approach. Each collection of measurements can be written as a vector,
where,

. .

heart rate blood pressure if the patient has green spots otherwise and so on
An example, continued... A vector of this kind contains all the measurements for a single patient and is generally called a feature vector or instance. The measurements are usually referred to as attributes or features. (Technically, there is a difference between an attribute and a feature - but we wont have time to explore it.) Attributes or features generally appear as one of three basic types:
binary:

discrete:

or
continuous:

where
can take one of a nite number of values, say
An example, continued... Now imagine that we have a large collection of patient histories ( in total) and for each of these we know whether or not the patient suffered from .
This can be paired with a single bit or denoting whether or not the th patient suffers from . The resulting pair is called an example or a labelled example. Collecting all the examples together we obtain a training sequence,

The th patient history gives us an instance
An example, continued... In the form of machine learning to be emphasised here, we aim to design a learning algorithm which takes s and produces a hypothesis .
Learning Algorithm
Intuitively, a hypothesis is something that lets us diagnose new patients.
But whats a hypothesis? Whats a hypothesis?
Denote by
the set of all possible instances.
is a possible instance
In other words, can take any instance and produce a or a , depending on whether (according to ) the patient the measurements were taken from is suffering from .

Then a hypothesis
could simply be a function from
to
But whats a hypothesis? There are some important issues here, which are effectively right at the heart of machine learning. Most importantly: the hypothesis can assign a or a to any .
This includes instances quence.
The overall process therefore involves rather more than memorising the training sequence.
Ideally, the aim is that should be able to generalize. That is, it should be possible to use it to diagnose new patients.

that did not appear in the training se-
But whats a hypothesis? In fact we need a slightly more exible denition of a hypothesis.
because:
there may be more than two classes:
A hypothesis is a function from
to some suitable set
or, we might want has disease
to indicated how likely it is that the patient
where denotes denitely does have the disease and denotes denitely does not have it. might for example denote that the patient is reasonably certain to have the disease.

No disease Disease
Disease
Disease
But whats a hypothesis? One way of thinking about the previous case is in terms of probabilities: is in class
We may have . For example if contains several recent measurements of a currency exchange rate and we want to be a prediction of what the rate will be in minutes time. Such problems are generally known as regression problems.

Types of learning The form of machine learning described is called supervised learning . This introduction will concentrate on this kind of learning. In particular, the literature also discusses: 1. Unsupervised learning . 2. Learning using membership queries and equivalence queries. 3. Reinforcement learning .
Some further examples
Speech recognition. Deciding whether or not to give credit. Detecting credit card fraud. Deciding whether to buy or sell a stock option. Deciding whether a tumour is benign. Data mining - that is, extracting interesting but hidden knowledge from existing, large databases. For example, databases containing nancial transactions or loan applications. Deciding whether driving conditions are dangerous. Automatic driving. (See Pomerleau, 1989, in which a car is driven for 90 miles at 70 miles per hour, on a public road with other cars present, but with no assistance from humans!) Playing games. (For example, see Tesauro, 1992 and 1995, where a world class backgammon player is described.)
What have we got so far? Extracting what we have so far, we get the following central ideas:
A training sequence containing

labelled examples,

A learning algorithm which takes . We can write,

with
and
for
and produces a hypothesis
A collection of classes to which any instance can belong. In some scenarios we might have .

A collection
of possible instances .
What have we got so far? But we need some more ideas as well:
We often need to state what kinds of hypotheses are available to .
The collection of available hypotheses is called the hypothesis space and denoted .
Some learning algorithms do not always return the same for each run on a given sequence . In this case is a probability distribution on .
is available to
Whats the right answer?
This is not always a sufcient way of thinking about the problem...
This is called the target concept and is denoted . It can be regarded as the perfect hypothesis, and so we have .
We may sometimes assume that there is a correct function that governs the relationship between the s and the labels.
Generalization The learning algorithm never gets to know exactly what the correct relationship between instances and classes is - it only ever gets to see a nite number of examples. So:
generalization corresponds to the ability of to pick a hypothesis which is close in some sense to the best possible; however we have to be careful about what close means here.
For example, what if some instances are much more likely than others?
Generalization performance How can generalization performance be assessed?
All examples are assumed to be independent and identically distributed (i.i.d.) according to . Given a hypothesis and any example we can introduce a measure of the error that makes in classifying that example.
Model the generation of training example using a probability distribution on .
for a classication problem when
For example the following denitions for
might be appropriate:
Generalization performance A reasonable denition of generalization performance is then

er
In the case of the denition of the previous slide this gives
er
for classication problems given in
In the case of the denition for given for regression problems, er is the expected square of the difference between true label and predicted label.
Problems encountered in practice In practice there are various problems that can arise:
There may be noise present.
The practical techniques to be presented have their own approaches to dealing with such problems. Similarly, problems arising in practice are addressed by the theory.
Classications in
may be incorrect.
Measurements may be missing from the
vectors.
Examples
NOT KNOWN
Some measurements may be missing
Noise
Noisy Examples
Learning Algorithm
. . .
Prior Knowledge

Hypothesis

. . .
Target Concept
Some
are available
The perceptron: a review The initial development of linear discriminants was carried out by Fisher in 1936, and since then they have been central to supervised learning. Their inuence continues to be felt in the recent and ongoing development of support vector machines, of which more later...
The perceptron: a review
We have a two-class classication problem in

We output class
if
, or class

if .
The perceptron: a review So the hypothesis is
where

if otherwise
The primal perceptron algorithm

)
do
while (mistakes are made in the for loop) return .

if (
for (each example in )
Novikoffs theorem The perceptron algorithm does not converge if arable. However Novikoff proved the following:

optimum optimum
mistakes.

for
, then the perceptron algorithm makes at most
Theorem 1 If is non-trivial and linearly separable, where there exists a hyperplane optimum optimum with and optimum
is not linearly sep-
Dual form of the perceptron algorithm If we set then the primal perceptron algorithm operates by adding and subtracting misclassied points to an initial at each step.
Note:
the values are positive and proportional to the number of times is misclassied;
if is xed then the vector representation of .
As a result, when it stops we can represent the nal
as
is an alternative
Dual form of the perceptron algorithm Using these facts, the hypothesis can be re-written
Dual form of the perceptron algorithm
do

while (mistakes are made in the for loop) return .

if (
for (each example in )
Mapping to a bigger space There are many problems a perceptron cant solve.
The classic example is the parity problem.
Mapping to a bigger space But what happens if we add another element to For example we could use
1.2 1 1 0.8 0.6 x2 0.4 0.2 0 0.2 0.5 0 x1 * x2 0.5
1 0.5 0 0.5 x1 1 1.5 x2 0 0 x1 1 0.5
In
a perceptron can easily solve this problem.

? .
where

Mapping to a bigger space

Example: In a multilayer perceptron
where is the vector of weights and hidden node .
Note however that for the time being the functions are xed, whereas in a multilayer perceptron they are allowed to vary as a result of varying the and .

This is an old trick, and the functions
can be anything we like.
the bias associated with

. . . . . .

Mapping to a bigger space We can now use a perceptron in the high-dimensional space, obtaining a hypothesis of the form
where the node. So:
use the
use a (linear) perceptron in the bigger space.
start with ;
are the weights from the hidden nodes to the output
to map it to a bigger space;
Mapping to a bigger space What happens if we use the dual form of the perceptron algorithm in this process? We end up with a hypothesis of the form
where
is the vector
The sum has (potentially) become smaller. The cost associated with this is that we may have to calculate several times.
Notice that this introduces the possibility of a tradeoff of
and .
Kernels This suggests that it might be useful to be able to calculate easily.
In fact such an observation has far-reaching consequences.
Note that a given kernel naturally corresponds to an underlying collection of functions . Ideally, we want to make sure that the value of does not have a great effect on the calculation of . If this is the case then
is easy to evaluate.
Denition 2 A kernel is a function
such that for all vectors
and
Kernels: an example We can use or
to obtain polynomial kernels.
In the latter case we have features that are monomials up to degree , so the decision boundary obtained using as described above will be a polynomial curve of degree .

Kernels: an example
Weighted sum value 100 examples, p=10, c=1 2 1.5 1 x2 0.5 0 0.5 1 1 1000 500 0 500 1000 1 0 0 x1 x1 1 2 1 2 1 0 x2 1
Value (truncated at 1000)
Gradient descent An alternative method for training a basic perceptron works as follows. We dene a measure of error for a given collection of weights. For example
where the that
has not been used. Modifying our notation slightly so
gives

Gradient descent

and
The vector
is some small positive number.
where
tells us the direction of the steepest decrease in
is parabolic and has a unique global minimum and no local minima. We therefore start with a random and update it as follows:
Simple feedforward neural networks
We continue using the same notation as previously.
Usually, we think in terms of a training algorithm pothesis based on a training sequence
and then classifying new instances
by evaluating
Usually with neural networks the training algorithm provides a vector of weights. We therefore have a hypothesis that depends on the weight vector. This is often made explicit by writing
and representing the hypothesis as a mapping depending on both and the new instance , so classication of
nding a hy-
Backpropagation: the general case First, lets look at the general case. We have a completely unrestricted feedforward structure:
For the time being, there may be several outputs, and no specic layering is assumed.

. . .
Backpropagation: the general case For each node:

. . .
connects node to node .
is the weighted sum or activation for node . .
is the activation function.
Backpropagation: the general case In addition, there is often a bias input for each node, which is always set to . This is not always included explicitly; sometimes the bias is included by writing the weighted summation as
where
is the bias for node .
Backpropagation: the general case As usual we have:

We also dene a measure of training error
where
is the vector of all the weights in the network.
Our aim is to nd a set of weights that minimises
measure of the error of the network on
a training sequence

instances

; .
Backpropagation: the general case How can we nd a set of weights that minimises
The approach used by the backpropagation algorithm is very simple:
3. update the weight vector by taking a small step in the direction of the gradient
2. at the th step, calculate the gradient ;

of
4. repeat this process until
is sufciently small.
at the point
1. begin at step
with a randomly chosen collection of weights
? ;
Backpropagation: the general case In order to do this we have to calculate
Often ple in
is the sum of separate components, one for each exam-
in which case
We can therefore consider examples individually.

Backpropagation: the general case Place example at the inputs and calculate the values all the nodes. This is called forward propagation.
We have
where weve dened
and used the fact that
So we now need to calculate the values for
...
and
for
Backpropagation: the general case When is an output unit this is easy as
and the rst term is in general easy to calculate for a given

Backpropagation: the general case When is not an output unit we have

. . .
where tion:

are the nodes to which node sends a connec-

. . .
So
Then
by denition, and
Backpropagation: the general case
Backpropagation: the general case Summary: to calculate
2. Backpropagation 1: for outputs

3. Backpropagation 2: For other nodes

where the
were calculated at an earlier step.

1. Forward propagation: apply the nodes in the network.
for one pattern: and calculate outputs etc for all
Backpropagation: a specic example
output . . . . . .
so For the other nodes:
Also,
For the output:

Backpropagation: a specic example

Backpropagation: a specic example For the output: We have
output output output
and so and
output

output

output
Backpropagation: a specic example For the hidden nodes:
but there is only one output so

We have
and we have a value for
output
so

output
output

output
output
Putting it all together We can then use the derivatives in one of two basic ways: Batch: (as described previously)
Sequential: using just one pattern at once
selecting patterns in sequence or at random.
Example: the classical parity problem As an example we show the result of training a network with:
two inputs; one output; ; one hidden layer containing all other details as above.

The problem is the classical parity problem. There are amples. The sequential approach is used, with entire training sequence.
units;

repetitions through the
noisy ex-
Example: the classical parity problem
Before training 2 1.5 1 x2 0.5 0 0.5 1 1 x2 2 1.5 1 0.5 0 0.5 1 1
After training
0 x1
0 x1

After training Before training
1 Network output 0.8 1 0.4 0.2 0 2 1 0 x2 1 1 1 0 x1 x2 2 Network output 0.6
0.5 1 0 1 0 1 2 2 x1 1 0

Error during training 10
61
100
200
300
400
500
600
700
800
900
1000

Learning Concept1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Learning Concept1

Uploaded by

Copyright:

Available Formats

Supervised Learning with Neural Networks We now look at how an agent might learn to solve a general problem by seeing

Reading: Russell and Norvig, chapter 18 and 19.

Copyright c Sean Holden 2002-2005.

if the patient suffers from otherwise

can take one of a nite number of values, say

The th patient history gives us an instance

Intuitively, a hypothesis is something that lets us diagnose new patients.

But whats a hypothesis? Whats a hypothesis?

the set of all possible instances.

could simply be a function from

This includes instances quence.

that did not appear in the training se-

there may be more than two classes:

A hypothesis is a function from

to some suitable set

or, we might want has disease

to indicated how likely it is that the patient

Some further examples

A training sequence containing

A learning algorithm which takes . We can write,

and produces a hypothesis

We often need to state what kinds of hypotheses are available to .

Whats the right answer?

This is not always a sufcient way of thinking about the problem...

Generalization performance How can generalization performance be assessed?

Model the generation of training example using a probability distribution on .

for a classication problem when

For example the following denitions for

Generalization performance A reasonable denition of generalization performance is then

In the case of the denition of the previous slide this gives

for classication problems given in

There may be noise present.

Measurements may be missing from the

Some measurements may be missing

The perceptron: a review

We have a two-class classication problem in

The perceptron: a review So the hypothesis is

The primal perceptron algorithm

while (mistakes are made in the for loop) return .

for (each example in )

, then the perceptron algorithm makes at most

is not linearly sep-

if is xed then the vector representation of .

As a result, when it stops we can represent the nal

Dual form of the perceptron algorithm

while (mistakes are made in the for loop) return .

for (each example in )

The classic example is the parity problem.

1 0.5 0 0.5 x1 1 1.5 x2 0 0 x1 1 0.5

a perceptron can easily solve this problem.

Mapping to a bigger space

Mapping to a bigger space

Example: In a multilayer perceptron

where is the vector of weights and hidden node .

This is an old trick, and the functions

can be anything we like.

the bias associated with

Mapping to a bigger space

where the node. So:

use a (linear) perceptron in the bigger space.

are the weights from the hidden nodes to the output

to map it to a bigger space;

Notice that this introduces the possibility of a tradeoff of

Kernels This suggests that it might be useful to be able to calculate easily.