You are on page 1of 3

Welcome back.

This is Machine Learning Lecture Number 2.


Let's start by reviewing basic concepts we've already
seen from lecture number 1.
Feature vectors x provide the context for classifier
to make predictions.
They are vectors.
They belong to Rd So they are d dimensional vectors in general,
d equals 2 for points on the plane, as we've already seen.
We have labels, or targets, or outputs
y that are plus minus 1.
The tasks in machine learning, I illustrated by a training set--
a supervised learning task where you
are given the input and the corresponding output
that you want.
And you're supposed to learn irregularity between the two
in order to make predictions for future examples, or inputs,
or feature vectors x.
So the training set, as we've already seen,
is denoted by S subscript n, where
n refers to the number of training examples that we have.
And it is a collection of pairs of the input feature vector
where the superscript i denotes the i-th example
and the corresponding label.
And we assume that we have some number n of them.
So that is the task as given to the algorithm.
Now a classifier is a mapping that takes, as an input,
a feature vector.
So it maps any point in Rd 2 plus R minus 1 label.
So we apply the classifier to a particular point.
We will get either a plus or minus 1 label, for example 1,
OK?
So a classifier essentially divides the space
into two halves so we can look at--
we've seen shaded areas on the plane
that corresponds to the set of positively
labeled points on the plane.
So this is, then, the set of all x
in Rd such that the classifier returns plus 1 label.
And similarly for the negative half, OK?
And those, together, comprise the Rd space.

All right.
Now once we have a notion of a classifier,
we need to be able to evaluate how good that classifier is.
And we have seen two ways to measure
how good a classifier is, one that we have access
to the training error, and the other one that we really
want to minimize, which is the test error.
The training error was calculated by big E subscript
n--
refers to the number of examples that you are
using to calculate the error.
It's a function of any classifier.
I pick any classifier.
I can evaluate the training error for that classifier.
And it is the fraction of misclassified examples
on the training set.
So we take the fraction, divide by the number
of training samples.
Sum over the training samples.
And for each example, return plus 1
when there is a mistake, and 0, otherwise.
So we apply that classifier to the i-th training sample,
and compare it with the i-th label as
given in the training set, and ask whether this discrepancy
is true.
If it's true, then this returns 1 if there is an error in 0,
otherwise.
All right.
Test error is defined exactly similarly over the test
examples.
And now, we typically drop the n there,
assuming that the test set is relatively large,
and it can be measured in principle for any classifier.
So defined similarly to the training error,
but over a disjoint set of examples
those future examples that you actually wish to do well.
And much of machine learning, really, the theory part
is in relating how a classifier that
might do well on the training set
would also do well on the test set.
That's the problem called Generalization,
as we've already seen.
We can effect generalization by limiting the choices
that we have at the time of considering
minimizing the training error.
The choices that we have, the set of hypotheses,
set of alternatives, is also called the set of classifiers.
So our classifier here belongs to a set
of classifiers, capital H here, that's
not the set of all mappings.
It's a limited set of options that we constrain ourselves to.

Let's look at this, geometrically, just as


a review here on the plane.
So each x here has two coordinates, x1 and x2.
So it is clearly a vector in 2d where d is 2.

And its training example is given as a point here.


So here we have a training example.
Let's say it's the first training example
with the associated label.
And in this case, the label is minus 1.
That defines the task for our classifier.
This is how the supervised learning task is illustrated.

A classifier, then, is a function


that we wish to learn select from our set of classifiers,
such that it does well on the training set.

So here, a classifier h is a linear classifier,


divides the space into two halves, linearly,
where the shaded area corresponds
to all the points in the 2d plane
that the classifier returns plus 1, and the rest minus 1.
So this classifier would have a training era equal to 0.
It correctly classifies all the training examples.

Now the real task that we wish to solve


is to correctly classify those test examples
that we have not yet seen.
And the trick here is to somehow guide
the selection of the classifier based on the training
example, such that it would do well on the examples
that we have not yet seen.
So in order for this to be possible at all,
you have to have some relationship
between the training samples and the test examples.
Typically, it is assumed that both sets are
samples from some large collection of examples
as a random subset.
So you get a random subset as a training set.
And then the rest, consider it as test examples, OK?
Today we're going to limit our choices of the classifier
to the set of linear classifiers and just assume
that since that is a restricted set of classifiers,
that if we do well in the training set,
we will also generalize well.
We will return to the question of generalization
more formally later on in this course.
So what is this lecture going to be about?
We're going to formally define the set
of linear classifiers, the set H restricted set of classifiers.
We need to introduce parameters that index classifiers in this
set so that we can search over the possible classifiers
in the set.
Then we'll talk about linear separation,
what a linear classifier or the set of classifiers can
and cannot do.
So what exactly is the limitation?
How are we constraining the set of classifiers?
Once we understand that set of classifiers
we are operating when we need to define the learning
algorithm that takes in the training
set and the set of classifiers and tries to find a classifier,
in that set, that somehow best fits the training set.
We will consider initially perceptron algorithm,
which is a very simple online mistake driven algorithm that
is still useful as it can be generalized
to high dimensional problems.
So perceptron algorithm finds a classifier h hat,
where hat denotes an estimate from the data.
It's an algorithm that takes, as an input, the training
set and the set of classifiers and then returns
that estimated classifier, OK?
And then we can apply that estimated classifier
on any new example to get the predicted label for that point.

You might also like