You are on page 1of 20

Lecture 1 – The Learning Problem

Welcome to machine learning. Let me start with an outline of the course, and then
go into the material of today's lecture.

As you see from the outline, the topics are given colors, and that designates their
main content, whether it's mathematical or practical.
Machine learning is a very broad subject. It goes from very abstract theory to
extreme practice as in rules of thumb. So some mathematics is useful because it
gives you the conceptual framework, and then some practical aspects are useful
because they give you the way to deal with real learning systems.
And the subject of this lecture is the learning problem. It's an introduction to what
learning is. And I'm going to start with one example that captures the essence of
machine learning. The example of machine learning that I'm going to start with is
how a viewer would rate a movie. Now that is an interesting problem, and it's
interesting for us because we watch movies, and very interesting for a company
that rents out movies, because it can recommend movies in a better way.
Now if you look at the problem of rating a movie, it captures the essence of
machine learning, and the essence has three components. If you find these
three components in a problem you have in your field, then you know that machine
learning is ready as an application tool. What are the three?
The first one is that a pattern exists. If a pattern didn't exist, there would be
nothing to look for. So what is the pattern in our example? There is no question that
the way a person rates a movie is related to how they rated other movies, and is
also related to how other people rated that movie. We know that much. So there is a
pattern to be discovered. However, we cannot really pin it down
mathematically. I cannot ask you to write a 17th-order polynomial that captures
how people rate movies. If we can pin it down mathematically you could just use the
mathematical definition and get the best possible solution. You definite can use
learning in this case, but it's not the recommended method, because it has certain
errors in performance. We couldn't write down the system on our own, so we're
going to depend on data in order to be able to find the system. There is a missing
component which is very important. We have to have data. We are learning from
data. If you have these three components, you are ready to apply machine learning.
Now, let me give you a solution to the movie rating in order to start getting a feel
for it. So here is a system:

We are going to describe a viewer as a vector of factors, a profile if you will. So,
if you look for example, the first one would be comedy content (from a viewer point
of view). Does the movie have a lot of comedy? Do they like action? Is it a
blockbuster? And you can go on all the way, even to asking yourself whether the
person likes the lead actor or not. Now you go to the content of the movie itself,
and you get the corresponding part. Does the movie have comedy? Does it have
action? Is it a blockbuster? And so on. Now you compare the two, and you realize
that if there is a match -- let's say you hate comedy and the movie has a lot of
comedy, then the chances are you're not going to like it and the rating is going to
be low. But if there is a match between so many coordinates (and the number
of factors here could be really like 300 factors), then the chances are you'll like the
movie. And if there's many mismatches, the chances are you're not going to like the
movie. So what do you do, you match the movie and the viewer factors, and
then you add the contributions of them. And then as a result of that, you get
the predicted rating. This is all good except for one problem, which is this is
really not machine learning. In order to produce this thing, you have to watch
the movie, and analyze the content. You have to interview the viewer, and ask
about their taste. And then after that, you combine them and try to get a prediction
for the rating.
Now the idea of machine learning is that you don't have to do any of that. So let's
look at the learning approach. So in the learning approach, we know that the viewer
will be a vector of different factors, and different components for every factor. Same
for the movie.
And the way we said we are computing the rating, is by simply taking these and
combining them and getting the rating in this setting

Now what machine learning will do is reverse-engineer that process. It starts from
the rating, and then tries to find out what factors would be consistent with
that rating
So think of it this way. You start, let's say, with completely random factors, to
movies and to viewers. For every user and every movie, that's your starting point.
Obviously, there is no chance in the world that when you get the inner product
between these two factors that are random, that you'll get anything that looks like
the rating that actually took place (that actually you have in your data base), right?
But what you do is you take a rating that actually happened, and then you
start nudging the factors ever so slightly toward that rating. Make the
direction of the inner product get closer to the ratings you have. Now it looks like a
hopeless thing. I start with so many factors, they are all random, and I'm trying to
make them match a rating. What are the chances? Well, the point is that you are
going to do this not for one rating, but for a million ratings. And you keep cycling
through the million, over and over and over. And eventually, you find that the
factors now are meaningful in terms of the ratings. And if you get a user, a
viewer here, that didn't watch a movie, and you get the vector that resulted from
that learning process (the movie vector that resulted from that process), and you do
the inner product, you get a rating which is actually consistent with how that viewer
rates the movie. That's the idea. And this is for real, this actually can be used.
Now with this example in mind, let's actually go to the components of learning. So
now I would like to abstract from the learning problems that I see, what are the
mathematical components that make up the learning problem? And I'm going to use
a metaphor now from another application domain, which is a financial
application. So the metaphor we are going to use is credit approval. You apply
for a credit card, and the bank wants to decide whether it's a good idea to extend a
credit card for you or not. From the bank's point of view, if they're going to make
money, they are happy. If they are going to lose money, they are not happy. Now,
very much like we didn't have a magic formula for deciding how a viewer will rate a
movie, the bank doesn't have a magic formula for deciding whether a person is
creditworthy or not. What they're going to do, they're going to rely on historical
records of previous customers, and how their credit behavior turned out, and then
try to reverse-engineer the system to construct a model and to apply it to a future
customer. That's the deal.
What are the components here? First, you have the applicant information that could
be this

Again, pretty much like we did in the movie example, there is no question that these
fields are related to the creditworthiness. They don't necessarily uniquely determine
it, but they are related. And the bank doesn't want a sure bet. They want to get the
credit decision as reliable as possible. So they want to use that pattern, in order to
be able to come up with a good decision. And they take this input, and they want to
approve the credit or deny it. So let's formalize this.
First, we are going to have an input that happens to be the customer application

So we can think of it as a d-dimensional vector, where the first component could


be the salary, years in residence, outstanding debt, whatever the components are.
You put it as a vector, and that becomes the input. Then we get the output y

Where the output is simply the decision, either to extend credit or not to extend
credit, +1 and -1. Now we have after that, the target function.

The target function is a function from a domain X, which is the set of all of these
inputs (it is the set of vectors of d-dimensions, it's a d-dimensional Euclidean space,
in this case). And then the Y is the set of y's that can only be +1 or -1, accept or
deny. And therefore this is just a binary co-domain. And this target function is the
ideal credit approval formula, which we don't know.
In all of our endeavors in machine learning, the target function is unknown to us. If
it were known, nobody needs learning. We just go ahead and implement it. But we
need to learn it because it is unknown to us. So what are we going to do to learn it?
We are going to use data, examples. So the data in this case is based on previous
customer application records. The input, which is the information in their
applications, and the output, which is how they turned out in hindsight. So this is
the data

And then you use the data, which is the historical records, in order to get the
hypothesis. The hypothesis is the formula we get to approximate the target
function. That's the goal of learning.
Now, let's put it in a diagram in order to analyze it a little bit more
If you look at the diagram, on top is the target function which is unknown -- that is
the ideal credit approval that's what we're hoping to get to approximate. And we
don't see it, we see it only through the eyes of the training examples. This is our
vehicle of understanding what the target function is. And eventually, we would like
to produce the final hypothesis. The final hypothesis is the formula the bank is going
to use in order to approve or deny credit, with the hope that g hopefully
approximates that f. Now what connects those two guys (hypothesis and data)? This
will be the learning algorithm. So the learning algorithm takes the examples, and
will produce the final hypothesis.
Now there is another component that goes into the learning algorithm. So what the
learning algorithm does, it creates the formula from a set of candidate formulas.
And these we are going to call the hypothesis set, a set of hypotheses from which
we are going to pick one hypothesis. So from this H comes a bunch of small h's,
which are functions that can be candidates for being the credit approval. And one of
them will be picked by the learning algorithm, which happens to be g, hopefully
approximating f.
But why do we have this hypothesis set? Why not let the algorithm pick from
anything? Just create the formula, without being restricted to a particular set of
formulas H. There are two reasons, and I want to explain them.
1. THERE IS NO DOWNSIDE One of them is that there is no downside for
including a hypothesis set in the formalization. There is no downside because
when you restrict to use: say a linear formula, or a neural network or a support
vector machine, you are already dictating a set of hypotheses. If you happen to
don't want to restrict yourself at all, then your hypothesis set is the set of all
possible hypotheses. So there is no loss of generality in putting it.
2. THERE IS AN UPSIDE The upside is not obvious here, but it will become obvious as
we go through the theory. The hypothesis set will play a pivotal role in the
theory of learning. It will tell us: can we learn, and how well we learn, and
whatnot. Therefore having it as an explicit component in the problem statement will
make the theory go through.
Now, let me focus on the solution components of that figure. Given the machine
learning problem what is the solution components? The learning algorithm and
the hypothesis set are your solution tools. These are things you choose, in
order to solve the problem.

So, here is the hypothesis set

We chose the notation H for the set, and the element will be given the symbol small
h. So h is a function, pretty much like the final hypothesis g; g is just one of them
that you happen to elect. So when we elect it, we call it g. If it's sitting there
generically, we call it h.
And you have your learning algorithm that will select in some way g from H. And
then, when you put them together, they are referred to as the learning model. So
if you're asked what is the learning model you are using, you're actually choosing
both a hypothesis set and a learning algorithm. For example, if you are using a
perceptron model you could use a perceptron learning algorithm (PLA); if you use a
neural network, you could use back propagation as the algorithm; if you use a
support vector machine, you could use radial basis function version, or, let’s say
quadratic programming as the learning algorithm. So every time you have a model,
there is a hypothesis set, and then there is an algorithm that will do the searching
and produce that model. So this is the standard form for the solution.
Now, let me go through a simple hypothesis set – the ‘perceptron’. So, let’s say
we have a d-dimensional input vector that corresponds to customers attributes.
So what does the perceptron model do? It does a very simple formula

So, it takes the attributes you have and gives them different weights. So, let's say
the salary is important, the chances are w corresponding to the salary will be big.
However consider the case of outstanding; outstanding debt is bad news. If you owe
a lot, that's not good. So the chances are the weight will be negative for outstanding
debt, and so on.
Now you add them together, and you add them in a linear form -- that's what makes
it a perceptron-- and you can look at this as a “credit score” where you can compare
it with a threshold. If you exceed the threshold, they approve the credit card. And if
you don't, they deny the credit card.
Now, we take this and we put it in the formalization we had

If the credit score quantity is positive, you will approve credit. If it's negative, you
will deny credit. And that will be the form of your hypothesis. Now, realize that what
defines h is your choice of w and the threshold. These are the parameters that
define one hypothesis versus the other in this case. x is an input that will be put into
any hypothesis. As far as we are concerned, when we are in the learning process,
the inputs and outputs are already determined. These are the data set. But what the
algorithm needs to vary in order to choose the final hypothesis, are those
parameters which, in this case, are w and the threshold.
So let's look at it visually. Let's assume that the data you are working with is
linearly separable like this
And if you look at the nine data points, some of them were good customers and
some of them were bad customers. And you would like now to apply the perceptron
model (that corresponds to the purple line), in order to separate them correctly.
Note that a particular purple line encodes a choose of parameters

But when you start, you start with random weights, and the random weights will
give you any line, like the one on the left. So, you can see that the learning
algorithm is playing around with these parameters, and therefore moving the line
around (in this case in a 2 dimensional space – but in reality is in a d-dimensional
space), trying to arrive at the solution on the right.
Now we are going to have a simple change of notation. Instead of calling it
threshold, we're going to treat it as if it's another weight, like

And now we are going to introduce an artificial coordinate that will allow me to
simplify the formula
So now we are down to this formula for the perceptron hypothesis that is the inner
product between two vectors.
Now that we have the hypothesis set, let's look for the learning algorithm that goes
with it. The hypothesis set tells you the resources you can work with. Now
we need the algorithm that is going to look at the data and navigate through the
space of hypotheses to bring the one that is going be the final hypothesis that you
give to your customer. So this one is called the perceptron learning algorithm
(PLA), and what it does is the following

It takes the training data and it tries to make the w correct. So if a point is
misclassified, it updates the weight vector towards some value that will
“push” the hypothesis to the correct classification. It changes the weight,
which changes the hypothesis, so that it behaves better on that particular
point.
And that is the intuition about what this formula is doing
Remember that the inner product between two vectors (in this case w and x) with
acute angle between them is positive and the sign function will give you +1; on

the other hand, if the angle is obtuse the inner product is negative and the sign
will give you -1. So, being misclassified means that either x and w have an acute
angle between them and the output should be -1 (second case above), or have an
obtuse angle and the output should be +1 (first case above). Given this, note by the
above diagram that the update rule w ≔ w+ yx tries to repair the misclassified
example for both cases. So this is the intuition behind it. However, it is not the
intuition that makes this work. There are a number of problems with this approach. I
just motivated that to show you that this update is not a crazy rule.
Now, let's look at the iterations of the perceptron learning algorithm. Here is one
iteration of PLA (perceptron learning algorithm)

In this case the purple line corresponds to a specific hypothesis. But in the picture
above we have one misclassified example. So now you would like to adjust the
weights, by moving around that purple line, such that the point is classified
correctly. If you apply the learning rule, you'll find that the line is actually moving in
the arrows direction, which means that the blue point will likely be correctly
classified after that iteration.
There is a problem here, because if I move the line in that direction, the negative
example near the line could be misclassified now. And if you think about it, by
taking care of one point, I may be messing up all other points, because I'm not
taking them into consideration. Well, the good news for the perceptron learning
algorithm is that all you need to do, is to pick a misclassified point, anyone you
like; and then apply the update rule that we saw. And you keep doing this until
there is no more misclassification. And if the data is linearly separable, then you will
end up with the correct solution (it will converge to solution). This is not an
obvious statement. It requires a proof. But it gives us the simplest possible learning
model we can think of.
Now, given the solution found you can use it to make new prediction on new
costumers. Now, you may ask the question: if I match the historical records,
does this mean that I'm getting future customers right, which is the only thing
that matters? Well, that's a loaded question which will be handled in extreme detail,
when we talk about the theory of learning. That's why we have to develop all of the
theory. So, that's it. And that is the perceptron learning algorithm.
Now let me go into the bigger picture of learning, defining these types. So let's talk
about the premise of learning, from which the different types of learning came
about

This is the premise that is common between any problem that you would consider
learning: you use a set of observations, what we call data, to uncover an
underlying process; in our case, the target function. You can see that this is a
very broad premise. And therefore, you can see that people have rediscovered that
over and over in so many disciplines. For example in statistics, where the underlying
process is a probability distribution and the observations are samples generated by
that distribution. And you want to take the samples, and predict what the probability
distribution is.
Now let’s talk about the different types of learning; where these are the most
important ones

So let's take them one by one.

1. Supervised learning
So what is supervising learning? Anytime you have the data that is given to you,
with the output explicitly given; or if a supervisor is helping you out, in order to be
able to classify the future ones, we call it supervised.
Let's take an example of coin recognition, just to be able to contrast it with
unsupervised learning in a moment. Let's say you have a vending machine, and you
would like to make the system able to recognize the coins. Now, given the
physical measurements of the coin (mass and size) and the correct output
(quarters, nickels, pennies, or dimes) we can construct this diagram

And because this is a supervised learning we have this diagram colored. I gave you
those and told you they are cents, 5 cents, et cetera. So you use those in order to
train a system, and the system will then be able to classify a future one. For
example, if we stick to the linear approach, you may be able to find separator lines
like those

And those separator lines will separate regions, based on the data. And once you
have those, you can vanish the data because you don't need it anymore.
And when you get a future coin that is now unlabeled, that you don't know what it
is, when the vending machine is actually working, then the coin will lie in one region
or another, and you're going to classify it accordingly giving to it a label. So that is
an example of supervised learning.

2. Unsupervised learning
For unsupervised learning, instead of having the examples with the correct target,
we are going to have examples that have less information, I'm just going to tell you
what the input is

And I'm not going to tell you what the target function is at all. I'm not
going to tell you anything about the target function. I'm just going to tell you,
here is the data of a customer. Good luck, try to predict the credit. Now, although
this seems pretty difficult let me explain how this can help us.
Let's go for the coin example. For the coin example, we have data that looks like
this

Now notice that even that I don’t know the labels, things tend to cluster together. So
I may be able to classify those clusters into categories, without knowing what the
categories are. That will be quite an achievement already. You still don't know
whether it's cents, or whatever; but the data actually made you able to do
something that is a significant step. So you're going to be able to come up with
these boundaries
Where it’s not very clear if I have three or four clusters. And indeed in unsupervised
learning, the number of clusters is ambiguous at times. And now, you are so close
to finding the full system. So unlabeled data actually can be pretty useful. Because
if you categorize the clusters into types, like this

..if someone comes with a single example of a quarter, a dime, etc., then you are
ready to go. Whereas before, you had to have lots of examples in order to choose
where exactly to put the boundary. And this is why a set like that, which looks like
complete jungle, is actually useful.
Let me give you another interesting example of unsupervised learning, where I give
you the input without the output, and you are actually in a better situation to learn.
Let's say that your company or your school in this case, is sending you for a
semester in Rio de Janeiro. So you're very excited, and you decide that you'd better
learn some Portuguese. And being that you find that the only resource you have is a
radio station in Portuguese in your car. So what you do, you just turn it on whenever
you drive. And for an entire month, you're bombarded with Portuguese, with things
you don’t really know the meaning: "tudo bem", "como vai", "valeu". After a while,
without knowing anything-- it's unsupervised, nobody told you the meaning of any
word-- you start to develop a model of the language in your mind. You are very
eager to know what actually "tudo bem" -- what does that mean. You are ready to
learn, and once you learn it, it's actually fixed in your mind. Then when you go
there, you will learn the language faster than if you didn't go through this
experience. So you can think of unsupervised learning as a way of getting a
higher-level representation of the input. Where this extremely high level
corresponds to clusters, which is a better representation than just the crude
input in your mind.

2. Reinforcement learning
In this case, it's not as bad as unsupervised learning. So again, without the benefit
of supervised learning, you don't get the correct output, but instead you have this

I'm going to give you some output and I going to grade it. So that is the information
provided to you. So I'm not explicitly giving you the output, but when you choose an
output, I'm going to tell you how well you're doing. Reinforcement learning is
interesting because it is mostly our own experience in learning (like a toddler
learning not to put your hand in an iron). The most important application, or one of
the most important applications of reinforcement learning, is in playing games,
like in backgammon. So in backgammon you want to take the current state of the
board and then you decide what is the optimal move in order to stand the best
chance to win. So the target function is the best move given a state.
Now, if I have to generate those things in order for the system to learn, then I must
be a pretty good backgammon player already. So now it's a vicious cycle. Now,
reinforcement learning comes in handy. What you're going to do, you are going to
have the computer choose any output. A crazy move, and then see what happens
eventually. So this computer is playing against another computer, both of them
want to learn. And you make a move, and eventually you win or lose. So you
propagate back the credit because of winning or losing, according to a very specific
and sophisticated formula, into all the moves that happened. Now you think that's
completely hopeless, because maybe this is not the move that resulted in the final
result, it's another move. But always remember, that you are going to do this billion
times. And maybe in three days of CPU time-- you go back to the computer, and you
have a backgammon champion. Actually, that's true. The world champion, at some
point, was a neural network that learned the way I described.
So now I'm going to give you a learning puzzle that is a supervised learning problem
The problem is, given the labeled examples (9 bits of information and the
correspond label), what is the label for the new example? So, your task is to look at
the examples, learn a target function and apply it to the test point, and then decide
what the value of the function is, +1 or -1.
Now, the answer is to that question is: this is an impossible task; because I told
you the target function is unknown. It could be anything. So there would be an
infinite number of functions that would fit those examples that gives a correct value
for the test point. For example, if the function is “top left block being white is +1”
then f =−1 . However, if the target function is “symmetric pattern is +1”, then
f =+1 . So, the function is unknown. Since you give me a finite sample, it can be
anything outside. Now, how in the world am I going to tell what the learning outside
is? How do I learn the correct hypothesis amongst an infinite number of them? We
will see; and we will see that, indeed, it’s possible to learn a good approximation
without having the target function a priori.

Lecture 1 - Q&A
Q1. How do you determine if a set of points is linearly separable, and what do you
do if they're not separable?
The linear separability assumption is a very simplistic assumption, and doesn't
apply mostly in practice. And I chose it only because it goes with a very simple
algorithm, which is the perceptron learning algorithm.
There are two ways to deal with the case of linear inseparability. There are specific
algorithms, and most algorithms actually deal with that case, and there's also a
technique that we are going to study, which will take a set of points which is not
linearly separable, and create a mapping that makes them linearly separable in
another space. So there is a way to deal with it.
However, the question how do you determine if it's linearly separable, the right way
of doing it in practice is that, when someone gives you data, you assume in general
it's not linearly separable. It will hardly ever be, and therefore you take techniques
that can deal with that case as well. There is a simple modification of the perceptron
learning algorithm, which is called the pocket algorithm, that applies the same
rule with a very minor modification, and deals with the case where the data is not
separable.
However, if you apply the perceptron learning algorithm, that is guaranteed to
converge to a correct solution in the case of linear separability, into a not linearly
separable problem, bad things happen. Not only is it going not to converge
(obviously it is not going to converge because it terminates when there are no
misclassified points) but it can go from a very good solution to a terrible solution in
one iteration. And that’s bad.
Q2. How does the rate of convergence (speed of convergence) of the perceptron
change with the dimensionality of the data?
Badly! That's the answer. Let me put it this way. You can build pathological cases,
where it really will take forever. Remember that: the perceptron is a very simple
algorithm that in generally it would behave very badly computationally.
Q3. Regarding the items for learning, you mentioned that there must be a pattern.
Can you be more specific about that? How do you know if there's a pattern?
You don't. When we get to the theory-- is learning feasible?-- it will become very
clear that there is a separation between the target function-- there is a pattern to
detect-- and whether we can learn it. The essence of it is that you take the data, you
apply your learning algorithm, and there is something you can explicitly detect
that will tell you whether you learned or not. So in some cases, you're not
going to be able to learn and in some cases you do. And the key is that you're going
to be able to tell by running your algorithm (and not by looking at the data).
Q4. Is the hypothesis set, in a topological sense, continuous?
The hypothesis set can be anything, in principle. So it can be continuous and it can
be discrete. For example, in the next lecture I take the simplest case where we have
a finite hypothesis set. In reality, almost all the hypothesis sets that you find are
continuous and infinite. And nonetheless, we will be able to see that under one
condition, which comes from the theory, we'll be able to learn even if the hypothesis
set is huge and complicated.
Q5. I don't understand the second example you gave about credit approval. So how
do we collect our data? Should we give credit to everyone, or should we make our
data biased? For example, should we give credit or not to persons we rejected?
So, let's say the bank uses historical records. So it sees the people who applied and
were accepted, and for those guys, it can actually predict what the credit behavior
is, because it has their credit history. Now, for those who were rejected, there's
really no way to tell in this case whether they were falsely rejected, that they would
have been good customers or not. The data set in this case is not completely
representative, and there is a particular principle in learning that we'll talk about,
which is sampling bias, that deals with this case.
Q6. How do you decide how much amount of data that is required for a particular
problem, in order to be able to come up with a reasonable model?
So let me tell you the theoretical, and the practical answer. The theoretical answer
is that this is exactly the crux of the theory part that we're going to talk about. And
in the theory, we are going to see: “can we learn?”, and how much data is
necessary to do. So all of this will be answered in a mathematical way. The practical
answer is: that's not under your control, because when you are facing a learning
problem often you have much data that you can, so you need to construct a system
from that. So in practice, you really have no control over the data size in almost all
the practical cases.
Q7. The larger the hypothesis set is, probably I'll be able to better fit the data. But,
as you were explaining, it might be a bad thing to do because when the new data
point comes, there might be troubles. So how do you decide the size of your
hypothesis set?
As we mentioned, learning is about being able to predict. So given the data, the
idea is not to memorize it, but to figure out what the pattern is. And if you figure out
a pattern that applies to all the data, and it's a reasonable pattern, then you have a
chance that it will generalize outside. Now the problem is that, if I give you points,
and you use a 10th-order polynomial, you will fit the heck out of the data. You will fit
it so much with so many degrees of freedom to spare, but you haven't learned
anything. You just memorized it in a fancy way. You put it in a polynomial form, and
that actually carries all the information about the data that you have and you don't
expect at all that this will generalize outside. And that intuitive observation will be
formalized when we talk about the theory.
Q8. Suppose I have a data set and an algorithm, and gave the output. But won't it
be also important to use the output of a new example, like a feedback?
You are alluding to different techniques here. But one of them would be validation,
which is after you learn, you validate your solution. And this is an extremely
established and core technique in machine learning that will be covered in one of
the lectures.
Q9. In practice, how many dimensions would be considered easy, medium, and hard
for a perceptron problem?
The hard, in most people's mind before they get into machine learning, is the
computational time. For machine learning, the bottleneck has never been the
computation time, even in incredibly big data sets. The bottleneck for machine
learning is to be able to generalize outside the data that you have seen.
So to answer your question, the perceptron behaves badly in terms of the
computational behavior; but, in general, good in terms of generalization.
Q10. Also, in the example you explain the use of binary output function. So can you
use more multi-valued or real functions?

Obviously there are hypotheses that cover all types of co-domains; y could be
anything.
Q11. In the learning process you showed, when do you pick your learning algorithm,
when do you pick your hypothesis set, and what liberty do you have?
The hypothesis set is the most important aspect of determining the
generalization behavior that we'll talk about. The learning algorithm does play
a role, although it is a secondary role, as we will see in the discussion.
So in general, the learning algorithm has the form of minimizing an error function.
So you can think of the PLA, what does this algorithm do? It tries to minimize the
classification error. That is your error function, and you're minimizing it using that
particular update rule (learning algorithm). So the question now translates into what
is the choice of the error function or error measure that will help or not help. And
that will be covered also next week under the topic, Error and Noise.
Q12. Back to the perceptron, what happens if your hypothesis gives you exactly
zero in this case?
So, in perceptron algorithm remember that the quantity you compute and compare
with the threshold was your credit score. So I told you what happens if you are
above threshold, and what happens if you're below threshold. So what happens if
you're exactly at the threshold? Your score is exactly the value of the threshold.
There are technical ways of defining that point. You can define it as zero, in which
case you are always making an error, because you are never +1 or -1, when you
should be. Or you could make it belong to the +1 category or to the -1 category.
But, as far as you're concerned, the easiest way to consider it is that the output will
be zero, and therefore you will be making an error regardless of whether it's +1 or
-1.
...

You might also like