You are on page 1of 6

MUSIC] Hi. In this video, we'll discuss linear models.

One of the simplest models in machine learning,


but linear models are building blocks for deep neural networks that we will discuss in our course. So,
they are quite important for us and let's start with an example. Suppose you are given an image and
the goal is to count sea lions on this image, this aerial world problem that was hosted on kegel.com.
So we want to write a program, a function a that takes an image as an input, counts sea lions on it
and counts the number of sea lions on the image. Of course, we can come up with some heuristics
like detect edges of the objects on this photograph and try to count connected components. But this
approach is inferior to machine learning. In machine learning, we try to collect a labelled set of
images.

Commencer la transcription à 55 secondes0:55

So, we try to collect like 1,000 or maybe even 1 million of such photographs and label them. We
count sea lions. We have a grand truth for every image and then we try to learn a function from data.
We try to fund a function a that fits this data the best.

Commencer la transcription à 1 minute 16 secondes1:16

Let's give some definitions that will be very useful for us. An image or any other object that we try to
analyze in machine learning is called an example. And if it's an example that we try and model, it's a
training example. We describe each example with deep characteristics that we call features. For
example, for an images, features are intensities of every pixel on the image or something else. So, we
have examples. And in supervised learning, we have target values. We have a grand truth and answer
for each example. For example, in the problem of count in sea lions, we have a number of sea lions
for every example, for every image. We denote this target values by y. So for example, xi. The target
value is yi. As I said in machine learning, we tried to collect a set of label examples. We denoted by X
large and it's a set of pairs of L pairs that have an example with its feature description, and target
value. And finally, we want to find a modal, a function that maps examples to ta get values. We
denoted by a(x) model or hypothesis and the goal of machine learning is to find a modal that fits the
training set x by the best way. There are two main classes of supervised learning problems,
regression and classification. In regression, the target value is a real value. For example, if we count
sea lions, the target value is real. Actually, it's natural numbers. But it's also regression or, for
example, if given a job description and we try to predict what salary will be given on this job. That's
also regression since salary is a real value. Or for example, if we're given movie review from some
user who tried to determine what rating will users give to this movie on a scale from one to five. It's
also can be solved as a regression problem. On the other hand, if the number of target failures is
finite, it's a classification task. For example, if we want to recognize some objects on images, for
example, we want to find out whether there are cats or dogs or grass or maybe clouds or bicycle on
the image. It's an object recognition task. Since number of answers is finite, there is finite number of
objects, then we are solving classification task. Or for example, if we are analyzing news articles and
want to find out what topic this article belongs to, is it about politics or sports or entertainment, then
it's also classification tasks since number of target values is once again, finite.

Commencer la transcription à 4 minutes 4 secondes4:04

Let's discuss this very simple dataset. Each object, each example is described with one feature and
you have real value target. Here is the dataset, so we can see that there is a linear trend. If feature
increases two times, then target decreases somewhere about two times. So maybe we could use
some linear model to describe this data, to build a predictive model. Here's linear model. It's very
simple and has just two parameters, w1 and w0. And if you find best weights, w1 and w0, then we'll
have a model like this one. It describes data very well. It isn't perfect. It doesn't predict the exact
target value for each example, but it fits the data quite well. Of course, in most machine learning
tasks, there are many features. So, we can use a generic linear model like this one. So it takes each
feature, x, j multiplies it by weight wj. Sums these multiplicates of all the features and then adds a
biased term, b.

Commencer la transcription à 5 minutes 15 secondes5:15

This is a linear model. It has d+1 parameters where d is the number of features in our dataset. There
are d weights or coefficients and one bias term, b. It's a very simple model. Because for example,
neural networks have much more parameters for the same number of features. And to make it even
simpler, we'll suppose that in every sample, there is a fake feature that will always have a value of
one. So, a coefficient with this feature is a bias. So in the following slides, we don't analyze a bias
separately. We suppose it is among the weights. It would be very convenient to write our linear
model in vector form. So, it's known from linear algebra that dot product is exactly what you've
written on the previous slide. It's multiples of vectors and then we sum it up. So, our linear model is
basically a dot product of weight vector and feature vector X. And if we want to apply our model to a
whole training set or maybe to other set of examples, then we do the following thing. We form a
matrix with our sample. Matrix is X large. It has L rows and d columns. Each row corresponds to one
same, to one example and each column corresponds to values of one feature on every example.
Then to apply our model to this set X large, we multiply matrix X by vector w and that's our
predictions. This multiplication will give us the vector of size L and each component is a prediction of
our linear model or each example.

Commencer la transcription à 6 minutes 58 secondes6:58

The next question in machine leaning is how to measure a quality or measure an error of a model on
some set, or train, or maybe test set.

Commencer la transcription à 7 minutes 8 secondes7:08

One of the most popular choices for loss function in regression is mean squared error. It goes like
this. We take a particular example, Xi, for example. We calculate a prediction of our model for this
example for the linear model is that product of w and Xi, then we subtract target value. So we
calculate deviation of target value from predictive value, then we take a square of it and then we
average these squares of deviations over all our training set. This is mean squared error. It measures
how well our model fits the data. The less mean squared error, the better the model fits the data.
And of course, we can write mean squared error in vector form. We multiply matrix X by vector w
and we have a vector of predictions for all the examples in the set, then we subtract vector of target
values of real answers and then we take euclidean norm of this vector. That is the same as the mean
squared error I described before.

Commencer la transcription à 8 minutes 15 secondes8:15

So we have a last function that measures how well our model fits the data, then all we have to do is
to minimize it with respect to w, to our parameters. So, we want to find the parameters set w that
gives us this most mean squared error on our train set. This is the essence of machine learning. We
optimize loss to find the best model.

Commencer la transcription à 8 minutes 38 secondes8:38

Actually, if you do some calculus, if you take derivatives and solve the equations, then you'll have the
analytical solution for these optimization problems. It goes like this, but it involves inverting and
matrix. It is a very complicated operation. And if you have more than 100 or 1,000 features, then it's
very hard to find an inverse matrix exposed by extra supposed X. We can reduce this problem to
solve it as a system of linear equations, but it's still quite hard and requires lots of computational
resources. So later, we'll try to find a framework for better, more scalable optimization of such
problems. In this video, we discussed linear models for regression. They are very simple, but they are
very useful for deep neural networks. We discussed mean squared error, a loss function for
regression problems. And found out that it has analytical solution, but it's not very good and it's hard
to compute. So in following videos, we'll try to find a better way to optimize such models. But first of
all in the next video, we'll discuss how to apply linear models in classification tasks. [MUSIC]

In this video, we will discuss how to adapt linear methods for classification problems. And let's start
with simplest classification problem, binary classification. Here we have only two values for the
target: minus one and one; negative class and positive class -- so, essentially, linear model calculus
dot product between w, weight vector, and x, feature vector. This dot product is a real value, and we
should somehow transform it to minus one or one, and to do it, we can just take a sine of dot
product. So linear classifier looks like sine of w transposed by x. It has deep parameters. And if you
remember, we agreed that there is also a constant feature that has a value of one on every example.
So we don't have to explicitly include bias into our model. The coefficient of this constant feature will
be the bias. So actually, maybe there are d+1 parameters for this model, and geometrically it looks
like that. Suppose that we have two features, so x's on this graph correspond to our features, and we
denote negative class by red points and positive class by blue points. And the linear model tries to
find some line that separates blue points from the red points. And as we know from geometry, the
sine of the dot product indicates on which side of the line the point lies. So, if we have a positive dot
product, then the point lies at the positive side of this line, and if the product is negative, then the
point lies on the negative side of our line. Okay. Let's switch to multi-class classification problem with
K classes 1- K. In this case, we should use some more complicated techniques to build our
classificator. One of the most popular approaches is to build a separate classifier for each class. So,
for example, for the first class, we'll have a linear model -- linear classifier -- that separates points of
the first class from all other points. So essentially, we try to fit a model so that points of the first class
lie on the positive side of this line of this hyperplane, and points from all other classes lie on the
negative side of this hyperplane. And the dot product of this model is essentially a score. The higher
the score, the more the model is confident that this point lies in the first class. Then we build such a
model for every class, and we have K linear models, and each model calculates a score, and then we
assign our new example to the class that has the largest score -- the class with higher confidence. For
example, if we have three classes, and our score vector looks like 7-7.5 and 10, then we assign our
example to the third class, because a third component of the score vector is the largest. Okay. Now
we have the model, and we should somehow learn it. So we need a loss function. And let's start with
the simplest loss function, accuracy loss, and to define it, we'll need some notation -- Iverson
brackets. They denote just basic square brackets, and we write some logical statement inside the
brackets. If the statement is true, then the value of brackets is one, and if the statement is false, then
the value of brackets is zero. So, now let's define accuracy metric. Let's take an example xi, find the
prediction a of xi, and compare it to the true value of class yi and write the equality of predicted class
and true class in Iverson brackets. So, the value of the bracket will be one if we guessed the class
correctly, and then it will be zero if we are misclassifying these points. And then we just average
these brackets over all our data points -- over all our training set. So, essentially, accuracy is just a
ratio of correctly classifying points in our training set. This metric is good and could be easily
interpreted, but it has two large disadvantages. At first, we'll learn from our next videos that we need
a gradient to optimize our loss function effectively. And accuracy doesn't have gradients with respect
to model parameters. So we cannot optimize it -- we cannot learn the model to accuracy score. And
also, this model doesn't take into account the confidence of our model in this prediction. So actually,
we have a dot product of weight vector and feature vector w and x, and the larger the score, the
more the model is confident in this prediction. If this dot product has a positive sign and a large
value, then the model is confident. But if the sign is positive, but the value is close to zero, then the
model is inconfident. And we want not only a model that makes correct decisions -- that gets its
classes -- but we want a confident model, and it's known from machine learning that models with
high confidence generalize better. Okay. Accuracy doesn't fit, so we need some other loss function.
Maybe we can use mean squared error. Suppose that we have some example, xi, and it belongs to
the positive class, to the class one, and consider a squared loss on this example. So we take dot
product between w and x and compare it to one, and take a square of this difference. So, if our
model predicts one, then the guess is correct and the loss is zero. If our model gives a prediction
between zero and one, then it's inconfident in its decision and we penalize it for low confidence. If
the model gives the value lower than zero, then it misclassifies this point. So we give it an even larger
penalty. That's okay, but if the model predicts a value larger than one, then we penalize it. We
penalize it for high confidence, and that's not very good. We should give small or zero loss for high-
confidence decisions. Okay. So we can just take one branch of our squared loss and penalize for low
confidence and for misclassification, and give zero loss for high confidence. Actually, there are many
loss functions that look like this one, and all of them lead to their own classification methods. We'll
discuss one of the most important-for-us methods, logistic regression. And to talk about it, we should
first find a way to convert our scores from linear classifiers, to probabilities, to distribution. So, we
have some vector of scores z, which has components w transposed by x, though these are scores for
each of our classes. Dot products can have any sign and have any magnitude. So we cannot interpret
them as probability distributions, and we should somehow change it. We'll do it in two steps. At first
step, we take first component of our vector and take e to the degree of this component. We do the
same to the second component, et cetera, to the last component. So, after this step, we have a
vector e to the degree of z that has only positive coordinates. So now we need only to normalize
these components to get a distribution. And to do it, we just sum all the components of this e-to-z
vector and divide each component by the sum. And after that, we get a vector sigma of z that is
normalized and has only non-negative components. So we can interpret it as a probability
distribution. This transform is called a softmax function -- a softmax transform. Consider an example
with three classes. We score 7-7.5 and 10. If we apply softmax transform to this vector, then we get
the vector sigma of z with components 0.05, zero, and 0.95. So the first component was largest
before the transform, and it has the largest probability after softmax transform. Okay. Now we have
an approach to transform our scores to probabilities. This is the predicted probabilities of classes.
And now we need some target vector, the vector that we want our probabilities to be equal to. Of
course, we want the probability of the true class to be one, and probabilities of all other classes to be
equal to zero. So,we'd form a vector, b. It's a target vector that is just a binary vector of the size K
where K is number of classes, and it has one in the component that corresponds to the true class of
the current example, and zeros in all other coordinates. Now, we have target vector b, vector of
predicted class probabilities sigma of z, and we should somehow measure the distance between
these probability distributions. To do it, we can use cross entropy. Essentially, cross entropy is just a
minus log of the predicted class probability for the true class. And also, we can write it as a minus
sum of the indicator that our class y equals to K multiplied by log of the predicted class probability for
the class K. Let's look at some examples of cross entropy. Suppose that we have three classes, and
our example belongs to the first class. So, y equals to one. Suppose that we have some model that
predicts probability of one to the first class, and zero probabilities to the second and third classes. So,
this model makes a correct guess, and the cross entropy is zero, because it's a perfect model for us. If
we have a model that makes a prediction of 0.5 for the first class and 0.25 for two other classes, then
the cross entropy equals approximately to 0.7. So there is some loss here. But if I have a model that
assigns the probability of one to the second class and zero probability to the first class, then the cross
entropy equals to plus infinity, because I multiply one by the logarithm of zero. So, cross entropy
gives a very high penalty for models that are confident in wrong decisions. Okay. Now we can just
sum cross entropies over all examples from our training set, and that would be our loss function. It's
quite complicated and we cannot find analytical solution for this problem. So, we need some
numerical methods to optimize it, and we'll discuss this method in the following videos. So in this
video, we discussed how to apply linear models to classification problems, both through binary
classification and multi-class classification, and discussed how loss for classification problems should
look like. One of the most important methods for learning linear classifiers is logistic regression, and
we discussed how a loss looks in this case. And in the next video, we'll talk about gradient descent
numerical method that optimizes any differentiable loss function

In this video, we'll talk about gradient descent, a generic method that can optimize any differentiable
loss function. So, we already know loss functions for regression, like mean squared error, or for
classification, like cross-entropy. Of course, there are many other loss functions, and it would be
good to have some generic method that can take any differentiable loss function and find its
minimum. And this method is gradient descent and other is extensions. So suppose that we have
some loss function L(w) and we want to minimize it. How can we do it? Let's take some initialization
w_zero. It's just some point on our surface. And of course the surface of our function can be very
difficult. It can have multiple minima, maxima, like for example on this graph. And we want to find
some local minimum of our function. Okay. So how can we improve w_zero? We should somehow
find a way, find a direction where the function decreases in this point, and take a step in this
direction. So to do it we can remember some calculus. There is a gradient vector that is essentially a
vector of partial derivatives with respect of all parameters of our function, of all w's, and gradient
points as the direction of steepest ascent of our function and minus gradient points as the direction
of steepest descent of our function. So, if we want to minimize our function, to minimize the value of
loss and w_zero, we should just calculate gradient at the point w_zero and step in the direction of
anti-gradient, of minus gradient. So, we take w_zero, calculate gradient which is denoted by nebula L
of w_zero multiplied with some learning rate, with some coefficient Eta t, Eta one, and we subtract
these gradient multiplied by Eta one from the approximation w_zero. And this how we get the next
approximation w_1. Then we can continue this process. So gradient descent looks like that. We
initialize our parameters by w_zero and on each step, we calculate the gradient and take a step in
the anti-gradient at this point, and then we check some stopping criteria. So, for example, if w_t, the
parameter vector at this step is very close to w_t minus one, does it previous parameters, then we
stop because it looks like we've achieved some minimum. For example, if the level lines of our
function look like this one, we take some initialization w_zero. We calculate gradient. We step in the
direction of anti-gradient. Get to the point w_one then we again calculate the gradient at this point,
take a step etc. And at some point, we converge to the local minimum of this function. Of course,
there are many questions to gradient descent. For example, how to initialize w_zero? Or how to
select a step size Eta t? Or it should be constant or change at reiteration, that is a question. Or when
to stop? How to choose a stopping criterium? We've discussed one of criteria, like checking whether
our new parameter vector is close to the previous one, but there are many other criteria. And of
course, to calculate the gradient of our function, we should calculate the gradient for every example
from our training set because loss function is essentially a sum of losses on each example from
training set. And these could be hard. And in practice, we usually approximate our gradient vector by
some methods that we'll discuss in following videos. Now, when we have gradient descent, we can
take any loss function, for example, min squared error. We can calculate the gradient that looks like
X transposed multiplied by Xw minus y and takes steps in the direction of this gradient. As you
remember, there is an analytical solution for min squared error and linear regression, but is inferior
to gradient descent because first of all, gradient descent is easier to implement. You don't need to
solve some systems of linear equations to calculate the minimum of our function. Gradient descent is
a very general framework that allows us to minimize any differentiable loss function, not only min
squared error but cross-entropy and all other losses. And also, if we use stochastic versions of
gradient descent, that approximate gradients, then this method is much more effective both in terms
of computations and memory. In this video, we discussed gradient descent, a method that can
optimize any differentiable function, and discussed that it has many questions, like how to choose
learning rate, or how to initialize w, or some other questions. And we'll discuss them in following
videos and in following weeks of our course.s

You might also like