You are on page 1of 590

Machine Learning Foundations: A Case

Study Approach
Week1 :

Welcome
WEEK

2 hours to complete

Regression: Predicting House Prices


https://www.coursera.org/lecture/ml-foundations/loading-exploring-house-sale-data-v8QUF
WEEK

2 hours to complete

Classification: Analyzing Sentiment


False +ve
https://www.coursera.org/lecture/ml-foundations/loading-exploring-product-review-data-5RD7r
Okay, so one way to represent this trade-off between something that's common locally but rare
globally is something that's called TF-IDF or Term frequency- inverse document frequency.
e end up with the log of some large # / 1 + another large number, and we'll say this is
approximately log of 1, which is equal to 0. So what we see here, is that we're gonna be very, very
strongly downweighting all the way to zero, the counts of any word that appears extremely
frequently. Where that appears in all of our documents.
we simply multiply these two factors together. So theres some numbers here, where the word the,
turns into a 0 and then these are some other numbers and then the word Messi is gonna be
upweighted, so a weight of 20. And again, there's some other computation we're doing for all the
other words in our vocabulary. But the point that we wanna make here is the fact that these very
common words like the, get downweighted and the rare and potentially very important words like
Messi are getting upweighted.
Important TF-IDF
WEEK

2 hours to complete

Recommending Products
Never watched
They love romance , they really hate drama
specific combination of a movie and a user, let's talk about how we can think about representing
our predictions over the entire set of users and movies
This is called a matrix factorization model, because I'm taking this matrix, and approximating it
with this factorization here. But the key thing is the output of this, is a set of estimated parameters
here
Whereas the features that are discovered from matrix factorization can capture groups of users
who behave similarly. So for example, women from Seattle who teach and are also a mom. Okay
so the question is how we combine these two different approaches.

we can weight more heavily on our matrix factorization approach and use those learn features,
more strongly when we're forming our recommendations. So it's very simple to think about
combining the ideas of a user-specified feature-based model with our learned features from matrix
factorization.
But another issue is something else which relates to the cost of making these different decisions.
When we talk about precision what we're gonna look at is all of the recommended items. So when
we're talking about recall our world when we were measuring this fraction here was looking at the
magenta boxes. That was the world that we looked at everything else could disappear from the
slide. But when we are talking about precision, we're gonna look at all the recommended items
and then everything else can disappear from the slide. So the recommended items are highlighted
by these green boxes.
we're thinking about precision we're thinking about basically how much garbage do I have to look
at compared to the number of items that I like. So, it's a measure of when I have a limited attention
span, how much am I gonna be wasting my efforts on products that I do not like? [MUSIC]
But what's the resulting precision of doing this? Well, it can actually be arbitrarily small. Because if
you have tons and tons and tons of products out there, and if I like only a very, very small number
of them, then if you recommend everything, I'm gonna have very small precision. So let's just say
this is small and maybe very small. So that's not really a great strategy. On the other hand, what
would be the optimal recommender? What's the best recommender you can imagine? Well, the
best recommender is one that recommends all the products I like but only the products I like. So
everything that was not liked was never shown by the recommender. So that would be great. No
wasted effort, capture everything I like, gonna make lots of money with this recommender system.
What's the precision and recall? Both are 1 in this case. And you can go through and verify that
using the equations from the previous slides. [MUSIC]
So now, let's talk about how we can use these metrics of precision and recall to compare our
different algorithms that we might think about using. And to do this we can draw something that's
called the precision recall curve.
what I'm gonna ask you to do is I'm gonna ask you to vary the threshold on how many items that
recommender system is allowed to recommend to me.
Because you want that precision to be as large as possible for the constraint of recommending
that number of products. And so these are two examples of metrics you might use to compare
between different algorithms using this notion of precision and recall.
So we said for each user, there's 1 million, about a million ratings or a million number of listens.
Let's see how many users are involved here.
so basically, not that exciting. Everybody gets recommended the same things. This is a problem
with this model.
so if I take this very similar model and I am gonna get similar items. So we can get all similar
users. For a user to find other users that like this one, like her. But here we are gonna get similar
items and the list I'm gonna give is a little different.
So, it does it pretty fast. It's comparing, it's evaluating first the performance of the popularity
model. That's model zero. Then the performance of the second model, model one, which was the
personalized model. And finally what we get here is a precisionary call curves. The last thing that
we saw in the lectures and this compares the recall and precision of these two models. The
personalized model in green and the popularity model in blue. And the closer you are to the top
here the better, you see that adding personalization significantly improves. The performance of
our recommender. So, personalization's great. Just recommending popular items, not so great.
And so, with this, we've built a song recommender and we explored that, we saw it recommend
cool songs for somebody like me who likes U2 and Buena Vista Social Club. But of course you
can go and explore all the things that you might like and see what else other recommendation
songs you can come up with.
WEEK

1 hour to complete

Welcome
3 hours to complete

Simple Linear Regression


, and we have just a simple line, a question is well what's the slope and what's the intercept of this
line that minimizes this goodness-of-fit metric. So this blue curve I'm showing here is showing how
good the fit is, where lower means actually a better fit and we're gonna try and minimize over all
possible combinations of intercepts and slopes.
This gradient descent algorithm is an iterative algorithm that's gonna take multiple steps. And
eventually go to this optimal solution that minimizes this metric. Measuring how well this line fits.
And a concave function is function where for any value of a and any value of b that you choose,
when you go and draw this little line between the points of the function, it's always gonna lie below
the actual curve of the function itself.
Nor convex
But we're looking at an optimization objective, where either our goal is to find the minimum or
maximum of a function. So typically if we're looking at a concave function, our interest is in finding
the maximum, and for convex it's in finding the minimum
Okay, so that's one way to find the maximum or minimum of a function depending on which
scenario we're in. It's just taking the derivative of the function, setting it equal to zero. And that
solution will be unique assuming we're either in this convex or concave situation.

. So, if we're looking at these concave situations and our interest is in finding the max over all w of
g(w) one thing we can look at is something called a hill-climbing algorithm. Where it's going to be
an integrative algorithm where we start somewhere in this space of possible w's and then we keep
changing w hoping to get closer and closer to the optimum.
So, this hill climbing algorithm, the way we can write it is, we say. While our algorithm has not
converged. So, while not converged, if I can spell converged. I'm gonna take my previous w,
where I was at iteration t. So this is an iteration counter. And I'm gonna move in the direction
indicated by the derivative. So, if the derivative of the function is positive, I'm going to be
increasing w, and if the derivative is negative, I'm going to be decreasing w, and that's exactly
what I want to be doing. But instead of moving exactly the amount specified by the derivative at
that point, we can introduce something, I'll call it ada. And ada is what's called a step size. It says
when I go, so let me just complete this statement here. So, it's a little bit more interpretable. So,
when I go to compute my next w value, I'm gonna take my previous w value and I'm going to
move and amount based on the derivative as determined by the step size. Okay so let's look at a
picture of how this might work. So, let's say I happen to start on this left hand side at this w value
here. Compute the derivative, and I take a step. Determined by that step size. And at this point the
derivative is pretty large. This function's pretty steep. So, I'm going to be taking a big step. Then, I
compute the derivative. I'm still taking a fairly big step. I keep stepping increasing. What I mean by
each of these is I keep increasing w. Keep taking a step in w. Going, computing the derivative and
as I get closer to the optimum The size of the derivative has decreased, and so if I assume that
eda is just some constant we'll get back to this in a couple slides. But if I assume that there is just
a fixed step size that I am taking. Then, I'm gonna be decreasing how much I'm moving. As I get
to the optimal. And if I end up on the other side of the optimum, then the derivative there is
negative, it's going to push me back towards the optimum. So, eventually I'm gonna converge to
the optimum itself. And note that if I'm close to the optimum, I'm not gonna jump very far. Again
because when you're close to the optimum, that derivative Is really, really small. So this term here
is gonna be really small and I'm not gonna be changing w very much. [MUSIC]
Well, we can use the same type of algorithm to find the minimum of a function. So, here, our
interest is min over all w g of w. And on this picture here, for this convex function, that's this point
right here. And, but let's think a little bit about what happens in this case. So let's say we're
starting at some w value here, and I'd like to know whether I should move again to the left or to the
right. So increase or decrease w? Well let's look at the derivative of the function. And what I see is
that the derivative is negative. The derivative is negative, and yet in this case, I want to be moving
to the right and increasing w. Now, let's look at a point on the other side of the optimum, so some
point w here. Look at the derivative. In this case the derivative is positive and when I ask whether I
want to move to the left or to the right, the answer in this case is, I want to decrease the value of
w. I want to move to the left. So what we're saying is that, when the derivative is positive we want
to decrease w and when the derivative is negative, you wanna increase w. So again, in this
picture I have that the derivative of this function g everywhere on the left-hand side of the
optimum, in this case, is negative, everywhere on the right-hand side is positive. So when I go to
do what I'm gonna call a hill descent algorithm to contrast with the hill climbing algorithm, the
update is gonna look almost exactly the same as the hill climbing. Except because of what we just
discussed, instead of having a plus sign here, and moving in the same direction, meaning the
same sign of the derivative, we're going to move in the opposite. Okay, so when the derivative,
just to be very clear, when the derivative is positive, what's going to happen? Well this term is
going to be negative, we're going to decrease w. When the derivative is negative, this term, this
joint term here is going to be positive, we're going to increase w. So that satisfies exactly what we
stated here. Okay. So that is finding the minimum of a convex function. And I wanna emphasize
this slide right here because we're gonna be looking at a lot of convex functions in this course, and
in this module. So this is really the picture that I want you to have in mind. [MUSIC]
okay, that's good enough. We're close enough to the optimum. I'm gonna terminate this algorithm.
And I'm gonna say that this is my solution. So what we're saying is that, what we're gonna need to
specify is something where we say, when the absolute value of the derivative, I don't care if I'm a
little bit to the right or a little bit to the left of the optimum, but just what the absolute value is. If that
is less than sum epsilon. This is a threshold I'm setting. Then if this is satisfied, then I'm gonna
terminate the algorithm and return whatever solution I have Wt. So in practice, we're just gonna
choose epsilon to be very small. And I wanna emphasize that what very small means depends on
the data that you're looking at, what the form of this function is. What are the range of gradients
we might expect? Is the value of the function, a plot of the value of the function over iterations?
And you'll tend to see that the value decrease, and it's basically not changing very much. And at
that point, you know that you've converged. [MUSIC]
t's talk about how we're going to move to functions defined over multiple variables.
If I want to look at the gradient at any point on this surface well I'm just going to plug in whatever
the W one and W zero values are at this point. So there's some W zero, W one value, and I'm
going to compute the gradient. And it'll be some vector. It's just some number in the first
component, some number in the second component, and that forms some factor. [MUSIC]
Value of this function
A slice of the 3D surface where all values here have the same value of this function. Sorry, our
functions are called g(w0, w1). So let me just step up and say what I'm trying to say here, which is
every w0, w1 pair along this ellipse here Has the same value of the function g because it was just
a flat slice through that 3D contour that we're looking at. Okay, so each of these rings have, in this
case, they are increasing values of the function as we go from blue, blue means a low value of the
function, all the way out to red, that means a high value of the function. We're looking at different
slices. We're slicing the function at different values, and that creates these different contours.
Okay, so this is what's called a contour plot, and it's useful because it's easier to work with 2D
when we're on a 2D surface here. So drawing things will be easier with this representation. Okay,
so that was just a little detour into contour plots. So that I can talk about the gradient descent
algorithm, which is the analogous algorithm to what I call the hill decent algorithm in 1D. But, in
place of the derivative of the function, we've now specified the gradient of the function. And other
than that, everything looks exactly the same. So what we're doing, is we're taking we now have a
vector of parameters, and we're updating them all at once. We're taking our previous vector and
we're updating with our sum, adda times our gradient which was also a vector. So, it's just the
vector analog of the hill descent algorithm. But, if I wanna show this a little bit in pictures here.
Again switching back to red because it'll be easier to see on this plat. Well if i'm out here at a point
the gradient is actually it's pointing in the direction of steepest assent. So that's up hill. It's pointing
this way. But we're moving in the negative gradient direction. So let me specify that this thing here
is our gradient, gradient direction, but then our steps are gonna be in the opposite direction. So let
me actually draw- sorry to take up a little time here but I think it's worthwhile for clarity. Let me just
happen to draw the gradient so that it's a purple vector so it's different from the vectors I'm going
to be drawing right now. Okay. Cuz the other vectors that I'm gonna be drawing right now are the
steps of my gradient descent algorithm. So the actual steps I'm taking are gonna be moving here,
Towards this optimal value. So, it's exactly like what we saw in the 1D case, but now we're moving
it in a 2D space. Or really any dimensional space but what I'm drawing is just a 2D space. And in
terms of assessing convergence in this case well in place of looking at the absolute value of the
derivative we're going to look at the magnitude of the gradient. And when the magnitude of the
gradient is less that sum epsilon that we're fixing, we're gonna say that the algorithm has
converged. >> [MUSIC]
[MUSIC] Okay, so now that all of you are optimization experts we can think about applying these
optimization notions and optimization algorithms that we described to our specific scenario of
interest. Which is searching over all possible lines and finding the line that best fits our data. So
the first thing that's important to mention is the fact that our objective is Convex.

we know that our gradient descent algorithm will converge to this minimum.
so let's return to the definition of our cost, which is the residual sum of squares of our two
parameters, (wo,w1),
But let's hold off on that because what do we know is another way to solve for the minimum of this
function? Well we know we can, just like we talked about in one D, taking the derivative and
setting it equal to zero, that was the first approach for solving for the minimum. Well here we can
take the gradient and set it equal to zero. [MUSIC]

Take this gradient, set it equal to zero. Solve for W0 and W1. Those are gonna be our estimates
of our two parameters of our model that define our fitted line.
hen when we go to write our gradient descent algorithm, what's the algorithm say? Well we have,
while not converged, we're gonna take our previous vector of W0 at iteration T, W1 at iteration T
and what are we going to. We're going to subtract. Going to write it up here, we're going to
subtract eta times the gradient,
and that is my update to form my next estimate of w0 and w1.
Well, this term here is positive. We're multiplying by a positive thing, and adding that to our W. So,
W zero is going to increase. And that makes sense, because we have some current estimate of
our regression fit. But if generally we're under predicting our observations that means probably
that line is too low. So, we wanna shift it up. And what does that mean? That means increasing
W0. So, there's a lot of intuition in this formula for what's going on in this gradient descent
algorithm. And that's just talking about this first term W0, but then there's this second term W1,
which is the slope of the line. And in this case there's a similar intuition. So, I'll say similar intuition,
For W1. But we need to multiply by this xi, accounting for the fact that this is a slope term. Okay.
So that's our gradient decent algorithm for minimizing our residual sum of squares where, when
we assess convergence what we're gonna output is w hat zero, W hat one. That's going to be our
fitted regression line. And this is an alternative approach to studying the gradient equal to zero
and solving for W hat zero and W hat one in that way. [MUSIC]
[MUSIC] Okay, so let's take a moment to compare the two approaches that we've gone over,
either setting the gradient equal to zero or doing gradient descent. Well, in the case of minimizing
residual sum of squares, we showed that both were fairly straight forward to do. But in a lot of the
machine learning method's that we're interested in taking the gradient and setting it equal to zero,
well there's just no close form solution to that problem. So, often we have to turn to method's like
gradient descent. And likewise, as we're gonna see in the next module, where we turn to having
lots of different inputs, lots of different features in our regression. Even though there might be a
close form solution to setting the gradient equal to zero, sometimes in practice it can be much
more efficient computationally to implement the gradient descent approach. And finally one thing
that I should mention about the gradient descent approach is the fact that in that case, we had to
choose a stepsize and a convergence criteria. Whereas, of course, if we take the gradient and are
able to set it to zero, we don't have to make any of those choices. So that is a downside to the
gradient descent approach, is having to specify these parameters of the algorithm. But sometimes
we're relying on these types of optimization algorithms to solve our optimization objective.
[MUSIC]
When Center city was in our dataset, we said that the average house value decreased by an
amount of $576 per unit increase in crime rate. Remember we know how to interpret these
coefficients and that's what I'm doing right now. In contrast, when I remove Center City, what is
the predicted decrease in crime rate I'm get, I mean sorry, predicted decrease in house of value
that I'm getting per unit increase in crime rate? Now, just removing one observation, my predicted
decrease is $2,287. That's significantly different. So when I'm going and I'm making an
interpretation about how much crime rate affects drops in house value. I have significantly
different interpretations when I include Center City in the dataset versus removing it.
[MUSIC] So that's our discussion on high leverage points and influential observations, but I wanna
think about whether if you go back to our data. I think it's easier to discuss this, looking at our
observations. Here what we see on the top part, there's a collection of five different observations.
So these are five different towns that have very high value compared to what you see for all of the
other towns. So question is even though these points aren't high leverage points, because they
are in this typical x range, Are they influential observations? Meaning if we remove these
observations will the fit change very much.
So, not let's just see what happens in this data set. Okay, so we're gonna remove these, what
we're saying here we're gonna remove these high value outlier neighborhoods and redo our
analysis. So what we're doing here is we're creating a data set, which I'm gonna call sales
underscore no high end for no high end towns. Which takes our data set, still with center city
removed, and just filters out all the towns that have average values greater than $350,000. Okay,
so let's fit this new data set. And again, let's compare coefficients. So I'm gonna compare the
coefficients to our fit with Center City removed to the fit that further removes these high end
houses, or sorry, these high end towns. And what you see is, yeah, there is some influence on
The estimated coefficient. But not nearly as significant as what we saw by simply removing center
city. So in this case, we've removed five observations out of a total of 97 observations. And we
see that impact of crime rate on predicted decrease and house value changes by a couple
hundred dollars, but not by the amount that we saw by just removing that one center city
observation earlier on. So this shows that high leverage points can be much more likely to be
influential observations for just small deviations from the data set. Then outline observations that
are within our x, our typical x range. Okay, so the summary of all of this analysis and discussion is
the fact that when you have your data, and you're making some fit and making predictions or
interpreting the coefficients. It's really, really important to do some data analysis to do
visualizations of your data or different checks for whether you have these high leverage points or
these outline observations and checking whether they might potentially be these influential
observations. Because that can dramatically change how you're interpreting or what you're
predicting based on your estimated fit. [MUSIC]

And the question is so we actually believe that is the case? And what happens if there might not
be symmetric cost to these error? But what if the cost of listing my house sales price as too high is
bigger than the cost if I listed it as too low? So for example, if I list the value as too high, then
maybe no one will even come see the house. Or they come see it and they say oh it's definitely
not worth it.

So as a seller, that's a big cost to me. I've gone through this whole thing of trying to sell my house.
I want or need to sell my house. And I get no offers. On the other hand, if I list the sales price as
too low, of course I won't get offers as high as I could have if I had more accurately estimated the
value of the house. But I still get offers, and maybe that cost to me Is less bad than getting no
offers at all. For example if I have to move to another state I have no choice but to sell my house.
Okay, so in this case it might be more appropriate to use an asymmetric cost function where the
errors are not weighed equally between these two types of mistakes. And instead of this dash
orange line here, which represents our fit when we're minimizing residual sum of squares. I will get
some different solution. Again, sorry, I love to write over my animations. Here this is the fit
minimizing residual sum of squares, and this other orange line here is this other solution using an
asymmetric loss. Asymmetric Loss and specifically in asymmetric loss. Let me just say
asymmetric cost. Where I prefer to underestimate the value than over. And that's what you see
here is, in general, we're predicting the values as lower. As compared to the line that we got or our
predictions using residual sum of squares. Okay so that's just a little bit of intuition about what
would happen using different cost functions and again we're gonna talk a lot more about this later
on in this course. [MUSIC]
WEEK

3 hours to complete

Multiple Regression
And it appears in so many applications. Another one that you might not think of is Motion capture,
just trying to model how a person walks over time. And if you look at the data, if you put sensors
over a person's body and look at how they walk, if you take those recordings, you're gonna get
these kind of up and down, up and down swings, as the person's going through their different
motions, raising their knees or walking or their arms. And so in this plot here, I'm looking at some
center trajectories from a person wearing a motion capture suit as they're going through different
behaviors, and you clearly see this type of seasonality here as well. [MUSIC]
so on. And we're gonna put these all together into a vector x. And I'm gonna use this bold-faced
notation to represent the fact that this is a vector. In some communities they might put a bar
under, a little arrow over, or there are lots of other options. But we're gonna use bold x to denote
some little d dimensional vector, meaning there are little d different inputs in our model. Okay,
then we're going to assume that our output is just some scalar, so that's just gonna be a normal y,
no boldness to it. Might be very bold, but it's not bold faced. Okay, so our notational conventions
are gonna be, bold x, square brackets of j is gonna take this vector x. And just like you would in
Python, it's gonna grab out that jth element. Okay, so the result of that is gonna be the jth input to
our model, and that's just a scalar, l
this entire d-dimensional vector. Then, when we use bold x sub i, what we're saying is that's our
ith input, so the input associated with our ith observation, the ith house in our data set for our
housing application. And then just to be very, very clear. Bold x, sub i, square brackets j, means
we're gonna look at this d dimensional input vector for this ith house. So, the number of square
feet, bedrooms, bathrooms, etc., for this ith house. And we're gonna grab out the jth input, which
might be number of bathrooms. Okay, so that's, again, just a scalar.
So what would be the interpretation of the jth coefficient? Well I have to think about fixing
everything else in the model, but if everything else is a power of this one input that I'm changing, I
can't do that. So here, unfortunately, we can't think about interpreting the coefficients in the way
that we did in the simple hyperplane example of the previous slide. So, just a little word of warning
that if you're ever in a situation where you can't hold all your other features fixed, then you can't
think about interpreting the coefficient. [MUSIC]
search our algorithm is going to be to search over all these different fits to minimize this cost.
And so the result here is -2h(y-Hw). So we have the -2 in both cases, this little scalar H is this big
matrix in our case, and y- Hw in the scalar case, this big vector matrix notation here. Okay, so just
believe that this is the gradient. We didn't wanna bog you down in too much linear algebra, or too
much in terms of derivatives. But if we have this notation, then we can derive everything we need
to for our two different solutions to fitting this model. [MUSIC]
Okay so if you have lots and lots and lots of features this can be really, really, really
computationally intensive to do. So, computationally intensive that it might actually be
computationally impossible to do. So, especially if we're looking at applications with lots and lots of
features, and again assuming we have more observations still than these number of features,
we're gonna wanna use some other solution than forming this big matrix and taking its inverse.
Even though there are actually some really fancy ways of doing this matrix inverse, and so know
that those fancy ways exist, but still, there are some very simple alternatives to this closed-form
solution. [MUSIC]
that have more of the feature, more numbers baths should weigh more heavily in our assessment
of the fit. So, that's why whenever we look at the residual, we weight by the value of the feature
that we're considering. Okay, so this gives us a little bit of intuition behind this gradient descent
algorithm, particularly looking just feature by feature and what the algorithm looks like. [MUSIC]
[MUSIC] Now, let's actually summarize the entire gradient descent algorithm for multiple
regression. Stepping through, very carefully, every step of this algorithm. So in particular at first,
what we're gonna do, is we're just gonna initialize all of our different parameters to be zero at the
first iteration. Or you could initialize them randomly or you could do something a bit smarter. But
let's just assume that they're all initialized to zero and we're gonna start our iteration counter at
one. And then what we're doing is we're saying while we're not converged, and what was the
condition we talked about before in our simple regression module for not being converged? We
said while the gradient of our residual sum of squares, the magnitude of that gradient is sufficiently
large. Larger than some tolerance epsilon then we were going to keep going. So what is the
magnitude of residual sum of squares? Here this thing just to be very explicit is the square root of
the square, so what are the elements of the gradient of residual sum of squares? Well, it's a vector
where every element is the partial derivative with respect to some parameter. I'm gonna refer to
that as partial of j, okay? So, when I take the magnitude of the vector I multiply the vector times its
transposed, take the square root. That's equivalent to saying I'm gonna sum up the partial
derivative with respect to the first feature squared plus all the way up to the partial derivative of the
capital Dth feature. Sorry, I guess I should start, the indexing really starts with zero, squared, and
then I take the square root. So if the result of this is greater than epsilon then I'm gonna continue
my gradient descent iterates. If it's less than epsilon then I'm gonna stop. But let's talk about what
the actual iterates are. Well, for every feature in my multiple regression model, first thing I'm going
to do is I'm going to calculate this partial derivative, with respect to the jth feature. And I'm going to
store that, because that's going to be useful in both taking the gradient step as well as monitoring
convergence as I wrote here. So this jth partial, we derived it on the previous slide. It has this
formed and then my gradient step takes that jth coefficient at timed t and subtracts my step size
times that partial derivative. And then once I cycle through all the features in my model, then I'm
gonna increment this t counter. I'm gonna check whether I've achieved convergence or not. If not
I'm gonna loop through, and I'm gonna do this until this condition, this magnitude of my gradient is
less than epsilon. Okay, so I wanna take a few moments to talk about this gradient descent
algorithm. Because we presented it specifically in the context of multiple regression, and also for
the simple regression case. But this algorithm is really, really important. It's probably the most
widely used machine learning algorithm out there. And we're gonna see it when we talk about
classification, all the way to talking about deep learning. So even though we presented this in the
context of multiple regression, this is a really really useful algorithm, actually an extremely useful
algorithm, as the title of this slide shows. [MUSIC]
WEEK

2 hours to complete

Assessing Performance

So for example in the housing application, if we list the house value as too low, then maybe we
get low offers. And that's a cost to me relative to having made a better prediction. Or if I list the
value as too high, maybe people don't come see the house and I don't get any offers. Or maybe
people notice that not many people are showing up to look at the house and they make me a very
low offer. So, again, I'm in the situation of being in a worse financial state having made a poor
prediction of the value of my house. So a question is, how much am I losing compared to having
made perfect predictions? Of course we can never make perfect predictions, the way in which the
world works is really complicated, and we can't hope to perfectly model that as well as the noise
that's adherent in the process of any observations we might see. But let's just imagine that we
could perfectly predict the value, then we'd say, in that case, our loss is 0. We're not losing any
money because we did perfectly. So a question is, how do we formalize this notion of how much
we're losing? And in machine learning, we do this by defining something called a loss function.
And what the loss function specifies is the cost incurred when the true observation is y, and I
make some other prediction. So, a bit more explicitly, what we're gonna do, is we're gonna
estimate our model parameters. And those are w hat. We're gonna use those to form predictions.
So, this notation here, f sub w hat is something we've equivalently written as f hat, but for reasons
that we'll see later in this module, this notation is very convenient. And what it is, is it's our
predicted value at some input x. And y is the true value. And this loss function, L, is somehow
measuring the difference between these two things. And there are a couple ways in which we
could define loss function. Well, there's actually many, many ways, but I'm just gonna go through
a couple examples. And in particular, these examples that I'm gonna go through assume that the
cost you incur by doing an overestimate, relative to an underestimate, are exactly the same. So
there's no difference in listing my house as $1,000 too high, relative to $1,000 too low. Okay, so
we're assuming what's called a symmetric loss function in these examples. And very common
choices include assuming something that's called absolute error, which just looks at the absolute
value of the difference between your true value and your predicted value. And another common
choice is something called squared error, where, instead of just looking at the absolute value, you
look at the square of that difference. And so that means that you have a very high cost if that
difference is large, relative to just absolute error. So as we're going through this module, it's useful
to keep in the back of your mind this quote by George Box. Which says that, Remember that all
models are wrong; the practical question is how wrong do they have to be to not be useful. Okay,
so we have spent a lot of time defining different models, and now we're gonna have tools to
assess the performance of these methods, to think about these questions of whether they can be
useful to us in practice. [MUSIC]
t. So, what's going wrong here? The issue is the fact that training error is overly optimistic when
we're going to assess predictive performance. And that's because these parameters, w-hat, were
fit on the training data. They were fit to minimize this training error. Sorry, minimize residual sum
of squares, which can often be related to training error. And then we're using training error to
assess predictive performance but that's gonna be very very optimistic as this picture shows. So,
in general, having small training error does not imply having good predictive performance unless
your training data set is really representative of everything that you might see there out in the
world. [MUSIC]
[MUSIC] So, instead of using training error to assess our predictive performance. What we'd really
like to do is analyze something that's called generalization or true error. So, in particular, we really
want an estimate of what the loss is averaged over all houses that we might ever see in our
neighborhood.
So specifically we estimate our model parameters on our training data set so that's what gives us
w hat. That defines the model we're using for prediction, and then we have our loss function,
assessing the cost of predicting f, this f sub w hat at our square foot x when the true value was y.
And then what we're gonna do is we're gonna average over all possible xy's. But weighted by how
likely they are according to those distributions over square feet and value given square feet.
But importantly, in contrast to training error we can't actually compute generalization error.
Because everything was relative to this true distribution, the true way in which the world works.
How likely houses are to appear in our dataset over all possible square feet and all possible house
values. And of course, we don't know what that is. So, this is our ideal picture or our cartoon of
what would happen. But we can't actually go along and compute these different points. [MUSIC]
And when we go to fit our models, we're just going to fit our models on the training data set. But
then when we go to assess our performance of that model, we can look at these test houses, and
these are hopefully going to serve as a proxy of everything out there in the world. So hopefully,
our test data set is a good measure of other houses that we might see, or at least in order to think
of how well a given model is performing.
This has nothing to do with our model nor our estimation procedure, it's just something that we
have to deal with. And so this is called Irreducible error because it's nothing that we can reduce
through choosing a better model or a better estimation procedure.
let's get back to this notion of bias. So what we are saying is, over all possible data sets of size N
that we might have been presented with of house sales, what do we expect our fit to look like? So
for one data set of size N we get this fit. Here's another dataset. Here's another data set. Or the
fits associated with those data sets. And of course there's a continuum of possible fits we might
have gotten. And for all those possible fits, here this dashed green line represents our average fit,
averaged over all those fits weighted by how likely they were to have appeared. Okay, so now we
can start talking about bias. What bias is, is it's the difference between this average fit and the true
function, f true. Okay, so, that's what this equation shows here, and we're seeing this with this
gray shaded region. That's the difference between the true function and our average fit. And so
intuitively what bias is saying is, is our model flexible enough to on average be able to capture the
true relationship between square feet and house value. And what we see is that for this very
simple constant model, this low complexity model has high bias. It's not flexible enough to have a
good approximation to the true relationship. And because of these differences, because of this
bias, this leads to errors in our prediction. [MUSIC]
And so intuitively what bias is saying is, is our model flexible enough to on average be able to
capture the true relationship between square feet and house value. And what we see is that for
this very simple constant model, this low complexity model has high bias. It's not flexible enough
to have a good approximation to the true relationship. And because of these differences, because
of this bias, this leads to errors in our prediction. [MUSIC]
as I'm looking at different possible data sets?
And if they could vary dramatically from one data set to the other, then you would have very
erratic predictions. Your prediction would just be sensitive to what data set you got. So, that would
be a source of error in your predictions. And to see this, we can start looking at high-complexity
models.
to write is sweet spot. And this is what we'd love to get at. That's the model complexity that we'd
want. But just like with generalization error, so I'm gonna write this down with generalization error.
Can we compute this? So, think about that while I'm writing. We cannot compute bias and
variance, and less mean squared error. And why? Well, the reason is because just like with
generalization error, they were defined in terms of the true function. Well, bias was defined very
explicitly in terms of the relationship relative to the true function. And when we think about defining
variance, we have to average over all possible data sets, and the same was true for bias too. But
all possible data sets of size n, we could have gotten from the world, and we just don't know what
that is. So, we can't compute these things exactly. But throughout the rest of this course, we're
gonna look at ways to optimize this tradeoff between bias and variance in a practical way.
[MUSIC]
So they converge to exactly the same point in the limit. Where that difference again, is the bias
inherent from the lack of flexibility of the model, plus the noise inherent in the data. Okay, so just
to write this down in the limit, I'm getting lots and lots of data points, this curve is gonna flatten out.
To how well model can fit true relationship f sub true. Okay, so I feel like I should annotate here
also, saying in the limit our true error equals our training error. Okay so what we've seen so far in
this module are three different measures of error. Our training, our true generalization error as well
as our test error approximation of generalization error. And we've seen three different
contributions to our errors. Thinking about that inherent noise in the data and then thinking about
this notion of bias in variance. And we finally concluded with this discussion on the tradeoff
between bias in variance and how bias appears no matter how much data we have. We can't
escape the bias from having a specified model of a given complexity. Okay, and in the subsequent
few videos, we 're gonna look at these notions with formalism. [MUSIC]
? So in these cases what we like to do is average our performance over all possible fits that we
might get. What I mean by that is all possible training data sets that might have appeared, and the
resulting fits on those data sets. So formerly, we're gonna define this thing called expected
prediction error which is the expected value of our generalization error, over different training data
sets.
so all those other factors out there in the world are captured by our noise term, which here we
write as just an additive term plus epsilon. So epsilon is our noise, and we said that this noise term
So between a given square feet and the house value whatever the true relationship is between
that input and the observation versus this average relationship estimated over all possible training
data sets. So that is the formal notion of bias of xt, and let's just remember that when it comes in
as our error term, we're looking at bias squared.

But the thing that we're really interested in, is over all possible fits we might see. How much do
they deviate from this expected fit? So thinking about again, specifically at our target xt, how much
variation is there in the training dataset specific fits across all training datasets we might see? And
that's this variance term and now again, let's define it very formally. Well let me first state what
variance is in general. So variance of some random variable is simply looking at the expected
value of that random variable minus its main squared. So in this context, when we're looking at the
variability of these functions at xt,
hen I take this squared, represents a notion of how much deviation a specific fit has from the
expected fit at xt. And then when I think about what the expectation is taking over, it's taking over
all possible training data sets of size N. So that's my variance term. And when we think intuitively
about why it makes sense that we have the sum of these three terms in this specific form. Well
what we're saying is variance is telling us how much can my specific function that I'm using for
prediction. I'm just gonna use one of these functions for prediction. I get a training dataset that
gives me an f sub w hat, I'm using that for prediction. Well, how much can that deviate from my
expected fit over all datasets I might have seen. So again, going back to our analogy, I'm a real
estate agent, I grab my data set, I fit a specific function to that training data. And I wanna know
well, how wild of a fit could this be relative to what I might have seen on average over all possible
datasets that all these other realtors are using out there? And so of course, if the function from
one realtor to another realtor looking at different data sets can vary dramatically, that can be a
source of error in our predictions. But another source of error which the biases is capturing is over
all these possible datasets, all these possible realtors. If this average function just can never
capture anything close to their true relationship between square feet and house value, then we
can't hope to get good predictions either and that's what our bias is capturing. And why are we
looking at bias squared? Well, that's putting it on an equal footing of these variance terms
because remember bias was just the difference between the true value and our expected value.
But these variance terms are looking at these types of quantities but squared. So that's intuitively
why we get five squared and then finally, what's our third sense of error? Well let's say, I have no
variance in my estimator always very low variance. And the model happens to be a very good fit
so neither of these things are sources of error, I'm doing basically magically perfect on my
modeling side, while still inherently there's noise in the data. There are things that just trying to
form predictions from square feet alone can't capture. And so that's where irreducible error or this
sigma squared is coming through. And so intuitively this is why our prediction errors are a sum of
these three different terms that now we've defined much more formally. [MUSIC]
But then we're gonna select our model complexity as the model that performs best on the
validation set has the lowest validation error. And then we're gonna assess the performance of
that selected model on the test set. And we're gonna say that that test error is now an
approximation of our generalization error. Because that test set was never used in either fitting our
parameters, w hat, or selecting our model complexity lambda, that other tuning parameter
Because you have a large enough validation set, and you still have a large enough test set in
order to assess the generalization error of the resulting model. And if this isn't the case, we're
gonna talk about other methods that allow us to do these same types of notions, but not with this
type of hard division between training, validation, and test. [MUSIC]
WEEK
4
3 hours to complete
Ridge Regression
[MUSIC] In the last module, we talked about the potential for high complexity models to become
overfit to the data. And we also discussed this idea of a bias-varience tradeoff. Where high
complexity models could have very low bias, but high variance. Whereas low complexity models
have high bias, but low variance. And we said that we wanted to trade off between bias and
variance to get to that sweet spot of having good predictive performance. And in this module, what
we're gonna do is talk about a way to automatically balance between bias and variance using
something called ridge regression.
So, previously we had discussed a very formal notion of what it means for a model to be overfit. In
terms of the training error being less than the training error of another model, whose true error is
actually smaller than the true error of the model with smaller training error. Okay, hopefully you
remember that from the last module. But a question we have now is, is there some type of
quantitative measure that's indicative of when a model is overfit? And to see this, let's look at the
following demo, where what we're going to show is that when models become overfit, the
estimated coefficients of those models tend to become really, really, really large in magnitude.
[MUSIC]
. So yeah, whoa, these coefficients are crazy. So what ridge regression is gonna do is it's going to
quantify overfitting through this measure of the magnitude of the coefficients. [MUSIC]

o let's talk about overfitting as a function of the number of observations that we have. As well as a
function of the number of inputs. Or the complexity of the model.
But now imagine a model where you have square feet, number of bathrooms, Number of
bedrooms, plot size, year, and a massive list of features. And if you wanna cover all possible
combinations of these things that you might see, that's really basically impossibly hard to do. So
this is a much much harder problem, and you're much more subject to your models becoming
overfit in these situations.
Okay, so clearly we want to balance between these two measures, because if I just optimize the
magnitude of the coefficients, I'd set all the coefficients to zero and that would sure not be overfit,
but it also would not fit the data well. So that would be a very high bias solution.
On the other hand, if I just focused on optimizing the measure of fit, that's what we did before.
That's the thing that was subject to becoming overfit in the face of complex models. So somehow
we want to trade off between these two terms, and that's what we're going to discuss now.
But we're gonna be operating in a regime where lambda is somewhere in between 0 and infinity.
And in this case, Then we know that the magnitude of our estimated coefficients, they're gonna be
less than or equal to the magnitude of our least squares coefficients. In particular, the two norm
will be less than. But we also know it's gonna be greater than or equal to 0. So we're gonna be
somewhere in between these two regions. And a key question is, what lambda do we actually
want? How much do we want to bias away from our least square solution, which was subject to
potentially over-fitting, down to this really simple, the most trivial model you can consider which is
nothing, no model? So, well not no model, no coefficients in the model. What's the model if all the
coefficients are 0? Just noise, we just have y equals epsilon, that noise term. Okay, so we're
gonna think about somehow trading off between these two extremes. Okay, I wanted to mention
that this is referred to as Ridge regression. And that's also known as doing L2 regularization.
Because, for reasons that we'll describe a little bit more later in this module, we're regularizing the
solution to the old objective that we had, using this L2 norm term. [MUSIC]
Okay, but what if we increase the strength of our penalty. So let's consider a very large L2
penalty. Here we're considering a value of 100, whereas in the case above we were considering a
value of one-eighth to the -25, so really really tiny. Well in this case, we end up with much smaller
coefficients. Actually they look really really small. So let's look at what the fit looks like. And we
see a really, really smooth curve. And very flat, actually probably way too simple of a description
for what's really going on in the data. It doesn't seem to capture this trend of the data. The value's
increasing and then decreasing. It just gets a constant fit followed by a decrease. So, this seems
to be under-fit and so as we expect, what we have is that when lambda is really really small we
get something similar to our lee square solution and when lambda becomes really really large we
start approaching all the coefficients going to 0. Okay so now what we're gonna do is look at the fit
as a function of a series of different lambda values going from our 1e to the minus 25 all the way
up to the value of 100. But looking at some other intermediate values as well to look at what the fit
and coefficients look like as we increase lambda. So we're starting with these crazy, crazy large
values. By the time we're at 1e to the -10 for lambda, the values have decreased by two orders of
magnitude so have times 10 to the 4th now. Then we keep increasing lambda. 1e to the -6. And
we get values on the order of hundreds for our coefficients, so in terms of reasonability of these
values I'd say that they start looking a little bit more realistic. And then we keep going and you see
that the value of the coefficients keep decreasing, and when we get to this value of lambda that's
100 we get these really small coefficients. But now lets look at what the fits are for these different
lambda values. And here's the plot that we've been showing before for this really small lambda.
Increasing the lambda a bit smoother fit, still pretty wiggly and crazy, especially on these boundary
points. Increase lambda more, things start looking better. When we get to 1e to the -3, this looks
pretty good. Especially here, it's hard to tell whether the function should be going up or down. I
want to emphasize that app boundaries where you have few observations, it's very hard to control
the fit so we believe much more the fit in intermediate regions of our x range where we have
observations. Okay but then we get to this really large lambda and we see that clearly we're over
smoothing across the data. So a natural question is, out of all these possible lambda values we
might consider, and all the associated fits, which is the one that we should use for forming our
predictions? Well, it would be really nice if there were some automatic procedure for selecting this
lambda value instead of me having to go through, specify a large set of lambdas, look at the
coefficients, look at the fit, and somehow make some judgment call about which one I want to use.
Well, the good news is that there is a way to automatically choose lambda. And this is something
we're gonna discuss later in this module. So one method that we're gonna talk about is something
called leave one out cross validation. And what leave one out cross validation does is it
approximates, so minimizing this leave one out cross-validation error that we're gonna talk about,
approximates minimizing the average mean squared error in our predictions. So, what we're
gonna do here is we're gonna define this leave one out cross-validation function and then apply it
to our data. And, this leave one out cross validation function, you're not gonna understand what's
going on here yet. But you will by the end of this module. You'll be able to implement this method
yourself. But what it's doing is it's looking at prediction error of different lambda values and then
choosing one to minimize. But of course we're not looking at that on the training error or on the,
sorry on the training set or the test set, we're using a validation set but in a very specific way.
Okay, so now that we've applied this leave one out function to our data in some set of specified
penalty values, we can look at what the plot of this leave one out cross validation error looks like
as a function of our considered lambda values. And in this case, we actually see a curve that's
pretty flat. In a bunch of regions. And what this means is that our fits are not so sensitive to those
choice of lambda in these regions. But there is some minimum and we can figure out what that
minimum is here. So here we're just selecting the lambda that has the lowest cross validation
error. And then we're gonna fit our polynomial ridge regression model using that specific lambda
value. And we're printing our coefficients and what you see is we have very reasonable numbers.
Things on the order of 1, .2, .5, and let's look at the associated fit. And things look really nice in
this case. So, there is a really nice trend throughout most of the range of x. The only place that
things look a little bit crazy is out here in the boundary. But again, at this boundary region we
actually don't have any data to really pin down this function. So, considering it's a 16 order
polynomial, we're shrinking coefficients but we don't really have much information about what the
function should do out here. But what we've seen is that this leave one out cross validation
technique really nicely selects a lambda value that provides a good fit and automatically does this
balance of bias and variance for us. [MUSIC]
t doesn't seem to capture this trend of the data. The value's increasing and then decreasing. It just
gets a constant fit followed by a decrease. So, this seems to be under-fit and so as we expect,
what we have is that when lambda is really really small we get something similar to our lee square
solution and when lambda becomes really really large we start approaching all the coefficients
going to 0.
and we can figure out what that minimum is here. So here we're just selecting the lambda that has
the lowest cross validation error. And then we're gonna fit our polynomial ridge regression model
using that specific lambda value. And we're printing our coefficients and what you see is we have
very reasonable numbers. Things on the order of 1, .2, .5, and let's look at the associated fit. And
things look really nice in this case. So, there is a really nice trend throughout most of the range of
x. The only place that things look a little bit crazy is out here in the boundary. But again, at this
boundary region we actually don't have any data to really pin down this function. So, considering
it's a 16 order polynomial, we're shrinking coefficients but we don't really have much information
about what the function should do out here. But what we've seen is that this leave one out cross
validation technique really nicely selects a lambda value that provides a good fit and automatically
does this balance of bias and variance for us. [MUSIC]
But what we've seen is that this leave one out cross validation technique really nicely selects a
lambda value that provides a good fit and automatically does this balance of bias and variance for
us. [MUSIC]
we saw that when lambda was 0, we get our least square solution. When lambda goes to infinity,
we get very, very small coefficients approaching 0. And in between, we get some other set of
coefficients and then we explore this experimentally in this polynomial regression demo.
how do the coefficients change? So how does my solution change as a function of lambda? And
what we're doing in this plot here is we're drawing this for our housing example, where we have
eight different features. Number of bedrooms, bathrooms, square feet of the living space, number
of square feet of the lot size. Number of floors, the year the house was built, the year the house
was renovated, and whether or not the property is waterfront. And for each one of these different
inputs to our model are different, and these we're just gonna use as different features, we're
drawing what the coefficients, so this would be, Coefficient value for square feet living. For some
specific choice of lambda and how that coefficient varies as I increase lambda and I'm showing
this for each one of the eight different coefficients. And I just want to briefly mention that in this
figure, we've rescaled the features so that they all have unit norm so each one of these different
inputs. That's why all of these coefficients are roughly on the same scale. They're roughly the
same order of magnitude. Okay, and so what we see in this plot is, as lambda goes towards 0, or
when it's specifically at 0, our solution here. The value of each of these coefficients, so each of
these circles touching this line, this is gonna be my w hat least squares solution. And as I increase
lambda out towards infinity, I see that my solution, w hat, approaches 0. There's a vector of
coefficients is going to 0. And we haven't made lambda large enough in this plot to see them
actually really, really, really, really close to 0, but you see the trend happening here. And then
there's some sweet spot in this model, sorry not in this model, in this plot. Which we're gonna talk
about later in this module. Whoops, I should draw it actually hitting some of these circles. One of
these considered points. So this is gonna represent, erase this, this is gonna represent some
lambda star. Which will be the value of lambda that we wanna use when we're selecting our
specific regularized model to use for forming predictions. And we're gonna discuss how we
choose which lambda to use later in the module. But for now, the main point of this plot is to
realize that for every value of lambda, every slice of this plot, we get a different solution, a different
w hat vector. [MUSIC]
. But for now, the main point of this plot is to realize that for every value of lambda, every slice of
this plot, we get a different solution, a different w hat vector. [MUSIC]
So we have the gradient of this first term plus the gradient of our model complexity term, or the
first term is our measure of fit, that residual sum of squares term. And we know that the gradient,
or the residual sum of squares has the following form. -2H transpose (y-Hw). The question is,
what's the gradient of this model complexity term? What we see is that the gradient of this is 2 * w.
Why is it 2*w? Well, instead of deriving it, I'll leave that for a little mini challenge for you guys. It's
fairly straightforward to derive, just taking partials with respect to each w component. Just write w
transpose w as w0 squared + w1 squared + blah, blah, blah. All the way up to WD squared, and
then take a derivative just with respect to one of the Ws. But for now what I'm gonna do is I'm just
gonna draw an analogy to the 1d case where w transpose w is analogous to just w squared. If w
weren't a vector and were instead just a scalar. And what's the derivative of w squared? It's 2w.
Okay, so proof by analogy here. [MUSIC]
That when lambda is equal to zero, this closed form solution we have is exactly equal to our least
square solution. That's what we had discussed at the very beginning of this module. And likewise,
when we crank lambda all the way up to infinity, our solution is equal to zero.
Even if the number of observations or number of linearly independent observations is less
than the number of features. So this is really important. When you have lots of features. So for
large D, which has lots of features, and remember, that's how we motivated using ridge
regression. We're in these really complicated models where you have lots and lots of features, a
lot of flexibility and the potential to over fit. Now we see something very explicit about how it helps
us. And just to return to the discussion on the naming of ridge regression being called a
regularization technique. If you remember, I said that we're regularizing our standardly square
solution. Well we can see that here, because lambda times the identity is making H transpose H
plus lambda identity more regular. That's what's allowing us to do this inverse even in this other
situation, this harder situation, and because this result is more regular, we call it regularized.
Okay. But, the complexity of the inverse is still cubic in the number of features we have and often
when we're thinking about ridge regression, like I said we're thinking about cases where you have
lots and lots of features, so doing this close form solution that we've shown here can be
computationally prohibitive. [MUSIC]
This is what we had again two modules ago where we just initialize our weights vector, and while
not converge for every feature dimension. We're going to go through, compute this partial
derivative of the residual sum of squares. That's that update term I was talking about on the last
slide. And then I'm going to use that to modify my previous coefficient according to the step size
Ada. And I'm gonna keep iterating until convergence. Okay, so this is what you code up before.
Now let's look at what we do to have to modify this code to get a rich regression variant. And
notice that the only change is in this line for updating wj where, instead of taking wjt and doing this
update with what I'm calling partial here, we're taking 1- 2 Ada Lambda x wj and doing the update.
So it's a trivial, trivial, trivial modification to your code. And that's really cool. It's very ,very, very
simple to implement ridge regression given a specific Lambda value. [MUSIC]
what if we don't have enough data to reasonably do a divide into these three different sets? What
can you do? And we're gonna talk about this in the context of rich regression, but again, this holds
for any tuning parameter lambda that you might have in selecting between different model
complexities. Or any other tuning parameter controlling your model.
Well, no, clearly the answer is no. We're saying that it's just a small set. It's not gonna be
representative of the space of things that we might see out there. Okay, so what can we do
better? We're stuck with just this dataset. Well, did we have to use the last set of tabulated
observations as the observations to define this validation set? No, I could of used the first few
observations, or next set of observations, or any random subset of observations in this dataset.
And a question is, which subset of observations should I use as my validation set? And the
answer, and this is the key insight, is use all of the data subsets. Because if you're doing that, then
you can think about averaging your performance across these validation sets. And avoid any
sensitivity you might have to one specific choice of validation set that might give some strange
numbers because it just has a few observations. It doesn't give a good assessment of comparison
between different model complexities. [MUSIC]
[MUSIC] So how are we gonna do this? How are we gonna use all of our data as our validation
set? We're gonna use something called K-fold cross validation.
onna keep track, of the error for the value of lambda for each block, and then I'm gonna do this for
every value of lambda. Okay, so I'm gonna move on to the next block, treat that as my validation
set, fit the model on all the remaining data, compute the error of that fitted model on that second
block of data. Do this on a third block, fit data on all the remaining data, assess the performance
on the third block, and cycle through each of my blocks like this. And at the end, I've tabulated my
error across each of these K different blocks for this value of lambda. And what I'm gonna do is
I'm gonna compute what's called the cross validation error of lambda, which is simply an average
of the air that I had on each of the K different blocks. So now I explicitly see how my measure of
air, my summary of air for the specific value lambda uses all of the data. it's an average across the
validation sets in each of the different blocks. Then, I'm gonna repeat this procedure for every
value that I'm considering of lambda and I'm gonna choose the lambda that minimizes this cross
validation error. So I had to divide my data into K different blocks in order to run this K full cross
validation algorithm. So a natural question is what value of K should I use? Well you can show
that the best approximation to the generalization error of the model is given when you take K to be
equal to N. And what that means is that every block has just one observation. So this is called
leave-one-out cross validation. So although it has the best approximation of what you're trying to
estimate, it tends to be very computationally intensive, because what do we have to do for every
value of lambda? We have to do N fits of our model. And if N is even reasonably large, and if it's
complicated to fit our model each time, that can be quite intensive. So, instead what people tend
to do is use K = 5 or 10, this is called 5-fold or 10-fold cross validation. Okay, so this summarizes
our cross validation algorithm, which is a really, really important algorithm for choosing two name
parameters. And even though we discussed this option of forming a training validation and test
set, typically you're in a situation where you don't have enough data to form each one of those. Or
at least you don't know if you have enough data to have an accurate approximation of
generalization error as well as assessing the difference between different models, so typically
what people do is cross validation. They hold out some test set and then they do either leave one
out, 5-fold, 10-fold cross validation to choose their tuning parameter lambda. And this is a really
critical step in the machine learning workflow is choosing these tuning parameters in order to
select a model and use that for the predictions or various tasks that you're interested in. [MUSIC]
And here it's gonna be very simple, we just add in a special case that if we're updating our
intercept term, so if we're looking at that zero feature, we're just gonna use our old re-sqaures
update. No shrinkage to w0, but otherwise, for all other features we're gonna do the ridge update.
Okay so we see algorithmically its very straightforward to make this modification where we don't
want to penalize that intercept term. But there's another option we have which is to transform the
data. So in particular if we center the data about 0 as a pre-processing step then it doesn't matter
so much we're shrinking the intercept towards 0 and not correcting for that, because when we
have data centered about 0 in general we tend to believe that the intercept will be pretty small. So
here what I'm saying is step one, first we transform all our y observations to have mean 0. And
then as a second step we just run exactly the ridge regression we described at the beginning of
this module. Where we don't account for the fact that there's this intercept term at all. So, that's
another perfectly reasonable solution to this problem. [MUSIC]
ridge regression is allowing us to do is automatically perform this bias variance tradeoff. So we
thought about how to perform ridge regression for a specific value of lambda, and then we talked
about this method of cross validation in order to select the actual lambda we're gonna use for our
models that we would use to make predictions. So in summary, we've described why ridge
regression might be a reasonable thing to do. Motivating that the magnitude term that ridge
regression introduces, the magnitude of the coefficients. Penalizing that makes sense from the
standpoint of over-fitted models tend to have very large magnitude coefficients. Then we talked
about the actual ridge regression objective and thinking about how it's balancing fit with the
magnitude of these coefficients. And we talked about how to fit the model both as a closed form
solution as well as creating a descent. And then how to choose our value of lambda using cross
validation, and that method generalizes well beyond regression, let alone just ridge regression.
And then finally, we talked about how to deal with the intercept term, if you wanna handle that
specifically. [MUSIC]
WEEK

3 hours to complete

Feature Selection & Lasso

[MUSIC] Okay, so how are we going to go about this feature selection task? Well one option we
have, is the obvious choice, which is to search over every possible combination of features we
might want to include in our model and look at the performance of each of those models
the set is connecting the points which represent the set of all best possible models, each with a
given number of features. Then the question is, which of these models of these best models with k
features do we want to use for our predictions?
Well hopefully it's clear from this course, as well as from this slide, that we don't just wanna
choose the model with the lowest training error, because as we know at this point, as we increase
model complexity, our training error is gonna go down, and that's what we're seeing in this plot. So
instead, it's the same type of choices that we've had previously in this course for choosing
between models of various complexity. One choice, is if you have enough data you can access
performance on the validation set. That's separate from your training and test set. We also talked
about doing cross validation. And in this case there many other metrics we can look at for how to
think about penalizing model complexity. There are things called BIC and a long list of other
methods that people have for choosing amongst these different models. But we're not going to go
through the details of that, for our course we're gonna focus on this notion of error on the
validation set. [MUSIC]
How many models did we have to evaluate? Well, clearly what we evaluated all models, but let's
quantify what that is. So we looked at the model that was just noise. We looked at the model with
just the first feature, second feature, all the way up to the model with just the first two features, the
second two features and every possible model up to the full model of all D features. And what we
can do is we can index each one of these models that we searched over by a feature vector. And
this feature vector is going to say, so for
And how many entries are there? Well, with my new indexing, instead of D there's really D plus
one. That's just a little notational choice. And I did a little back of the envelope calculation for a
couple choices of D. So for example, if we had a total of eight different features we were looking
over, then we would have to search over 256 models.
That actually might be okay. But if we had 30 features, all of a sudden we have to search over 1
billion some number of different models. And if we have 1,000 features, which really is not that
many in applications we look at these days, all of a sudden we have to search over 1.07 times 10
to the 301. And for the example I gave with 100 billion features, I don't even know what that
number is. Well, I'm sure I could go and compute it, but I didn't bother and it's clearly just huge.
So, what we can see is that typically, and in most situations we're faced with these days. It's just
computationally prohibitive to do this all subset search. [MUSIC]
select the next best feature and keep iterating.
One is the fact that from iteration to iteration, error can never decrease. And the reason for that is,
even if there's nothing else that you want to add to your model. Even if everything else is junk, or
perhaps only making things a lot worse in terms of predictions. You can always just set the weight
equal to zero on that additional feature. And then you'd have exactly the same model you had,
which is one fewer feature. And so, the training error would be exactly the same. So, we know the
training error will never increase. And the other thing that we mentioned is that eventually, the
training error will be equivalent to that of all subsets when you're using all of your features, and
perhaps, before that as well. Well, when do we stop this greedy procedure? Just to really hammer
this home, I'll just ask the question again. Do we stop when the training error is low enough? No.
Remember that training error will decrease as we keep adding model complexity. So, as we keep
adding more and more features, our training error will go down. What about when test error's low
enough? Again, no. Remember that we never touch our test data when we're searching over
model complexity, or choosing any kind of tuning parameter. Again, what we're gonna do is use
our validation set or cross validation to choose when to stop this greedy procedure. [MUSIC]
Okay, so this procedure is gonna be significantly more computationally efficient or feasible than
doing all subsets. Well this was just one of many possible choices you have for greedy algorithms
for doing feature selection.

[MUSIC] Well, for our third option for feature selection, we're gonna explore a completely different
approach which is using regularized regression to implicitly perform feature selection for us. And
the algorithm we're gonna explore is called Lasso.
Let's recall regularized regression, and the context of ridge regression first. Where, remember, we
were balancing between the fit of our model on our training data and a measure of the magnitude
of our coefficients, where we said that smaller magnitudes of coefficients indicated that things
were not as overfit as if you had crazy, large magnitudes. And we introduced this tuning
parameter, lambda, which balanced between these two competing objectives. So for our measure
of fit, we looked at residual sum of squares. And in the case of ridge regression, when we looked
at our measure of the magnitude of the coefficients, we used what's called the L2 norm, so this is
just the two norm squared in this case, which is the sum of each of our feature weights squared.
Okay, this ridge regression penalty we said encouraged our weights to be small. But one thing I
want to emphasize is that I encourage them to be small but not exactly 0. We can see this if we
look at the coefficient path that we described for ridge regression, where we see the magnitude of
our coefficients shrinking and shrinking towards 0, as we increase our lambda value. And we said
in the limit as lambda goes to infinity, in that limit, the coefficients become exactly 0. But for any
finite value of lambda, even a really really large value of lambda, we're still just going to have very,
very, very small coefficients but they won't be exactly 0. So why does it matter that they're not
exactly 0? Why am I emphasizing so much this concept of the coefficients being 0? Well, this is
this concept of sparsity that we talked about before, where if we have coefficients that are exactly
0, well then, for efficiency of our predictions, that's really important because we can just
completely remove all the features where their coefficients are 0 from our prediction operation and
just use the other coefficients and the other features. And likewise, for interpretability, if we say
that one of the coefficients is exactly 0, what we're saying is that that feature is not in our model.
So that is doing our feature selection. So a question though, is can we use regularization to get at
this idea of doing feature selection, instead of what we talked about before? Where before, when
we're talking about all subsets, or greedy algorithms, what we were doing is we were searching
over a discrete set of possible solutions, we're searching over the solution that included the first
and the fifth feature, or the second and the seventh, or this entire collection of these discrete
solutions. But we'd like to ask here is whether we can start with for example, our full model. And
then just shrink some coefficients not towards 0, but exactly to 0. Because if we shrink them
exactly to 0, then we're knocking out those coefficients, we're knocking those features out from
our model. And instead, the non-zero coefficients are going to indicate our selected features.
[MUSIC]
Then ridge regression is gonna prefer a solution that places a bunch of smaller weights on all the
features, rather than one large weight on one of the features. Because remember the cost under
the ridge regression model is the size of that feature squared. And so if you have one really big
one, that's really gonna blow up that cost, that L2 penalty term. Whereas the fit of the model is
gonna be basically about the same. Whether I distribute the weights over redundant features or if I
put a big one on just for one of them and zeros elsewhere. So what's gonna happen is I'm going to
get a bunch of these small weights over the redundant features. And if I think about simply
thresholding, I'm gonna discard all of these redundant features. Whereas one of them, or
potentially the whole set, really were relevant to my prediction task. So hopefully, it's clear from
this illustration that just taking ridge regression and thresholding out these small weights, is not a
solution to our feature selection problem. So instead we're left with this question of, can we use
regularization to directly optimize for sparsity? [MUSIC]
Lambda's governing the sparsity of the solution and that's in contrast to, for example, the all
subsets or greedy approaches, where we talked about those searching over a discrete set of
possible solutions.
this penalty term is completely going to disappear, and our objective is simply going to be to
minimize residual sum of squares. That was our old least squares objective. So, we're going to get
W hat what I'll call lasso. The solution to our lasso problem is going to be exactly equal to W hat
least squares. So, this is equal to our unregularized solution. And in contrast if we set lambda
equals to infinity. This is where we are completely favoring. This magnitude penalty in completely
ignoring the residual square is fit. In this case, what's the thing that minimizes the L1 norm. So,
what value of our regression coefficients is gonna have some other absolute value is being the
smallest. Well again or just like in ridge when lambda's equal to infinity we're gonna get W hat
lasso equal to the zero vector. And if lambda is in between we're gonna get that in this case the
one norm of our lasso solution It's gonna be less than or equal to the one norm of our lease
square solution and it's gonna be greater than or equal to this zero vector. I mean, this zero
number. Sorry. Here it's just a number once we've taken this norm. Okay. So, as of yet, it's not
clear why this L 1 norm is leading to sparsity, and we're going to get to that, but let's first just
explore this visually. And one way we can see this is from the coefficient path. But first, let's just
remember the coefficient path for ridge regression, where we saw that even for a large value of
lambda Everything was in our model, just with small coefficients. So, everything has W hat J
greater than zero but all W hat J. Are small for large values of our tuning parameter lambda. In
contrast, when we look at the coefficient path for lasso, we see a very different pattern. What we
see is that at certain critical values of this tuning parameter lambda. Certain ones of our features
jump out of our model. So, for example here we had square feet of the lot size disappears from
the model. Here number of bedrooms almost simultaneously with number of floors and number of
bathrooms. Followed by the year the house was built. And then, but one thing that we see, so let
me just be clear, that for let's say a value of lambda like this, we have a sparse set of features
included in our model. So, the ones I've circled. Are the only feature, sorry. Only features in our
model. And all the other ones, have dropped completely exactly to zero. And one thing that we
see is that when lambda is very large, like the large value I showed on the previous plot, the only
thing in our model is square feet living. And note that square feet living still has a really
significantly large weight on it. So, I'll say large weight on square feet living when everything else
is out of the model. Meaning not included in the model. So, square feet living is still very valuable
to our predictions, and it would take quite a large lambda value to say that square feet living, even
that was not relevant. Eventually, square feet living would be shrunk exactly to 0. But for a much
large value of land. But, if I go back to my ridge regression solution. I see that I had a much
smaller value on square feet living, because I was distributing weights across many other features
in the model. So, that individual impact of square feet living wasn't as clear. [MUSIC]
. So let's say I'm just trying to minimize my two norm. What's the solution to that? Well I'm going to
jump down these contours to my minimum. And my minimum, let me do it directly in red this time.
My minimum, oops, that didn't switch colors. Directly in red. My minimum is setting w0 and w1
equal to zero. So this is min/w0 w1, my two norm, which I'll write explicitly in this 2D case, is w0
squared + w1 squared, and the solution is 0. Okay, so this would be our ridge regression solution
if lambda, the waiting on this 2 norm, were infinity. We talked about that before. [MUSIC]
[MUSIC] For any specific value of lambda, we get some balance between this residual sum of
squares, and this two norm. And so what I'm gonna do, is in this movie, I'm gonna add these two
contour plots together. I'm gonna add, so let me write this down. Add contour plots together,
where I'm getting residual sum of squares of w plus lambda 2 norm of w. Where here the residual
sum of squares were these ellipses, centered about my least squares solution, and there's to
norm, where these circles centered about zero. And lambda is some weighting on how much I'm
including that two norm penalty or the cost. And what I'm going to do is I'm going to show a movie
as a function of lambda. So movie, function of increasing lambda, where I have my ellipses, and
I'm weighting more and more heavily these contours that are coming from the circle. The circle
terms from this two norm penalty. Okay, so this is the movie right here, and my lovely assistant
Carlos, will click the mouse to play the movie. [LAUGH] Since I don't know how to control it from
the tablet, unfortunately. [LAUGH] Thank you Fanna. That reference was probably lost on most
people. And in doing so, you didn't get me describing the movie so let's watch it again. But what
we see this x, let me be clear, that the x is going to mark the optimal solution, For a specific
lambda and we're varying lambda so this x is gonna move. Okay where's the x gonna start? Well
when lambda's equal to zero, we're starting out our lee squared solution and as lambda increases
we know that as lambda goes to infinity the coefficients are gonna shrink to zero. But let's
visualize the path that it takes as we increase lambda. Okay so let's play this movie again, gonna
start it early square the solution and we see that the magnitude of our coefficients W0 and W1 are
shrinking smaller and smaller towards zero. So, maybe we'll play that once more just to visualize
this and what we say. Again, this was just the tail end of the movie. Is this shrinking magnitude of
the coefficients? Carlos is very excited about this movie, so we're gonna watch it one more time.
It's pretty cool. We've never actually seen somebody do this visualization. We think it's really
intuitive. So again, as that land of penalty is increasing, the magnitude of the coefficients are
getting shrunk. Okay, well now let's talk about what the solution looks like for a given value of
lambda. Oops, sorry let me turn my pen on. So for a specific lambda value. We have some
balance between residual sum of squares and the magnitude of our coefficients. Lambda's
automatically doing some trade-off between the two. So some balance Between RSS and our two
norm. And specifically, in this plot, this is our solution. So, it has some RSS that happens to be,
five thousand two hundred fifteen, that's what the number on this contour is indicating and it has
some tune arm, which has value 4.75, and so this lambda has chosen the specific trade off and
we see that our solution is somewhere here, which has shrunk from where our least square
solution was. Let's remember our least square solution was somewhere around here. And the
optimal for lambda equals infinity was that zero. So it's somewhere in between these two values.
And if we had chosen a different value of lambda, let's say a larger value of lambda We would of
had a different solution. And when I'm drawing all these contours, what I'm saying is, let me just
go back to the original one before this this drawing. What I'm saying is, every other point along this
circle, has exactly the same residual sum of squares. But larger l2 norm of w and everywhere
along this circle has exactly the same w2 norm, sorry, l2 norm of w, but it has larger residual sum
of squares. So that's why this is the optimal trade off for this lambda. Then, like I drew here, if I
chose a larger lambda, I will get a solution that preferred a smaller two norm and a larger residual
sum of squares. So, this would be solution for a larger lambda value. Okay, so this is just a little
visualization of what a ridge regression solution looks like. [MUSIC]
just like in ridge, the solution is to make the magnitude as small as possible Which is zero so this
is min over w of my one norm. Okay. So, this is a really important visualization for the one norm
and we're going to return to it in a couple slides. But first what I wanna do is show exactly the
same type of movie that we showed for ridge objective but now for lasso so again this is a movie
where we're adding the two contour plots so adding RSS + lambda W1 in this case so we're
adding ellipses. Plus some waiting Lambda of a set of diamonds. And then we're gonna solve for
the minimum, that's gonna be x. So, x is again our optimal W hat for a specific lambda. And we're
gonna look at how that solution changes as we increase the value of lambda. And again if we set
lambda equal to zero we're gonna be at our least square solution, so we're gonna start at exactly
the same point that we did in our ridge regression movie. But now as we're increasing lambda, the
solution's gonna look very different than what it did for ridge regression. We know that the final
solution is gonna go towards zero, but let's look at what the path looks like. Okay, Vanna You're
up, play the movie. So, what we see is that the solution eventually gets the point where w0 is
exactly equal to 0. So, if we watch this movie again, we see that this X is moving along shrinking
and then it hits the Y axis and it moves along that Y axis. So, the first thing that happens is W 0
becomes exactly 0 while the coefficients shrink and at some point it hits the point where W 0
becomes exactly 0 and then our W 1 term, the waiting on this second feature, H 1, is going to
decrease and decrease and decrease as we continue to increase out penalty term lambda. So, it's
going to continue to walk down this axis. So, lets watch this one more time with this in mind. Our
solution hits that zero point, that spar solution where W0 hat is equal to zero and then it continues
to shrink the coefficients to zero. And you see that our contours become more and more like the
diamonds that are defined by that L1 norm. As the weighting on that norm increases. Now,let's go
ahead and visualize what the lasso solution looks like. And this is where we're gonna get our
geometric intuition beyond what was just shown in the movie for why lasso solutions are sparse.
So, we already saw in the movie that for certain values of lambda, we're gonna get coefficients
exactly equal to zero. But now let's just look at just one value of lambda. And here Is our solution,
and what you see is that because of this diamond, so let me write this as our solution, Because of
this diamond shape of our L1 objective or the penalty that we're adding We're gonna have some
probability of hitting those corners of this diamond. And at those corners we're gonna get sparse
solutions. So, like Carlos likes to say, it's like a ninja star that's stabbing our RSS contours. So,
maybe that's a little Brutal of a description but maybe you'll remember it. So, this is why lasso
leads to sparse solutions. And another thing I want to mention is this visualization is just in two
dimensions, but as we get to higher dimensions instead of diamonds they're called Wrong boy and
that they're very pointy objects. So, in high dimensions were very likely to hit one of those corners
of this L1 penalty for any value of Lynda. [MUSIC]
And you see that our contours become more and more like the diamonds that are defined by that
L1 norm. As the weighting on that norm increases.
But as we increase this lambda value, we get more and more sparsity in our solution. So the
number of nonzeroes here is 14, then we get five coefficients being nonzero. And by the time we
have a penalty strength of ten, we only have two of our coefficients being nonzero. So you can
see very explicitly from this how lasso is leading to sparse solutions, especially as you're
increasing the strength of that l1 penalty term. But now, let's just look at the fits associated with
these different estimated models. So this is for our very small penalty value. This function doesn't
look as crazy as the least square solution, still fairly wiggly. But remember that in lasso just like in
ridge, the coefficients are shrunk relative to the ridge regression solution. So even in this case
where we don't have any sparsities and none of the features have been knocked out of our model,
we still have that the coefficients are a little bit shrunk relative to those of the least square solution
and that's providing enough regularization to lead to the smoother fit in this case. But as we
increase this lambda value, we see that we actually do get smoother and smoother fits. This starts
to look like the fit that we had for our optimal setting of our ridge regression objective or the one
that minimized or leave one out across validation error. But again, when we get to really large
lambdas just like in ridge regression, we start to have fits that are over smoothing things. So this
was a case where we only had two nonzero coefficients and we see that it's really just insufficient
for describing what's going on in this case. So again, to choose our lambda value here, we could
do the same. Leave one out cross validation, we did in our ridged regression demo. But the point
that I wanted to show here is how we get these sparse solutions where we're knocking out, in this
case, different powers of x in our polynomial regression fit. And this is in contrast to ridge
regression, which simply shrink the coefficient of each one of these powers of x in our degree 16
polynomial fit. [MUSIC]
But just know that they exist, and they're crucial to the derivation of the algorithms we're gonna
talk about for the lasso objective. And if you're interested in learning more, stay tuned for our
optional advanced video. But even if you could compute this derivative that we're saying doesn't
exist for this absolute value function, there's still no closed-form solution for our lasso objective. So
we can't do the closed-form option, but we could do our gradient descent algorithm. But again not
using gradients, using subgradients. [MUSIC]
how to apply coordinate descent to solving our lasso objective. So, our goal here is to minimize
sub function g. So, this is the same objective that we have whether we are talking about our
closed form solution, gradient descent, or this coordinate descent algorithm. But, let me just be
very explicit, where. We're saying we wanna minimize over all possible w some g(w), where here,
we're assuming g(w) is function of multiple variables. Let's call it g(w0,w1,...,wD). So this W I'm
trying to write in some bold font here. And often, minimizing over a large set of variables can be a
very challenging problem. But in contrast, often it's possible to think about optimizing just a single
dimension, keeping all of the other dimensions fixed. So easy for each coordinate when keeping
others fixed, because that turns into just a 1D optimization problem. And so, that's the motivation
behind coordinate decent, where the coordinate descent algorithm, it's really intuitive. We're just
gonna start by initializing our w vector, which will denote w hat equal to 0. Or you could use some
smarter initialization, if you have it. And then while the algorithm's not converged, we're just gonna
pick a coordinate out of 0, 1 all the way to d. And we're gonna set W hat J equal to the W that
minimizes. I'm gonna write this as an omega, so I'm gonna search over all possible values of
omega that minimizes g of w hat zero dot dot dot, all the way to W hat j minus one, coma omega,
coma w hat at j plus one all the way up to w hat D. We're here, these values, everything with a hat
on it are values from previous iterations of this coordinate to send algorithm. And our objective
here, we're just minimizing over this J coordinate. Sorry, these J's are pretty sloppy here. Okay, so
whatever omega value minimizes this joint objective, plugging omega in, is gonna be what we set
w hatch a equal to at this iteration, and then were gonna keep cycling through until conversions.
So, let's look at a little illustration of this just in two dimension. So, here it would just be W0, W1.
And we're gonna pick a dimension, so let's say, we choose this optimizing the W1 dimension,
keeping W0 fixed. So I'm just gonna look at this slice, and what's the minimum along this slice?
Well the minimum occurs here, if I look at the contours this is the minimum along this dimension,
and then what I'm gonna is I'm, in this case just gonna cycle through my coordinates. I'm next
gonna look at keeping W1 fixed and optimizing over W0, and here was the minimum, I'm gonna
drop down to the minimum which is here and I'm gonna optimize over W1 holding W0 fixed to this
value. And I'm gonna keep taking these steps and if I look at the path of coordinate descent, it's
gonna look like the following, where I get this axis aligned moves. So how are we gonna pick
which coordinate to optimize next? Well there are a couple choices. One is to just choose
coordinates randomly. This is called either random or stochastic coordinate descent. Another
option is round robin, where I just cycle through, zero, one, two, three, all the way up to D and
then cycle through again. And there are other choices you can make as well. And, I want to make
a few comments about coordinate descent. The first is, it's really important to note that there is no
stepsize that we're using in this algorithm. So, that's in contrast to gradient descent, where we
have to choose our stepsize or stepsize schedule. And hopefully, in the practice that you've had,
you've seen the importance of carefully choosing that step size. So it's nice not having to choose
that parameter. And this algorithm is really useful in many, many problems. And you can show
that this coordinate descent algorithm converges for searching objective functions. One example
of this is, if the objective is strongly convexed, and we talked about strongly convexed functions
before. Another example is, for lasso which is very important and that's why we're talking about
using coordinate descent for lasso. We're not gonna go through proof of convergence in this case,
but know that it does indeed converge. [MUSIC]
n operation for a specific observation and that operation would be different between different
observations and things would be all on different scales between our different observations and
that wouldn't make any sense. So please never do that. Please do this observation on columns of
the training data and what's this operation we're gonna do? Well we're gonna take our column, our
specific feature like square feet over our n different observations in our training set, and we're
going to normalize each of these entries, each of these feature values by the following quantity,
and doing this operation is gonna place each one of our features into the same numeric range,
because the range of values we might see for square feet versus number of floors or number of
bathrooms is dramatically different. After we do this transformation though, everything will be in
the same numeric range, so that's why it's a practically important transformation but, I wanna
emphasize one thing that when you transform your features, when you normalize them. In this
way or any way that you choose to transform your features on your training dataset you have to
do exactly the same operation on your test dataset. Otherwise you're gonna be mixing up the
apples and oranges because for example, you think about transforming your training data from
Measuring square feet to measuring square meters for every house but then your test data is still
in square feet. When I learn a model in my training data, which is now in square meters, and I go
to test it on my test data which is square feet, it's gonna give me bogus answers. Okay, so make
sure you do the same operation on both training and test and so, if we're gonna to call this
normalizer, which we'll use this notation later, Zj, we're going to divide by exactly the same
quantity for each column of our test set. So this same Zj appears here where, just to emphasize,
we're summing over all of our training data points in doing this operation here, and we're applying
it to one of our test data points. [MUSIC]
Well what row j is doing is it's a measure of the correlation between our future j and this residual
between the predicted value. Where remember that prediction was formed excluding that jth
feature from the model and the true observation. So we're looking at our prediction, the difference
relative to the truth, when j is not included in the model. And then we're seeing how correlated is
that with the jth feature. And if that correlation is really high, so if they tend to agree, then what
that means is that jth feature is important for the model. And it can be added into the model and
reduce the residuals. So equivalently improve our predictions, so what this is going to mean. If
these things tend to agree, row j is going to be really large, high correlation. And we're gonna to
set w hat j large, we're gonna put a weight strongly on this feature being in the model. But in
contrast if they're uncorrelated, if they don't tend to agree across observations, or if the residuals
themselves are already very small, because the predictions are already very good. Then what it's
saying is that row j is gonna be small and as a result, we're not gonna put much weight on this
feature in the model. [MUSIC]
[MUSIC] So, now we're gonna describe what the variant of this coordinate descent algorithm looks
like in the case of lasso. Again, we're gonna be looking at these normalized features. And just
remember this is where we left off with our coordinate descent algorithm for just least squares un-
regularized regression. And remember the key point was that we set w hat j equal to row j. This
correlation between our features and the residuals from a prediction, leaving j out of the model.
Well in the case of lasso what we're gonna do is, how we set w hat j is gonna depend of the value
of our tuning parameter lambda. And how that relates to this rho j correlation term. So, in
particular if rho j is small, if it's in this minus lambda over 2 to lambda over 2 range, where again
what small means is determined by lambda. What we're gonna do is we're gonna set that w hat j
exactly equal to zero. And here we see the sparsity of our solutions coming out directly here. But
in contrast, if rho j is really large or on the flip side very small, what that means is that the
correlation is either very positive or very negative. Then we're gonna include that feature in the
model. Just like we did in our least squares solution but relative to our least squares solution,
we're gonna decrease the weight. So, in the positive case if we've a strong correlation rho j.
Instead of putting w hat j equal to rho j, we're gonna set it equal to rho j minus lambda over 2. And
on the negative side, we're going to add lambda over 2. So let's look at this function of how we're
setting w hat j visually. Okay, well this operation that we are performing here in these lasso
updates is something called soft thresholding. And so, let's just visualize this. And to do this we're
gonna make a plot of rho j, that correlation we've been talking about, versus w hat j, our coefficient
that we're setting. And remember, in the least squared solution, we set w hat j equal to rho j for
least squares. And we can see that she's setting lambda equal to zero. Remember, lambda
equals zero returns us to our least squares solution, so I'll specifically write least squares there.
So that's why we get this line y equals x, this green line appearing here. So this represents as a
function of rho j how we would set w hat j for least squares. And in contrast this fuchsia line here
we're showing is for lasso. And what we see is that in the range minus lambda over 2 to lambda
over 2. If this correlation is within this range, meaning that there's not much a relationship between
our feature and the residuals from predictions without feature JNR model, we're just gonna
completely eliminate that feature. We're gonna set it's weight exactly equal to 0. But if we're
outside that range we're still gonna include the feature in the model. But we're gonna shrink the
weight on that feature relative to the least square solution by an amount lambda over 2. So this is
why it's called soft thresholding, we're shrinking the solution everywhere, but we're strictly driving it
to zero from minus lambda over 2 to lambda over 2. And I just want to mention to contrast with, let
me choose a color that we don't have here, I guess red will work. I wanna contrast with the ridge
regression solution where you can show, which we're not going to do here, but you can show that
the ridge regression solution. Shrinks the coefficients everywhere, but never strictly to zero. So
this is the line w hat ridge. And, Let me just write that this is w hat lasso. Okay, so here we got a
very clear visualization of the difference between least squares, ridge, and lasso. [MUSIC]
we're gonna shrink the weight on that feature relative to the least square solution by an amount
lambda over 2. So this is why it's called soft thresholding, we're shrinking the solution everywhere,
but we're strictly driving it to zero from minus lambda over 2 to lambda over 2.
[MUSIC] Well in our coordinate descent algorithm for lasso. And actually all of our coordinate
descent algorithms that we've presented we have this line that says while not converged. And the
question is how are we assessing Convergence? Well, when should I stop in coordinate descent?
In gradient descent, remember we're looking at the magnitude of that gradient vector. And
stopping when the magnitude of the vector was below some tolerance epsilon. Well here, we don't
have these gradients we're computing, so we have to do something else. One thing we know,
though, is that for convex objectives, the steps that we're taking as we're going through this
algorithm are gonna become smaller and smaller and smaller as we're moving towards our
optimum. Well at least in strongly convex functions, we know that we're converging to our optimal
solution. And so one thing that we can do is we can measure the size of these steps that we're
taking through a full cycle of our different coordinates. Because I wanna emphasize, we have to
cycle through all of our coordinates, zero to d. Before judging whether to stop, because it's
possible that one coordinate or a few coordinates might have small steps, but then you get to
another coordinate, and you still take a large step. But if, over an entire sweep of all coordinates, if
the maximum step that you take in that entire cycle is less than your tolerance epsilon, then that's
one way you can assess that your algorithms converged. I also wanna mention that this
Coordinate descent algorithm is just one of many possible ways of solving this lasso objective. So
classically, lasso was solved using what's called lars least angle regression and shrinkage. And
that was popular up until roughly 2008 when an older algorithm was kinda rediscovered and
popularized. Which is doing this coordinate descent approach for lasso. But more recently there's
been a lot a lot of activity in the area of coming up with efficient parallel lines and distributed
implementations of lasso solvers. These include a parallel version of coordinate descent. And
other parallel learning approaches like parallel stochastic gradient descent or thinking about this
kind of distribute and average approach that's fairly popular as well. And one of the most popular
approaches specifically for lasso is something called, Alternating direction method of multipliers,
or ADMM, and that's been really popular within the community of people using lasso. [MUSIC]
[MUSIC] So finally, I just wanted to present the coordinate descent algorithm for lasso if you don't
normalize your features. So this is the most generic form of the algorithm, because of course it
applies to normalized features as well. But let's just remember our algorithm for our normalized
features. So, here it is now. And relative to this, the only changes we need to make are what's
highlighted in these green boxes. And what we see is that we need to precompute for each one of
our features. This term is Zj, and that's exactly equivalent to the normalizer that we described
when we normalized our features. So if you don't normalize, you still have to compute this
normalizer. But we're gonna use it in a different way as we're going through this algorithm. Where,
when we go to compute roh j, we're looking at our unnormalized features. And when we're forming
our predictions, y hat sub i, so our prediction for the ith observation, again, that prediction is using
unnormalized features. So there are two places in the rho j compuation where you would need to
change things for unnormalized features. And then finally when we're setting w hat j according to
the soft thresholding rule, instead of just looking at roh j plus lambda over two, or roh j minus
lambda over two, or zero. We're gonna divide each of these terms by z j, this normalizer. Okay, so
you see that it's fairly straight forward to implement this for unnormalized features, but the intuition
we provided was much clearer for the case of normalized features. [MUSIC]
we can put it all together and get our subgradient of our entire lasso cost, with respect to wj, and
here this part, this is from our residual sum of squares. Whereas this part is from the L1 penalty.
Or really landa times the L1 penalty. And when we put these things together we get three different
cases. Because of the three cases for the L1 objective. So, we get 2zj that normalizer. Wj-rho j-
lambda. It won't read all of the different cases. But this now is our subgradient of our entire Lasso
solution. Okay, well, let's not get lost in the weeds here. Let's pop up a level and say remember
before we would take the gradient and set it equal to 0, or in the coordinate descent algorithm we
talked about taking the partial with respect to one coordinate then setting that equal to zero to get
the update for that dimension. Here now instead of this partial derivative we're taking the
subgradient and we have to set it equal to zero. To get the update for this specific jth coordinate.
So, let's do that, so we've taken our subgradient, we're setting it equal to 0. Again we have three
different cases we need to consider. So, in Case 1, where wj is less than 0, let's solve for w at j.
You get w at j Equals 2 rho j + lambda divided by 2 z j, which I'm just gonna rewrite as rho j +
lambda over 2, divided by z j. So, I've multiplied the top and bottom by one-half here. Okay, but to
be in this case, to have W to have J less than zero, we need a constraint on row J. So if row J is
less than minus lambda over two, remember this is that correlation term. If that and that's
something we can compute because that's a function of all the other variables except for wj. So, if
ro j is less than minus lambda over 2 then we know that w hat j will be less than 0 according to this
formula. Okay, then we get to the second case, which is wj = 0. In that case, where you've already
solved for w hatch a, there's only one solution. But in order to have that be the optimum, we know
that this subgradient when wj = 0 The subgradient has to contain zero otherwise we would never
get that, this is the case that is equal to zero that it's an optimum. So, we need for this range to
contain zero, so that w hat j equals 0 is an optimum. Of our objective. And for that to be true, we
need minus 2 row j plus lambda to be greater than 0. So, we need this upper term of the interval
greater than 0, which is equivalent to saying that Rho J is less than lambda over two and we need
this bottom interval to be less than zero. So, minus two rho J minus lambda less than zero which
is equivalent to saying rho J is greater than Minus lambda over two. And if we put these together
what this is saying is that row j is less than lambda over two and greater than minus lambda over
two And actually we could put the equal sign here. So, let's just do, so it has to be less than or
equal to, less than or equal to, less than or equal to. Okay, and our final case, let's just quickly
work through it we get w hat j equals. Row l- lambda over 2 divided by Zj and in order to have W
hat J be greater than 0, we need row j would be greater than lambda over two. So, let me just talk
through this kind of in the other direction now that we've done the derivation which is saying if rho
J is less than minus lambda over two, then we'll set W hat J as follows. If row j is in this interval,
we'll set w hat j equal to 0, and if row j is greater than lambda over 2, we're gonna set w hat j as
this third case. Okay, so, this slide just summarizes what I just said. So, this is our optimal 1 D
optimization for this lasso objective. So, let's talk about this more general form of the soft
thresholding rule for lasso in the case of our unnormalized features. So, remember for our
normalized features, there wasn't this CJ here And what we ended up with for our least square
solution when lambda was equal to 0 was just this line, w-hat j equals rho j. But now what does
our least squares line look like? Well again, we can just set lambda equal to 0, and we see that
this Lee squares line w hat lee squares is equal to row j over z j. Remember z j is that normalizer
so I mean over the square of all of our features. So, that number will be positive and it's typically
gonna be larger than one. Potentially much larger than one So, relative to a slope, which is a 45
degree angle slope of 1, I'm saying that this line is shrunk more this way, typically, in the case of
unnormalized features. And then, when I look at my lasso solution, w hat. Lasso in this case.
Again, in the range minus lambda over 2. Sorry, that is clearly not minus lambda over 2. This is
minus lambda over 2, to lambda over 2. I get this thresholding of the coefficients exactly to zero,
relative to My least square solution, and outside that range, the difference between the least
square solution and my lasso solution, is that my coefficients are each shrunk by an amount
lambda over lambda 2zj. Okay. But remember that rho_j here as compared to when we talked
about normalized features was defined differently. It was defined in terms of our unnormalized
features. So, for the same value of lambda that you would use with normalized features you're
getting a different relationship here. A different range where things are set to zero. S,o in
summary, we've derived this coordinate descent algorithm for lasso in the case of unormalized
features. And in particular, the key insight we had was instead of just taking the partial of our
objective with respect to WJ we had to take the subgradient of our objective with respect to Wj,
and that's what leads to these three different cases, because the gradient itself is defined for every
value of Wj, except for this one critical point, Wj = 0. But in particular we also had a lot of insight
into how this soft thresholding gives us the sparsity in our lasso solutions. [MUSIC]
[MUSIC] Okay, well in summary, we've really learned a lot this module. We've thought about this
task of feature selection and we've described ways searching over first, all possible sets of
features that we might want to include to come up with the best model. Talked about the
challenges of that computationally, and then we turned to thinking about greedy algorithms and
then discuss these radialized regression approach of lasso for addressing the feature selection
task. So really it's covering a lot of ground. That's really important concepts in machine learning.
And this lasso regularized regression approach, although really, really simple, has dramatically
transformed the field of machine learning statistics and engineering. It's shown its utility in a
variety of different applied domains. But I wanna mention a really important issue, which we kind
of alluded to, which is that for feature selection, not just lasso, but in general when you're thinking
about feature selection, you have to be really careful about interpreting the features that you
selected. And some reasons for this include the fact that the features you selected are always just
in the context of what you provided as the set of possible features to choose from to begin with.
And likewise, the set of selected features are really sensitive to correlations between features and,
in those cases, small changes in the data can lead to different features being included, too. So to
say that one feature is important and the other isn't, you have to be careful with statements like
that. And also of course, the set of selective features depends on which algorithm you use. We
especially saw this when we talked about those greedy algorithms, like the forward stepwise
procedure. But I did want to mention that there are some nice theoretical guarantees for lasso
under very specific conditions. So in conclusion, here's a very long list of things that you can do
now that you've completed this module. Everywhere from thinking about searching over the
discrete set of possible models to do future selection using all subsets with these 3D algorithms.
To formulating a regularized regression approach, lasso, to implicitly do this feature selection,
searching over a continuous space, this tuning parameter lambda. We talked about formulating
the objective, geometric interpretations of why the lasso objective leads to sparsity. And we talked
about using coordinate descent as an algorithm for solving lasso. And so coordinate descent itself
was an algorithm that generalizes well beyond lasso, so that was an important concept that we got
out of this module as well. And finally, if you watch the optional video, we talked about some really
technical concepts relating to subgradients. And to conclude this module we talked about some of
the challenges associated with lasso. But, as well as some of the potential impact that this method
has, because it's really quite an important tool. And like I've mentioned, it's really shown a lot of
promise in many different domains. [MUSIC]
WEEK

2 hours to complete

Nearest Neighbors & Kernel Regression

https://courses.cs.washington.edu/courses/cse446/17wi/slides/
https://courses.cs.washington.edu/courses/cse446/17wi/lectures.html
but the truth is that this cubic fit is kind of a bit too flexible, a bit too complicated for the regions
where we have smaller houses. It looks like it fits very well for our large houses but it's a little bit
too complex for lower values of square feet. Because, really for this data you could describe it as
having a linear relationship like, I talked about for low square feet value. And then, just a constant
relationship between square feet and value for some other region. And then, having this maybe
quadratic relationship between square feet and house value when you get to these really, really
large houses. So this motivates this idea of wanting to fit our function locally to different regions of
the input space. Or have the flexibility to have a more local description of what's going on then our
models which did these global fits allowed. So what are we gonna do? So we want to flexibly
define our f(x) the relationship in this case between square feet and house value to have this type
of local structure. But let's say we don't want to assume that there is what are called structural
breaks that there are certain change points where the structure of our regression is gonna
change. In that case, you'd have to infer where those break points are. Instead let's consider a
really simple approach that works well when you have lots of data. [MUSIC]
and I'm gonna choose my predicted value for my house to be exactly the same as this other
observation. So we can see that this leads to this idea of having local fits where these local fits are
defined around each one of our observations. And how local they are, how far they stretch is
based on where the placement of other observations are. And this is called one nearest neighbor
regression.
So specifically let's call our x nearest neighbor to be the house that minimizes over all of our
observations, i, the distance between xi and our query house, xk.
And then we say, what is the value of that house, which is how much it sold for y, nearest
neighbor. And that's exactly what we're gonna predict for our query house. So the key thing in our
nearest neighbor method is this distance metric which measures how similar this query house is to
any other house. And this defines our notion of in quotes closest house in the data set. And it's a
really, really key thing to how the algorithm's gonna perform. For example, what house it's gonna
say is most similar to my house. So we're gonna talk about distance metrics in a little bit. But first,
let's talk about what one nearest neighbor looks like in higher dimensions because so far, we've
assumed that we have just one input like square feet and in that case, what we had to do was
define these transition points. Where we go from one nearest neighbor to the next nearest
neighbor. And thus changing our predicted values across our input space square feet. But how do
we think about doing this in higher dimensions? Well, what we can do is look at something that's
called a Voronoi diagram or a Voronoi tessellation. And here, we're showing a picture of such a
Voronoi tessellation, but just in two dimensions, though this idea generalizes to higher dimensions
as well. And what we do is we're just gonna divide our input space into regions. So in the case of
two dimensions, they're just regions in this two dimensional space. So I'm just gonna highlight one
of these regions. And each one of these regions is defined by one observation. So here is an
observation. And what defines the region is the fact that any other point in this region, let's call it x.
Let's call this xi, and this other point some other x. Well, this point x is closer to xi, let me write this.
So x closer to xi than any other xj, for j not equal to i. Meaning any other observation. So in
pictures, what we're saying is that the distance from x to xi is less than the distance to any of
these other observations. In our dataset. So what that means is, let's say that x is our query point.
So let's now put a sub q. If x is our query point, then when we go to predict the value associated
with xq. We're just gonna look at what the value was associated with xi, the observation that's
contained within this region. So this Voronoi diagram might look really complicated, but we're not
actually going to explicitly form all these regions. All we have to do is be able to compute the
distance and define between any two points in our input space and define which is the closest.
[MUSIC]
[MUSIC] How are defining distance? Well, in 1-d it's really straightforward because our distance
on continuous space is just gonna be Euclidean distance. Where we take our input-xi and our
query x-q and look at the absolute value between these numbers. So, these might represent
square feet for two houses and we just look at the absolute value of their difference.
efine what's called a scaled Euclidean distance, where we take the distance between now this
vector Of inputs, let's call it x,j. And this vector of inputs associated with our query house x,q and
we're gonna component wise look at their difference squared. But then we're gonna scale it by
some number. And then we're gonna sum this over all our different dimensions, okay? So, in
particular I'm using this letter a to denote the scaling. So, a sub d is the scaling on our dth input,
and what this is capturing is the relative importance of these different inputs in computing this
similarity. And after we take the sum of all these squares we're gonna take the square root and if
all these a values were exactly equal to 1, meaning that all our inputs had the same importance
then this just reduces to standard Euclidean distance. So, this is just one example of a distance
metric we can define at multiple dimensions, there's lots and lots of other interesting choices we
might look at as well But lets visualize what impact different distance metrics have on our resulting
nearest neighbor fit. So, if we just use standard Euclidean distance on the data shown here. We
might get this image, which is shown on the right where the different colors indicate what the
predicted value is in each one of these regions. Remember each region you're gonna assume any
point in that region, the predicted value is exactly the same because it has the same nearest
neighbor. So, that's why we get these different regions of constant color. But if we look at the plot
on the left hand side, where we're using a different distance metric, what we see is we're defining
different regions where again those regions mean that any point within that region is closer to the
one data point lying in that region, than any of the other data points in our training data set, but the
way this distance is defined is different so thus the region looks different, so for example, with this
Manhattan distance what this is saying just think of New York and driving along the streets of New
York. It's measuring distance along this axis-aligned directions, so it's distance along the x
direction plus distance along the y direction which is a different difference than our standard
Euclidean distance. [MUSIC]
we can initialize what I'm calling distance to Nearest Neighbor to be infinity and initialize our
closest house to be the empty set. Then what we do is we're going to step through every house in
our dataset. And we're going to compute the distance from our query house to the house that
we're at at our current iteration, and if that distance is less than the current distance to our nearest
neighbor. Which at first is infinity, so at the first iteration. You're definitely gonna have a closer
house. So the first house that you search over you're gonna choose as your nearest neighbor for
that iteration. But remember, you're gonna continue on. And so what you're gonna do is, if the
distance is less than the distance to your nearest neighbor, you're gonna set your current nearest
neighbor equal to that house, or that x. And then you're gonna set your distance to nearest
neighbor equal to the distance that you had to that house. And then you're just gonna iterate
through all your houses. And in the end what you're gonna return is the house that was most
similar. And then we can use that for prediction to say that the value associated with that house is
the value we're predicting for our query house. So let's look at what this gives us on some actual
data. So, here, we drew some set of observations from a true function that had this kind of curved
shape to it. And, the blue line indicates the true function that was used to generate this data. And
what the green represents is our one nearest neighbor fit. And what you can see is that the fit
looks pretty good for data that's very dense in our input space. So, dense and x. We get lots of
observation across our whole input space. But if we just removed some observations in a region
of our input space, and things start to look not as great because Nearest Neighbor really struggles
to interpolate across regions of the input space where you don't have any observations or you
have very few observations. And likewise, if we look at a data set that's much noisier, we see that
our Nearest Neighbor fit is also quite wild. So this looks exactly like the types of plots we showed
when we talked about models that were overfitting our data. So what we see is that one, Nearest
Neighbors is also really sensitive to noise in the data. [MUSIC]
For completeness I've included what the k nearest neighbor algorithm looks like where the one
key insight here is to maintain a sorted queue of the k nearest neighbors as you're going through
the different steps of your search. So let's just walk through this algorithm here. What we're gonna
do is we're gonna initialize what I'm calling distance to k nearest neighbors, which is a list of
distances, the current distances to the k nearest neighbors I have at the current iteration. And so
to start with, we can just look at the first k houses in our data set and initialize those as our k
nearest neighbors. But we're gonna sort these distances for these k nearest neighbors for these
first k houses. However these distances sort, we're gonna likewise sort the first k houses in our
data set to be our current estimate of our nearest neighbors. Then for now, we're gonna just cycle
through houses k + 1 to capital N. What we're gonna do is we're gonna compute the distance of
this new house that we haven't seen before to our query house. And if that distance is less than
the distance to my kth nearest neighbor, the house that's furthest away, then what I'm going to do
is I'm going to go through my sorted list. Because remember just because this new house is closer
then the kth nearest neighbor, that doesn't mean that this new house becomes my kth nearest
neighbor because it could be closer than my second nearest neighbor or fourth. So what I need to
do is I need to go through and find which index this house should go into. Meaning there's some
point at which all houses nearest neighbors one, two, j minus one are actually closer than this
house that I'm looking at now. But this house is closer than all the other nearest neighbors in my
set. And what I'm gonna do, now that we have this index j, we're gonna go and try and insert this
house that we found that's one of our current estimates of our nearest neighbor into the queue.
And what we're gonna do is we're simply knock off the kth nearest neighbor, which we know was
our worst estimate of our nearest neighbor so far. We're gonna go and insert this house into this q.
So this step is just shifting the indices for our list of houses, just knocking off that kth last house.
And then this step is doing exactly the same thing, but for the distances associated with each of
these nearest neighbors. So finally, here, this is where we're inserting the new nearest neighbor.
Where, in that jth lost, we're putting the distance to this newly found house. And in the jth slot of
our list of houses, we're putting the house itself. So once we cycle through all our observations in
our data set, what we're gonna end up with is a set of the k most similar houses to the query
house. And like we described before, when we wanna go and predict the value associated with
our query house, we're simply gonna average over the values associated with these k nearest
neighbors. [MUSIC]

And if we think about averaging the values of all these points, that results in the value of the
green line at this target point x0. And we can repeat this for every value of our input space, and
that's what's gonna give us this green curve. And so what we see here is that this fit looks much
more reasonable than that really noisy one nearest neighbor fit we showed before. But, one thing
that I do want to point out is that we get these boundary effects and the same is true if we have
limited data in any region of the input space but in particular at the boundary the reason that we
get these constant fits is the fact that our nearest neighbors are exactly the same set of points for
all these different input points. Because if I'm all the way over at the boundary, all my nearest
neighbors are the k points to either the right or left of me depending which boundary I'm at. And
then if I shift over one point I still have the same set of nearest neighbors obviously accept for the
one point that is the query point but aside from that its basically the same set of values that you're
using at each one of these points along with the boundary but overall we see that we've been able
to cope with some of the noise that we had in the one nearest neighbor situation a lot better than
we did before. But beyond the boundary issues, there's another fairly important issue with the K
nearest neighbors fit, which is the fact that you get discontinuity. So, if you look closely at this
green line, it's these, a bunch of jumps between values. And the reason you get those jumps is the
fact that as you shift from one input to the next input, a nearest neighbor is either completely in or
out of the window. So, there's this effect where all of a sudden a nearest neighbor changes, and
then you're gonna get a jump in the predicted value. And so the overall effect on predictive
accuracy might actually not be that significant. But there's some reasons we don't like fits with
these types of discontinuities. First, visually maybe it's not very appealing. But let's think in terms
of our housing application, where what this means is that if we go from a house, for example,
2640 square feet to a house of 2641 square feet. To you, that probably wouldn't make much of a
difference in assessing the value but if you have a discontinuity between these two points what it
means is there's a jump in the predicted value. So, I take my house as sum predicted value I just
add one square feet and predicted value would have perhaps significant increase or decrease.
And so that is not very attractive in the applications like housing. And more generally we just don't
ten do believe these types of fits.
[MUSIC] So this leads us to consider something called Weighted K-nearest Neighbors. And what
weighted k-nearest neighbors does, is it down weights the neighbors that are further from the
specific query point, or target point. And the effect that has is as we're shifting from target point to
target point, when a neighbor jumps in or out of our set of nearest neighbors, the effect of that isn't
as significant because when I'm computing what my prediction is at a given value. I've already
down weighted that neighbor so when it drops out, it was not that signficant in forming my
predicted value to start with. So, in particular, when we talk about our predicted value, we're
gonna weight each one of our nearest neighbor observations by sum amount that I'll call C
subscript Q, for looking at query point Q, and then I also have a nearest neighbor j, indexing which
nearest neighbor that weight is associated with and the question is how do we wanna define these
weights? Well, if the distance between my query point and its nearest neighbor is really large and
we want this constant that we're multiplying by to be very small because we wanna down weight
that observation. But in contrast, if the distance is very small, meaning this house is very similar to
my house, I want that constant to be larger. So one simple choice is simply to take our weights to
be equal to 1 over the distance between these points. So Let's say that C, Q, nearest neighbor, J,
is gonna be equal to one over the distance between XJ and XQ. So this will have the effect that
when the distance is larger the weight is smaller. When the distance is smaller the weight is
larger. More generally we can think of defining these weights using something that's called a
kernel where here in this light we're gonna focus on the simplest case of what are called isotropic
kernels.

And these kernels are just the function of the distance between any point and the target point, the
query X Q. And in this figure we're showing a couple examples of commonly used kernels. What
we see is this kernel is defining how the weights are gonna decay, if at all, as a function of the
distance between a given point and our query point. So as an example here, one commonly used
kernel is something called The Gaussian kernel. And what we see is that this kernel takes the
distance between XI and XQ, looks at the square of that distance. And then divides that by a
parameter of lambda and that lambda is going to define how quickly the function decays and
importantly, this whole thing that I described is exponentiated so that all these weights are
positive. But I wanna emphasize that this parameter lambda that all of these kernels have, that's
why I've written kernel sub lambda, define how quickly we're gonna have the weights decay as a
function of this distance. And another thing that's worth noting is the fact that this Gaussian kernel
that I've mentioned here Has the weights never go exactly to zero, they just become very, very,
very, very small as the distance increases, but for the other kernels I've shown on this plot here.
There is once you get beyond this minus lambda to lambda region, the weights go exactly to zero.
So what we just described was assuming that we were just looking at a single input like square
feet. And thus our distance we just our standard Euclidean distance of looking at the absolute
value between the two inputs. But more generally you can use the same kernels and higher
dimensional spaces, where, again you just compute the distance, however you've defined it in
your multi varied space, and plug that in to the kernel to define your weight. [MUSIC]
[MUSIC] So, these ideas lead directly to a related, but slightly a different approach called kernel
regression. Let's start by just recalling what we did for weighted k-NN. And what we did was we
took our set of k-NN, and we applied some weight to each nearest neighbor based on how close
that neighbor was to our query point. Well for kernel regression, instead of just weighting some set
of k-NN, we're gonna apply weights to every observation in our training set. So in particular, our
predicted value is gonna sum over every observation, and we're gonna have a weight C sub qi on
each one of these data points. And this weight is gonna be defined according to the kernels that
we talked about for weighted k-NN. And in statistics, this is often called Nadarya-Watson kernel
weighted averages. So let's see what effect this kernel regression has in practice. And now what
we're showing is this yellow, curved region, represents some kernel that we're choosing. In this
case, it's called the Epanechnikov kernel, and what we see are some set of red, highlighted
observations for our given target point, in this case called (x, 0). And I want to emphasize that we
didn't set a fixed number of observations to highlight as red like we did in our k-NN. Here what we
did is we chose a kernel with a given, what's called bandwidth, that's the lambda parameter we
discussed earlier, and that defines a region in which our observations are gonna be included in
our waiting. Because, in this case, when we're looking at this Epanechnikov kernel, this kernel has
bounded support. And that's in contrast to, for example, the Gaussian kernel. And what that
means is that there's some region in which points will have this weighting, these decaying sets of
weights. And then outside this region the observations are completely discarded from our fit at this
one point. So, what we do is at every target point, (x, 0), in our input space, we weight our
observations by this kernel. So we compute this weighted average, and we say that is our
predicted value. We do that at each point in our input space to carve out this green line as our
predicted fit. So the result of this kernel regression isn't very different from than what the fit would
look like from weighted k-NN. It's just in this case instead of specifying k, we're specifying a region
based on this kernel bandwidth for weighting the observations to form this fit. But what we see is,
in this case for, for this kernel regression fit, which like I've alluded to, should look fairly similar to
our weighted k-NN. Things look a lot smoother than our standard k-NN's fit.
So there are two important questions when doing kernal regression. One is which kernel to use,
and the other is, for a given kernel what bandwidth should I use? But typically the choice of the
bandwidth matters much more then the choice of kernel. So to motivate this, let's just look again at
this Epanechnikov kernel with a couple different choices of bandwidths. And what we see is that
the fit dramatically changes. Here for example, we get a much wilder overfitting. Here for this
lambda value, things look pretty reasonable. But when we choose a bandwidth that's too large, we
start to get over smoothing. So this is an oversmooth fit to the data. I'm not making very good
predictions. So we can think of this in terms of this bias variance trade-off. This lambda parameter
is controlling how much we're fitting our observations, which is having low bias but high variance.
Very sensitive to the observations. If I change the observations, I get a dramatically different fit for
kernel regression, versus over here for a very large bandwidth, it's the opposite. I have very low
variance as I'm changing my data. I'm not gonna change the fit very much, but high bias.
Significant differences relative to the true function, which is shown in blue. But just to show how
we're fairly insensitive to the choice of kernel, here in the middle plot I'm just gonna change the
kernel from this Epanechnikov to our boxcar kernel. And where the boxcar kernel, instead of
having decaying weights with distance, it's gonna have a fixed set of weights. But we're only
gonna include observations within some fixed window, just like the Epanechnikov kernel. So this
boxcar kernel starts to look very, very similar to our standard k-NN. But again, instead of fixing
what k is, you're fixing a neighborhood about your observation. But what you see is that although
this boxcar window has these types of discontinuities we saw with k-NN, because observations
are either in or they're out. The fit looks fairly similar to our Epanechnikov kernel. So, this is why
we're saying that the choice of bandwidth has a much larger impact than the choice of kernel.
So this leads to the next important question of, how are we gonna choose our bandwidth
parameter lambda that we said matters so much? Or when we're talking about k-NN, the
equivalent parameter we have to choose is k, the number of nearest neighbors we're gonna look
at. And again, in the k-NN, I just wanna mention this now, we saw a similar kind of bias variance
trade-off, where for one nearest neighbors we saw these crazy, wildly, overfit functions. But once
we got to k, for some larger k value, we had much more reasonable and well behaved fits. So
again, we have the same type of bias variance trade-off for that parameter as well. And so the
question is, how are we choosing these tuning parameters for these methods that we're looking
at? Well, its the same story as before, so we don't have to go into the lengthy conversations that
we've had in past modules, and hopefully you know the answer is cross validation. Or, using some
validation set, assuming you have enough data to do that. [MUSIC]
[MUSIC] At the beginning of this module, we talked about this idea of fitting globally versus fitting
locally. Now that we've seen k nearest neighbors and kernel regression, I wanna formalize this
idea.
. Now that we've seen k nearest neighbors and kernel regression, I wanna formalize this idea. So
in particular, let's look at what happens when we just fit a constant function to our data. So in that
case that's just computing what's called a global average where we take all of our observations,
add them together and take the average or just divide by that total number of observations. So
that's exactly equivalent to summing over a weighted set of our observations, where the weights
are exactly the same on each of our data points, and then dividing by the total sum of these
weights. So now that we've put our global average in this form, things start to look very similar to
the kernel regression ideas that we've looked at. Where here it's almost like kernel regression, but
we're including every observation in our fit, and we're having exactly the same weights on every
observation. So that's like using this box car kernel that puts the same weights on all observations,
and just having a really really massively large bandwidth parameters such that for every point in
our input space all the other observations are gonna be included in the fit.
But now let's contrast that with a more standard version of kernel regression, which leads to what
we're gonna think of as locally constant fits. Because [COUGH] if we look at the kernel regression
equation, what we see is that, it's exactly what we had for our global average, but now it's gonna
be weighted by this kernel. Where in a lot of cases, what that kernel is doing, is it's putting a hard
limit that some observations outside of our window of around whatever target point what we're
looking at, are out of our calculation. So the simplest case we can talk about is this box car kernel,
that's gonna put equal weights over all observations, but just local to our target point x,o. And so,
we're gonna get a constant fit but, just at that one target point, and then we're going to get a
different constant fit at the next target point, and the next one, and the next one. And, I want to be
clear that the resulting output isn't a stair case kind of function. It's not a collection of these
constant fits. It is a collection of the constant fits, but just at a single point. So we're taking a single
point, doing another constant fit, taking the single point, which is at that target, and as we're doing
this over all our different inputs that's what's defining this green curve. Okay, but if we look at
another kernel, like our Epanechnikov kernel that has the weights decaying over this fixed region.
Well, it is still doing a constant fit, but how is it figuring out what the level of that line should be at
our target point? Well, what it's doing is, it's just down weighting observations that are further from
our target point and emphasizing more heavily the observations more close to our target point. So
this is just a weighted global average but its no longer global it's local because we're only looking
at observations within this defined window. So we're doing this weighted average locally at each
one of our input points and tracing out this green curve. So, this hopefully makes very clear how
before in the types of linear regression models we were talking about, we were doing these global
fits which in the simplest case, was just a constant model. That was our most basic model we
could consider having just the constant feature and now what we're talking about is doing exactly
the same thing but locally and so locally that it's at every single point at our input space.
And this is referred to as locally weighted averages but instead of fitting a constant at each point in
our input space we could have likewise fit a line or polynomial. And so what this leads to is
something that's called locally weighted linear regression.
We are not going to go through the details of of locally weighted linear regression in this module.
It's fairly straightforward. It's a similar idea to these local constant fits, but now plugging in a line or
polynomial. But I wanted to leave you with a couple rules of thumb for which fit you might choose
between a different set of polynomials that you have options over. And one thing that fitting a local
line instead of a local constant helps you with are those boundary effects that we talked about
before. The fact that you get these large biases at the boundary. So you can show very formally
that these local linear fits help with that bias, and if we talk about local quadratic fits, that helps
with bias that you get at points of curvature in the interior view of space. So, for example, we see
that blue curve we've been trying to fit, and if we go back, maybe it's worth quickly jumping back to
what our fit looks like we see that, towards the boundary we get large biases, and right at the point
of curvature, we also have a bias where we're under fitting the true curvature of that blue function.
And so the local quadratic fit helps with fitting that curvature. But what it does is it actually leads to
a larger variance so that can be unattractive. So in general just a basic recommendation is to use
just a standard local linear regression, fitting lines at every point in the input space. [MUSIC]
[MUSIC] So now let's step back and discuss some important theoretical and practical aspects of
K-nearest neighbors and kernel regression. If you remember the title of this module it was Going
Nonparametric, and we've yet to mention what that means. What is a nonparametric approach?
Well, K-nearest neighbors and kernel regression are examples of nonparametric approaches. And
the general goal of a non-parametric approach is to be really flexible in how you're defining f of x
and in general you want to make as few assumptions as possible. And the really key that defines
a non-parametric method Is that the complexity of the fit can grow as you get more data points.
We've definitely seen that with K-nearest neighbors and kernel regression, in particular the fit is a
function of how many observations you have. But these are just two examples of nonparametric
methods you might use for regression. There are lots of other choices. Things like splines, and
trees which we'll talk about in the classification course, and locally weighted structured versions of
the types of regression models we've talked about.
[MUSIC] So now let's step back and discuss some important theoretical and practical aspects of
K-nearest neighbors and kernel regression. If you remember the title of this module it was Going
Nonparametric, and we've yet to mention what that means. What is a nonparametric approach?
Well, K-nearest neighbors and kernel regression are examples of nonparametric approaches. And
the general goal of a non-parametric approach is to be really flexible in how you're defining f of x
and in general you want to make as few assumptions as possible. And the really key that defines
a non-parametric method Is that the complexity of the fit can grow as you get more data points.
We've definitely seen that with K-nearest neighbors and kernel regression, in particular the fit is a
function of how many observations you have. But these are just two examples of nonparametric
methods you might use for regression. There are lots of other choices. Things like splines, and
trees which we'll talk about in the classification course, and locally weighted structured versions of
the types of regression models we've talked about. So nonparametrics is all about this idea of
having the complexity grow with the number of observations. So now let's talk about what's the
limiting behavior of nearest neighbor regression as you get more and more data. And to start with,
let's just assume that we get completely noiseless data. So every observation we get lies exactly
on the true function. Well in this case, the mean squared error of one nearest neighbor regression
goes to zero, as you get more and more data. But let's just remember what mean squared error is
and if you remember from a couple modules ago, we talked about this bias-variance trade-off and
that mean squared error is the sum of bias squared plus variance. So having mean squared error
go to zero means that both bias and variance are going to zero. So to motivate this visually, let's
just look at a couple of movies. Here, in this movie, I'm showing what the one nearest neighbor fit
looks like as we're getting more and more data. So, remember the blue line is our true curve. The
green line is our current nearest neighbor fit based on some set of observations that are gonna lie
exactly on the true function at that blue curve. Okay, so here's our fit changing as we get more
and more data and what you see is that it's getting closer, and closer, and closer, and closer, and
closer to the true function. And hopefully you can believe that in limit of getting an infinite number
of observations spread over our input space this nearest neighbor fit is gonna lie exactly on top of
the true function. And that's true of all possible data sets of infinite number of observations that we
would get. In contrast, if we look what just doing a standard quadratic fit, just our standard least
squares fit we talked about before. No matter how much data we get, there's always gonna be
some bias. So we can see this here, where especially at this point of curvature we see that this
green fit, even as we get lots and lots of observations, is never matching up to that true blue
curve. And that's because that true blue curve, that's actually a part of a sinusoid. We've just
zoomed in on a certain region of that sinusoid. And so this quadratic fit is never exactly gonna
describe what a sinusoid is describing.
So this is what we talked about before, about the bias that's inherent in our model, even if we have
no noise. So even if we eliminate this noise, we still have the fact that our true error, as we're
getting more and more and more data, is never gonna to go exactly to zero. Unless of course the
data were generated from exactly the model that we're using to fit the data. But in most cases, for
example, maybe you have data that looks like the following. So this is our noise list data, and if we
can strain our this was all for fixed model complexity remember, if we can strain our model to say
be a quadratic, then maybe this will be our best quadratic fit. And no matter how many
observations I give you from this more complicated function, this quadratic is never gonna have
zero bias. In contrast, let's switch colors here, so that we can draw our one nearest neighbor fit.
Our one nearest neighbor, as we get more and more data, it's fitting these constants locally to
each observation. And as we get more and more and more data, the fit is gonna look exactly like
the true curve. And so when we talk about our true error with increasing number of observations.
Our true error for, this is a plot of true error for one nearest neighbor, is going to go to zero for
noiseless data. But now let's talk about the noisy case. This is the case that we're typically faced
with in most applications. And in this case what you can say is that the mean squared error of
nearest neighbor regression goes to zero. If you allow the number of neighbors or the k in our
nearest neighbor regression, to increase with the number of observations as well. Because if you
think about getting tons and tons and tons of observations. If you keep k fixed, you're just gonna
be looking at a very, very, very local region of your input space. And you're gonna have A lot of
noise introduced from that, but if you'll allow k to grow, it's gonna smooth over the noise that's
being introduced. So let's look at a visualization of this. So here what we're showing is the same
true function we've shown throughout this module, but we're showing tons of observations, all
these great dots. But they're noisy observations, they're no longer lying exactly on that blue curve.
That's why they're this cloud of blue points. And we see that our one nearest neighbor fit is very,
very noisy. Okay, it has this wild behavior, because like we discussed before, one nearest
neighbor is very sensitive to noise in the data. But in contrast, if we look at a large k, so here we're
looking at k equals 200. So our 200 nearest neighbors fit, it looks much, much better. So you can
imagine that as you get more and more observations, if you're allowing k to grow, you can smooth
over the noise being introduced by each one of these observations, and have the mean squared
error going to zero. But in contrast, again, if we look at just our standard least squares regression
here in this case of a quadratic fit, we're always gonna have bias. So nothing's different by having
introduced noise. It, if anything, will just make things worse. [MUSIC]
[MUSIC] So, we've talked about the performance of nearest neighbor regression, as you get tons
and tons of observations. And things look pretty good in that case. Actually, amazingly good, but
the picture isn't quite as nice when we start increasing the dimension of our input space or if we
just have a limited set of observations. Because the performance of nearest neighbors regression,
as we talked about before, is very sensitive to how well the data covers the input space. If you
remember we had this plot early on where we just removed observations from some point of our
input space and we got a pretty bad, When you're a [INAUDIBLE]. Well, now imagine you have a
really a huge, really, really high dimensional space that you are looking at. And, you need your
observations to cover all points within this space. Well, you're gonna need an exponentially large
number of observations in the dimension of the space you're looking at in order to cover this
space and have good nearest neighbor performance. And that's typically really hard to do, to have
as many observations as you would need in these high-dimensional spaces. So, the cases where
you have limited data relative to the dimension of the input space you're looking at Is where these
parametric models that we've talked about throughout the rest of this course become so important.
And another thing that's important to talk about is the complexity of our nearest neighbor search
and the naive type of algorithm that we've talked about so far which is just our search through our
entire data set can be quite computationally intensive. So, for example, if you have just one query
point, xq, and you're just doing one nearest neighbor search, you have to scan through every
single observation in your data set and compute the distance to each one of those observations.
So, going from our query house to each one of the different houses in our data set computing this
similarity between these houses in order to find the nearest neighbor. So, that has complexity,
that's linear in the number of observations that we have. Or if we wanna do our k nearest neighbor
as approach, we have to maintain a list of the k nearest neighbors and if you do this in this sorted
queue that we've talked about. Then, you can do this search in O(Nlogk). But both of these, either
for our 1-NN or our k-NN, have complexity linear and the number of observations we have. But
what if N is huge? Because this is the situation where we've said these k-NN approaches work so
well. Well, that can be really, really intensive to do, and especially if you have to do many queries.
So, if you wanna predict the values at many points in your input space, this can be a really, really
intensive procedure. So, instead, we're gonna talk about more efficient methods for doing this type
of nearest neighbor search in our Clustering and Retrieval course later in this specialization.
[MUSIC]
[MUSIC] So we've talked about using k-NN for regression, but these methods can also be very,
very straightforwardly used for classification. So this is a little warm-up for the next course in this
specialization, which is our classification course. And let's start out by just recalling our
classification task, where we're gonna do this in the context of spam filtering. Where we have
some email as our input, and the output is gonna be whether the email is spam or not spam. And
we're gonna make this decision based on the text of the email. Maybe information about the
sender, IP and things like this. Well, what we can do is use k-NN for classification. Visually we can
think about just taking all of the emails that we have labeled as spam or not spam and throwing
them down in some space. Where the distance between emails in this space represents how
similar the text or the sender IP information is. All the inputs or the features we're using to
represent these emails. And then what we can do, is we get some query email that comes in. So,
that's this little gray email here. And we're gonna say, is it spam or not spam? There's a very
intuitive way to do this, which is just search for the nearest neighbors of this email. The emails
most similar to this email. And then we're just gonna do a majority vote on the nearest neighbors
to decide whether this email is spam or not spam. And what we see in this case, is that four of the
neighbors are spam, and only one neighbor is not spam, so we're gonna label this email as spam.
And so this is the really, really straightforward approach of using k-NN for classification. [MUSIC]
[MUSIC] So in summary, we've talked about nearest neighbor and kernel regression. And these
are, as we've seen, really simple approaches. Very simple to think about intuitively, and really
simple to implement in practice. But they have surprisingly good performance in just a very wide
range of different applications. And some things in particular that we talked about in this module
are how to perform one nearest neighbor or k-NN regression. And we also talked about ideas of
weighting our k-NNs, leading us to this idea of doing kernel regression. And for this, there was this
choice of our bandwidth parameter, which is kind of akin to the k and k-NN. And we said we could
just choose this using cross validation. And then we talked about some of the theoretical and
practical aspects of k-NN and kernel regression. Talking about some really nice properties of k-NN
as you get lots and lots of data. But also some computational challenges that you run into. And
challenges you run into if you don't have a lot of data or if you're in really high dimensional input
spaces. And finally, we talked about how one can use k-NN for classification. And we're gonna
talk a lot more about classification in the next course, which is all about classification, so stay
tuned for that course. [MUSIC]
1 hour to complete
Closing Remarks

You might also like