You are on page 1of 18

Linear Models and Learning via

Optimization
CS771: Introduction to Machine Learning
Piyush Rai
2
Announcements
 Quiz 1 postponed by a week (now on Aug 23)
 Venue TBD, timing: 7pm, duration: 45 minutes
 Syllabus will be everything up to Aug 21 class
 Homework 1 released by end of this week

CS771: Intro to ML
3

Wrapping Up Decision Trees..

CS771: Intro to ML
4
Avoiding Overfitting in DTs
 Desired: a DT that is not too big in size, yet fits the training data reasonably
 Note: An example of a very simple DT is “decision-stump” A Decision Stump

 A decision-stump only tests the value of a single feature (or a simple rule)
 Not very powerful in itself but often used in a large ensemble of decision stumps
 Some ways to keep a DT simple enough:
 Control its complexity while building the tree (stopping early)
 Prune after building the tree (post-pruning)
 Criteria for judging which nodes could potentially be pruned (from already built complex
DT)
 Use a validation set (separate from the training set)
 Prune each possible node that doesn’t hurt the accuracy on the validation set
 Greedily remove the node that improves the validation accuracy the most
 Stop when the validation set accuracy starts worsening CS771: Intro to ML
5
Ensemble of Trees
All trees can be
 Ensemble is a collection of models. Popular in ML trained in parallel
 Each model makes a prediction. Take their majority as the final prediction
Each tree is trained on a
 Ensemble of trees is a collection of simple DTs subset of the training
 Often preferred as compared to a single massive, complicated tree inputs/features
An RF with 3 simple trees. The
 A popular example: Random Forest (RF) majority prediction will be the final
prediction

 XGBoost is another popular ensemble of trees


 Based on the idea of “boosting” (will study boosting later) simple trees
 Sequentially trains a set of trees with each correcting errors of previous ones CS771: Intro to ML
6

Linear Models and


Learning via Optimization

CS771: Intro to ML
7
Linear Models
 Suppose we want to learn to map inputs to real-valued outputs

 Linear model: Assume output to be a linear weighted combination of the input features
This defines a linear model with
parameters given by a “weight
vector”
Each of these weights have a simple
interpretation: is the “weight” or contribution
of the feature in making this prediction
The “optimal” weights are unknown and
 This simple model can be used for Linear Regression have to be learned by solving an
optimization problem, using some training
data
 This simple model can also be used as a “building block” for more complex models, e.g.,
 Classification (binary/multiclass/multi-output/multi-label) and various other ML/deep learning models
 Unsupervised learning problems (e.g., dimensionality reduction models)
CS771: Intro to ML
8
Linear Regression
 Given: Training data with input-output pairs , ,

 Goal: Learn a model to predict the output for new test inputs

 Assume the function that approximates the I/O relationship to be a linear


model ⊤ Can also write all of them
𝑦 𝑛 ≈ 𝑓 ( 𝒙𝑛 ) = 𝒘 𝒙 𝑛 (𝑛=1 , 2 , … , 𝑁 ) compactly using matrix-
vector notation as

 Let’s write the total error or “loss” of this model over the training
measures the data as
Goal of learning is to find
the that minimizes this loss
) prediction error or “loss” or
“deviation” of the model on
+ does well on test data Unlike models like KNN and DT, here we
a single training input
have an explicit problem-specific
objective (loss function) that we wish to
CS771: Intro to ML
9
Linear Regression: Pictorially
 Linear regression is like fitting a line or (hyper)plane to a set of points [ ] 𝑧1
𝑧2
= 𝜙( 𝑥 )

What if a line/plane doesn’t


model the input-output
relationship very well, e.g., Original (single) feature Two features
Nonlinear curve needed Can fit a plane (linear)
if their relationship is better
modeled by a nonlinear

(Output )
curve or curved surface?
(Output )

Do linear No. We can even fit a


models become curve using a linear
useless in such model after suitably
cases? transforming the inputs
Input (single feature) (Feature 2) (Feature 1)

The transformation can be predefined or learned (e.g.,


using kernel methods or a deep neural network based
feature extractor). More on this later
 The line/plane must also predict outputs of the unseen (test) inputs well
CS771: Intro to ML
10
Loss Functions for Regression Choice of loss function usually
depends on the nature of the
data. Also, some loss functions
result in easier optimization
 Many possible loss functions for regression problems problem than others

Squared loss Absolute


Loss ( 𝑦 𝑛 − 𝑓 ( 𝒙 𝑛 ))2 Loss ¿ 𝑦 𝑛 − 𝑓 ( 𝒙 𝑛 ) ∨¿
loss
Very commonly used
for regression. Leads Grows more slowly than
to an easy-to-solve squared loss. Thus better
suited when data has some
optimization problem
outliers (inputs on which
model makes large errors)
) )
Loss ¿ 𝑦 𝑛 − 𝑓 ( 𝒙 𝑛 ) ∨− 𝜖
Huber loss Loss
-insensitive loss
Squared loss for
small errors (say up
(a.k.a. Vapnik
to ); absolute loss for loss) Note: Can also use
larger errors. Good Zero loss for small errors squared loss instead
for data with outliers (say up to ); absolute loss of absolute loss
−𝛿 𝛿 ) for larger errors
−𝜖 𝜖 )
CS771: Intro to ML
11
Minimizing Loss Func using First-Order Optimality
 Use basic calculus to find the minima
Called “first order” since only gradient
is used and gradient provides the first
order info about the function being
optimized
The approach works only for very
simple problems where the objective
is convex and there are no
constraints on the values can take

 First order optimality: The gradient must be equal to zero at the optima

=0
 Sometimes, setting and solving for gives a closed form solution
 If closed form solution is not available, the gradient vector can still be used in
iterative optimization algos, like gradient descent (GD)
 Note: Even if closed-form solution is possible, GD can sometimes be more efficient
CS771: Intro to ML
12
Minimizing Loss Func. using Iterative Optimization
Can I used this approach For max. problems weIterative since it requires
to solve maximization several steps/iterations to find
can use gradient ascent
problems? the optimal solution
Fact: Gradient gives the For convex functions, Good
direction of steepest Will move in the
GD will converge to initialization
change in function’s direction of the gradient
the global minima needed for non-
value Gradient Descent convex functions
The learning rate
very imp. Should be
 Initialize as set carefully (fixed
or chosen
adaptively). Will
 For iteration (or until convergence) discuss some
 Calculate the gradient using the current iterates strategies later
 Set the learning rate Sometimes may be
tricky to to assess
 Move in the opposite direction of gradient convergence. Will
( 𝑡 +1) (𝑡 ) (𝑡 ) see some methods
𝒘 =𝒘 −𝜂 𝒈
𝑡 later
CS771: Intro to ML
13
Linear Regression with Squared Loss
In matrix-vector notation, can write it
 In this case, the loss func will be compactly as )

 Let us find the that optimizes (minimizes) the above squared loss
The “least squares” (LS)
 Let’s use first order optimality problem Gauss-Legendre, 18th
century)

 The LS problem can be solved easily and has a closed form solutionClosed form
solutions to ML
= problems are
rare.

= ¿( 𝑿 𝑿)
⊤ −1 ⊤
𝑿 𝒚
matrix inversion – can be expensive.
Ways to handle this. Will see later CS771: Intro to ML
14
Proof: A bit of calculus/optim. (more on this later)
 We wanted to find the minima of

 Let us apply basic rule of calculus: Take first derivative of and set to zero
Chain rule of calculus

Partial derivative of dot product w.r.t each element of Result of this derivative is - same size as
 Using the fact , we get
 To separate to get a solution, we write the above as
𝑁 𝑁

∑ 2 𝒙 𝑛 ( 𝑦𝑛 − 𝒙 ⊤
𝑛 𝒘 ) =0 ∑ 𝑦 𝑛 𝒙 𝑛 − 𝒙𝑛 𝒙 ⊤
𝑛 𝒘=0
𝑛=1 𝑛=1

= ¿( 𝑿 𝑿)
⊤ −1
𝑿 𝒚

CS771: Intro to ML
15
Problem(s) with the Solution!
 We minimized the objective w.r.t. and got

= ⊤
¿( 𝑿 𝑿)
−1 ⊤
𝑿 𝒚
 Problem: The matrix may not be invertible
 This may lead to non-unique solutions for

 Problem: Overfitting since we only minimized loss defined on training data


 Weights may become arbitrarily large to fit training data perfectly
 Such weights may perform poorly on the test data however is called the Regularizer and
measures the “magnitude” of
 One Solution: Minimize a regularized objective
is the reg. hyperparam.
 The reg. will prevent the elements of from becoming too large Controls how much we wish
 Reason: Now we are minimizing training error + magnitude of vector to regularize (needs to be
tuned via cross-validation)
CS771: Intro to ML
16
Regularized Least Squares (a.k.a. Ridge Regression)
 Recall that the regularized objective is of the form

 One possible/popular regularizer: the squared Euclidean ( squared) norm of


2
𝑅 ( 𝒘 )=‖𝒘‖2=𝒘 ⊤ 𝒘
 With this regularizer, we have the regularized least squares problem as
Look at the form of the solution. We
+ are adding a small value to the
diagonals of the DxD matrix (like
Why is the method 𝑁 adding a ridge/mountain to some land)
called “ridge”
regression = arg min 𝒘 ∑ ( 𝑦 𝑛 − 𝒘 𝒙 𝑛 ) + 𝜆 𝒘 𝒘
⊤ 2 ⊤

𝑛=1
 Proceeding just like the LS case, we can find the optimal which is given by

= ¿( 𝑿

𝑿+ 𝜆 𝐼𝐷)
−1
𝑿 𝒚

CS771: Intro to ML
Note that optimizing loss
17
Other Ways to Control Overfitting functions with such regularizers is
usually harder than ridge reg. but
several advanced techniques exist
(we will see some of those later)
 Use a regularizer defined by other norms, e.g.,
Use them if you have a very
norm regularizer large number of features but
𝐷
many irrelevant features.
‖𝒘‖1 =∑ ¿ 𝑤𝑑 ∨¿ ¿ These regularizers can help in
When should I used these 𝑑=1 automatic feature selection
regularizers instead of the sparse means many entries
regularizer? ‖𝒘‖0 =¿ nnz ( 𝒘 ) Using such regularizers in will be zero or near
gives a sparse weight zero. Thus those features
Automatic feature
vector as solution will be considered
selection? Wow, cool!!! norm regularizer (counts
But how exactly? number of nonzeros in irrelevant by the model
and will not influence
prediction
 Use non-regularization based approaches
 Early-stopping (stopping training just when we have a decent val. set accuracy)
 Dropout (in each iteration, don’t update some of the weights) All of these are very popular ways
to control overfitting in deep
 Injecting noise in the inputs learning models. More on these
later when we talk about deep
learning
CS771: Intro to ML
18
Gradient Descent for Linear/Ridge Regression
 Just use the GD algorithm with the gradient expressions we derived 
 Iterative updates for linear regression will be of the form
Unlike the closed form solution of
least squares regression, here we
( 𝑡 +1) (𝑡 ) (𝑡 )
𝒘 =𝒘 −𝜂𝑡 𝒈 have iterative updates but do not
require the expensive matrix
inversion of the matrix
𝑁
¿ 𝒘 − 𝜂𝑡 ∑ ( 𝑦 𝑛 − 𝒘 𝒙𝑛 ) 𝒙𝑛
(𝑡 ) ⊤

𝑛=1
 Similar updates for ridge regression as well (with the gradient expression being
slightly different; left as an exercise)

 More on iterative optimization methods later


CS771: Intro to ML

You might also like