Linear Classifier: by Dr. Sanjeev Kumar Associate Professor Department of Mathematics IIT Roorkee, Roorkee-247 667, India

INDIAN INSTITUTE OF TECHNOLOGY ROORKEE
Linear Classifier
by
Dr. Sanjeev Kumar
Associate Professor
Department of Mathematics
IIT Roorkee, Roorkee-247 667, India
malikdma@gmail.com
malikfma@iitr.ac.in
Linear models
A strong high-bias assumption is linear separability:
 in 2 dimensions, can separate classes by a line
 in higher dimensions, need hyperplanes
A linear model is a model that assumes the data is linearly

separable
Linear models
A linear model in n-dimensional space (i.e. n
features) is define by n+1 weights:
In two dimensions, a line:

0 =w1 f1 + w2 f2 + b (where b = -a)
In three dimensions, a plane:

0 =w1 f1 + w2 f2 + w3 f3 + b
In m-dimensions, a hyperplane
m
0 =b + å wj fj
j=1
Perceptron learning algorithm
repeat until convergence (or for some # of iterations):
for each training example (f1, f2, …, fm, label):
m
prediction =b + å wj fj
j=1
if prediction * label ≤ 0: // they don’t agree

for each wj:
wj = wj + fj*label
b = b + label
Which line will it find?
Which line will it find?
Only guaranteed to find some

line that separates the data
Linear models
Perceptron algorithm is one example of a linear
classifier
Many, many other algorithms that learn a line (i.e. a

setting of a linear combination of weights)
Goals:
 Explore a number of linear training algorithms
 Understand why these algorithms work

Perceptron learning algorithm
prediction =b + å
m
wj fj
j=1

for each wi:
wi = wi + fi*label
b = b + label
A closer look at why we got it wrong
w1 w2 (-1, -1, positive)

0 * f1 +1* f2 =
We’d like this value to be positive
0 * - 1+1*- 1 =- 1 since it’s a positive value
didn’t contribute, contributed in the

but could have wrong direction
Intuitively these make sense
decrease decrease Why change by 1?
Any other way of doing it?
0 -> -1 1 -> 0
Model-based machine learning
1. pick a model
 e.g. a hyperplane, a decision tree,…
 A model is defined by a collection of parameters
What are the parameters for DT? Perceptron?

1. pick a model
2. pick a criteria to optimize (aka objective function)
What criterion do decision tree learning and

perceptron learning optimize?
1. pick a model

 e.g. training error
3. develop a learning algorithm

 the algorithm should try and minimize the criteria
 sometimes in a heuristic way (i.e. non-optimally)
 sometimes explicitly
Linear models in general
1. pick a model
0 =b + å
m
wj fj
j=1
These are the parameters we want to learn

Some notation: indicator function
ìï 1 if x =True üï
1[ x ] =í ý
î 0 if x =False
ï ï
þ
Convenient notation for turning T/F answers into numbers/counts:
drinks _ to _ bring _ for _ class = å 1[ x >=21]

xÎ class
Some notation: dot-product
Sometimes it is convenient to use vector notation
We represent an example f1, f2, …, fm as a single vector, x
Similarly, we can represent the weight vector w1, w2, …, wm as a single

vector, w
The dot-product between two vectors a and b is defined as:

m
a ×b =å a j b j
j=1
Linear models
1. pick a model
0 =b + å
n
wj fj
j=1
These are the parameters we want to learn

n
å 1[ y (w ×x + b) £0]
i i
i=1
What does this equation say?

0/1 loss function
n
å 1[ y (w ×x + b) £0]
i i
i=1
m
distance =b + å w j x j =w ×x + b distance from hyperplane
j=1
incorrect =yi (w ×xi + b) £0 whether or not the

prediction and label agree
n
0/1 loss =å 1[ yi (w ×xi + b) £0 ] total number of mistakes,
i=1 aka 0/1 loss
1. pick a model
0 =b + å
m
wj fj
j=1

n
å 1[ y (w ×x + b) £ 0]
i i
i=1

n
argmin w,b å 1[ yi (w ×xi + b) £ 0 ] Find w and b that
i=1
minimize the 0/1 loss
Minimizing 0/1 loss
n
Find w and b that
argmin w,b å 1[ yi (w ×xi + b) £ 0 ]
i=1
How do we do this?
How do we minimize a function?
Why is it hard for this function?
Minimizing 0/1 in one dimension
n
å 1[ y (w ×x + b) £ 0]
i i
i=1
loss
Each time we change w such that the example

is right/wrong the loss will increase/decrease
Minimizing 0/1 over all w
n
å 1[ y (w ×x + b) £ 0]
i i
i=1
loss
Each new feature we add (i.e. weights) adds

another dimension to this space!
Minimizing 0/1 loss
n
Find w and b that
argmin w,b å 1[ yi (w ×xi + b) £ 0 ]
i=1
This turns out to be hard (in fact, NP-HARD )

Challenge:
- small changes in any w can have large changes in
the loss (the change isn’t continuous)
- there can be many, many local minima
- at any give point, we don’t have much information
to direct us towards any minima
More manageable loss functions
loss
w
What property/properties do we want from our loss function?
More manageable loss functions
loss
- Ideally, continues (i.e. differentiable) so we get an

indication of direction of minimization
- Only one minima
Convex functions
Convex functions look something like:
One definition: The line segment between any

two points on the function is above the function
Surrogate loss functions
For many applications, we really would like to minimize the
0/1 loss
A surrogate loss function is a loss function that provides an

upper bound on the actual loss function (in this case, 0/1)
We’d like to identify convex surrogate loss functions to make

them easier to minimize
Key to a loss function is how it scores the difference between

the actual label y and the predicted label y’
0/1 loss: l(y, y') =1[ yy' £0 ]
Ideas?
Some function that is a proxy for
error, but is continuous and convex
0/1 loss: l(y, y') =1[ yy' £ 0 ]
Hinge: l(y, y') =max(0,1- yy')
Exponential: l(y, y') =exp(- yy')
Squared loss: l(y, y') =(y - y')2
Why do these work? What do they penalize?

0/1 loss: l(y, y') =1[ yy' £ 0 ] Hinge: l(y, y') =max(0,1- yy')
Squared loss: l(y, y') =(y - y')2 Exponential: l(y, y') =exp(- yy')
1. pick a model
0 =b + å
m
wj fj
j=1

n
use a convex surrogate
å exp(- y (w ×x + b))
i i loss function
i=1

n
argmin w,b å exp(- yi (w ×xi + b)) Find w and b that
i=1
minimize the
surrogate loss
Finding the minimum
You’re blindfolded, but you can see out of the bottom of the
blindfold to the ground right by your feet. I drop you off
somewhere and tell you that you’re in a convex shaped valley
and escape is at the bottom/minimum. How do you get out?
Finding the minimum
loss
How do we do this for a function?

One approach: gradient descent
Partial derivatives give us the

slope (i.e. direction to move)
in that dimension
loss
w

slope (i.e. direction to move) in
that dimension
loss
Approach:
 pick a starting point (w)
 repeat: w
 pick a dimension
 move a small amount in that
dimension towards decreasing loss
(using the derivative)

slope (i.e. direction to move) in
that dimension
Approach:
 repeat:
 move a small amount in that
dimension towards decreasing loss
Gradient descent

 repeat until loss doesn’t decrease in all dimensions:
 move a small amount in that dimension towards decreasing loss
d
w j =w j - h loss(w)
dw j
What does this do?

Gradient descent

d
w j =w j - h loss(w)
dw j
learning rate (how much we want to move in the error

direction, often this will change over time)
Some maths
d d n
dw j
loss = å
dw j i=1
exp(- yi (w ×xi + b))
n
d
=å exp(- yi (w ×xi + b)) - yi (w ×xi + b)
i=1 dw j
n
=å - yi xij exp(- yi (w ×xi + b))
i=1
Gradient descent

n
w j =w j + h å yi xij exp(- yi (w ×xi + b))
i=1
What is this doing?

Exponential update rule
n
w j =w j + hå yi xij exp(- yi (w ×xi + b))
i=1
for each example xi:
w j =w j + h yi xij exp(- yi (w ×xi + b))
Does this look familiar?

Perceptron learning algorithm!
m
prediction =b + å wj fj
j=1

for each wj:
wj = wj + fj*label
b = b + label

or
w j =w j + xij yi c where c =h exp(- yi (w ×xi + b))
The constant
c =h exp(- yi (w ×xi + b))
learning rate label prediction
When is this large/small?

The constant
c =h exp(- yi (w ×xi + b))
label prediction
If they’re the same sign, as the

predicted gets larger there update
gets smaller
If they’re different, the more

different they are, the bigger the
update
prediction =b + å
m
wj fj
j=1

for each wj: Note: for gradient descent, we always update
wj = wj + fj*label
b = b + label

or
Summary
Model-based machine learning:
 define a model, objective function (i.e. loss function),
minimization algorithm
Gradient descent minimization algorithm

 require that our loss function is convex
 make small updates towards lower losses
Perceptron learning algorithm:

 gradient descent
 exponential loss function (modulo a learning rate)
Regularization
Introduction
Introduction
Introduction
Introduction
Introduction
Ill-posed Problems
 In finite domains, most of the inverse problems are

referred to the ill-posed problem.
 The mathematical term well-posed problem stems

from a definition given by Jacques Hadamard. He
believed that mathematical models of physical
phenomena should have the properties that
•A solution exists
•The solution is unique
•The solution's behavior changes continuously
with the initial conditions.
Ill-conditioned Problems
 Problems that are not well-posed in the sense of

Hadamard are termed ill-posed. Inverse problems
are often ill-posed.
 Even if a problem is well-posed, it may still be ill-

conditioned, meaning that a small error in the
initial data can result in much larger errors in the
answers. An ill-conditioned problem is indicated by
a large condition number.
Example: Curve Fitting
Some Examples: Linear System of Eqn
Solve [A]{x} = {b}

Ill-conditioning
• A system of equations is singular if det|A| = 0
• If a system of equations is nearly singular it is ill-
conditioned.
Systems which are ill-conditioned are extremely

sensitive to small changes in coefficients of [A]
and {b}. These systems are inherently sensitive to
round-off errors.
One concern
n
argmin w,b å exp(- yi (w ×xi + b))
i=1
loss
What is this calculated on? w

Is this what we want to optimize?
prediction =b + å
m
wj fj
j=1

for each wj: Note: for gradient descent, we always update
wj = wj + fj*label
b = b + label

or
One concern
n
argmin w,b å exp(- yi (w ×xi + b))
i=1
loss
We’re calculating this on the training

set
We still need to be careful about w

overfitting!
The min w,b on the training set is

generally NOT the min for the test set
How did we deal with this for the perceptron algorithm?
Overfitting revisited: regularization
A regularizer is an additional criteria to the loss function to
make sure that we don’t overfit
It’s called a regularizer since it tries to keep the parameters

more normal/regular
It is a bias on the model forces the learning to prefer certain

types of weights over others
n
argmin w,b å loss(yy') + l regularizer(w, b)
i=1
Regularizers
0 =b + å
n
wj fj
j=1
Should we allow all possible weights?
Any preferences?
What makes for a “simpler” model for a

linear model?
Regularizers
0 =b + å
n
wj fj
j=1
Generally, we don’t want huge weights
If weights are large, a small change in a feature can result in a

large change in the prediction
Also gives too much weight to any one feature
Might also prefer weights of 0 for features that aren’t useful
How do we encourage small weights? or penalize large weights?

Regularizers
0 =b + å
n
wj fj
j=1
How do we encourage small weights? or penalize large weights?
n
argmin w,b å loss(yy') + l regularizer(w, b)
i=1
Common regularizers
r(w, b) =å w j
sum of the weights
wj
2
sum of the squared weights r(w, b) = å wj
wj
What’s the difference between these?

Common regularizers
r(w, b) =å w j
sum of the weights
wj
åw
2
sum of the squared weights r(w, b) = j
wj
Squared weights penalizes large values more

Sum of weights will penalize small values more
p-norm
sum of the weights (1-norm) r(w, b) =å w j

wj
åw
sum of the squared weights 2
(2-norm)
r(w, b) = j
wj
r(w, b) = p å w j = w
p p
p-norm
wj
Smaller values of p (p < 2) encourage sparser vectors

Larger values of p discourage large weights more
p-norms visualized
lines indicate penalty = 1

w1
w2
p w2
1 0.5
For example, if w1 = 0.5 1.5 0.75
2 0.87
3 0.95
∞ 1
p-norms visualized
all p-norms penalize larger

weights
p < 2 tends to create sparse

(i.e. lots of 0 weights)
p > 2 tends to like similar

weights
1. pick a model
0 =b + å
n
wj fj
j=1

n
å loss(yy') + lregularizer(w)
i=1

n
argmin w,b å loss(yy') + lregularizer(w) Find w and b
i=1
that minimize
Minimizing with a regularizer
We know how to solve convex minimization problems using
gradient descent:
n
argmin w,b å loss(yy')
i=1
If we can ensure that the loss + regularizer is convex then we

could still use gradient descent:
n
argmin w,b å loss(yy') + lregularizer(w)
i=1
make convex
Convexity revisited
One definition: The line segment between any

two points on the function is above the function
Mathematically, f is convex if for all x1, x2:

f (tx1 +(1- t)x2 ) £tf (x1 ) +(1- t) f (x2 ) " 0 < t <1
the value of the the value at some point
function at some point on the line segment
between x1 and x2 between x1 and x2
Adding convex functions
Claim: If f and g are convex functions then so is the
function z=f+g
Prove:
z(tx1 +(1- t)x2 ) £ tz(x1 ) +(1- t)z(x2 ) " 0 < t <1
Mathematically, f is convex if for all x1, x2:

f (tx1 +(1- t)x2 ) £ tf (x1 ) +(1- t) f (x2 ) " 0 < t <1
Adding convex functions
By definition of the sum of two functions:
z(tx1 +(1- t)x2 ) = f (tx1 +(1- t)x2 ) + g(tx1 +(1- t)x2 )
tz(x1 ) +(1- t)z(x2 ) =tf (x1 ) +tg(x1 ) +(1- t) f (x2 ) +(1- t)g(x2 )
=tf (x1 ) +(1- t) f (x2 ) +tg(x1 ) +(1- t)g(x2 )
Then, given that:
f (tx1 +(1- t)x2 ) £tf (x1 ) +(1- t) f (x2 )
g(tx1 +(1- t)x2 ) £ tg(x1 ) +(1- t)g(x2 )
We know:
f (tx1 +(1- t)x2 ) + g(tx1 +(1- t)x2 ) £tf (x1 ) +(1- t) f (x2 ) + tg(x1 ) +(1- t)g(x2 )
So: z(tx1 +(1- t)x2 ) £ tz(x1 ) +(1- t)z(x2 )
Minimizing with a regularizer
We know how to solve convex minimization problems using
gradient descent:
n
argmin w,b å loss(yy')
i=1
If we can ensure that the loss + regularizer is convex then we

could still use gradient descent:
n
argmin w,b å loss(yy') + lregularizer(w)
i=1
convex as long as both loss and regularizer are convex

p-norms are convex
r(w, b) = p å w j = w
p p
wj
p-norms are convex for p >= 1

1. pick a model
0 =b + å
n
wj fj
j=1

n
l
å exp(- yi (w ×xi + b)) + 2 w
2
i=1

n
l
argmin w,b å exp(- yi (w ×xi + b)) + w Find w and b
2
i=1 2 that minimize

Our optimization criterion
n
l
argmin w,b å exp(- yi (w ×xi + b)) + w
2
i=1 2
Loss function: penalizes

Regularizer: penalizes large
examples where the prediction
weights
is different than the label
Key: this function is convex allowing us to use gradient descent

Gradient descent

d
wi =wi - h (loss(w) + regularizer(w, b))
dwi
n
l
argmin w,b å exp(- yi (w ×xi + b)) +
2
w
i=1 2
Some more maths
d d n l
å
2
objective = exp(- yi (w ×x i + b)) + w
dw j dw j i=1 2
…
(some math happens)
n
=- å yi xij exp(- yi (w ×xi + b)) + lw j
i=1
Gradient descent

d
wi =wi - h (loss(w) + regularizer(w, b))
dwi
n
w j =w j + h å yi xij exp(- yi (w ×xi + b)) - hlw j
i=1
The update
w j =w j + h yi xij exp(- yi (w ×xi + b)) - hl w j
learning rate direction to regularization

update
constant: how far from wrong
What effect does the regularizer have?

The update
w j =w j + h yi xij exp(- yi (w ×xi + b)) - hlw j

update
If wj is positive, reduces wj moves wj towards 0

If wj is negative, increases wj
L1 regularization
n
argmin w,b å exp(- yi (w ×xi + b)) + w
i=1
d d n
dw j
objective = å
dw j i=1
exp(- yi (w ×xi + b)) + l w
n
=- å yi xij exp(- yi (w ×xi + b)) + lsign(w j )
i=1
L1 regularization
w j =w j + h yi xij exp(- yi (w ×xi + b)) - hlsign(w j )

update
What effect does the regularizer have?

L1 regularization
w j =w j + h yi xij exp(- yi (w ×xi + b)) - hlsign(w j )

update
If wj is positive, reduces by a constant moves wj towards 0

If wj is negative, increases by a constant regardless of magnitude
Regularization with p-norms
L1:
w j =w j + h (loss _ correction - l sign(w j ))
L2:
w j =w j + h (loss _ correction - l w j )
Lp:
p- 1
w j =w j + h (loss _ correction - l cw )
j
How do higher order norms affect the weights?

Regularizers summarized
L1 is popular because it tends to result in sparse solutions
(i.e. lots of zero weights)
However, it is not differentiable, so it only works for gradient
descent solvers
L2 is also popular because for some loss functions, it can

be solved directly (no gradient descent required, though
often iterative solvers still)
Lp is less popular since they don’t tend to shrink the

weights enough
The other loss functions
Without regularization, the generic update is:
w j =w j + h yi xij c
where
c =exp(- yi (w ×xi + b)) exponential
c =1[yy' <1] hinge loss
w j =w j + h (yi - (w ×xi + b)xij ) squared error

Linear Classifier: by Dr. Sanjeev Kumar Associate Professor Department of Mathematics IIT Roorkee, Roorkee-247 667, India

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Linear Classifier: by Dr. Sanjeev Kumar Associate Professor Department of Mathematics IIT Roorkee, Roorkee-247 667, India

Uploaded by

Copyright:

Available Formats

INDIAN INSTITUTE OF TECHNOLOGY ROORKEE

 in higher dimensions, need hyperplanes

A linear model is a model that assumes the data is linearly

In two dimensions, a line:

In three dimensions, a plane:

if prediction * label ≤ 0: // they don’t agree

Only guaranteed to find some

Many, many other algorithms that learn a line (i.e. a

 Understand why these algorithms work

if prediction * label ≤ 0: // they don’t agree

w1 w2 (-1, -1, positive)

didn’t contribute, contributed in the

What are the parameters for DT? Perceptron?

2. pick a criteria to optimize (aka objective function)

What criterion do decision tree learning and

2. pick a criteria to optimize (aka objective function)

3. develop a learning algorithm

These are the parameters we want to learn

2. pick a criteria to optimize (aka objective function)

Convenient notation for turning T/F answers into numbers/counts:

drinks _ to _ bring _ for _ class = å 1[ x >=21]

We represent an example f1, f2, …, fm as a single vector, x

Similarly, we can represent the weight vector w1, w2, …, wm as a single

The dot-product between two vectors a and b is defined as:

These are the parameters we want to learn

2. pick a criteria to optimize (aka objective function)

What does this equation say?

incorrect =yi (w ×xi + b) £0 whether or not the

2. pick a criteria to optimize (aka objective function)

3. develop a learning algorithm

Each time we change w such that the example

Each new feature we add (i.e. weights) adds

This turns out to be hard (in fact, NP-HARD )

- Ideally, continues (i.e. differentiable) so we get an

One definition: The line segment between any

A surrogate loss function is a loss function that provides an

We’d like to identify convex surrogate loss functions to make

Key to a loss function is how it scores the difference between

0/1 loss: l(y, y') =1[ yy' £0 ]

0/1 loss: l(y, y') =1[ yy' £ 0 ]

Hinge: l(y, y') =max(0,1- yy')

Exponential: l(y, y') =exp(- yy')

Squared loss: l(y, y') =(y - y')2

Why do these work? What do they penalize?

2. pick a criteria to optimize (aka objective function)

3. develop a learning algorithm

How do we do this for a function?

Partial derivatives give us the

Partial derivatives give us the

Partial derivatives give us the

 pick a starting point (w)

What does this do?

 pick a starting point (w)

learning rate (how much we want to move in the error

 pick a starting point (w)

What is this doing?

for each example xi:

w j =w j + h yi xij exp(- yi (w ×xi + b))

Does this look familiar?

if prediction * label ≤ 0: // they don’t agree

w j =w j + h yi xij exp(- yi (w ×xi + b))

learning rate label prediction

When is this large/small?