You are on page 1of 94

Background: Generative and

Discriminative Classifiers
Logistic
Regression
Logistic Regression

Important analytic tool in natural and


social sciences
Baseline supervised machine learning
tool for classification
Is also the foundation of neural
networks
Generative and Discriminative Classifiers
Naive Bayes is a generative classifier

by contrast:

Logistic regression is a discriminative


classifier
Generative and Discriminative Classifiers
Suppose we're distinguishing cat from dog images

imagenet imagenet
Generative Classifier:
• Build a model of what's in a cat image
• Knows about whiskers, ears, eyes
• Assigns a probability to any image:
• how cat-y is this image?

Also build a model for dog images

Now given a new image:


Run both models and see which one fits better
Discriminative Classifier
Just try to distinguish dogs from cats

Oh look, dogs have collars!


Let's ignore everything else
ISTIC R EGRESSION
Finding the correct class c from a document d in
Generative vs Discriminative Classifiers
ormally, recall that the naive Bayes assigns a class c to a document d no
computing P(c|d) but by computing a likelihood and a prior
Naive Bayes
ISTIC R EGRESSION
likelihood prior
z }| { z}|{
ĉ =the
rmally, recall that argmax P(d|c)assigns
naive Bayes P(c) a class c to a document
(5.1d
c2C
computing P(c|d) but by computing a likelihood and a prior
Logistic
ive model like Regression
naive Bayes makes use of this likelihood term, which
documentprior
how to generate the features oflikelihood
aposterior if we knew it was of class c.
z }| { z}|{
trast a discriminative model in this text categorization scenario attempts
ĉ = argmax P(d|c)
P(c|d) P(c) (
compute P(c|d). Perhapsc2Cit will learn to assign high weight to documen
at directly improve its ability to discriminate between possible classes
7
Components of a probabilistic machine learning
classifier
Given m input/output pairs (x(i),y(i)):

1. A feature representation of the input. For each input


observation x(i), a vector of features [x1, x2, ... , xn]. Feature j
for input x(i) is xj, more completely xj(i), or sometimes fj(x).
2. A classification function that computes 𝑦, ! the estimated
class, via p(y|x), like the sigmoid or softmax functions.
3. An objective function for learning, like cross-entropy loss.
4. An algorithm for optimizing the objective function: stochastic
gradient descent.
The two phases of logistic regression

Training: we learn weights w and b using stochastic


gradient descent and cross-entropy loss.

Test: Given a test example x we compute p(y|x)


using learned weights w and b, and return
whichever label (y = 1 or y = 0) is higher probability
Background: Generative and
Discriminative Classifiers
Logistic
Regression
Classification in Logistic Regression

Logistic
Regression
Classification Reminder

Positive/negative sentiment
Spam/not spam
Authorship attribution
(Hamilton or Madison?)
Alexander Hamilton
Text Classification: definition

Input:
◦ a document x
◦ a fixed set of classes C = {c1, c2,…, cJ}

Output: a predicted class 𝑦! Î C


Binary Classification in Logistic Regression

Given a series of input/output pairs:


◦ (x(i), y(i))
For each observation x (i)
◦ We represent x(i)
by a feature vector [x1, x2,…, xn]
(i)
◦ We compute an output: a predicted class 𝑦! Î {0,1}
Features in logistic regression
• For feature xi, weight wi tells is how important is xi
• xi ="review contains ‘awesome’": wi = +10
• xj ="review contains ‘abysmal’": wj = -10
• xk =“review contains ‘mediocre’": wk = -2
Logistic Regression for one observation x

Input observation: vector x = [x1, x2,…, xn]


Weights: one per feature: W = [w1, w2,…, wn]
◦ Sometimes we call the weights θ = [θ1, θ2,…, θn]
Output: a predicted class 𝑦! Î {0,1}

(multinomial logistic regression: 𝑦! Î {0, 1, 2, 3, 4})


her real number that’s n
added
! to the weighted inputs.
ecision X
Howon to a
do
z = test instance—
classification
wi xi + b after we’ve learned the
ssifier first multiplies each x i by its weight wi , sums up th
For each featurei=1 xi, weight wi tells us importance of xi
ds the◦ bias term
(Plus we'll b.a bias
have Theb) resulting single number z exp
kthe
we’ll represent
evidence for such
the sums using the dot product notatio
class.
We'll sum up all the weighted features and the bias
ot product of two vectors a and ! b, written as a · b is the s
orresponding elementsX n
of each vector. Thus the followin
to Eq. 5.2: z = wi xi + b
i=1
z = w·x+b
ook we’ll represent such sums using the dot
If this sum is high, we say y=1; if low, then y=0product no
But we want a probabilistic classifier

We need to formalize “sum is high”.


We’d like a principled classifier that gives us a
probability, just like Naive Bayes did
We want a model that can tell us:
p(y=1|x; θ)
p(y=0|x; θ)
ear around 0 but outlier values get squashed toward 0 or 1.
to Eq.
The5.2:
problem: z isn't a probability, it's just a
number!
e a probability, we’ll pass z through the sigmoid func
ction (named becausez =it w ·x+
looks b an s) is also called t
like
es logistic regression its name. The sigmoid has the fol
ically in Fig.
Solution: use5.1:
a function of z that goes from 0 to 1
g in Eq. 5.3 forces z to be a legal probabil
fact, since weights are1 real-valued,
1 the outp
y = s (z) = =
om • to •. 1+e z 1 + exp ( z)
rest of the book, we’ll use the notation exp(x) to mean e
r of advantages; it takes a real-valued number and maps
rangeswe’ll
bility, • to
frompass •.
z through the sigmoid function, s (z). T
The very useful sigmoid or logistic function
med because it looks like an s) is also called the logistic fu
regression its name. The sigmoid has the following equati
Fig. 5.1:

1
y = s (z) = z
(5
1+e

20
Idea of logistic regression

We’ll compute w∙x+b


And then we’ll pass it through the
sigmoid function:
σ(w∙x+b)
And we'll just treat it as a probability
wo cases,
wo cases, p(y =probabilities
Making 1) and p(y =
= 0),
0), sum
withsum to
to 1.
1. We
sigmoids Wecan
cando
dothis
thisasasfoll
fol

P(y = 1) = s (w · xx +
+b)
b)
11
=
1 + exp
exp(( (w (w··xx+
+b))
b))
P(y =
P(y = 0) = 11 s
0) = (w··xx+
s(w +b)b)
11
= 1
= 1 1 + exp ( (w · x + b))
1 + exp ( (w · x + b))
exp
exp (
( (w
(w ··xx++ b))
b))
=
= 1 + exp ( (w · x + b))
1 + exp ( (w · x + b))
P(y = 0) = 1 s (w · x + b)
TheP(ysigmoid = s (w · xhas
= 1) function + b) the property
By the way: 1 = 1
1
= s (x) = s ( x) 1 + exp ( (w · x
1 + exp ( (w · x1+ b))
exp ( (w · x + b))
=
so we
P(ycould
= 0) also
= 1have · x + b) P(y = 0) =as s ( (w ·1x+
expressed
s (w b)).( (w · x + b
+exp
Now we have an algorithm 1 that given an instance x computes the
P(y = 1|x). How = 1 do The we sigmoid
make a function has For
decision? the property
a test instance x, we sa
1 + exp ( (w · x + b)) Because
probability P(y = exp 1|x)( is(w more than .5, and no otherwise. We call .5
· x + b)) 1 s (x) = s ( x)
boundary: = (5.5)
1 + exp ( (w · x + b))
so we could⇢ also have expressed P(y = 0) as s ( (w · x
tion has the property Now 1 if P(y = 1|x) > 0.5
ŷ =we have an algorithm that given an instance
0 otherwise
P(y = 1|x). How do we make a decision? For a test i
1 s (x) = s ( x) P(y = 1|x) is more than .5, (5.6)
probability and no otherwi
5.1.1 Example:decision
boundary sentiment classification
boundary:
How do we make a decision? For a test instance x, we sa
(y = 1|x) is more
Turning than .5, andinto
a probability no otherwise.
a classifierWe call .5 t


1 if P(y = 1|x) > 0.5
ŷ =
0 otherwise

ample: sentiment classification


0.5 here is called the decision boundary

n example. Suppose we are doing binary sentiment class


text, and we would like to know whether to assign the sen
ranges from • to •.
The probabilistic classifier P(y = 1) = s (w · x + b)
1
=
1 + e (w·x+b)
P(y=1)
P(y = 0) = 1 s (w · x +
1
= 1
1 + e (w·
e (w·x+b)
=
1+e (w·x+b)

Now we have an
wx algorithm
+b that given an instan
algorithm that given an instance x computes the probabil
we make a decision? For a test instance x, we say yes if t
Turning a probability into a classifier
is more than .5, and no otherwise. We call .5 the decisi


1 if P(y = 1|x) > 0.5 if w∙x+b > 0
ŷ =
0 otherwise if w∙x+b ≤ 0

sentiment classification
e. Suppose we are doing binary sentiment classification
Classification in Logistic Regression

Logistic
Regression
Logistic Regression: a text example
on sentiment classification
Logistic
Regression
Sentiment example: does y=1 or y=0?

It's hokey . There are virtually no surprises , and the writing is second-rate .
So why was it so enjoyable ? For one thing , the cast is
great . Another nice touch is the music . I was overcome with the urge to get off
the couch and start dancing . It sucked me in , and it'll do the same to you .

29
1 if P(y = 1|x) > 0.5
ŷ x= =2
2 0 otherwise
x3=1
5.1.1 It'sExample:
hokey . Theresentiment
are virtually classification
no surprises , and the writing is second-rate .
So why was it so enjoyable ? For one thing , the cast is
Let’s have
great an example.
. Another nice Suppose wemusic
touch is the are doing
. I wasbinary sentiment
overcome with theclassification
urge to get offon
movie the
review
couch text,
andand
startwe would. like
dancing to know
It sucked mewhether to assign
in , and it'll do the the
samesentiment
to you . class
+ or to a review document doc. We’ll represent each input observation by the 6
x4=3
x1input
features x1 ...x6 of the =3 shown
x5=0in the following
x6=4.19 table; Fig. 5.2 shows the features
in a sample mini test document.
Figure 5.2 A sample mini test document showing the extracted features in the vector x.
Var Definition Value in Fig. 5.2
x1 these
Given count(positive lexicon)
6 features and 2 doc)
the input review x, P(+|x)3 and P( |x) can be com-
putedx2usingcount(negative
⇢Eq. 5.5: lexicon) 2 doc) 2
1 if “no” 2 doc
x3 = P(Y = 1|x) = s (w · x + b)
p(+|x) 1
0 otherwise
x4 count(1st
⇢ = s
and 2nd pronouns
([2.5, doc)
5.0, 21.2, 3 · [3, 2, 1, 3, 0, 4.19] + 0.1)
0.5, 2.0, 0.7]
1 if “!”=2 doc
s (.833)
x5 0
0 otherwise
= 0.70 (5.6)
x6 log(word count of doc) ln(66) = 4.19 30
x2the following
tures x1 ...x6 of the input shown in count(negative
⇢ table; Fig. lexicon)
5.2 shows 2 doc)
the features
Classifying sentiment
a sample mini test document.
x
for input x 1 if “no” 2 doc
3
Var Definition 0 otherwise
Value in Fig. 5.2
x1 x4 2 doc)
count(positive lexicon) count(1st
⇢ and3 2nd pronouns 2 doc)
x2 count(negative lexicon) 2 doc) 1 if “!” 22doc

1 if “no” 2 doc 5
x
x3 0 otherwise
1
0 otherwise x6 log(word count of doc)
x4 count(1st
⇢ and 2nd pronouns 2 doc) 3
1 if “!” 2 doc
x5 0
0 otherwise
x6 Let’s assume
log(word count of doc) for the moment that
ln(66) = 4.19we’ve alrea
for each of these features, and that the 6 weights
Suppose ware= [2.5, 5.0, 1.2, 0.5, 2.0, 0.7], while b = 0.1. (
et’s assume for the moment that we’ve already learned a real-valued weight for
how the
b = 0.1
weights are learned.) The weight w 1 , for e
ch of these features, and that the 6 weights corresponding to the 6 features are
31
Figure 5.2 1 mini test5document showing
A sample 6 the extracted features in the vector x.

Classifying
Figure 5.2
sentiment for input x
A sample mini test document showing the extracted features in the vector x.
Given these 6 features and the input review x, P(+|x) and P( |x) can be com-
puted usingthese
Given Eq. 5.5:
6 features and the input review x, P(+|x) and P( |x) can be com-
puted using Eq. 5.5:
p(+|x) = P(Y = 1|x) = s (w · x + b)
p(+|x) = P(Y = 1|x) = s ([2.5,
(w · x + 5.0,
b) 1.2, 0.5, 2.0, 0.7] · [3, 2, 1, 3, 0, 4.19] + 0.1)
= s (.833)
([2.5, 5.0, 1.2, 0.5, 2.0, 0.7] · [3, 2, 1, 3, 0, 4.19] + 0.1)
= s (.833)
0.70 (5.6)
p( |x) = P(Y = 0|x) = 1 0.70s (w · x + b) (5.6)
p( |x) = P(Y = 0|x) = 0.30 1 s (w · x + b)
= 0.30
Logistic regression is commonly applied to all sorts of NLP tasks, and any property
of the input
Logistic can be aisfeature.
regression commonlyConsider
appliedthetotask
all of period
sorts disambiguation:
of NLP tasks, and any deciding
property
if
of athe
period
input is
canthe
beend of a sentence
a feature. Considerorthepart ofofa period
task word, by classifying each
disambiguation: period
deciding
32
erty of the input can be a feature. Consider the task of period disambiguation:
We can build features for logistic regression for
deciding if a period is the end of a sentence or part of a word, by classifying each
any classification task: period disambiguation
period into one of two classes EOS (end-of-sentence) and not-EOS. We might use
features like x1 below expressing that the current word is lower case and the class
is EOS (perhaps with a positive weight), or that the current word is in our abbrevia-
End ofwith
tions dictionary (“Prof.”) and the class is EOS (perhaps sentence
a negative weight). A
feature canThis ends ainquite
also express a period.
complex combination of properties. For example a
The house
period following at 465
a upper cased word Main St.to is
is a likely new.
be an EOS, but if the word itself is
St. and the previous word is capitalized, then the period is likely part of a shortening
of the word street. Not end

1 if “Case(wi ) = Lower”
x1 =
0 otherwise

1 if “wi 2 AcronymDict”
x2 =
0 otherwise

1 if “wi = St. & Case(wi 1 ) = Cap”
x3 =
0 otherwise 33
, which is just whatinwe
Classification want for
(binary) a probability.
logistic Because
regression: it is
summary
but has a sharp slope toward the ends, it tends to squash outlier
Given:
And it’s differentiable, which as we’ll see in Section 5.8 will
◦ a set of classes: (+ sentiment,- sentiment)
◦ a vector x of features [x1, x2, …, xn]
e. If we◦apply the sigmoid to
x1= count( "awesome") the sum of the weighted features,
ween 0 and 1. To make it a probability,
◦ x2 = log(number of words in review) we just need to make
s, p(y◦=A 1) and wp(y
vector = 0), sum
of weights to 1.w2,
[w1, We can
…, do
wn]this as follows:
◦ wi for each feature fi

P(y = 1) = s (w · x + b)
1
=
1 + e (w·x+b)
Logistic Regression: a text example
on sentiment classification
Logistic
Regression
Learning: Cross-Entropy Loss

Logistic
Regression
Wait, where did the W’s come from?

Supervised classification:
• We know the correct label y (either 0 or 1) for each x.
• But what the system produces is an estimate, 𝑦!
We want to set w and b to minimize the distance between our
estimate 𝑦! (i) and the true y(i).
• We need a distance estimator: a loss function or a cost
function
• We need an optimization algorithm to update w and b to
minimize the loss.
37
Learning components

A loss function:
◦ cross-entropy loss

An optimization algorithm:
◦ stochastic gradient descent
The distance between 𝑦! and y

We want to know how far is the classifier output:


𝑦! = σ(w·x+b)

from the true output:


y [= either 0 or 1]

We'll call this difference:


L(𝑦! ,y) = how much 𝑦! differs from the true y
Intuition of negative log likelihood loss
= cross-entropy loss

A case of conditional maximum likelihood


estimation
We choose the parameters w,b that maximize
• the log probability
• of the true y labels in the training data
• given the observations x
is the negative log likelihood loss, generally called the cross-entrop
derive this loss function, applied to a single observation x. We’d
Deriving cross-entropy loss for a single observation
ghts that maximize the probability of the correct label p(y|x). Sinc
x
two discrete outcomes (1 or 0), this is a Bernoulli distribution, and
Goal: maximize probability of the correct label p(y|x)
he probability p(y|x) that our classifier produces for one observa
Since there
wing (keeping are only
in mind that2 if
discrete outcomes
y=1, Eq. (0 or 1) to
5.9 simplifies weŷ;can
if y=0,
express
s to 1 ŷ): the probability p(y|x) from our classifier (the thing
we want to maximize) as
y 1 y
p(y|x) = ŷ (1 ŷ)
noting:
take the log of both sides. This will turn
if y=1, this simplifies to 𝑦! out to be handy mathema
n’t hurt us; whatever values maximize a probability will also maxim
if y=0, this simplifies to 1- 𝑦!
e probability:
he probability p(y|x) that our classifier produces for one observa
y 1 y
wingDeriving in mind that ifp(y|x)
(keepingcross-entropy loss
y=1, for ŷsingle
= a5.9
Eq. ŷ)
observation
(1simplifies x
to ŷ; if y=0,
s to 1 ŷ):
Now weGoal:
takemaximize
the log ofprobability
both sides.of This
the correct
will turnlabeloutp(y|x)
to be handy m
nd doesn’t hurt us; whatever values y maximize 1 y a probability will also
Maximize: p(y|x) = ŷ (1 ŷ)
og of the probability:
Now take the log of both sides (mathematically handy)
take the log of both sides. This will turn ⇥ y out to be handy
⇤ mathema
1 y
n’t hurt us; whatever log
Maximize: p(y|x)
values = log aŷprobability
maximize (1 ŷ) will also maxim
e probability: = y log ŷ + (1 y) log(1 ŷ)
⇥ y 1 y

Eq. 5.10 log p(y|x)
Whatever
describes values log ŷ log
a logmaximize
likelihood
= that
(1 ŷ)
p(y|x)
should will be
alsomaximized.
maximize In or
p(y|x)
nto loss function (something thatŷ +we(1need
= y log y) to minimize),
log(1 ŷ) we’ll just
Eq. 5.10. The result is the cross-entropy loss LCE :
and doesn’t hurt us; whatever values maximize a probability will also maximiz
Now weNow wethe
take takelog
theof
logboth
of both sides.
sides. This Thiswill
willturn
turn out
out to
tobe
behandy
handy mathe
ma
logDeriving
of the
and
probability:
cross-entropy
doesn’t hurt us; loss
whatever for
values a single
maximize a observation
probability will x
also max
and doesn’t hurt us; whatever values⇥ maximize a⇤probability will also m
log of the probability: y 1 y
log of the probability: log p(y|x) = log ŷ (1 ŷ)
Goal: maximize probability of the correct label p(y|x)
⇥ y 1 y

= y log
log p(y|x) = ŷlog
+⇥(1ŷ (1y) log(1
ŷ) ŷ)
⇤ (
Maximize: y 1 y
log p(y|x) = =logy logŷ ŷ(1
+ (1 ŷ)y) log(1 ŷ)
Eq. 5.10 describes a log likelihood that should be maximized. In order to turn
= y log ŷ + (1 y) log(1 ŷ)
into loss
Eq.function (something
5.10 describes a logthat we needthat
likelihood to should
minimize), we’ll just flip
be maximized. the sig
In order t
Eq.Now
5.10.flip
intoThesign
loss toisturn
result this into that
the(something
function a loss:
cross-entropy loss something to minimize
LCE :to minimize),
we need we’ll just flip th
Eq. 5.10 describes a log likelihood that should
Eq. 5.10. The result is the cross-entropy loss LCE :be maximized. In orde
intoCross-entropy
loss function loss (because
(something is
that formula
we need for
to cross-entropy(y,
minimize),
LCE (ŷ, y) = log p(y|x) = [y log ŷ + (1 y) log(1 ŷ)] we’ll 𝑦
! ))
just fli(
Eq. 5.10. The result
Minimize: LCEis(ŷ,the cross-entropy
y) = log p(y|x) =loss [y
Llog
CE :ŷ + (1 y) log(1 ŷ)]
Finally, we can plug in the definition of ŷ = s (w · x + b):
Or, plugging
Finally, we in
candefinition of 𝑦:
!
plug in the definition of ŷ = s (w · x + b):
LCE (ŷ, y) = log p(y|x) = [y log ŷ + (1 y) log(1 ŷ)
LCE (ŷ, y) = [y log s (w · x + b) + (1 y) log (1 s (w · x + b))] (
LCE (ŷ, y) = [y log s (w · x + b) + (1 y) log (1 s (w · x + b))
Let's see if this works for our sentiment example
We want loss to be:
• smaller if the model estimate is close to correct
• bigger if model is confused
Let's first suppose the true label of this is y=1 (positive)

It's hokey . There are virtually no surprises , and the writing is second-rate .
So why was it so enjoyable ? For one thing , the cast is great . Another nice
touch is the music . I was overcome with the urge to get off the couch and
start dancing . It sucked me in , and it'll do the same to you .
x4=3
x1=3 x5=0 x6=4.19
Let'sFigure
see 5.2 if thisminiworks
A sample test documentfor our
showing sentiment
the extracted example
features in the vector x.

Given these 6 features and the input review x, P(+|x) and P( |x) can be com-
True value is y=1. How well is our model doing?
puted using Eq. 5.5:

p(+|x) = P(Y = 1|x) = s (w · x + b)


= s ([2.5, 5.0, 1.2, 0.5, 2.0, 0.7] · [3, 2, 1, 3, 0, 4.19] + 0.1)
R 5 • L OGISTIC R EGRESSION= s (.833)
= 0.70 (5.6)
side of the
p(equation drops
|x) = P(Y out, =
= 0|x) leading
1 sto (wthe
· x +following
b) loss (we’ll use log to mean
natural log when the base is not specified):
Pretty well! What's the loss?
= 0.30
LCE (ŷ, y) = [y log s (w · x + b) + (1 y) log (1 s (w · x + b))]
Logistic regression is commonly applied to all sorts of NLP tasks, and any property
= can be a[log
of the input s (w Consider
feature. · x + b)] the task of period disambiguation: deciding
if a period
= is the end of a sentence
log(.70)or part of a word, by classifying each period
into one=of two classes EOS (end-of-sentence)
.36 and not-EOS. We might use features
like x1 below expressing that the current word is lower case and the class is EOS
CE
p(+|x) = P(Y = 1|x) = s (w · x + b)
[log s (w · x + b)]
Let's see= if this works for our sentiment example
=
= s
log(.70)
([2.5, 5.0, 1.2, 0.5, 2.0, 0.7] · [3, 2, 1
= s (.833)
= .36
Suppose true value instead was y=0.
= 0.70
By contrast, let’s pretend instead that the example in Fig. 5.2 was actually negative,
i.e., y = 0 p( |x) =
(perhaps theP(Y = 0|x)
reviewer s (w
went=on1to say ·x+
“But b) line, the movie is
bottom
terrible! I beg you not to see it!”). In this case our model is confused and we’d want
= 0.30
the loss to be higher. Now if we plug y = 0 and 1 s (w · x + b) = .31 from Eq. 5.7
intoWhat's theleftloss?
Eq. 5.12, the side of the equation drops out:
Logistic regression is commonly applied to all sorts of NLP tasks,
LCEof
(ŷ, the
y) =input can be sa(w
[y log feature. Consider
· x + b)+(1 s (w ·of
the task
y) log (1 x +period
b))] disambig
if a period
= is the end of a sentence [log (1or spart
(w · xof a word, by classif
+ b))]
into one= of two classes EOS (end-of-sentence)
log (.30) and not-EOS. We m
like x=1 below expressing that 1.2 the current word is lower case and
(perhaps with a positive weight), or that the current word is in o
Sure enough, the loss for the first classifier (.37) is less than the loss for the second
8 C HAPTER 5
LCE• (ŷ,Ly) = [y log s (w · x + b) + (1
OGISTIC R EGRESSION
y) log (1 s (w · x + b))]
Let's see if= this works for
[log s (w · x + b)] our sentiment example
side of the=equation drops out, log(.70)
leading to the following loss (we’ll use log to mean
The loss when model was right (if true y=1)
natural log=when the base is not specified):
.36
LCE (pretend
By contrast, let’s ŷ, y) = instead [y
that s (wexample
logthe · x + b) +in(1Fig.y)5.2
logwas s (w · x +negative,
(1 actually b))]
i.e., y = 0 (perhaps the = reviewer[log
wents (won· x to
+ b)]
say “But bottom line, the movie is
terrible! I beg you not to = see it!”). In this case
log(.70)our model is confused and we’d want
the loss to be higher. Now = if we plug y = 0 and .361 s (w · x + b) = .31 from Eq. 5.7
Is lower than the loss when model
into Eq. 5.12, the left side of the equation drops out:
was wrong (if true y=0):
By contrast, let’s pretend instead that the example in Fig. 5.2 was actually negative,
LCE (yŷ,=y)0=(perhaps the
i.e., [y log s (w · xwent
reviewer + b)+(1 y) log
on to say “But s (w ·line,
(1 bottom x + b))]
the movie is
terrible! I beg
= you not to see it!”). In this case[log
our (1
model is confused
s (w · x + b))]and we’d want
the loss to be higher. Now if we plug y = 0 and 1 s (w · x + b) = .31 from Eq. 5.7
into Eq. 5.12, logout:
= the left side of the equation drops (.30)
= 1.2
LCE (ŷ, y) = [y log s (w · x + b)+(1 y) log (1 s (w · x + b))]
Sure enough, the loss for
= the first classifier (.37) is less than
(1 the loss
s (w · x for the second
+ b))]
Sure enough,
classifier (1.17). loss
=
was bigger when model
[log
log (.30)
was wrong!
Cross-Entropy Loss

Logistic
Regression
Stochastic Gradient Descent

Logistic
Regression
th gradient descent is to find the optimal weights: minim
Our goal: minimize the loss
ve defined for the model. In Eq. 5.13 below, we’ll explici
the loss
Let'sfunction L is that
make explicit parameterized by the
the loss function weights, whic
is parameterized
by weights
e learning 𝛳=(w,b)as q (in the case of logistic regressio
in general
s to find
• theAndset of represent
we’ll weights which minimizes
𝑦! as f (x; θ ) to makethetheloss functi
mples: dependence on θ more obvious
We want the weights that minimize the loss, averaged
over all examples:
m
X
1
q̂ = argmin LCE ( f (x ; q ), y )
(i) (i)
q m
i=1
Intuition of gradient descent
How do I get to the bottom of this river canyon?

Look around me 360∘

Find the direction of


steepest slope down
x Go that way
Our goal: minimize the loss
For logistic regression, loss function is convex
• A convex function has just one minimum
• Gradient descent starting from any point is
guaranteed to find the minimum
• (Loss for neural networks is non-convex)
Let's first visualize for a single scalar w
Q: Given current w, should we make it bigger or smaller?
A: Move w in the reverse direction from the slope of the function

Loss Should we move


right or left from here?

w1 wmin w
0 (goal)
Let's first visualize for a single scalar w
Q: Given current w, should we make it bigger or smaller?
A: Move w in the reverse direction from the slope of the function

Loss

slope of loss at w1
is negative

So we'll move positive

w1 wmin w
0 (goal)
Let's first visualize for a single scalar w
Q: Given current w, should we make it bigger or smaller?
A: Move w in the reverse direction from the slope of the function

Loss

one step
of gradient
slope of loss at w1 descent
is negative

So we'll move positive

w1 wmin w
0 (goal)
Gradients
The gradient of a function of many variables is a
vector pointing in the direction of the greatest
increase in a function.

Gradient Descent: Find the gradient of the loss


function at the current point and move in the
opposite direction.
How much do we move in that direction ?

• The value
GISTIC R EGRESSION of the gradient (slope in our example)
!
!"
𝐿(𝑓 𝑥; 𝑤 , 𝑦) weighted by a learning rate η
• Higher learning rate means move w faster

t+1 t d
w =w h L( f (x; w), y)
dw
’s extend the intuition from a function of one scalar variable
Now let's consider N dimensions
We want to know where in the N-dimensional space
(of the N parameters that make up θ ) we should
move.
The gradient is just such a vector; it expresses the
directional components of the sharpest slope along
each of the N dimensions.
of a 2-dimensional gradient vector taken at the red point.
Imagine 2 dimensions, w and b
Cost(w,b)
Visualizing the
gradient vector at
the red point
It has two
dimensions shown
in the x-y plane
b
w
Real gradients
Are much longer; lots and lots of weights
For each dimension wi the gradient component i
tells us the slope with respect to that variable.
◦ “How much would a small change in wi influence the
total loss function L?”
◦ We express the slope as a partial derivative ∂ of the loss
∂wi
The gradient is then defined as a vector of these
partials.
ach dimension wi , we express the slope as a partial derivative ∂ wi of th
n. The gradient is then defined as a vector of these partials. We’ll repre
The gradient
q ) to make the dependence on q more 2 ∂obvious: 3
∂ w L( f (x; q ), y)
We’ll represent 𝑦! as f (x; θ ) to make
6 ∂the 1 dependence on θ more
7
obvious: 6
2 ∂ w L( f (x; q ), y)
37
6
—q L( f (x; q ), y)) = 6 ∂ w L( f (x;
∂ 2
q ), y) 77
1
.
.
4
6 ∂ . 75
6 ∂ w∂ 2 L( f (x; q ), y)7
—q L( f (x; q ), y)) = 6 6 ∂ wn L( f ..(x; q ), y)
7
7
4 . 5

nal equation for updating q based on ∂ wnthe (x; q ), y)is thus
L( fgradient
The final equation for updating θ based on the gradient is thus
final equation for updating q based on the gradient is thus
qt+1 = qt h—L( f (x; q ), y)
Gradient for Logistic Regression
4.1 What
The are
Gradient for Logistic
these partial Regression
derivatives for logistic regression?
te q , we need a definition for the gradient —L( f (x; q ), y). Re
order to update q , we need a definition for the gradient —L( f (x; q ), y). Recal
ession, the cross-entropy loss function is:
logistic regression, the cross-entropy loss function is:
The loss function
y) =LCE (ŷ,[yy)log=s (w ·x+
[y log b)· x++(1
s (w y) log
b) + (1 (1(1 ss(w
y) log (w ·· xx+ b))]
+b))]
It turns
that theoutderivative
that the derivative
of thisoffunction
this function
forfor oneobservation
one observation vecto
vec
. 5.18The
(theelegant derivative
interested reader of
canthis
seefunction
Section(see textbook
5.8 for 5.8 for derivation)
the derivation of this equa
erested reader can see Section 5.8 for the derivation of this eq
∂ LCE (ŷ, y)
∂ LCE (ŷ, y) ∂ w = [s (w · x + b) y]x j
=j [s (w · x + b) y]x j
∂ w j the gradient with respect to a single weight w j represe
Note in Eq. 5.18 that
y intuitive
5.18 that value: the difference
the gradient with between
respectthe
totrue y and our
a single estimated
weight ŷ =
w j repr
function S TOCHASTIC G RADIENT D ESCENT(L(), f (), x, y) returns q
# where: L is the loss function
# f is a function parameterized by q
# x is the set of training inputs x(1) , x(2) , ..., x(m)
# y is the set of training outputs (labels) y(1) , y(2) , ..., y(m)

q 0
repeat til done # see caption
For each training tuple (x(i) , y(i) ) (in random order)
1. Optional (for reporting): # How are we doing on this tuple?
Compute ŷ (i) = f (x(i) ; q ) # What is our estimated output ŷ?
Compute the loss L(ŷ (i) , y(i) ) # How far off is ŷ(i) ) from the true output y(i) ?
2. g —q L( f (x(i) ; q ), y(i) ) # How should we move q to maximize loss?
3. q q hg # Go the other way instead
return q

Figure 5.5 The stochastic gradient descent algorithm. Step 1 (computing the loss) is used
Hyperparameters
The learning rate η is a hyperparameter
◦ too high: the learner will take big steps and overshoot
◦ too low: the learner will take too long
Hyperparameters:
• Briefly, a special kind of parameter for an ML model
• Instead of being learned by algorithm from
supervision (like regular parameters), they are
chosen by algorithm designer.
Stochastic Gradient Descent

Logistic
Regression
Stochastic Gradient Descent:
An example and more details
Logistic
Regression
Working through an example
One step of gradient descent
A mini-sentiment example, where the true y=1 (positive)
Two features:
x1 = 3 (count of positive lexicon words)
x2 = 2 (count of negative lexicon words)
Assume 3 parameters (2 weights and 1 bias) in Θ0 are zero:
w1 = w2 = b = 0
η = 0.1
5.4.1 The Gradient for Logistic Regression
Let’s assume the initial weights and bias in q 0 are all set to 0, and the initial learning
Example of gradient descent
al equation for updating
Inrate
order updateqq ,based
h isto0.1: we needon the gradient
a definition is thus
for the gradient
for logistic regression, the cross-entropy loss function w
—L( f (x; q ), y). Re
is:1 = w2 = b = 0;
Update stepLfor update [yθlog
CE (ŷ, y) =
is:s (w · x + b) + (1
w1 = w2 = b = 0 x1(1= 3;s (w
y) log x2· x=+2 b))]
h = 0.1
Itqturns out
= q
that the h—L(
derivative
f (x; q
of this
),function
y) for one observation ve
t requires that we compute the gradient, multiplied by the
t+1 update step
The single
Eq. 5.18 (the
learning rate
interested reader can see Section 5.8 for the derivation of this eq
∂ LCEt(ŷ, y)
where q t+1
= q h—q L(
∂wj
; q ·),xy+
[s(i)(w
= f (x (i) b)
) y]x j
InNote
our mini example
in Eq. 5.18there
that are
thethree parameters,
gradient so the gradient
with respect vectorweight
to a single has 3 dimen-
w j rep
Gradient vector has 3 dimensions:
sions, for w1 , w2 , and b. We can compute the first gradient as follows:
very intuitive value: the difference between the true y and our estimated ŷ
2 ∂ LCEx(ŷ,y) 3 2
+ b) for that observation,
(s (w · x + b) y)x1
multiplied
3 2 by the
(s (0) 1)x1
corresponding
3 2
0.5x1
3input
2 value
1.5
3 x j.
∂ w1
6 ∂ LCE (ŷ,y) 7
—w,b = 4 ∂ w2 5 = 4 (s (w · x + b) y)x2 5 = 4 (s (0) 1)x2 5 = 4 0.5x2 5 = 4 1.0 5
5.4.2
∂ LCE (ŷ,y)
∂b
The Stochastic Gradient Descent Algorithm
s (w · x + b) y s (0) 1 0.5 0.5

Stochastic gradient descent is an online algorithm that minimizes


1 the loss
Now that we have a gradient, we compute the new parameter vector q by moving
5.4.1 The Gradient for Logistic Regression
Let’s assume the initial weights and bias in q 0 are all set to 0, and the initial learning
Example of gradient descent
al equation for updating
Inrate
order updateqq ,based
h isto0.1: we needon the gradient
a definition is thus
for the gradient
for logistic regression, the cross-entropy loss function w
—L( f (x; q ), y). Re
is:1 = w2 = b = 0;
Update stepLfor update [yθlog
CE (ŷ, y) =
is:s (w · x + b) + (1
w1 = w2 = b = 0 x1(1= 3;s (w
y) log x2· x=+2 b))]
h = 0.1
Itqturns out
= q
that the h—L(
derivative
f (x; q
of this
),function
y) for one observation ve
t requires that we compute the gradient, multiplied by the
t+1 update step
The single
Eq. 5.18 (the
learning rate
interested reader can see Section 5.8 for the derivation of this eq
∂ LCEt(ŷ, y)
where q t+1
= q h—q L(
∂wj
; q ·),xy+
[s(i)(w
= f (x (i) b)
) y]x j
InNote
our mini example
in Eq. 5.18there
that are
thethree parameters,
gradient so the gradient
with respect vectorweight
to a single has 3 dimen-
w j rep
Gradient vector has 3 dimensions:
sions, for w1 , w2 , and b. We can compute the first gradient as follows:
very intuitive value: the difference between the true y and our estimated ŷ
2 ∂ LCEx(ŷ,y) 3 2
+ b) for that observation,
(s (w · x + b) y)x1
multiplied
3 2 by the
(s (0) 1)x1
corresponding
3 2
0.5x1
3input
2 value
1.5
3 x j.
∂ w1
6 ∂ LCE (ŷ,y) 7
—w,b = 4 ∂ w2 5 = 4 (s (w · x + b) y)x2 5 = 4 (s (0) 1)x2 5 = 4 0.5x2 5 = 4 1.0 5
5.4.2
∂ LCE (ŷ,y)
∂b
The Stochastic Gradient Descent Algorithm
s (w · x + b) y s (0) 1 0.5 0.5

Stochastic gradient descent is an online algorithm that minimizes


1 the loss
Now that we have a gradient, we compute the new parameter vector q by moving
5.4.1 The Gradient for Logistic Regression
Let’s assume the initial weights and bias in q 0 are all set to 0, and the initial learning
Example of gradient descent
al equation for updating
Inrate
order updateqq ,based
h isto0.1: we needon the gradient
a definition is thus
for the gradient
for logistic regression, the cross-entropy loss function w
—L( f (x; q ), y). Re
is:1 = w2 = b = 0;
Update stepLfor update [yθlog
CE (ŷ, y) =
is:s (w · x + b) + (1
w1 = w2 = b = 0 x1(1= 3;s (w
y) log x2· x=+2 b))]
h = 0.1
Itqturns out
= q
that the h—L(
derivative
f (x; q
of this
),function
y) for one observation ve
t requires that we compute the gradient, multiplied by the
t+1 update step
The single
Eq. 5.18 (the
learning rate
interested reader can see Section 5.8 for the derivation of this eq
∂ LCEt(ŷ, y)
where q t+1
= q h—q L(
∂wj
; q ·),xy+
[s(i)(w
= f (x (i) b)
) y]x j
InNote
our mini example
in Eq. 5.18there
that are
thethree parameters,
gradient so the gradient
with respect vectorweight
to a single has 3 dimen-
w j rep
Gradient vector has 3 dimensions:
sions, for w1 , w2 , and b. We can compute the first gradient as follows:
very intuitive value: the difference between the true y and our estimated ŷ
2 ∂ LCEx(ŷ,y) 3 2
+ b) for that observation,
(s (w · x + b) y)x1
multiplied
3 2 by the
(s (0) 1)x1
corresponding
3 2
0.5x1
3input
2 value
1.5
3 x j.
∂ w1
6 ∂ LCE (ŷ,y) 7
—w,b = 4 ∂ w2 5 = 4 (s (w · x + b) y)x2 5 = 4 (s (0) 1)x2 5 = 4 0.5x2 5 = 4 1.0 5
5.4.2
∂ LCE (ŷ,y)
∂b
The Stochastic Gradient Descent Algorithm
s (w · x + b) y s (0) 1 0.5 0.5

Stochastic gradient descent is an online algorithm that minimizes


1 the loss
Now that we have a gradient, we compute the new parameter vector q by moving
5.4.1 The Gradient for Logistic Regression
Let’s assume the initial weights and bias in q 0 are all set to 0, and the initial learning
Example of gradient descent
al equation for updating
Inrate
order updateqq ,based
h isto0.1: we needon the gradient
a definition is thus
for the gradient
for logistic regression, the cross-entropy loss function w
—L( f (x; q ), y). Re
is:1 = w2 = b = 0;
Update stepLfor update [yθlog
CE (ŷ, y) =
is:s (w · x + b) + (1
w1 = w2 = b = 0 x1(1= 3;s (w
y) log x2· x=+2 b))]
h = 0.1
Itqturns out
= q
that the h—L(
derivative
f (x; q
of this
),function
y) for one observation ve
t requires that we compute the gradient, multiplied by the
t+1 update step
The single
Eq. 5.18 (the
learning rate
interested reader can see Section 5.8 for the derivation of this eq
∂ LCEt(ŷ, y)
where q t+1
= q h—q L(
∂wj
; q ·),xy+
[s(i)(w
= f (x (i) b)
) y]x j
InNote
our mini example
in Eq. 5.18there
that are
thethree parameters,
gradient so the gradient
with respect vectorweight
to a single has 3 dimen-
w j rep
Gradient vector has 3 dimensions:
sions, for w1 , w2 , and b. We can compute the first gradient as follows:
very intuitive value: the difference between the true y and our estimated ŷ
2 ∂ LCEx(ŷ,y) 3 2
+ b) for that observation,
(s (w · x + b) y)x1
multiplied
3 2 by the
(s (0) 1)x1
corresponding
3 2
0.5x1
3input
2 value
1.5
3 x j.
∂ w1
6 ∂ LCE (ŷ,y) 7
—w,b = 4 ∂ w2 5 = 4 (s (w · x + b) y)x2 5 = 4 (s (0) 1)x2 5 = 4 0.5x2 5 = 4 1.0 5
5.4.2
∂ LCE (ŷ,y)
∂b
The Stochastic Gradient Descent Algorithm
s (w · x + b) y s (0) 1 0.5 0.5

Stochastic gradient descent is an online algorithm that minimizes


1 the loss
Now that we have a gradient, we compute the new parameter vector q by moving
5.4.1 The Gradient for Logistic Regression
Let’s assume the initial weights and bias in q 0 are all set to 0, and the initial learning
Example of gradient descent
al equation for updating
Inrate
order updateqq ,based
h isto0.1: we needon the gradient
a definition is thus
for the gradient
for logistic regression, the cross-entropy loss function w
—L( f (x; q ), y). Re
is:1 = w2 = b = 0;
Update stepLfor update [yθlog
CE (ŷ, y) =
is:s (w · x + b) + (1
w1 = w2 = b = 0 x1(1= 3;s (w
y) log x2· x=+2 b))]
h = 0.1
Itqturns out
= q
that the h—L(
derivative
f (x; q
of this
),function
y) for one observation ve
t requires that we compute the gradient, multiplied by the
t+1 update step
The single
Eq. 5.18 (the
learning rate
interested reader can see Section 5.8 for the derivation of this eq
∂ LCEt(ŷ, y)
where q t+1
= q h—q L(
∂wj
; q ·),xy+
[s(i)(w
= f (x (i) b)
) y]x j
InNote
our mini example
in Eq. 5.18there
that are
thethree parameters,
gradient so the gradient
with respect vectorweight
to a single has 3 dimen-
w j rep
Gradient vector has 3 dimensions:
sions, for w1 , w2 , and b. We can compute the first gradient as follows:
very intuitive value: the difference between the true y and our estimated ŷ
2 ∂ LCEx(ŷ,y) 3 2
+ b) for that observation,
(s (w · x + b) y)x1
multiplied
3 2 by the
(s (0) 1)x1
corresponding
3 2
0.5x1
3input
2 value
1.5
3 x j.
∂ w1
6 ∂ LCE (ŷ,y) 7
—w,b = 4 ∂ w2 5 = 4 (s (w · x + b) y)x2 5 = 4 (s (0) 1)x2 5 = 4 0.5x2 5 = 4 1.0 5
5.4.2
∂ LCE (ŷ,y)
∂b
The Stochastic Gradient Descent Algorithm
s (w · x + b) y s (0) 1 0.5 0.5

Stochastic gradient descent is an online algorithm that minimizes


1 the loss
Now that we have a gradient, we compute the new parameter vector q by moving
6 ∂ w L( f (x; q ), y) 7
—q L( f (x; qthere 6
In our mini example
y))arefor=
),sions, 7
2there are three parameters, so the gradient vector has 3 dimen-
(5.15)
r mini Example
example
for w1 , w22, and b. We
of
can
w1gradient
three
,6
4
parameters,
compute the
.
.
w2 , and b. We can
. descent
so
first
the gradient
gradient
vector has 3 dimen-
7 first gradient as follows:
compute the
5 as follows:
∂ L (ŷ,y) 3 2 3 2 3 2 3 2 3
CE
∂ b) y)x1
∂w 1 (s (w · x +
2 — = 6 ∂ L (ŷ,y) 7 = 43(s (w2· x∂+wb)
CE
L( f (x; q (sy)
),
n y)x 5 = 43(s (0)
(0) 1)x1
2 1)x 5 =34 0.5x
0.5x1
2 5 = 43 1.0 5
1.5
w,b 4 5 2 2 2
(s (w · x + b) y)x1 s (w · x +(s
∂ w2
∂ LCE (ŷ,y) y 1)x1 s (0) 1 0.5x1
b)(0) 0.5 1.5 0.5
= 4 (s (w · x + b) y)x2 5 = 4 (s (0) 1)x2 5 = 4 0.5x2 5 = 4 1.0 5
∂b
on for updating q based on the gradient is thus
s (wthat
Now we Now
· x + b) have
y thata we havesa (0)
gradient, we1compute
gradient, we compute the new
0.5
the new parameter
parameter 0.5q 1vector
vector by moving
q 0 in the opposite direction from the gradient:
θ1 by moving θ0 in the opposite direction from the gradient:
2 3 2 3 2 3
that we have a gradient, we compute the wnew
1 parameter
1.5 .15 q 1 by moving
vector
q
the opposite = q h—L( f
q
t+1directiont from the gradient:
1
(x;
= 4qw),2 5y) h 4 1.0 5 =η =
4 0.1;
.1 5 (5.16)
b 0.5 .05
2 3 2 3 2 3
So afterwone 1.5 descent, .15
1 step of gradient the weights have shifted to be: w1 = .15,
1w2 =4.1, and5b = .05.
q = w2 h 4 1.0 5 = 4 .1 5
Note that this observation x happened to be a positive example. We would expect
b 0.5 .05
that after seeing more negative examples with high counts of negative words, that
the weight w2 would shift to have a negative value.
ter one step of gradient descent, the weights have shifted to be: w = .15,
6 ∂ w L( f (x; q ), y) 7
—q L( f (x; qthere 6
In our mini example
y))arefor=
),sions, 7
2there are three parameters, so the gradient vector has 3 dimen-
(5.15)
r mini Example
example
for w1 , w22, and b. We
of
can
w1gradient
three
,6
4
parameters,
compute the
.
.
w2 , and b. We can
. descent
so
first
the gradient
gradient
vector has 3 dimen-
7 first gradient as follows:
compute the
5 as follows:
∂ L (ŷ,y) 3 2 3 2 3 2 3 2 3
CE
∂ b) y)x1
∂w 1 (s (w · x +
2 — = 6 ∂ L (ŷ,y) 7 = 43(s (w2· x∂+wb)
CE
L( f (x; q (sy)
),
n y)x 5 = 43(s (0)
(0) 1)x1
2 1)x 5 =34 0.5x
0.5x1
2 5 = 43 1.0 5
1.5
w,b 4 5 2 2 2
(s (w · x + b) y)x1 s (w · x +(s
∂ w2
∂ LCE (ŷ,y) y 1)x1 s (0) 1 0.5x1
b)(0) 0.5 1.5 0.5
= 4 (s (w · x + b) y)x2 5 = 4 (s (0) 1)x2 5 = 4 0.5x2 5 = 4 1.0 5
∂b
on for updating q based on the gradient is thus
s (wthat
Now we Now
· x + b) have
y thata we havesa (0)
gradient, we1compute
gradient, we compute the new
0.5
the new parameter
parameter 0.5q 1vector
vector by moving
q 0 in the opposite direction from the gradient:
θ1 by moving θ0 in the opposite direction from the gradient:
2 3 2 3 2 3
that we have a gradient, we compute the wnew
1 parameter
1.5 .15 q 1 by moving
vector
q
the opposite = q h—L( f
q
t+1directiont from the gradient:
1
(x;
= 4qw),2 5y) h 4 1.0 5 =η =
4 0.1;
.1 5 (5.16)
b 0.5 .05
2 3 2 3 2 3
So afterwone 1.5 descent, .15
1 step of gradient the weights have shifted to be: w1 = .15,
1w2 =4.1, and5b = .05.
q = w2 h 4 1.0 5 = 4 .1 5
Note that this observation x happened to be a positive example. We would expect
b 0.5 .05
that after seeing more negative examples with high counts of negative words, that
the weight w2 would shift to have a negative value.
ter one step of gradient descent, the weights have shifted to be: w = .15,
6 ∂ w L( f (x; q ), y) 7
—q L( f (x; qthere 6
In our mini example
y))arefor=
),sions, 7
2there are three parameters, so the gradient vector has 3 dimen-
(5.15)
r mini Example
example
for w1 , w22, and b. We
of
can
w1gradient
three
,6
4
parameters,
compute the
.
.
w2 , and b. We can
. descent
so
first
the gradient
gradient
vector has 3 dimen-
7 first gradient as follows:
compute the
5 as follows:
∂ L (ŷ,y) 3 2 3 2 3 2 3 2 3
CE
∂ b) y)x1
∂w 1 (s (w · x +
2 — = 6 ∂ L (ŷ,y) 7 = 43(s (w2· x∂+wb)
CE
L( f (x; q (sy)
),
n y)x 5 = 43(s (0)
(0) 1)x1
2 1)x 5 =34 0.5x
0.5x1
2 5 = 43 1.0 5
1.5
w,b 4 5 2 2 2
(s (w · x + b) y)x1 s (w · x +(s
∂ w2
∂ LCE (ŷ,y) y 1)x1 s (0) 1 0.5x1
b)(0) 0.5 1.5 0.5
= 4 (s (w · x + b) y)x2 5 = 4 (s (0) 1)x2 5 = 4 0.5x2 5 = 4 1.0 5
∂b
on for updating q based on the gradient is thus
s (wthat
Now we Now
· x + b) have
y thata we havesa (0)
gradient, we1compute
gradient, we compute the new
0.5
the new parameter
parameter 0.5q 1vector
vector by moving
q 0 in the opposite direction from the gradient:
θ1 by moving θ0 in the opposite direction from the gradient:
2 3 2 3 2 3
that we have a gradient, we compute the wnew
1 parameter
1.5 .15 q 1 by moving
vector
q
the opposite = q h—L( f
q
t+1directiont from the gradient:
1
(x;
= 4qw),2 5y) h 4 1.0 5 =η =
4 0.1;
.1 5 (5.16)
b 0.5 .05
2 3 2 3 2 3
So afterwone 1.5 descent, .15
1 step of gradient the weights have shifted to be: w1 = .15,
1w2 =4.1, and5b = .05.
q = w2 h 4 1.0 5 = 4 .1 5
Note that this observation x happened to be a positive example. We would expect
b 0.5 .05
that after seeing more negative examples with high counts of negative words, that
the weight w2 would shift to have a negative value.
ter one step of gradient descent, the weights have shifted to be: w = .15,
6 ∂ w L( f (x; q ), y) 7
—q L( f (x; qthere 6
In our mini example
y))arefor=
),sions, 7
2there are three parameters, so the gradient vector has 3 dimen-
(5.15)
r mini Example
example
for w1 , w22, and b. We
of
can
w1gradient
three
,6
4
parameters,
compute the
.
.
w2 , and b. We can
. descent
so
first
the gradient
gradient
vector has 3 dimen-
7 first gradient as follows:
compute the
5 as follows:
∂ L (ŷ,y) 3 2 3 2 3 2 3 2 3
CE
∂ b) y)x1
∂w 1 (s (w · x +
2 — = 6 ∂ L (ŷ,y) 7 = 43(s (w2· x∂+wb)
CE
L( f (x; q (sy)
),
n y)x 5 = 43(s (0)
(0) 1)x1
2 1)x 5 =34 0.5x
0.5x1
2 5 = 43 1.0 5
1.5
w,b 4 5 2 2 2
(s (w · x + b) y)x1 s (w · x +(s
∂ w2
∂ LCE (ŷ,y) y 1)x1 s (0) 1 0.5x1
b)(0) 0.5 1.5 0.5
= 4 (s (w · x + b) y)x2 5 = 4 (s (0) 1)x2 5 = 4 0.5x2 5 = 4 1.0 5
∂b
on for updating q based on the gradient is thus
s (wthat
Now we Now
· x + b) have
y thata we havesa (0)
gradient, we1compute
gradient, we compute the new
0.5
the new parameter
parameter 0.5q 1vector
vector by moving
q 0 in the opposite direction from the gradient:
θ1 by moving θ0 in the opposite direction from the gradient:
2 3 2 3 2 3
that we have a gradient, we compute the wnew
1 parameter
1.5 .15 q 1 by moving
vector
q
the opposite = q h—L( f
q
t+1directiont from the gradient:
1
(x;
= 4qw),2 5y) h 4 1.0 5 =η =
4 0.1;
.1 5 (5.16)
b 0.5 .05
2 3 2 3 2 3
So afterwone 1.5 descent, .15
1 step of gradient the weights have shifted to be: w1 = .15,
1w2 =4.1, and5b = .05.
q = w2 h 4 1.0 5 = 4 .1 5
Note that this observation x happened to be a positive example. We would expect
b 0.5 .05
that after seeing more negative examples with high counts of negative words, that
the weight w2 would shift to have a negative value.
ter oneNote
stepthat enough negative
of gradient descent,examples wouldhave
the weights eventually
shiftedmake w2 negative
to be: w = .15,
Mini-batch training
Stochastic gradient descent chooses a single
random example at a time.
That can result in choppy movements
More common to compute gradient over batches of
training instances.
Batch training: entire dataset
Mini-batch training: m examples (512, or 1024)
Stochastic Gradient Descent:
An example and more details
Logistic
Regression
Regularization

Logistic
Regression
Overfitting
A model that perfectly match the training data has a
problem.
It will also overfit to the data, modeling noise
◦ A random word that perfectly predicts y (it happens to
only occur in one class) will get a very high weight.
◦ Failing to generalize to a test set without this word.
A good model should be able to generalize
Overfitting Useful or harmless features

+ X1 = "this"
X2 = "movie
This movie drew me in, and it'll X3 = "hated"
do the same to you. X4 = "drew me in"

-
I can't tell you how much I
4gram features that just
"memorize" training set and
might cause problems
hated this movie. It sucked. X5 = "the same to you"
X7 = "tell you how much"
81
Overfitting
4-gram model on tiny data will just memorize the data
◦ 100% accuracy on the training set
But it will be surprised by the novel 4-grams in the test data
◦ Low accuracy on test set
Models that are too powerful can overfit the data
◦ Fitting the details of the training data so exactly that the
model doesn't generalize well to the test set
◦ How to avoid overfitting?
◦ Regularization in logistic regression
◦ Dropout in neural networks
82
o the unseen test set,
Regularizationbut a model that overfits will have poor generalization
o avoid overfitting, a new regularization term R(q ) is added to the object
on in Eq. 5.13, resulting in the following objective for a batch of m exa
slightlyArewritten
solutionfrom
for Eq.
overfitting
5.13 to be maximizing log probability rather th
1
Addand
mizing loss, a regularization
removing the mterm R(θ) todoesn’t
term which the loss function
affect the argmax):
(for now written as maximizing logprob rather than minimizing loss)
X m
q̂ = argmax (i) (i)
log P(y |x ) aR(q ) (5.
q i=1
Idea: choose an R(θ) that penalizes large weights
ew regularization term R(q ) is used to penalize large weights. Thus a sett
◦ fitting the data well with lots of big weights not as good
weights that matches
as fitting thethe training
data a littledata
lessperfectly— but uses
well, with small many weights w
weights
values to do so—will be penalized more than a setting that matches the dat
ess well, but does so using smaller weights. There are two common ways
high
to dovalues to dobe
so—will so—will be penalized
penalized more thanmorea than a setting
setting that matches
that matches the the
datadaa
ell, lessL2
littlebut does Regularization
well, butusing
so does so
smaller (= ridge
using smaller
weights. regression)
weights.
There There
are twoare common
two common way
ways to
compute this regularization term R(q ). L2 regularization is
s regularization term R(q ). L2 regularization is a quadratic function o a quadratic function
the weight values, named because it uses the (square of the) L2 norm of the wei
values, Thenamed sumbecause
of the itsquares
uses the of(square
the of the) L2 norm of the weigh
weights
values. The L2 norm, ||q ||2 , is the same as the Euclidean distance of the vecto
L2 norm,
from theThe ||q || 2 , is
If q is the same as the Euclidean distance of the vector q
name
origin. because
consists this is the
of n weights, then:(square of the)
gin. If q consists of n weights, then:
L2 norm ||θ||2, = Euclidean distance
n
X of θ to the origin.
2
R(q ) = ||qn||
X2 = q j2 (5
2 2 j=1
R(q ) = ||q ||2 = qj (5.23
j=1
The L2 regularized objective function becomes:
L2 regularized objective function:
"
larized objective function becomes:m
# n
X X
2
q̂ ="argmax (i) (i)
log P(y # |x ) a qj (5
mq
X n
X
i=1 j=1
q i=1 j=1
i=1 j=1
L1 Regularization
regularization is a linear function(= lasso
of the regression)
weight values, named after the L1 n
ularization is a linear function of the weight values, named after
||1 , the sum of the absolute values of the weights, or Manhattan distance
the L1 n
the sum
nhattanThe of the
distance absolute
is the values
distance of
you’dthe weights,
have to or
walk Manhattan
between two distance
points in a
sum of the (absolute value of the) weights
htan distance
a street gridislike
theNew
distance
York):you’d have to walk between two points in a
Named
treet grid afterYork):
like New the L1 norm ||W||1, = sum of the absolute
n
values of the weights, = Manhattan
X
n
distance
R(q ) = ||q ||1 =
X |qi | (
R(q ) = ||q ||1 = |qi |
i=1 (
i=1
L1 regularized objective function becomes:
L1 regularized objective function:
regularized objective function
" mbecomes: #
X n
X
q̂ = argmaxX" m
(i) (i)#
log P(y |x ) aX
n |q j | (
q
q̂ = argmax 1=i (i) (i)
log P(y |x ) a j=1|q
j| (
Regularization

Logistic
Regression
Logistic Multinomial Logistic
Regression
Regression
Multinomial Logistic Regression
Often we need more than 2 classes
◦ Positive/negative/neutral
◦ Parts of speech (noun, verb, adjective, adverb, preposition, etc.)
◦ Classify emergency SMSs into different actionable classes
If >2 classes we use multinomial logistic regression
= Softmax regression
= Multinomial logit
= (defunct names : Maximum entropy modeling or MaxEnt
So "logistic regression" will just mean binary (2 output classes)
88
Multinomial Logistic Regression
The probability of everything must still sum to 1
P(positive|doc) + P(negative|doc) + P(neutral|doc) = 1

Need a generalization of the sigmoid called the softmax


◦ Takes a vector z = [z1, z2, ..., zk] of k arbitrary values
◦ Outputs a probability distribution
◦ each value in the range [0,1]
◦ all the values summing to 1
89
akes a vector z = [z1 , z2 , ..., zk ] of k arbitrarye values and maps them to a probability
moid, it is an exponential
softmax(zi )function.
= Pk
distribution, with each value in the range (0,1), zj
1  i  k (5.32)
and all the values summing to 1.
e
ctor z of The softmax function
dimensionality k, the
Like the sigmoid, it is an exponential function. softmax
j=1
is defined as:
For
Thea softmax
vector zofofandimensionality
input vector z =k,[zthe
1 , z2softmax
, ..., zk ] isisthus
defined as: itself:
a vector
Turns a vector z = [z1, z2, ... , zk] of k arbitrary values into probabilities
"exp (z
exp i ) #
softmax(z i ) =
softmax(z ) P
=e z1
softmax(z) = i Pk kPk, Pk
(z
e z2i )
1 1

, ..., Pk
e
i
zk
 i k  k (5.33)(5.30)
ezi expexp (z
(z
e zi jj)) e zi
i=1j=1 j=1 i=1 i=1

Pk
The softmax of an input zvector z = [z , z , ..., z ] is thus a vector itself:
max of an input vector
The denominator i=1 e z = [z1 , z2 , ..., zk ] is thus a vector itself:
i is used to normalize
1 2 all
k the values into probabilities.
Thus for example given a vector:
" #
" exp1.1,
z = [0.6, exp
(z1 ) 1.5, 1.2, 3.2,
(z2 )1.1] exp (zk ) #
softmax(z) = Pk , Pk , ..., Pk (5.31)
exp (zi=1
1 )exp (zi ) exp (z2(z) i )
i=1 exp exp
i=1 exp (zi(z
) k)
max(z)
the result softmax(z)
= Pisk , Pk , ..., Pk
i=1 exp (z i ) i=1 exp (z i ) i=1 exp (z i )
[0.055, 0.090, 0.0067, 0.10, 0.74, 0.010] 90
z e z ze z ze z
For a vector z of dimensionality i=1e k, 1 thei softmaxe
i=1
2is idefined as:
i=1e k i
softmax(z) = Pk , Pk , ..., Pk
The softmax function
Pk z i=1 e zi
i=1 e zi
i=1 e zi
nominator i=1softmax(z e i is usedi ) = Pk
to normalize
exp (zi ) all
1ik the values into proba
(5.30)
xample given P
◦ Turns
a a vector
kvector:
z
z = [z 1 ,z2 ,...,z k ] of
j=1
k arbitrary
exp (z j ) values into probabilities
enominator i=1 e is used to normalize all the values into prob
i

example givenofaanvector:
The softmax input vector z = [z1 , z2 , ..., zk ] is thus a vector itself:
z = [0.6, 1.1, 1.5, 1.2, 3.2, 1.1]
" #
z = [0.6,
exp1.1,
(z1 ) 1.5,exp
1.2,
(z23.2,
) 1.1]exp (zk )
oftmax(z) is
softmax(z) = Pk , Pk , ..., Pk (5.31)
i=1 exp (zi ) i=1 exp (zi ) i=1 exp (zi )
softmax(z) is
[0.055, 0.090, 0.0067, 0.10, 0.74, 0.010]
[0.055, 0.090, 0.0067, 0.10, 0.74, 0.010]
ike the sigmoid, the input to the softmax will be the dot product b 91
vector w and an input vector x (plus a bias). But now we’ll need sepa
ectors (and bias) for each of the K classes.
Softmax in multinomial logistic regression
exp (wc · x + bc )
p(y = c|x) = k (5
X
exp (w j · x + b j )
j=1

Input is still the dot product between weight vector w


he sigmoid, the softmax has the property of squashing values toward 0
and input vector x
ne of the inputs is larger than the others, it will tend to push its probab
But now we’ll need separate weight vectors for each
and suppress the probabilities
of the K classes. of the smaller inputs.

92
inomial
gative logistic with
sentiment) regression function
the absolute valuesimilarly
indicatingtohow
binary logistit
important
Fordifference
ne multinominal
Featuresthat logistic regression,
we’ll versus
in binary need separateby contrast,
weight
multinomial with separate
vectors
logistic (and weigh
biases
regression
s, a feature can be evidence for or against each individual
sses. Recall our binary exclamation point feature x5 from page class.
In 3-way multiclass
Binary: sentiment
positive weight classification,
à y=1 neg for example,
weight à y=0we must a

ument one of the 3 classes +, , or 0 (neutral). Now a feature related
1 if “!” 2 doc w5 = 3.0 and a posit
ion marks might x5 have
= a negative weight for 0 documents,
0 otherwise
+ or documents:
Multinominal: separate weights for each class:
ssification a positive weight w5 on a feature influences the clas
ositive sentiment)
Feature and a negative
Definition
⇢ weight w influences
5,+ w 5, w it
5,0 toward y
ent) with the absolute1value
if “!”indicating
2 doc how important the fe
f5 (x) 3.5 3.1 5.3
inal logistic regression,0 by
otherwise
contrast, with separate weights for
an be evidence for or against each individual class. 93
Logistic Multinomial Logistic
Regression
Regression

You might also like