Professional Documents
Culture Documents
Discriminative Classifiers
Logistic
Regression
Logistic Regression
by contrast:
imagenet imagenet
Generative Classifier:
• Build a model of what's in a cat image
• Knows about whiskers, ears, eyes
• Assigns a probability to any image:
• how cat-y is this image?
Logistic
Regression
Classification Reminder
Positive/negative sentiment
Spam/not spam
Authorship attribution
(Hamilton or Madison?)
Alexander Hamilton
Text Classification: definition
Input:
◦ a document x
◦ a fixed set of classes C = {c1, c2,…, cJ}
1
y = s (z) = z
(5
1+e
20
Idea of logistic regression
P(y = 1) = s (w · xx +
+b)
b)
11
=
1 + exp
exp(( (w (w··xx+
+b))
b))
P(y =
P(y = 0) = 11 s
0) = (w··xx+
s(w +b)b)
11
= 1
= 1 1 + exp ( (w · x + b))
1 + exp ( (w · x + b))
exp
exp (
( (w
(w ··xx++ b))
b))
=
= 1 + exp ( (w · x + b))
1 + exp ( (w · x + b))
P(y = 0) = 1 s (w · x + b)
TheP(ysigmoid = s (w · xhas
= 1) function + b) the property
By the way: 1 = 1
1
= s (x) = s ( x) 1 + exp ( (w · x
1 + exp ( (w · x1+ b))
exp ( (w · x + b))
=
so we
P(ycould
= 0) also
= 1have · x + b) P(y = 0) =as s ( (w ·1x+
expressed
s (w b)).( (w · x + b
+exp
Now we have an algorithm 1 that given an instance x computes the
P(y = 1|x). How = 1 do The we sigmoid
make a function has For
decision? the property
a test instance x, we sa
1 + exp ( (w · x + b)) Because
probability P(y = exp 1|x)( is(w more than .5, and no otherwise. We call .5
· x + b)) 1 s (x) = s ( x)
boundary: = (5.5)
1 + exp ( (w · x + b))
so we could⇢ also have expressed P(y = 0) as s ( (w · x
tion has the property Now 1 if P(y = 1|x) > 0.5
ŷ =we have an algorithm that given an instance
0 otherwise
P(y = 1|x). How do we make a decision? For a test i
1 s (x) = s ( x) P(y = 1|x) is more than .5, (5.6)
probability and no otherwi
5.1.1 Example:decision
boundary sentiment classification
boundary:
How do we make a decision? For a test instance x, we sa
(y = 1|x) is more
Turning than .5, andinto
a probability no otherwise.
a classifierWe call .5 t
⇢
1 if P(y = 1|x) > 0.5
ŷ =
0 otherwise
Now we have an
wx algorithm
+b that given an instan
algorithm that given an instance x computes the probabil
we make a decision? For a test instance x, we say yes if t
Turning a probability into a classifier
is more than .5, and no otherwise. We call .5 the decisi
⇢
1 if P(y = 1|x) > 0.5 if w∙x+b > 0
ŷ =
0 otherwise if w∙x+b ≤ 0
sentiment classification
e. Suppose we are doing binary sentiment classification
Classification in Logistic Regression
Logistic
Regression
Logistic Regression: a text example
on sentiment classification
Logistic
Regression
Sentiment example: does y=1 or y=0?
It's hokey . There are virtually no surprises , and the writing is second-rate .
So why was it so enjoyable ? For one thing , the cast is
great . Another nice touch is the music . I was overcome with the urge to get off
the couch and start dancing . It sucked me in , and it'll do the same to you .
29
1 if P(y = 1|x) > 0.5
ŷ x= =2
2 0 otherwise
x3=1
5.1.1 It'sExample:
hokey . Theresentiment
are virtually classification
no surprises , and the writing is second-rate .
So why was it so enjoyable ? For one thing , the cast is
Let’s have
great an example.
. Another nice Suppose wemusic
touch is the are doing
. I wasbinary sentiment
overcome with theclassification
urge to get offon
movie the
review
couch text,
andand
startwe would. like
dancing to know
It sucked mewhether to assign
in , and it'll do the the
samesentiment
to you . class
+ or to a review document doc. We’ll represent each input observation by the 6
x4=3
x1input
features x1 ...x6 of the =3 shown
x5=0in the following
x6=4.19 table; Fig. 5.2 shows the features
in a sample mini test document.
Figure 5.2 A sample mini test document showing the extracted features in the vector x.
Var Definition Value in Fig. 5.2
x1 these
Given count(positive lexicon)
6 features and 2 doc)
the input review x, P(+|x)3 and P( |x) can be com-
putedx2usingcount(negative
⇢Eq. 5.5: lexicon) 2 doc) 2
1 if “no” 2 doc
x3 = P(Y = 1|x) = s (w · x + b)
p(+|x) 1
0 otherwise
x4 count(1st
⇢ = s
and 2nd pronouns
([2.5, doc)
5.0, 21.2, 3 · [3, 2, 1, 3, 0, 4.19] + 0.1)
0.5, 2.0, 0.7]
1 if “!”=2 doc
s (.833)
x5 0
0 otherwise
= 0.70 (5.6)
x6 log(word count of doc) ln(66) = 4.19 30
x2the following
tures x1 ...x6 of the input shown in count(negative
⇢ table; Fig. lexicon)
5.2 shows 2 doc)
the features
Classifying sentiment
a sample mini test document.
x
for input x 1 if “no” 2 doc
3
Var Definition 0 otherwise
Value in Fig. 5.2
x1 x4 2 doc)
count(positive lexicon) count(1st
⇢ and3 2nd pronouns 2 doc)
x2 count(negative lexicon) 2 doc) 1 if “!” 22doc
⇢
1 if “no” 2 doc 5
x
x3 0 otherwise
1
0 otherwise x6 log(word count of doc)
x4 count(1st
⇢ and 2nd pronouns 2 doc) 3
1 if “!” 2 doc
x5 0
0 otherwise
x6 Let’s assume
log(word count of doc) for the moment that
ln(66) = 4.19we’ve alrea
for each of these features, and that the 6 weights
Suppose ware= [2.5, 5.0, 1.2, 0.5, 2.0, 0.7], while b = 0.1. (
et’s assume for the moment that we’ve already learned a real-valued weight for
how the
b = 0.1
weights are learned.) The weight w 1 , for e
ch of these features, and that the 6 weights corresponding to the 6 features are
31
Figure 5.2 1 mini test5document showing
A sample 6 the extracted features in the vector x.
Classifying
Figure 5.2
sentiment for input x
A sample mini test document showing the extracted features in the vector x.
Given these 6 features and the input review x, P(+|x) and P( |x) can be com-
puted usingthese
Given Eq. 5.5:
6 features and the input review x, P(+|x) and P( |x) can be com-
puted using Eq. 5.5:
p(+|x) = P(Y = 1|x) = s (w · x + b)
p(+|x) = P(Y = 1|x) = s ([2.5,
(w · x + 5.0,
b) 1.2, 0.5, 2.0, 0.7] · [3, 2, 1, 3, 0, 4.19] + 0.1)
= s (.833)
([2.5, 5.0, 1.2, 0.5, 2.0, 0.7] · [3, 2, 1, 3, 0, 4.19] + 0.1)
= s (.833)
0.70 (5.6)
p( |x) = P(Y = 0|x) = 1 0.70s (w · x + b) (5.6)
p( |x) = P(Y = 0|x) = 0.30 1 s (w · x + b)
= 0.30
Logistic regression is commonly applied to all sorts of NLP tasks, and any property
of the input
Logistic can be aisfeature.
regression commonlyConsider
appliedthetotask
all of period
sorts disambiguation:
of NLP tasks, and any deciding
property
if
of athe
period
input is
canthe
beend of a sentence
a feature. Considerorthepart ofofa period
task word, by classifying each
disambiguation: period
deciding
32
erty of the input can be a feature. Consider the task of period disambiguation:
We can build features for logistic regression for
deciding if a period is the end of a sentence or part of a word, by classifying each
any classification task: period disambiguation
period into one of two classes EOS (end-of-sentence) and not-EOS. We might use
features like x1 below expressing that the current word is lower case and the class
is EOS (perhaps with a positive weight), or that the current word is in our abbrevia-
End ofwith
tions dictionary (“Prof.”) and the class is EOS (perhaps sentence
a negative weight). A
feature canThis ends ainquite
also express a period.
complex combination of properties. For example a
The house
period following at 465
a upper cased word Main St.to is
is a likely new.
be an EOS, but if the word itself is
St. and the previous word is capitalized, then the period is likely part of a shortening
of the word street. Not end
⇢
1 if “Case(wi ) = Lower”
x1 =
0 otherwise
⇢
1 if “wi 2 AcronymDict”
x2 =
0 otherwise
⇢
1 if “wi = St. & Case(wi 1 ) = Cap”
x3 =
0 otherwise 33
, which is just whatinwe
Classification want for
(binary) a probability.
logistic Because
regression: it is
summary
but has a sharp slope toward the ends, it tends to squash outlier
Given:
And it’s differentiable, which as we’ll see in Section 5.8 will
◦ a set of classes: (+ sentiment,- sentiment)
◦ a vector x of features [x1, x2, …, xn]
e. If we◦apply the sigmoid to
x1= count( "awesome") the sum of the weighted features,
ween 0 and 1. To make it a probability,
◦ x2 = log(number of words in review) we just need to make
s, p(y◦=A 1) and wp(y
vector = 0), sum
of weights to 1.w2,
[w1, We can
…, do
wn]this as follows:
◦ wi for each feature fi
P(y = 1) = s (w · x + b)
1
=
1 + e (w·x+b)
Logistic Regression: a text example
on sentiment classification
Logistic
Regression
Learning: Cross-Entropy Loss
Logistic
Regression
Wait, where did the W’s come from?
Supervised classification:
• We know the correct label y (either 0 or 1) for each x.
• But what the system produces is an estimate, 𝑦!
We want to set w and b to minimize the distance between our
estimate 𝑦! (i) and the true y(i).
• We need a distance estimator: a loss function or a cost
function
• We need an optimization algorithm to update w and b to
minimize the loss.
37
Learning components
A loss function:
◦ cross-entropy loss
An optimization algorithm:
◦ stochastic gradient descent
The distance between 𝑦! and y
It's hokey . There are virtually no surprises , and the writing is second-rate .
So why was it so enjoyable ? For one thing , the cast is great . Another nice
touch is the music . I was overcome with the urge to get off the couch and
start dancing . It sucked me in , and it'll do the same to you .
x4=3
x1=3 x5=0 x6=4.19
Let'sFigure
see 5.2 if thisminiworks
A sample test documentfor our
showing sentiment
the extracted example
features in the vector x.
Given these 6 features and the input review x, P(+|x) and P( |x) can be com-
True value is y=1. How well is our model doing?
puted using Eq. 5.5:
Logistic
Regression
Stochastic Gradient Descent
Logistic
Regression
th gradient descent is to find the optimal weights: minim
Our goal: minimize the loss
ve defined for the model. In Eq. 5.13 below, we’ll explici
the loss
Let'sfunction L is that
make explicit parameterized by the
the loss function weights, whic
is parameterized
by weights
e learning 𝛳=(w,b)as q (in the case of logistic regressio
in general
s to find
• theAndset of represent
we’ll weights which minimizes
𝑦! as f (x; θ ) to makethetheloss functi
mples: dependence on θ more obvious
We want the weights that minimize the loss, averaged
over all examples:
m
X
1
q̂ = argmin LCE ( f (x ; q ), y )
(i) (i)
q m
i=1
Intuition of gradient descent
How do I get to the bottom of this river canyon?
w1 wmin w
0 (goal)
Let's first visualize for a single scalar w
Q: Given current w, should we make it bigger or smaller?
A: Move w in the reverse direction from the slope of the function
Loss
slope of loss at w1
is negative
w1 wmin w
0 (goal)
Let's first visualize for a single scalar w
Q: Given current w, should we make it bigger or smaller?
A: Move w in the reverse direction from the slope of the function
Loss
one step
of gradient
slope of loss at w1 descent
is negative
w1 wmin w
0 (goal)
Gradients
The gradient of a function of many variables is a
vector pointing in the direction of the greatest
increase in a function.
• The value
GISTIC R EGRESSION of the gradient (slope in our example)
!
!"
𝐿(𝑓 𝑥; 𝑤 , 𝑦) weighted by a learning rate η
• Higher learning rate means move w faster
t+1 t d
w =w h L( f (x; w), y)
dw
’s extend the intuition from a function of one scalar variable
Now let's consider N dimensions
We want to know where in the N-dimensional space
(of the N parameters that make up θ ) we should
move.
The gradient is just such a vector; it expresses the
directional components of the sharpest slope along
each of the N dimensions.
of a 2-dimensional gradient vector taken at the red point.
Imagine 2 dimensions, w and b
Cost(w,b)
Visualizing the
gradient vector at
the red point
It has two
dimensions shown
in the x-y plane
b
w
Real gradients
Are much longer; lots and lots of weights
For each dimension wi the gradient component i
tells us the slope with respect to that variable.
◦ “How much would a small change in wi influence the
total loss function L?”
◦ We express the slope as a partial derivative ∂ of the loss
∂wi
The gradient is then defined as a vector of these
partials.
ach dimension wi , we express the slope as a partial derivative ∂ wi of th
n. The gradient is then defined as a vector of these partials. We’ll repre
The gradient
q ) to make the dependence on q more 2 ∂obvious: 3
∂ w L( f (x; q ), y)
We’ll represent 𝑦! as f (x; θ ) to make
6 ∂the 1 dependence on θ more
7
obvious: 6
2 ∂ w L( f (x; q ), y)
37
6
—q L( f (x; q ), y)) = 6 ∂ w L( f (x;
∂ 2
q ), y) 77
1
.
.
4
6 ∂ . 75
6 ∂ w∂ 2 L( f (x; q ), y)7
—q L( f (x; q ), y)) = 6 6 ∂ wn L( f ..(x; q ), y)
7
7
4 . 5
∂
nal equation for updating q based on ∂ wnthe (x; q ), y)is thus
L( fgradient
The final equation for updating θ based on the gradient is thus
final equation for updating q based on the gradient is thus
qt+1 = qt h—L( f (x; q ), y)
Gradient for Logistic Regression
4.1 What
The are
Gradient for Logistic
these partial Regression
derivatives for logistic regression?
te q , we need a definition for the gradient —L( f (x; q ), y). Re
order to update q , we need a definition for the gradient —L( f (x; q ), y). Recal
ession, the cross-entropy loss function is:
logistic regression, the cross-entropy loss function is:
The loss function
y) =LCE (ŷ,[yy)log=s (w ·x+
[y log b)· x++(1
s (w y) log
b) + (1 (1(1 ss(w
y) log (w ·· xx+ b))]
+b))]
It turns
that theoutderivative
that the derivative
of thisoffunction
this function
forfor oneobservation
one observation vecto
vec
. 5.18The
(theelegant derivative
interested reader of
canthis
seefunction
Section(see textbook
5.8 for 5.8 for derivation)
the derivation of this equa
erested reader can see Section 5.8 for the derivation of this eq
∂ LCE (ŷ, y)
∂ LCE (ŷ, y) ∂ w = [s (w · x + b) y]x j
=j [s (w · x + b) y]x j
∂ w j the gradient with respect to a single weight w j represe
Note in Eq. 5.18 that
y intuitive
5.18 that value: the difference
the gradient with between
respectthe
totrue y and our
a single estimated
weight ŷ =
w j repr
function S TOCHASTIC G RADIENT D ESCENT(L(), f (), x, y) returns q
# where: L is the loss function
# f is a function parameterized by q
# x is the set of training inputs x(1) , x(2) , ..., x(m)
# y is the set of training outputs (labels) y(1) , y(2) , ..., y(m)
q 0
repeat til done # see caption
For each training tuple (x(i) , y(i) ) (in random order)
1. Optional (for reporting): # How are we doing on this tuple?
Compute ŷ (i) = f (x(i) ; q ) # What is our estimated output ŷ?
Compute the loss L(ŷ (i) , y(i) ) # How far off is ŷ(i) ) from the true output y(i) ?
2. g —q L( f (x(i) ; q ), y(i) ) # How should we move q to maximize loss?
3. q q hg # Go the other way instead
return q
Figure 5.5 The stochastic gradient descent algorithm. Step 1 (computing the loss) is used
Hyperparameters
The learning rate η is a hyperparameter
◦ too high: the learner will take big steps and overshoot
◦ too low: the learner will take too long
Hyperparameters:
• Briefly, a special kind of parameter for an ML model
• Instead of being learned by algorithm from
supervision (like regular parameters), they are
chosen by algorithm designer.
Stochastic Gradient Descent
Logistic
Regression
Stochastic Gradient Descent:
An example and more details
Logistic
Regression
Working through an example
One step of gradient descent
A mini-sentiment example, where the true y=1 (positive)
Two features:
x1 = 3 (count of positive lexicon words)
x2 = 2 (count of negative lexicon words)
Assume 3 parameters (2 weights and 1 bias) in Θ0 are zero:
w1 = w2 = b = 0
η = 0.1
5.4.1 The Gradient for Logistic Regression
Let’s assume the initial weights and bias in q 0 are all set to 0, and the initial learning
Example of gradient descent
al equation for updating
Inrate
order updateqq ,based
h isto0.1: we needon the gradient
a definition is thus
for the gradient
for logistic regression, the cross-entropy loss function w
—L( f (x; q ), y). Re
is:1 = w2 = b = 0;
Update stepLfor update [yθlog
CE (ŷ, y) =
is:s (w · x + b) + (1
w1 = w2 = b = 0 x1(1= 3;s (w
y) log x2· x=+2 b))]
h = 0.1
Itqturns out
= q
that the h—L(
derivative
f (x; q
of this
),function
y) for one observation ve
t requires that we compute the gradient, multiplied by the
t+1 update step
The single
Eq. 5.18 (the
learning rate
interested reader can see Section 5.8 for the derivation of this eq
∂ LCEt(ŷ, y)
where q t+1
= q h—q L(
∂wj
; q ·),xy+
[s(i)(w
= f (x (i) b)
) y]x j
InNote
our mini example
in Eq. 5.18there
that are
thethree parameters,
gradient so the gradient
with respect vectorweight
to a single has 3 dimen-
w j rep
Gradient vector has 3 dimensions:
sions, for w1 , w2 , and b. We can compute the first gradient as follows:
very intuitive value: the difference between the true y and our estimated ŷ
2 ∂ LCEx(ŷ,y) 3 2
+ b) for that observation,
(s (w · x + b) y)x1
multiplied
3 2 by the
(s (0) 1)x1
corresponding
3 2
0.5x1
3input
2 value
1.5
3 x j.
∂ w1
6 ∂ LCE (ŷ,y) 7
—w,b = 4 ∂ w2 5 = 4 (s (w · x + b) y)x2 5 = 4 (s (0) 1)x2 5 = 4 0.5x2 5 = 4 1.0 5
5.4.2
∂ LCE (ŷ,y)
∂b
The Stochastic Gradient Descent Algorithm
s (w · x + b) y s (0) 1 0.5 0.5
Logistic
Regression
Overfitting
A model that perfectly match the training data has a
problem.
It will also overfit to the data, modeling noise
◦ A random word that perfectly predicts y (it happens to
only occur in one class) will get a very high weight.
◦ Failing to generalize to a test set without this word.
A good model should be able to generalize
Overfitting Useful or harmless features
+ X1 = "this"
X2 = "movie
This movie drew me in, and it'll X3 = "hated"
do the same to you. X4 = "drew me in"
-
I can't tell you how much I
4gram features that just
"memorize" training set and
might cause problems
hated this movie. It sucked. X5 = "the same to you"
X7 = "tell you how much"
81
Overfitting
4-gram model on tiny data will just memorize the data
◦ 100% accuracy on the training set
But it will be surprised by the novel 4-grams in the test data
◦ Low accuracy on test set
Models that are too powerful can overfit the data
◦ Fitting the details of the training data so exactly that the
model doesn't generalize well to the test set
◦ How to avoid overfitting?
◦ Regularization in logistic regression
◦ Dropout in neural networks
82
o the unseen test set,
Regularizationbut a model that overfits will have poor generalization
o avoid overfitting, a new regularization term R(q ) is added to the object
on in Eq. 5.13, resulting in the following objective for a batch of m exa
slightlyArewritten
solutionfrom
for Eq.
overfitting
5.13 to be maximizing log probability rather th
1
Addand
mizing loss, a regularization
removing the mterm R(θ) todoesn’t
term which the loss function
affect the argmax):
(for now written as maximizing logprob rather than minimizing loss)
X m
q̂ = argmax (i) (i)
log P(y |x ) aR(q ) (5.
q i=1
Idea: choose an R(θ) that penalizes large weights
ew regularization term R(q ) is used to penalize large weights. Thus a sett
◦ fitting the data well with lots of big weights not as good
weights that matches
as fitting thethe training
data a littledata
lessperfectly— but uses
well, with small many weights w
weights
values to do so—will be penalized more than a setting that matches the dat
ess well, but does so using smaller weights. There are two common ways
high
to dovalues to dobe
so—will so—will be penalized
penalized more thanmorea than a setting
setting that matches
that matches the the
datadaa
ell, lessL2
littlebut does Regularization
well, butusing
so does so
smaller (= ridge
using smaller
weights. regression)
weights.
There There
are twoare common
two common way
ways to
compute this regularization term R(q ). L2 regularization is
s regularization term R(q ). L2 regularization is a quadratic function o a quadratic function
the weight values, named because it uses the (square of the) L2 norm of the wei
values, Thenamed sumbecause
of the itsquares
uses the of(square
the of the) L2 norm of the weigh
weights
values. The L2 norm, ||q ||2 , is the same as the Euclidean distance of the vecto
L2 norm,
from theThe ||q || 2 , is
If q is the same as the Euclidean distance of the vector q
name
origin. because
consists this is the
of n weights, then:(square of the)
gin. If q consists of n weights, then:
L2 norm ||θ||2, = Euclidean distance
n
X of θ to the origin.
2
R(q ) = ||qn||
X2 = q j2 (5
2 2 j=1
R(q ) = ||q ||2 = qj (5.23
j=1
The L2 regularized objective function becomes:
L2 regularized objective function:
"
larized objective function becomes:m
# n
X X
2
q̂ ="argmax (i) (i)
log P(y # |x ) a qj (5
mq
X n
X
i=1 j=1
q i=1 j=1
i=1 j=1
L1 Regularization
regularization is a linear function(= lasso
of the regression)
weight values, named after the L1 n
ularization is a linear function of the weight values, named after
||1 , the sum of the absolute values of the weights, or Manhattan distance
the L1 n
the sum
nhattanThe of the
distance absolute
is the values
distance of
you’dthe weights,
have to or
walk Manhattan
between two distance
points in a
sum of the (absolute value of the) weights
htan distance
a street gridislike
theNew
distance
York):you’d have to walk between two points in a
Named
treet grid afterYork):
like New the L1 norm ||W||1, = sum of the absolute
n
values of the weights, = Manhattan
X
n
distance
R(q ) = ||q ||1 =
X |qi | (
R(q ) = ||q ||1 = |qi |
i=1 (
i=1
L1 regularized objective function becomes:
L1 regularized objective function:
regularized objective function
" mbecomes: #
X n
X
q̂ = argmaxX" m
(i) (i)#
log P(y |x ) aX
n |q j | (
q
q̂ = argmax 1=i (i) (i)
log P(y |x ) a j=1|q
j| (
Regularization
Logistic
Regression
Logistic Multinomial Logistic
Regression
Regression
Multinomial Logistic Regression
Often we need more than 2 classes
◦ Positive/negative/neutral
◦ Parts of speech (noun, verb, adjective, adverb, preposition, etc.)
◦ Classify emergency SMSs into different actionable classes
If >2 classes we use multinomial logistic regression
= Softmax regression
= Multinomial logit
= (defunct names : Maximum entropy modeling or MaxEnt
So "logistic regression" will just mean binary (2 output classes)
88
Multinomial Logistic Regression
The probability of everything must still sum to 1
P(positive|doc) + P(negative|doc) + P(neutral|doc) = 1
Pk
The softmax of an input zvector z = [z , z , ..., z ] is thus a vector itself:
max of an input vector
The denominator i=1 e z = [z1 , z2 , ..., zk ] is thus a vector itself:
i is used to normalize
1 2 all
k the values into probabilities.
Thus for example given a vector:
" #
" exp1.1,
z = [0.6, exp
(z1 ) 1.5, 1.2, 3.2,
(z2 )1.1] exp (zk ) #
softmax(z) = Pk , Pk , ..., Pk (5.31)
exp (zi=1
1 )exp (zi ) exp (z2(z) i )
i=1 exp exp
i=1 exp (zi(z
) k)
max(z)
the result softmax(z)
= Pisk , Pk , ..., Pk
i=1 exp (z i ) i=1 exp (z i ) i=1 exp (z i )
[0.055, 0.090, 0.0067, 0.10, 0.74, 0.010] 90
z e z ze z ze z
For a vector z of dimensionality i=1e k, 1 thei softmaxe
i=1
2is idefined as:
i=1e k i
softmax(z) = Pk , Pk , ..., Pk
The softmax function
Pk z i=1 e zi
i=1 e zi
i=1 e zi
nominator i=1softmax(z e i is usedi ) = Pk
to normalize
exp (zi ) all
1ik the values into proba
(5.30)
xample given P
◦ Turns
a a vector
kvector:
z
z = [z 1 ,z2 ,...,z k ] of
j=1
k arbitrary
exp (z j ) values into probabilities
enominator i=1 e is used to normalize all the values into prob
i
example givenofaanvector:
The softmax input vector z = [z1 , z2 , ..., zk ] is thus a vector itself:
z = [0.6, 1.1, 1.5, 1.2, 3.2, 1.1]
" #
z = [0.6,
exp1.1,
(z1 ) 1.5,exp
1.2,
(z23.2,
) 1.1]exp (zk )
oftmax(z) is
softmax(z) = Pk , Pk , ..., Pk (5.31)
i=1 exp (zi ) i=1 exp (zi ) i=1 exp (zi )
softmax(z) is
[0.055, 0.090, 0.0067, 0.10, 0.74, 0.010]
[0.055, 0.090, 0.0067, 0.10, 0.74, 0.010]
ike the sigmoid, the input to the softmax will be the dot product b 91
vector w and an input vector x (plus a bias). But now we’ll need sepa
ectors (and bias) for each of the K classes.
Softmax in multinomial logistic regression
exp (wc · x + bc )
p(y = c|x) = k (5
X
exp (w j · x + b j )
j=1
92
inomial
gative logistic with
sentiment) regression function
the absolute valuesimilarly
indicatingtohow
binary logistit
important
Fordifference
ne multinominal
Featuresthat logistic regression,
we’ll versus
in binary need separateby contrast,
weight
multinomial with separate
vectors
logistic (and weigh
biases
regression
s, a feature can be evidence for or against each individual
sses. Recall our binary exclamation point feature x5 from page class.
In 3-way multiclass
Binary: sentiment
positive weight classification,
à y=1 neg for example,
weight à y=0we must a
⇢
ument one of the 3 classes +, , or 0 (neutral). Now a feature related
1 if “!” 2 doc w5 = 3.0 and a posit
ion marks might x5 have
= a negative weight for 0 documents,
0 otherwise
+ or documents:
Multinominal: separate weights for each class:
ssification a positive weight w5 on a feature influences the clas
ositive sentiment)
Feature and a negative
Definition
⇢ weight w influences
5,+ w 5, w it
5,0 toward y
ent) with the absolute1value
if “!”indicating
2 doc how important the fe
f5 (x) 3.5 3.1 5.3
inal logistic regression,0 by
otherwise
contrast, with separate weights for
an be evidence for or against each individual class. 93
Logistic Multinomial Logistic
Regression
Regression