You are on page 1of 62

Logistic Regression

Lin Li
CS@PVAMU
Logistic Regression
 Important analytic tool in natural and social sciences
 Baseline supervised machine learning tool for
classification
 Is also the foundation of neural networks
 Two phases of logistic regression:
 Training: we learn weights w and b using stochastic
gradient descent and cross-entropy loss.
 Test: Given a test example x we compute p(y|x) using
learned weights w and b, and return whichever label (y = 1
or y = 0) is higher probability

2
Generative vs. Discriminative Classifiers
Suppose we're distinguishing cat from dog images

ImageNet ImageNet

 Naive Bayes is a generative classifier


 K-Nearest Neighbor, Rocchio Classifier, Logistic regression,
and Linear Regression (while being used for classification, e.g.,
Perceptron) are discriminative classifiers
3
Generative Classifiers
 Build a model of what's in a cat image
 Knows about whiskers, ears, eyes
 Assigns a probability to any image
 how cat-y is this image?

 Also, build a model of what's in a dog image

 Given a new image, run both models


to see which one fits better

4
Discriminative Classifiers
 Just try to distinguish dogs from cats

Look, dogs have collars! Let's ignore everything else

5
Generative vs. Discriminative Classifiers
Generative models Discriminative models
 Specifying joint distribution  Specifying conditional
 Full probabilistic distribution
specification for all the  Only explain the target variable
random variables  Arbitrary features can be
 Dependence assumption incorporated for modeling
has to be specified for 𝑝 𝒕𝒘
𝑝 𝒘 𝒕 and 𝑝(𝒕)  Need labeled data, only
 Flexible, can be used in suitable for (semi-) supervised
unsupervised learning learning
 e.g., Naïve Bayes  e.g. Logistic Regression
𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑 𝑝𝑟𝑖𝑜𝑟 𝑝𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟
𝑐 = argmax 𝑃(𝑑|𝑐) 𝑃(𝑐) 𝑐 = argmax 𝑃(𝑐|𝑑)
𝑐∈𝐶 𝑐∈𝐶
6
Probabilistic Learning Classifier
 Given m input/output pairs (x(i),y(i)), the
components include:
 A feature representation of the input. For each input
observation x(i), a vector of features [x1, x2, ... , xn].
Feature j for input x(i) is xj, more completely xj(i), or
sometimes fj(x).
 A classification function that computes 𝑦, the
estimated class, via p(y|x), like the sigmoid or softmax
functions.
 An objective function for learning, like mean squared
error or cross-entropy loss.
 An algorithm for optimizing the objective function:
stochastic gradient descent.

7
Error-based Learning

 Linear Regression
 Logistic Regression

8
Regression
 Given a dataset, how to use a parameterized prediction model to
represent the relationship between the descriptive features and the
target feature?

 Note: all chapter indexed pictures (e.g., Table 7.1) in this lecture are from the textbook of John
Kelleher, et al, “Machine Learning for Predictive Data Analytics: Algorithms, Worked Examples, and
Case Studies”
9
Linear Regression

 Initiate a linear equation representing the relation between


descriptive features and the target feature
 E.g., between size (x) and price (y), y = w × x + b; which is a line with
w being the slope, and b being the y intercept.
 Key → find w and b (e.g., after exploration, price = 6.47 + 0.62 × size)
10
Measuring Prediction Error
 Linear regression formula for multiple descriptive features (e.g.,
k features)
t = w1 × d1 + w2 × d2 + … + wk × dk + w0
where w1 , w2 , …, wk are the weights of the respective features
 Measure the sum of the squared error (SSE, difference between
prediction and target value) or the sum of the squared residual1.
e.g., SSE is calculated as: 𝑛
1
𝐿2 𝑀𝑤 , 𝐷 = (𝑡𝑖 − 𝑀𝑤 (𝑑𝑖 ))2
2
𝑖=1

e.g., for the size (d1) and price (t) model, after expansion,
𝑛
1
𝐿2 𝑀𝑤 , 𝐷 = (𝑡𝑖 − (𝑤 0 + 𝑤[1] × 𝑑𝑖 [1]))2
2
𝑖=1

11
Error Surface

 Weight space and error surface: for example, t = w1 × d1 + w0


 Surface is convex and there is a global minimum
12
Gradient Descent
 Goal: adjust the weights to minimize the residual
 How to: find the optimal weights at the point where the partial
derivatives of the error surface with respect to w0 and w1 are
equal to 0, i is the index of the data with range [1, n]
𝑛
𝜕 1
(𝑡𝑖 − (𝑤 0 + 𝑤[1] × 𝑑𝑖 [1]))2 = 0
𝜕𝑤[0] 2
𝑖=1
𝑛
𝜕 1
(𝑡𝑖 − (𝑤 0 + 𝑤[1] × 𝑑𝑖 [1]))2 = 0
𝜕𝑤[1] 2
𝑖=1

 For multiple variable linear regression,


𝑚

𝑀𝑤 𝑑 = 𝑤 0 + 𝑤 1 × 𝑑 1 + ⋯ + 𝑤 𝑚 × 𝑑 𝑚 = 𝑤 0 + 𝑤[𝑗] × 𝑑 [𝑗]
𝑖=1

apply partial derivatives similarly

13
Gradient Descent (cont’d)

 Idea: starts by initiate the weights to “random” values, then adjust it along the
deepest direction (slope) of the current point toward the minimum.
14
Gradient Descent (cont’d)
 Gradient with respect to each weight: (partial derivative by
chain rule) 𝜕
𝐿 𝑀 ,𝐷 =
𝜕 1
(𝑡 − 𝑀 (𝑑))2
2 𝑤 𝑤
𝜕𝑤 𝑗 𝜕𝑤 𝑗 2
𝜕
= (𝑡 − 𝑀𝑤 (𝑑) × 𝑡 − 𝑀𝑤 (𝑑)
𝜕𝑤 𝑗
𝜕
= (𝑡 − 𝑀𝑤 (𝑑) × 𝑡 − (𝑤 ∙ 𝑑)
𝜕𝑤 𝑗
= (𝑡 − 𝑀𝑤 (𝑑) × −𝑑[𝑗]
 Update the weights one by one by tuning them toward the
opposite direction of the calculated gradient, using a learning
rate (α) to decide the change amount
𝑛

𝑤 𝑗 ←𝑤 𝑗 +𝑎× ((𝑡𝑖 − 𝑀𝑤 𝑑𝑖 ) × 𝑑𝑖 [𝑗])


𝑖=1
𝑒𝑟𝑟𝑜𝑟𝐷𝑒𝑙𝑡𝑎(𝐷,𝑤[𝑗])

 Then, re-measure the errors and keep updating the weights


as above until the error is less than a threshold (e.g., 0.1)
15
Error Change with Weights Update

16
Example
Rental Price = w0 + w1 × size + w2 × floor + w3 × Broadband Rate
Set α = 0.00000002, and initialize all weights between [-0.2, +0.2]

17
Example (cont’d)

After 100 iterations, w0 = -0.1513, w1 = 0.6270, w2 = -0.1781, w3 = 0.0714, and


the squared error of the final model is 2913.5
Rental Price = w0 + w1 × size + w2 × floor + w3 × Broadband Rate
18
Batch and Stochastic Gradient Descent
 Batch Gradient Descent: weight adjustment at each iteration
based on summation of the squared error made by the
candidate model (linear regression formula) for the whole
training set
 Stochastic Gradient Descent: weight adjustment based on the
error in the predication made by the candidate model (linear
regression formula) for each training instance individually
 Mini-batch (a compromise of both): train/adjust the weights
each time on a group of m examples that is less than the whole
dataset (if m = 1, same as stochastic gradient descent; if m = size
of the whole dataset, same as batch gradient descent)
 e.g., batch size is 512 or 1024

19
Interpreting Multivariable Linear Models
 Signs of Weights: indicating whether the feature has a positive or
negative impact on the prediction
 Magnitudes of Weights: how much the target value changes
according to the descriptive feature’s unit
 Mistake: the features with higher weights are more important
than those with lower weights
 Each descriptive feature has a different varying scale! Need to
performance statistical significance test before comparison.
 A test-statistic is computed (e.g., t-test) to decide whether the
feature has significant impact on the target
 P-value: the probability that the test-statistic is greater than or
equal to the one computed being the result of chance
 Null-hypothesis: the feature has no significant impact on the target
 P-value is compared with pre-defined significance threshold. If it is
≤ threshold (e.g., 5% or 1%), null hypothesis is rejected.
20
Interpreting Multivariable Linear Models (Example)

 First: calculate the standard error of the


model (n is sample size)

 Second: calculate the standard error on a


specific descriptive feature

 Third : calculate t-statistic for this feature


 Finally: check p-value associated with the
calculated t-test value in a look-up table,
and then Compare results

21
Interpreting Multivariable Linear Models (Example)

 Often shown in data analytics (e.g., R or Python)

22
Impact of Learning Rate (α)
 Large α:
converge
fast at
beginning,
but may
miss the
global
optimal
 Small α:
won’t miss
the global
optimal,
but may
converge
very slowly

23
Learning Rate (α) with Weight Decay

 Reason: reduce the residual (sum of squared error) fast at beginning, and
slow down gradually to converge to the global optimal
α0 is the initial learning rate (e.g., 0.5)
c is a constant which controls how quickly the rate decays (e.g., 10),
τ is the current iteration number of the gradient descent
24
Handling Categorical Descriptive Features

 Rental Price = w0 + w1 × size + w2 × floor + w3 × broadband rate +


w4 × energy rating A + w5 × energy rating B +
w6 × energy rating C
 Note: you can save one extra feature (e.g., use ratings A and B only)
Rental Price = w0⃰ + w1⃰ × size + w2⃰ × floor + w3 ⃰ × broadband rate +
w4 ⃰ × energy rating A + w5⃰ × energy rating B
where w0 + w6 = w0⃰ , w1 = w1⃰ , w2 = w2⃰ , w3 = w3⃰ , w4 - w6 = w4⃰ , w5 - w6 = w5⃰
25
Predicting Categorical Targets

 Train a linear model (line or hyper plane), named decision surface, to classify
the data.
f (RPM, vibration) = w0 + w1 × RPM + w2 × vibration
Status = f(RPM, vibration) > 0 ? Good : Faulty

26
Error-based Learning

 Linear Regression
 Logistic Regression

27
Regression for Classification
 Linear Regression
𝑦 ← 𝑤𝑇𝑋
 Relationship between a scalar dependent variable 𝑦 and one
or more explanatory variables Y is discrete in a
classification problem!

y
1 𝑤 𝑇 𝑋 > 0.5 1.00
𝑦=
0 𝑤 𝑇 𝑋 ≤ 0.5
0.75

What if we have 0.50 Optimal


an outlier? 0.25 regression
model
0.00 x

28
Regression for Classification
 Logistic Regression
1
𝑝 𝑦𝑥 =𝜎 𝑤𝑇𝑋 =
1 + exp(−𝑤 𝑇 𝑋)
 Directly modeling of class posterior Sigmoid function

P(y|x)

1.00

0.75
What if we have 0.50
an outlier?
0.25

0.00 x

29
Why Sigmoid function?
 Logistic Regression
𝑃 𝑋 𝑦 = 1 𝑃(𝑦 = 1)
𝑃 𝑦=1𝑋 =
𝑃 𝑋 𝑦 = 1 𝑃 𝑦 = 1 + 𝑃 𝑋 𝑦 = 0 𝑃(𝑦 = 0)
1
=
𝑃 𝑋 𝑦 = 0 𝑃(𝑦 = 0)
1+
𝑃 𝑋 𝑦 = 1 𝑃(𝑦 = 1)

Binomial P(y|x)

𝑃 𝑦=1 =𝛼 1.00

0.75 𝑃 𝑋 𝑦 = 1 = 𝑁(𝜇1 , 𝛿 2 )
𝑃 𝑋 𝑦 = 0 = 𝑁(𝜇0 , 𝛿 2 ) 0.50
0.25 Normal with identical variance
x
30
Why Sigmoid function?
 Logistic Regression
𝑃 𝑋 𝑦 = 1 𝑃(𝑦 = 1)
𝑃 𝑦=1𝑋 =
𝑃 𝑋 𝑦 = 1 𝑃 𝑦 = 1 + 𝑃 𝑋 𝑦 = 0 𝑃(𝑦 = 0)
1 1
= =
𝑃 𝑋 𝑦 = 0 𝑃(𝑦 = 0) 𝑃 𝑋 𝑦 = 1 𝑃(𝑦 = 1)
1+ 1 + exp − ln
𝑃 𝑋 𝑦 = 1 𝑃(𝑦 = 1) 𝑃 𝑋 𝑦 = 0 𝑃(𝑦 = 0)
𝑉
𝑃 𝑋 𝑦 = 1 𝑃(𝑦 = 1) 𝑃(𝑦 = 1) 𝑃(𝑥𝑖 |𝑦 = 1)
ln = ln + ln
𝑃 𝑋 𝑦 = 0 𝑃(𝑦 = 0) 𝑃(𝑦 = 0) 𝑃(𝑥𝑖 |𝑦 = 0)
𝑖=1
𝑉 2 2
𝛼 𝜇1𝑖 − 𝜇0𝑖 𝜇1𝑖 − 𝜇0𝑖
= ln + 2 𝑥𝑖 −
1−𝛼 𝛿𝑖 2𝛿𝑖2
𝑖=1
𝑉
Origin of the name: 𝜇1𝑖 − 𝜇0𝑖 1 𝑥−𝜇 2
= 𝑤0 + 𝑥𝑖 𝑃 𝑥𝑦 = 𝑒

2𝛿 2
logit function 𝛿𝑖2 𝛿 2𝜋
𝑖=1
= 𝑤0 + 𝑤𝑇𝑋 = 𝑤𝑇𝑋
31
Why Sigmoid function?
 Logistic Regression
𝑃 𝑋 𝑦 = 1 𝑃(𝑦 = 1)
𝑃 𝑦=1𝑋 =
𝑃 𝑋 𝑦 = 1 𝑃 𝑦 = 1 + 𝑃 𝑋 𝑦 = 0 𝑃(𝑦 = 0)
1
=
𝑃 𝑋 𝑦 = 0 𝑃(𝑦 = 0)
1+
𝑃 𝑋 𝑦 = 1 𝑃(𝑦 = 1)
1
=
𝑃 𝑋 𝑦 = 1 𝑃(𝑦 = 1)
1 + exp − ln
𝑃 𝑋 𝑦 = 0 𝑃(𝑦 = 0)
1
=
1 + exp(−𝑤 𝑇 𝑋)
Generalized Linear Model

Note: it is still a linear relation among the features!

32
Another Dataset

 logistic model is more functional


in derivative deduction
 Output range [0, 1], represent
either the normalized numerical
result or the probability of the
query falling into one target
category
1
𝑙𝑜𝑔𝑖𝑠𝑡𝑖𝑐 𝑥 =
1 + 𝑒 −𝑥
33
Logistic Regression

1
 Sigmoid function: replace x in the logistic formula 𝑙𝑜𝑔𝑖𝑠𝑡𝑖𝑐 𝑥 = with
1 + 𝑒 −𝑥
the linear model described before, i.e.,
1
𝑀 𝑅𝑃𝑀, 𝑉𝑖𝑏𝑟𝑎𝑡𝑖𝑜𝑛 =
1 + 𝑒 −1∗(𝑤0+𝑤1∗𝑅𝑃𝑀+𝑤2∗𝑉𝑖𝑏𝑟𝑎𝑡𝑖𝑜𝑛)
 Then, use gradient descent method to train the weights of the logistic model
34
Turning Probability into a Classifier
1
𝑃 𝑦 =1 =𝜎 𝑤∙𝑥+𝑏 =
1 + exp(− 𝑤 ∙ 𝑥 + 𝑏 )

1
𝑃 𝑦 =0 =1−𝜎 𝑤∙𝑥+𝑏 =1−
1 + exp(− 𝑤 ∙ 𝑥 + 𝑏 )
exp(− 𝑤 ∙ 𝑥 + 𝑏 )
= = 𝜎 −(𝑤 ∙ 𝑥 + 𝑏)
1 + exp(− 𝑤 ∙ 𝑥 + 𝑏 )

1 𝑖𝑓 𝑃 𝑦 = 1 𝑥 > 0.5 0.5 here is called the decision boundary


𝑦=
0 𝑂𝑡𝑕𝑒𝑟𝑤𝑖𝑠𝑒

𝑁𝑜𝑡𝑒: 1 − 𝜎 𝑥 = 𝜎 −𝑥

35
Weight Adjustment
 Goal: adjust the weights to minimize the residual
 Apply gradient descent to the sum of squared error with
respect to wj equal to 0 (partial derivative by chain rule)
𝜕
𝐿 𝑀 , 𝐷 = (𝑡 − 𝑀𝑤 𝑑 ) × 𝑀𝑤 (𝑑) × (1 − 𝑀𝑤 𝑑 ) × 𝑑[𝑗]
𝜕𝑤 𝑗 2 𝑤

Mw(d) represent the calculated output (prediction), t is the


target value coming with the training data.
 The weight adjustment amount by each training data is:
𝑤 𝑗 ← 𝑤 𝑗 + 𝛼 × (𝑡 − 𝑀𝑤 𝑑 ) × 𝑀𝑤 (𝑑) × (1 − 𝑀𝑤 𝑑 ) × 𝑑[𝑗]
 Considering all training data, the weight will be adjusted by
𝑛

𝑤 𝑗 ←𝑤 𝑗 +𝛼× ((𝑡 − 𝑀𝑤 𝑑 ) × 𝑀𝑤 (𝑑) × (1 − 𝑀𝑤 𝑑 ) × 𝑑[𝑗])


𝑖=1
i is the index of the training data, range [1,n]
36
Weight Update through Batch Gradient Descent

37
Logistic Model Change with Weights Update

1
Final model: M𝑤 RPM, Vibration =
1 + 𝑒 −(−0.4077+4.1697∗𝑅𝑃𝑀+6.0460∗𝑉𝑖𝑏𝑟𝑎𝑡𝑖𝑜𝑛)
38
Logistic Regression in Text Mining

 Sentiment Analysis: suppose we are doing binary


sentiment classification on movie review text, and
we would like to know whether to assign the
sentiment class 1 = positive or 0 = negative to the
following review

 e.g., what is the sentiment of the follow text?


It's hokey. There are virtually no surprises, and the writing is
second-rate. So why was it so enjoyable? For one thing, the
cast is great. Another nice touch is the music. I was overcome
with the urge to get off the couch and start dancing. It sucked
me in, and it'll do the same to you.

39
Feature Extraction

40
Classifying Sentiment for Input
 Let’s assume for the moment that we’ve already learned a
real-valued weight for each of the features, and that the 6
weights corresponding to the 6 features are as follows:
[2.5, -5.0, -1.2, 0.5, 2.0, 0.7]
and b = 0.1
 Thus, we have the following and classify the text as “positive”
𝑝 + 𝑥 =𝑝 𝑦 =1 𝑥 =𝜎 𝑤∙𝑥+𝑏
= 𝜎 2.5, −5.0, −1.2, 0.5, 2.0,0.7 ∙ [3,2,1,3,0,4.19] + 0.1
= 𝜎 .833 = 0.70

𝑝 − 𝑥 = 𝑝 𝑦 = 0 𝑥 = 1 − 𝜎 𝑤 ∙ 𝑥 + 𝑏 = 0.30

41
Build Features
 By examining the training set with an eye to linguistic
intuitions and the linguistic literature on the domain
 Created automatically via feature templates
 Example: Period Disambiguation
End of sentence
This ends in a period.
The house at 465 Main St. is new.
Not end

42
Weight Adjustment: Revisit
 If the loss function f is twice differentiable and the domain is
real, then we can characterize it as follows:
 f is convex if and only if f ”(x) ≥ 0 for all x. Hence if the double
derivative of the loss function is ≥ 0, then we can claim it to be
convex (i.e., guarantee to find global optima)
 For logistic regression, if the loss function is sum of squared
error (SSE), it is NOT convex! (i.e., may stuck at local optima)
2
1
𝑓 𝑥 = (𝑦 − 𝑦)2 = 𝑦 −
1 + exp − 𝑤 ∙ 𝑥 + 𝑏
𝜕2𝑓 𝑥 2
Not always ≥0
= −2 𝑦 − 2𝑦𝑦 − 2𝑦 + 3𝑦 ∗ 𝑥𝑗 ∗ 𝑥𝑗 ∗ 𝑦(1 − 𝑦)
𝜕𝑤𝑗 2
 Is there a convex loss function? —Yes, cross-entropy loss!
https://towardsdatascience.com/why-not-mse-as-a-loss-function-for-logistic-regression-589816b5e03c

43
Training of Logistic Regression
 Goal: maximize probability of the correct label p(y|x)
 Since there are only 2 discrete outcomes (0 or 1), we can
express the probability p(y|x) from our classifier as:
𝑝(𝑦|𝑥) = 𝑦 𝑦 (1 − 𝑦)1−𝑦
If y =1, simplifies to 𝑦, if y =0, simplifies to 1- 𝑦.
 Take the log of both sides — mathematically handy maximize
log p(y|x) will maximize p(y|x)
log 𝑝 𝑦 𝑥 = log 𝑦 𝑦 (1 − 𝑦)1−𝑦 = [𝑦 ∗ log 𝑦 + (1 − 𝑦) ∗ log(1 − 𝑦)]
 For a training dataset of N instances, maximize the log
conditional likelihood:
𝑁 𝑁 𝑁

log 𝑝(𝑦𝑖 |𝑥𝑖 ) = log(𝑦𝑖 𝑦𝑖 (1 − 𝑦𝑖 )1−𝑦𝑖 ) = 𝑦𝑖 ∗ log 𝑦𝑖 + (1 − 𝑦𝑖 ) ∗ log(1 − 𝑦𝑖 )


𝑖=1 𝑖=1 𝑖=1

44
Cross-Entropy Loss: 𝑳𝑪𝑬 𝒚, 𝒚
 Flip the sign of target function to turn it into a loss
function 𝐿𝐶𝐸 𝑦, 𝑦 : (to minimize)
𝐿𝐶𝐸 𝑦, 𝑦 = −log 𝑝 𝑦 𝑥 = −[𝑦 ∗ log 𝑦 + (1 − 𝑦) ∗ log(1 − 𝑦)]

Plugging in the definition of 𝑦:


𝐿𝐶𝐸 𝑦, 𝑦 = −[𝑦 ∗ log 𝜎 𝑤 ∙ 𝑥 + 𝑏 + (1 − 𝑦) ∗ log(1 − 𝜎 𝑤 ∙ 𝑥 + 𝑏 )]
 For a training dataset of N instances, 𝐿𝐶𝐸 𝑦, 𝑦 is:
𝑁

𝐿𝐶𝐸 𝑦, 𝑦 = − [𝑦𝑖 ∗ log 𝑦𝑖 + (1 − 𝑦𝑖 ) ∗ log(1 − 𝑦𝑖 )]


𝑖=1
𝑁 x(i): the ith instance
=− [𝑦𝑖 ∗ log 𝜎 𝑤 ∙ 𝑥(𝑖) + 𝑏 + (1 − 𝑦𝑖 ) ∗ log(1 − 𝜎 𝑤 ∙ 𝑥(𝑖) + 𝑏 )]
𝑖=1

45
Cross-Entropy Loss: Example

 True value is y =1, given the weights values as [2.5, -5.0,


-1.2, 0.5, 2.0, 0.7] and b = 0.1, How well is our model?
𝑝 + 𝑥 = 𝑝 𝑦 = 1 𝑥 = 0.70 very good!

 How about the loss?


𝐿𝐶𝐸 𝑦, 𝑦 = −[𝑦 ∗ log 𝜎 𝑤 ∙ 𝑥 + 𝑏 + (1 − 𝑦) ∗ log(1 − 𝜎 𝑤 ∙ 𝑥 + 𝑏 )]
= − log 𝜎 𝑤 ∙ 𝑥 + 𝑏 = − log 0.70 = 0.36
What about the loss if the true value is y = 0?
𝐿𝐶𝐸 𝑦, 𝑦 = −[𝑦 ∗ log 𝜎 𝑤 ∙ 𝑥 + 𝑏 + (1 − 𝑦) ∗ log(1 − 𝜎 𝑤 ∙ 𝑥 + 𝑏 )]
= − log 1 − 𝜎 𝑤 ∙ 𝑥 + 𝑏 = − log 0.30 = 1.2

46
Gradient for Cross-Entropy Loss Function
 What’s the partial derivative of the loss function?
𝐿𝐶𝐸 𝑦, 𝑦 = −[𝑦 ∗ log 𝜎 𝑤 ∙ 𝑥 + 𝑏 + (1 − 𝑦) ∗ log(1 − 𝜎 𝑤 ∙ 𝑥 + 𝑏 )]

 Apply chain rule, the elegant derivative is:


𝜕𝐿𝐶𝐸 𝑦, 𝑦
= 𝜎 𝑤 ∙ 𝑥 + 𝑏 − 𝑦 ∗ 𝑥𝑗
𝜕𝑤𝑗
Note: do the math, use textbook or the Appendix as reference
 For a training dataset of N instances, the gradient is:
𝑁 𝑁
𝜕 𝑖=1 𝐿𝐶𝐸 𝑦𝑖 , 𝑦𝑖
= 𝜎 𝑤 ∙ 𝑥(𝑖) + 𝑏 − 𝑦𝑖 ∗ 𝑥(𝑖)𝑗
𝜕𝑤𝑗
𝑖=1
x(i)j: the jth feature of the ith instance

47
Weight Adjustment for Logistic Regression
 Given a learning rate η (hyperparameter), train the model:

48
Stochastic Weight Adjustment: One-Step
 A mini-sentiment example, where y=1 (positive)
 Two features:
x1 = 3 (count of positive lexicon words)
x2 = 2 (count of negative lexicon words)
 Assume the three parameters (2 weights and 1 bias) of the
logistic regression model Θ are initialized to zeros:
w1 = w2 = b = 0, and η = 0.1
𝜕𝐿𝐶𝐸 𝑦, 𝑦
 Calculate gradient descent using: = 𝜎 𝑤∙𝑥+𝑏 −𝑦 ∗ 𝑥𝑗
𝜕𝑤𝑗
𝜕𝐿𝐶𝐸 (𝑦, 𝑦)
𝜕𝑤1
𝜎 𝑤 ∙ 𝑥 + 𝑏 − 𝑦 𝑥1 𝜎 0 − 1 𝑥1 −0.5𝑥1 −1.5
𝜕𝐿𝐶𝐸 (𝑦, 𝑦)
𝛻𝑤,𝑏 = = 𝜎 𝑤 ∙ 𝑥 + 𝑏 − 𝑦 𝑥2 = 𝜎 0 − 1 𝑥2 = −0.5𝑥2 = −1.0
𝜕𝑤2 𝜎 𝑤∙𝑥+𝑏 −𝑦 −0.5
𝜎 0 −1 −0.5
𝜕𝐿𝐶𝐸 (𝑦, 𝑦)
𝜕𝑏 𝑤1 0 −1.5 .15
 Update weights and bias in mode: 𝜃 = 𝑤2 = 0 − η −1.0 = .1
𝑏 0 −0.5 .05
49
Overfitting
 A model that perfectly match the training data may
overfit to the data, modeling noise
 A random word that perfectly predicts y (it happens to only occur in
one class) will get a very high weight.
 Failing to generalize to a test set without this word
 A good model should be able to generalize!
 e.g., 4-gram model on tiny data will just memorize the
data, but be surprised by novel 4-grams in testing data!

50
Regularization
 How to avoid overfitting
 Regularization in regression
 Dropout in neural networks (will talk about it later)
 Regularization ― Add a regularization term R(θ) to the
loss function (for now written as maximizing log prob
rather than minimizing loss)
𝑚

𝜃 = argmax log 𝑃 𝑦 (𝑖) |𝑥 (𝑖) − 𝛼𝑅(𝜃)


𝜃
𝑖=1

 Idea: choose an R(θ) that penalizes large weights


 Fitting the data well with lots of big weights not as good as
fitting the data a little less well, with small weights

51
L2 Regularization (= Ridge Regression)
 The sum of the squares of the weights
 The name is because this is the (square of the) L2 norm
‖θ‖2, = Euclidean distance of θ to the origin.

 L2 regularized objective function:


𝑚 𝑛

𝜃 = argmax log 𝑃 𝑦 (𝑖) |𝑥 (𝑖) − 𝛼 𝜃𝑗 2


𝜃
𝑖=1 𝑗=1

52
L1 Regularization (= Lasso Regression)
 The sum of the (absolute value of the) weights
 Named after the L1 norm ‖W‖1, = sum of the absolute
values of the weights, = Manhattan distance

 L1 regularized objective function:


𝑚 𝑛

𝜃 = argmax log 𝑃 𝑦 (𝑖) |𝑥 (𝑖) − 𝛼 |𝜃𝑗 |


𝜃
𝑖=1 𝑗=1

53
Multinomial Logistic Regression
 Often we need more than 2 classes
 Positive/negative/neutral
 Parts of speech (noun, verb, adjective, adverb, preposition, etc.)
 Classify emergency SMSs into different actionable classes
 If >2 classes, we use multinomial logistic regression,
which is also named as:
= Softmax regression
= Multinomial logistic regression
= (defunct names : Maximum entropy modeling or MaxEnt)
 So "logistic regression" just means binary classification

54
Multinomial Logistic Regression
 The probability of everything must still sum to 1
P(positive|doc) + P(negative|doc) + P(neutral|doc) = 1
 A generalization of the sigmoid called softmax, takes a
vector z = [z1, z2, ..., zk] of k arbitrary values, and outputs a
probability distribution
 each value in the range [0,1]
 all the values summing to 1
exp 𝑧𝑖
𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑧𝑖 = 𝑘 1≤𝑖≤𝑘
𝑗=1 exp 𝑧𝑗

𝑘 𝑧𝑖
 Denominator 𝑖=1 𝑒 normalizes values into probabilities
exp 𝑧1 exp 𝑧2 exp 𝑧𝑘
𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑧 = 𝑘
, 𝑘
,…, 𝑘
𝑖=1 exp 𝑧𝑖 𝑖=1 exp 𝑧𝑖 𝑖=1 exp 𝑧𝑖

55
Example of Softmax
 Given a vector z = [z1, z2, ..., zk] of k arbitrary values:
𝑧 = [0.6, 1.1, −1.5, 1.2, 3.2, −1.1]
 Using softmax function:
exp 𝑧1 exp 𝑧2 exp 𝑧𝑘
𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑧 = 𝑘 , 𝑘 ,…, 𝑘
𝑖=1 exp 𝑧𝑖 𝑖=1 exp 𝑧𝑖 𝑖=1 exp 𝑧𝑖
 The probabilities of the class are:
[0.055, 0.090, 0.0067, 0.10, 0.74, 0.010]

56
Features in Multinomial Logistic Regression
 Softmax in multinomial logistic regression
 Input is still the dot product between weight vector w and
input vector x
 But now we’ll need separate weight vectors for each of the
K classes
 Features in binary logistic regression: one weight each
feature
w5 = 3.0

 Features in multinomial logistic regression: k weights


each feature (i.e., one weight per each class)

57
Binary vs. Multinomial Logistic Regression

58
What you should know
 Linear Regression
 Loss Function and Gradient Descent Optimization
 Mean Square Error
 Cross-Entropy
 Logistic Regression
 Binary Classes
 Multinomial Classes
 Overfitting and Regularization

59
Reading
 SLP Chapter 5
 Chapter 7 of “Machine Learning for Predictive Data
Analytics: Algorithms, Worked Examples, and Case
Studies”, MIT Press, 2015

Note: SLP: “Speech and Language Processing”, check


slide 30 of lecture 1 for the acronyms of the textbooks

60
Acknowledgement
 The slides of this lecture were prepared based on the
course materials of the following scholars:
 Text contents of John Kelleher, et al, “Machine Learning for Predictive
Data Analytics: Algorithms, Worked Examples, and Case Studies”
 Textbook and slides of Dan Jurafsky, Stanford University, “Speech and
Language Processing”
 Course materials of many other higher educators (e.g., Hongning Wang
from University of Virginia, Greg Durrett from UT Austin, etc.)

61
Appendix A: Deriving the Gradient Equation

62

You might also like