You are on page 1of 50

Lecture 02:

“Supervised” Machine Learning

Pratik Desai
Asst. Prof, MCA Dept , NMITD
Mob No: 9867498362

Adapted from the “Chapter 18 – Learning from Examples” of the AI – A Modern


Approach, 3rd Ed text, by Stuart Russel & Peter Norvig

Lecture 1: Introduction 1/28


Learning Agents

Learning Agents:
• Agents that can improve their behavior through diligent study of
their own experiences.

• An agent is learning if it improves its performance on future


tasks after making observations about the world.

Supervised Learning:
• One class of learning problem: from a collection of input-output
pairs, learn a function that predicts the output for new inputs.

• Although this seems restricted, it has vast applicability.

Roger Grosse and Jimmy Ba


Lecture 2: Supervised Machine Learning 1/
Why would we want an agent to learn?
 If the design of the agent cant be improved, why wouldn’t the
designers just program in that improvement to begin with?

 Three main reasons why we need learning ability:


 Programmers cannot anticipate all possible situations that
the agent can find itself in
 Programmers cannot anticipate all changes over time – the
world is ever changing
 For certain tasks, human programmers have no idea how
to program a solution themselves – example: recognizing
faces, understanding speech, driving a car, etc.

Roger Grosse and Jimmy Ba


Lecture 2: Supervised Machine Learning 3/
Supervised Learning is Inductive Learning
Facts
 Inductive Learning:
learning a (possibly incorrect) general function or
rule from specific input-output pairs

Supervised Learning is a form of inductive principles


learning

principles
 Deductive (or Analytical) Learning:
going from a known general rule to a new rule that
is logically entailed, but is useful because it allows
more efficient processing.
Facts

Roger Grosse and Jimmy Ba


Lecture 2: Supervised Machine Learning 3/
Formalizing the task of Supervised Learning
 The task of supervised learning is this:
• Given a training set of N example input-output pairs
𝑥1, 𝑦1 , 𝑥2, 𝑦2 , … 𝑥𝑁 , 𝑦𝑁
• where each 𝑦𝑗 was generated by an unknown function
𝑦=𝑓 𝑥 ,
• discover a function ℎ that approximates the true function 𝑓.
 Here 𝑥 and 𝑦 can be any value; they need not be numbers.
 The function ℎ is a hypothesis.
 Learning is a search through the space of possible hypotheses for
one that will perform well, even on new examples beyond the
training set.
 To measure the accuracy of a hypothesis we give it a test set of
examples that are distinct from the training set.
 We say a hypothesis generalizes well if it correctly predicts the
value of 𝑦 for novel examples.
 When the function 𝑓 is stochastic – it is not strictly a function of 𝑥,
what we have to learn is a conditional probability distribution, 𝑃(𝑌Τ𝑥)
Roger Grosse and Jimmy Ba
Lecture 2: Supervised Machine Learning 1/
Classification vs. Regression

 When the output 𝑦 is one of a finite set of values (such as sunny,


cloudy or rainy), the learning problem is called classification

 Binary (or Boolean) classification: classification where there are


only two possible values for the output 𝑦.

 When the output 𝑦 is a number (such as tomorrow’s temperature),


the learning problem is called regression.

 Technically, solving a regression problem is finding a


conditional expectation or average value of y, because the
probability that we have found exactly the right real-valued
number for y is 0.

Roger Grosse and Jimmy Ba


Lecture 2: Supervised Machine Learning 1/
Supervised Learning as Function (Curve) Fitting

 We have some example data points in the 𝑥, 𝑦 plane, where 𝑦 =


𝑓(𝑥).

 We don’t know what 𝑓 is, but we will approximate it with a


function ℎ selected from a hypothesis space 𝐻.

 For this example, we will take the hypothesis space 𝐻 to be


the set of all polynomials, such as 𝑥 5 + 3𝑥 2 + 2

 If a hypothesis function ℎ fits all the data points its called a


consistent hypothesis because it agrees with all the data.

Roger Grosse and Jimmy Ba


Lecture 2: Supervised Machine Learning 1/
Two consistent hypothesis: a) linear b) non-linear

(a) Example (𝑥, 𝑓 𝑥 ) pairs and a (b) A consistent, degree-7


consistent, linear hypothesis polynomial for the same data set

Roger Grosse and Jimmy Ba


Lecture 2: Supervised Machine Learning 1/
The fundamental problem in Inductive Learning

Source: “Bringing Out Occam’s razor in peer-review” (Nature), Credit: Matteo De Ventra

How do we choose from among multiple consistent hypothesis?


Roger Grosse and Jimmy Ba
Lecture 2: Supervised Machine Learning 1/
Solution: Apply Ockham’s razor !!

Source: www.phdcomics.com

Roger Grosse and Jimmy Ba


Lecture 2: Supervised Machine Learning 1/
The Fundamental Problem in Inductive Learning
 The previous slide shows two hypothesis functions: (a) a linear
function – a straight line fitting the data ; (b) a degree-7 polynomial
function – a complex curve fitting the data

 Both are consistent with the same data.

 This illustrates a fundamental problem in inductive learning – how


do we choose from multiple consistent hypotheses?
 Solution – Use the principle called Ockham’s razor

 The principle of Ockham’s razor – named after the 14th century


English philosopher William of Ockham, who argued against all sorts of
complications – advises us to prefer the simplest hypothesis
consistent with data.

 Since a degree-1 polynomial is simpler than a degree-7 polynomial, thus


hypothesis (a) should be preferred over hypothesis (b)
Roger Grosse and Jimmy Ba
Lecture 2: Supervised Machine Learning 1/
Hypothesis that fits need not generalize and vice a versa

(c) A different dataset, which admits two hypothesis:


an exact degree-6 polynomial that fits or an approximate linear fit

Roger Grosse and Jimmy Ba


Lecture 2: Supervised Machine Learning 1/
Expanding the hypothesis space H

(c) A different dataset, which admits


(d) A simple, exact sinusoidal fit
two hypothesis: an exact degree-6
to the same dataset
polynomial that fits or an approximate
linear fit

Note: Here we expand the hypothesis space H to include polynomials over


both 𝑥 and sin 𝑥 [e.g., functions like 𝒂𝒙 + 𝒃 + 𝒄𝒔𝒊𝒏(𝒙)]
Roger Grosse and Jimmy Ba
Lecture 2: Supervised Machine Learning 1/
Assumptions & Issues in Evaluating & Choosing the Best Hypothesis

 Stationarity Assumption: past data (used for training) will be similar in


nature to future data (the model will face in the real world)

 How to Define the Best Fit?: Errors vs Loss (and thus Error Rates vs
Loss Functions)

 Error Rate of a hypothesis is defined as the proportion of mistakes it


makes – the proportion of times that ℎ 𝑥 ≠ 𝑦 for an (𝑥, 𝑦) example.

 Just because a hypothesis h has a low error rate on the training set does
not mean that it will generalize well.

Roger Grosse and Jimmy Ba


Lecture 2: Supervised Machine Learning 1/
Evaluation of An Hypothesis
 To get an accurate evaluation we need to test it on a set of examples it has not
seen yet.
 holdout cross validation: randomly split the available data into a training set
from which the learning algorithm produces ℎ and a test set on which the
accuracy of ℎ is evaluation
 fails to use all the available data. If we use half the data for the test set,
then we are only training on half the data, and we may get a poor
hypothesis.
 On the other hand, if we reserve only 10% of the data for the test set,
then we may, by statistical chance, get a poor estimate of the actual
accuracy.
 𝒌 – fold cross-validation: We can squeeze more out of the data and still get an
accurate estimate. Each example serves double duty—as training data and test
data.
 First we split the data into 𝒌 equal subsets. We then perform 𝑘 rounds
of learning; on each round 1/𝑘 of the data is held out as a test set and the
remaining examples are used as training data.
 The average test set score of the k rounds should then be a better
estimate than a singleRoger
score. Popular values of 𝑘 : 5 and 10
Grosse and Jimmy Ba
Lecture 2: Supervised Machine Learning 1/
Evaluation of An Hypothesis

 leave-one-out cross validation (LOOCV): extreme case of 𝑘 – fold cross


validation, where 𝑘 = 𝑛

Roger Grosse and Jimmy Ba


Lecture 2: Supervised Machine Learning 1/
From error rates to loss

 Consider the problem of classifying email messages as spam or non-spam.

 It is worse to classify non-spam as spam (and thus potentially miss an


important message) then to classify spam as non-spam (and thus
suffer a few seconds of annoyance).

 So a classifier with a 1% error rate, where almost all the errors were
classifying spam as non-spam, would be better than a classifier with only
a 0.5% error rate, if most of those errors were classifying non-spam as
spam.

Roger Grosse and Jimmy Ba


Lecture 2: Supervised Machine Learning 1/
Loss Function as Amount of Utility Lost

 The loss function 𝐿(𝑥, 𝑦, ŷ) is defined as the amount of


utility lost by predicting ℎ(𝑥) = ŷ when the correct answer
is 𝑓(𝑥) = 𝑦.

 Often a simplified version is used, 𝐿(𝑦, ŷ), that is


independent of 𝑥.

 We can say it is 10 times worse to classify non-spam as


spam than vice-versa as :
𝐿 𝑠𝑝𝑎𝑚, 𝑛𝑜𝑠𝑝𝑎𝑚 = 1, 𝐿 𝑛𝑜𝑠𝑝𝑎𝑚, 𝑠𝑝𝑎𝑚 = 10

 Note that 𝑳(𝒚, 𝒚) is always zero: by definition there is no


loss when you guess exactly right.
Roger Grosse and Jimmy Ba
Lecture 2: Supervised Machine Learning 1/
L1 Loss, L2 Loss and L1/0 Loss
 loss value for discrete valued functions vs. loss value for
real valued functions

 𝑳𝟏 loss (the absolute value loss):


𝐿1 𝑦, ŷ = 𝑦 − ŷ

 𝑳𝟐 𝒍𝒐𝒔𝒔 (the squared error loss):


𝐿2 𝑦, ŷ = (𝑦 − ŷ)2

 𝑳𝟎/𝟏 𝒍𝒐𝒔𝒔 (the 0/1 loss): a loss of 1 for an incorrect answer


𝐿0/1 𝑦, ŷ = 0 if y = ŷ, else 1

Roger Grosse and Jimmy Ba


Lecture 2: Supervised Machine Learning 1/
Generalization Loss
 The learning agent can theoretically maximize its expected
utility by choosing the hypothesis that minimizes expected
loss over all input–output pairs it will see.

 Let 𝜀 be the set of all possible input-output examples.

 The expected generalization loss for a hypothesis ℎ (with


respect to loss function 𝐿) is

 And the best hypothesis, ℎ∗ , is the one with the minimum


expected generalization loss:

Roger Grosse and Jimmy Ba


Lecture 2: Supervised Machine Learning 1/
Estimating Generalization Loss with Empirical Loss

 Because 𝑃(𝑥, 𝑦) is not known, the learning agent can only


estimate generalization loss with empirical loss on a set of
examples, 𝐸

 The estimated best hypothesis ĥ∗ is then the one with


minimum empirical loss:

Roger Grosse and Jimmy Ba


Lecture 2: Supervised Machine Learning 1/
Regression and Classification
with Linear Models
Linear Regression

 Hypothesis Space: the class of all linear functions of


continuous valued inputs.

 Univariate Linear Regression

 Multivariate Linear Regression

Roger Grosse and Jimmy Ba


Lecture 2: Supervised Machine Learning 1/
Univariate Linear Regression
 A univariate linear function (a straight line) with input x and
output y has the form 𝑦 = 𝑤1𝑥 + 𝑤0

 𝑤0 and 𝑤1 are real-valued coefficients to be learned.

 We think of coefficients as weights (hence the letter 𝑤);

 The value of 𝑦 is changed by changing the relative weight


of one term or another.

 Let 𝒘 be the vector [𝑤0, 𝑤1], and let us define


ℎ𝒘 𝑥 = 𝑤1𝑥 + 𝑤0

 The task of finding the ℎ𝒘 that best fits the data is called
linear regression

Roger Grosse and Jimmy Ba


Lecture 2: Supervised Machine Learning 1/
Doing Linear Regression
• We see an example of a
training set of n points in
the x, y plane

• Each point represents the


size in square feet and the
price of a house offered for
sale

• To fit a line to the data, all


we have to do is find the
values of the weights
Data points of price versus floor space of houses for
[𝑤0, 𝑤1] that minimize the sale in Berkeley, CA, in July 2009, along with the
linear function hypothesis that minimizes squared
empirical loss. error loss: y = 0.232x + 246.

Roger Grosse and Jimmy Ba


Lecture 2: Supervised Machine Learning 1/
Doing Linear Regression

• Usually empirical loss is computed


using the squared loss function,
𝐿2, summed over all the training
examples.

• 𝐿𝑜𝑠𝑠 ℎ𝒘 = σ𝑁
𝑗=1 𝐿2(𝑦𝑗, ℎ𝒘 𝑥𝑗 )

= σ𝑁
𝑗=1 𝑦𝑗 − ℎ𝒘 𝑥𝑗
2

= σ𝑁
𝑗=1 𝑦𝑗 − (𝑤1𝑥𝑗 + 𝑤0)
2

Plot of the loss function σ𝑵 𝟐


𝒋=𝟏 𝒘𝟏𝒙𝒋 + 𝒘𝟎 − 𝒚𝒋 for
various values of w0, w1. Note that the loss function
• We would like to find is convex, with a single global minimum.

𝒘∗ = 𝑎𝑟𝑔𝑚𝑖𝑛𝒘 𝐿𝑜𝑠𝑠 (ℎ𝒘)

Roger Grosse and Jimmy Ba


Lecture 2: Supervised Machine Learning 1/
Doing Linear Regression – It Has A Mathematical Solution

• The sum σ𝑁 2
𝑗=1 𝑦𝑗 − (𝑤1𝑥𝑗 + 𝑤0) is minimized when its partial
derivatives with respect to 𝑤0 and 𝑤1 are zero

• i.e., when - Equations (1)


𝜕
σ𝑁 𝑦𝑗 − (𝑤1𝑥𝑗 + 𝑤0) 2 =0
𝜕𝑤𝑜 𝑗=1
and
𝜕
σ𝑁 𝑦𝑗 − (𝑤1𝑥𝑗 + 𝑤0) 2 =0
𝜕𝑤1 𝑗=1

• These equations have a unique solution: - Equations (2)

𝑁 σ 𝑥j 𝑦𝑗 − (σ 𝑥𝑗)(σ 𝑦𝑗) (σ 𝑦𝑗 − 𝑤1(σ 𝑥𝑗))


𝑤1 = ; 𝑤0 =
𝑁 σ 𝑥𝑗 2 − σ 𝑥𝑗 2 𝑁

• In our case, the solution is 𝒘𝟏 = 𝟎. 𝟐𝟑𝟐, 𝒘𝟎 = 𝟐𝟒𝟔 and the line with those
weights is shown as a dashed line in the figure (a)

Roger Grosse and Jimmy Ba


Lecture 2: Supervised Machine Learning 1/
Doing Linear Regression – It Has A Mathematical Solution

In our case, the solution is 𝑤1 = 0.232, 𝑤0 = 246 and the line with those
weights is shown as a dashed line in the figure (a)

Data points of price versus floor space of houses for Plot of the loss function σ𝑵
𝒋=𝟏 𝒘𝟏𝒙𝒋 + 𝒘𝟎 − 𝒚𝒋
𝟐 for

sale in Berkeley, CA, in July 2009, along with the various values of w0, w1. Note that the loss function
linear function hypothesis that minimizes squared is convex, with a single global minimum.
error loss: y = 0.232x + 246.

Roger Grosse and Jimmy Ba


Lecture 2: Supervised Machine Learning 1/
Gradient Descent Learning – When There Is No Mathematical Soln
 Involves adjusting weights in the weight space to minimize
loss
 When we go beyond linear models, the equations defining
minimum loss will often have no closed-form solution.
 In such cases, we face a general optimization search problem
in a continuous weight space.
 Such can be addressed by a hill-climbing algorithm that
follows the gradient of the function to be optimized.
 We use gradient descent – because we want to minimize
the loss α - the learning rate

- Equation (3)
Roger Grosse and Jimmy Ba
Lecture 2: Supervised Machine Learning 1/
Mathematical Derivations – Stochastic Gradient Descent

 We work out the partial derivative in the simplified case of one


training example 𝑥, 𝑦 :

- Equations (4) - final learning rules (-2 folds into α)

Roger Grosse and Jimmy Ba


Lecture 2: Supervised Machine Learning 1/
Stochastic Gradient Descent

 We consider only a single training point at a time, taking a step


after each one using, equations 4, given below

- Equations (4)

 Can be used in online setting –where new data comes in one at


a time
 In offline setting we cycle through the same data as many times
as necessary, taking a step after considering each single
example.
 Often faster than batch gradient descent.
 With a fixed learning rate α convergence not guaranteed
(possibility of never-ending oscillations around minimum)
 In some cases, decreasing learning rates does guarantee
convergence

Roger Grosse and Jimmy Ba


Lecture 2: Supervised Machine Learning 1/
Mathematical Derivations – Batch Gradient Descent

 For 𝑁 training examples, we want to minimize the sum of the


individual losses for each example.

 Because the derivative of a sum is the sum of the derivatives,


we have the following update rule:

- Equations (5)

Roger Grosse and Jimmy Ba


Lecture 2: Supervised Machine Learning 1/
Batch Gradient Descent

 We consider all the 𝑁 training examples at each step and


minimize the sum of the individual losses for each example using
equations 5, given below

- Equations (5)

 Convergence to the unique global minimum is guaranteed – as


long as we pick small enough α)

 Convergence may be very slow – we have to cycle through all


the training data for every step, and there may be many steps

Roger Grosse and Jimmy Ba


Lecture 2: Supervised Machine Learning 1/
Multivariate Linear Regression
 We can easily extend to multivariate linear regression problems, in
which each example 𝑿𝑗 is an n-element vector

 Our hypothesis space is the set of functions of the form

 The 𝑤0 term, the intercept, stands out as different from the others.

 We can fix that by inventing a dummy input attribute, 𝑥𝑗, 0, which is


defined as always equal to 1.

 Then h is simply the dot product of the weights and the input vector (or
equivalently, the matrix product of the transpose of the weights and
the input vector):

 Best vector of weights 𝑤 ∗ minimizes the squared error loss over


examples:
Roger Grosse and Jimmy Ba
Lecture 2: Supervised Machine Learning 1/
Gradient Descent for Multivariate Linear Regression

 Gradient descent will reach the (unique) minimum of the loss function

 The update equation for each weight 𝑤𝑖 is

Roger Grosse and Jimmy Ba


Lecture 2: Supervised Machine Learning 1/
Analytical Solution for Multivariate Linear Regression

 It is also possible to solve analytically for the w that minimizes loss.

 Let y be the vector of outputs for the training examples, and X be the
data matrix, i.e., the matrix of inputs with one 𝑛-dimensional example
per row.

 The solution that minimizes the squared error:

Roger Grosse and Jimmy Ba


Lecture 2: Supervised Machine Learning 1/
Overfitting and Regularization in Multivariate Linear Regression
 With univariate linear regression we didn’t have to worry about
overfitting.
 Problem of Overfitting – But with multivariate linear regression in
high-dimensional spaces it is possible that some dimension that is
actually irrelevant appears by chance to be useful, resulting in
overfitting.
Regularization –
 It is common to use regularization on multivariate linear functions to
avoid overfitting.
 We search for a hypothesis that directly minimizes the weighted sum
of empirical loss and the complexity of the hypothesis, which we will
call the total cost:
 𝝀, a positive number is a hyperparameter ; it serves as a conversion
rate between loss and complexity measure. If 𝜆 is chosen well it nicely
balances the empirical loss of a simple function against a complicated
function’s tendency to overfit.
 This process of explicitly penalizing complex hypotheses is
called regularization: we’re looking for functions that are more
regular.
Roger Grosse and Jimmy Ba
Lecture 2: Supervised Machine Learning 1/
L1 and L2 Regularization for Linear Functions

 For linear functions the complexity can be specified as a function of


the weights.
 We can consider a family of regularization functions:

 As with loss functions, with 𝑞 = 1 we have 𝑳𝟏 regularization, which


minimizes the sum of the absolute values;

 With 𝑞 = 2, we have 𝑳𝟐 regularization, which minimizes the sum of


squares.

 Which regularization function should you pick? – That depends


on the specific problem

 𝑳𝟏 Regularization Advantage: It tends to produce a sparse model


(setting many weights to zero effectively declaring those attributes as
irrelevant) - resulting in hypothesis less likely to overfit.
Roger Grosse and Jimmy Ba
Lecture 2: Supervised Machine Learning 1/
Linear Classifiers with a hard threshold
 In addition to regression, linear Training data
functions can also be used for
classification.

 Example: Classifying a seismic signal


as either a nuclear explosion or as an
earthquake.

 (𝑥1, 𝑥2) ≡ (body wave magnitude, surface


wave magnitude) – The two features of
a seismic signal
Earthquakes (white circles) and
 Classification Task: given the training nuclear explosions (black circles)
occurring between 1982 and
data, learn a hypothesis ℎ that will take 1990 in Asia and the Middle
new (𝑥1, 𝑥2) points and return 0 for East. Also shown is a decision
earthquakes or 1 for explosion. boundary between the classes.

Roger Grosse and Jimmy Ba


Lecture 2: Supervised Machine Learning 1/
Decision Boundary for a Linearly Separable Data
 A decision boundary is a line (or a
surface, in higher dimensions) that Training data
separates the two classes
 A linear decision boundary is called a
linear separator
 Data that admit such a linear separator are
called as linearly separable

 In our case, the linear separator is:


𝒙𝟐 = 𝟏. 𝟕𝒙𝟏 − 𝟒. 𝟗 or −𝟒. 𝟗 + 𝟏. 𝟕𝒙𝟏 − 𝒙𝟐 = 𝟎

 The explosions, which we want to classify Earthquakes (white circles) and


with value 1, are to the right of this line with nuclear explosions (black circles)
higher values of 𝑥1 and lower values of 𝑥2 occurring between 1982 and
1990 in Asia and the Middle
East. Also shown is a decision
 So explosions are the points for which boundary between the classes.
−𝟒.𝟗+𝟏.𝟕𝒙𝟏−𝒙𝟐 > 𝟎 while earthquakes have
−𝟒.𝟗+𝟏.𝟕𝒙𝟏−𝒙𝟐 < 𝟎
Roger Grosse and Jimmy Ba
Lecture 2: Supervised Machine Learning 1/
Decision Boundary for a Linearly Separable Data
 We can write the classification hypothesis as
h𝒘(x)= 1 if w.x ≥ 𝟎 and 0 otherwise.
Training data
(assuming 𝑥0 = 1 as a dummy input)

 We can think of ℎ as the result of passing the


linear function w.x through a hard threshold
function:
ℎ𝒘(x) = 𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑(w.x) where
𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑(𝑧) = 1 if 𝑧 ≥ 0 and 0 otherwise.

Earthquakes (white circles) and


The hard threshold function
nuclear explosions (black circles)
𝑻𝒉𝒓𝒆𝒔𝒉𝒐𝒍𝒅(𝒛) with 0/1 output.
Note that the function is occurring between 1982 and
nondifferentiable at 𝒛 = 𝟎 1990 in Asia and the Middle
East. Also shown is a decision
boundary between the classes.

Roger Grosse and Jimmy Ba


Lecture 2: Supervised Machine Learning 1/
The Perceptron Learning Rule
 How to chose weights w to minimize the loss?

 We cannot solve mathematically for weights by


setting the gradient to zero

 Also we cannot use gradient descent in weight


space !! …. Why??

 The Perceptron Learning Rule:


• A simple weight update rule that converges
The hard threshold function
to a solution provided the data is linearly 𝑻𝒉𝒓𝒆𝒔𝒉𝒐𝒍𝒅(𝒛) with 0/1 output.
separable: Note that the function is
nondifferentiable at 𝒛 = 𝟎
• For a single example (x, 𝑦), we have

Roger Grosse and Jimmy Ba


Lecture 2: Supervised Machine Learning 1/
Behavior of The Perceptron Learning Rule
 The Perceptron Learning Rule:
• A simple weight update rule that converges to a
solution provided the data is linearly separable:
• For a single example (x, 𝑦), we have

 For our 0/1 classification problem perceptron rule


behaves as below:
• If the output is correct, i.e., 𝑦 = ℎ𝒘(x), then the
The hard threshold
weights are not changed function 𝑻𝒉𝒓𝒆𝒔𝒉𝒐𝒍𝒅(𝒛)
with 0/1 output.
• If 𝑦 is 1 but ℎ𝒘(x) is 0, then 𝑤𝑖 is increased Note that the function is
when the corresponding input 𝑥𝑖 is positive and nondifferentiable at 𝒛 = 𝟎
decreased when 𝑥𝑖 is negative (makes sense?)
• If 𝑦 is 0 but ℎ𝒘(x) is 1, then 𝑤𝑖 is decreased
when the corresponding input 𝑥𝑖 is positive and
increased when 𝑥𝑖 is negative (makes sense?)
 Typically the learning rule is applied one example at a
time as in stochastic gradient descent
Roger Grosse and Jimmy Ba
Lecture 2: Supervised Machine Learning 1/
Convergence not quick but guaranteed for Linearly Separable Data

 Training Curve: A training curve


measures the classifier performance on a
fixed training set as the learning process
proceeds on that same training set.

 The curve shows the update rule


converging to a zero-error linear separator.

 The “convergence” process isn’t exactly


pretty, but it always works.

 This particular run takes 657 steps to


converge, for a data set with 63 examples,
so each example is presented roughly 10
times on average.

Roger Grosse and Jimmy Ba


Lecture 2: Supervised Machine Learning 1/
What If Data Isn’t Linearly Separable
• Adding missing data makes our data linearly non- Original partial data
separable.

 The perceptron learning rule failing to converge


even after 10,000 steps: even though it hits the
minimum-error solution (three errors) many times,
the algorithm keeps changing the weights.

 Training curve:
After adding missing data

 In general, the perceptron rule may not converge


to a stable solution for a fixed learning rate 𝛼

 For decaying learning rate it converges to a


minimum error solution after a very long time
Roger Grosse and Jimmy Ba
Lecture 2: Supervised Machine Learning 1/
Linear Classifiers with a hard threshold

Roger Grosse and Jimmy Ba


Lecture 2: Supervised Machine Learning 1/
Linear Classifiers with a soft threshold (using logistic regression)

• Softening the threshold function by using logistic function


1
• 𝐿𝑜𝑔𝑖𝑠𝑡𝑖𝑐(𝑧) = 1+𝑒 −𝑧
• Hypothesis
1
• ℎ𝑤(𝑥) = 𝐿𝑜𝑔𝑖𝑠𝑡𝑖𝑐(𝑤. 𝑥) = 1+𝑒 −𝑤.𝑥
• Output can be interpreted as a probability
• Hypothesis forms a soft boundary in the input space (fig c)
Roger Grosse and Jimmy Ba
Lecture 2: Supervised Machine Learning 1/
Linear Classifiers with a soft threshold (using logistic regression)
• Logistic Regression:
fitting weights to minimize loss on a dataset
• No easy closed form solution
• Gradient descent with L2 loss works

Roger Grosse and Jimmy Ba


Lecture 2: Supervised Machine Learning 1/
Linear Classifiers with a soft threshold (using logistic regression)

Training Curve for Logistic Classifier for out Seismic Signal Data
 Slow but predictable convergence for linearly separable data (fig a)
 Quick and reliable convergence for noisy and linearly non-separable data (fig
b and fig c)

Roger Grosse and Jimmy Ba


Lecture 2: Supervised Machine Learning 1/
?? Open to Questions ! . . . 

23/28
Lecture 1: Introduction

You might also like