Lect02 Understanding Supervised Machine Learning Student

Lecture 02:
“Supervised” Machine Learning
Pratik Desai
Asst. Prof, MCA Dept , NMITD
Mob No: 9867498362
Adapted from the “Chapter 18 – Learning from Examples” of the AI – A Modern

Approach, 3rd Ed text, by Stuart Russel & Peter Norvig
Lecture 1: Introduction 1/28

Learning Agents
Learning Agents:
• Agents that can improve their behavior through diligent study of
their own experiences.
• An agent is learning if it improves its performance on future

tasks after making observations about the world.
Supervised Learning:
• One class of learning problem: from a collection of input-output
pairs, learn a function that predicts the output for new inputs.
• Although this seems restricted, it has vast applicability.
Roger Grosse and Jimmy Ba

Lecture 2: Supervised Machine Learning 1/
Why would we want an agent to learn?
 If the design of the agent cant be improved, why wouldn’t the
designers just program in that improvement to begin with?
 Three main reasons why we need learning ability:

 Programmers cannot anticipate all possible situations that
the agent can find itself in
 Programmers cannot anticipate all changes over time – the
world is ever changing
 For certain tasks, human programmers have no idea how
to program a solution themselves – example: recognizing
faces, understanding speech, driving a car, etc.

Supervised Learning is Inductive Learning
Facts
 Inductive Learning:
learning a (possibly incorrect) general function or
rule from specific input-output pairs
Supervised Learning is a form of inductive principles

learning
principles
 Deductive (or Analytical) Learning:
going from a known general rule to a new rule that
is logically entailed, but is useful because it allows
more efficient processing.
Facts

Formalizing the task of Supervised Learning
 The task of supervised learning is this:
• Given a training set of N example input-output pairs
𝑥1, 𝑦1 , 𝑥2, 𝑦2 , … 𝑥𝑁 , 𝑦𝑁
• where each 𝑦𝑗 was generated by an unknown function
𝑦=𝑓 𝑥 ,
• discover a function ℎ that approximates the true function 𝑓.
 Here 𝑥 and 𝑦 can be any value; they need not be numbers.
 The function ℎ is a hypothesis.
 Learning is a search through the space of possible hypotheses for
one that will perform well, even on new examples beyond the
training set.
 To measure the accuracy of a hypothesis we give it a test set of
examples that are distinct from the training set.
 We say a hypothesis generalizes well if it correctly predicts the
value of 𝑦 for novel examples.
 When the function 𝑓 is stochastic – it is not strictly a function of 𝑥,
what we have to learn is a conditional probability distribution, 𝑃(𝑌Τ𝑥)
Classification vs. Regression
 When the output 𝑦 is one of a finite set of values (such as sunny,

cloudy or rainy), the learning problem is called classification
 Binary (or Boolean) classification: classification where there are

only two possible values for the output 𝑦.
 When the output 𝑦 is a number (such as tomorrow’s temperature),

the learning problem is called regression.
 Technically, solving a regression problem is finding a

conditional expectation or average value of y, because the
probability that we have found exactly the right real-valued
number for y is 0.

Supervised Learning as Function (Curve) Fitting
 We have some example data points in the 𝑥, 𝑦 plane, where 𝑦 =

𝑓(𝑥).
 We don’t know what 𝑓 is, but we will approximate it with a

function ℎ selected from a hypothesis space 𝐻.
 For this example, we will take the hypothesis space 𝐻 to be

the set of all polynomials, such as 𝑥 5 + 3𝑥 2 + 2
 If a hypothesis function ℎ fits all the data points its called a

consistent hypothesis because it agrees with all the data.

Two consistent hypothesis: a) linear b) non-linear
(a) Example (𝑥, 𝑓 𝑥 ) pairs and a (b) A consistent, degree-7

consistent, linear hypothesis polynomial for the same data set

The fundamental problem in Inductive Learning
Source: “Bringing Out Occam’s razor in peer-review” (Nature), Credit: Matteo De Ventra
How do we choose from among multiple consistent hypothesis?

Solution: Apply Ockham’s razor !!
Source: www.phdcomics.com

The Fundamental Problem in Inductive Learning
 The previous slide shows two hypothesis functions: (a) a linear
function – a straight line fitting the data ; (b) a degree-7 polynomial
function – a complex curve fitting the data
 Both are consistent with the same data.
 This illustrates a fundamental problem in inductive learning – how

do we choose from multiple consistent hypotheses?
 Solution – Use the principle called Ockham’s razor
 The principle of Ockham’s razor – named after the 14th century

English philosopher William of Ockham, who argued against all sorts of
complications – advises us to prefer the simplest hypothesis
consistent with data.
 Since a degree-1 polynomial is simpler than a degree-7 polynomial, thus

hypothesis (a) should be preferred over hypothesis (b)
Hypothesis that fits need not generalize and vice a versa
(c) A different dataset, which admits two hypothesis:

an exact degree-6 polynomial that fits or an approximate linear fit

Expanding the hypothesis space H
(c) A different dataset, which admits

(d) A simple, exact sinusoidal fit
two hypothesis: an exact degree-6
to the same dataset
polynomial that fits or an approximate
linear fit
Note: Here we expand the hypothesis space H to include polynomials over

both 𝑥 and sin 𝑥 [e.g., functions like 𝒂𝒙 + 𝒃 + 𝒄𝒔𝒊𝒏(𝒙)]
Assumptions & Issues in Evaluating & Choosing the Best Hypothesis
 Stationarity Assumption: past data (used for training) will be similar in

nature to future data (the model will face in the real world)
 How to Define the Best Fit?: Errors vs Loss (and thus Error Rates vs
Loss Functions)
 Error Rate of a hypothesis is defined as the proportion of mistakes it

makes – the proportion of times that ℎ 𝑥 ≠ 𝑦 for an (𝑥, 𝑦) example.
 Just because a hypothesis h has a low error rate on the training set does
not mean that it will generalize well.

Evaluation of An Hypothesis
 To get an accurate evaluation we need to test it on a set of examples it has not
seen yet.
 holdout cross validation: randomly split the available data into a training set
from which the learning algorithm produces ℎ and a test set on which the
accuracy of ℎ is evaluation
 fails to use all the available data. If we use half the data for the test set,
then we are only training on half the data, and we may get a poor
hypothesis.
 On the other hand, if we reserve only 10% of the data for the test set,
then we may, by statistical chance, get a poor estimate of the actual
accuracy.
 𝒌 – fold cross-validation: We can squeeze more out of the data and still get an
accurate estimate. Each example serves double duty—as training data and test
data.
 First we split the data into 𝒌 equal subsets. We then perform 𝑘 rounds
of learning; on each round 1/𝑘 of the data is held out as a test set and the
remaining examples are used as training data.
 The average test set score of the k rounds should then be a better
estimate than a singleRoger
score. Popular values of 𝑘 : 5 and 10
Grosse and Jimmy Ba
Evaluation of An Hypothesis
 leave-one-out cross validation (LOOCV): extreme case of 𝑘 – fold cross

validation, where 𝑘 = 𝑛

From error rates to loss
 Consider the problem of classifying email messages as spam or non-spam.
 It is worse to classify non-spam as spam (and thus potentially miss an

important message) then to classify spam as non-spam (and thus
suffer a few seconds of annoyance).
 So a classifier with a 1% error rate, where almost all the errors were
classifying spam as non-spam, would be better than a classifier with only
a 0.5% error rate, if most of those errors were classifying non-spam as
spam.

Loss Function as Amount of Utility Lost
 The loss function 𝐿(𝑥, 𝑦, ŷ) is defined as the amount of

utility lost by predicting ℎ(𝑥) = ŷ when the correct answer
is 𝑓(𝑥) = 𝑦.
 Often a simplified version is used, 𝐿(𝑦, ŷ), that is

independent of 𝑥.
 We can say it is 10 times worse to classify non-spam as

spam than vice-versa as :
𝐿 𝑠𝑝𝑎𝑚, 𝑛𝑜𝑠𝑝𝑎𝑚 = 1, 𝐿 𝑛𝑜𝑠𝑝𝑎𝑚, 𝑠𝑝𝑎𝑚 = 10
 Note that 𝑳(𝒚, 𝒚) is always zero: by definition there is no

loss when you guess exactly right.
L1 Loss, L2 Loss and L1/0 Loss
 loss value for discrete valued functions vs. loss value for
real valued functions
 𝑳𝟏 loss (the absolute value loss):

𝐿1 𝑦, ŷ = 𝑦 − ŷ
 𝑳𝟐 𝒍𝒐𝒔𝒔 (the squared error loss):

𝐿2 𝑦, ŷ = (𝑦 − ŷ)2
 𝑳𝟎/𝟏 𝒍𝒐𝒔𝒔 (the 0/1 loss): a loss of 1 for an incorrect answer

𝐿0/1 𝑦, ŷ = 0 if y = ŷ, else 1

Generalization Loss
 The learning agent can theoretically maximize its expected
utility by choosing the hypothesis that minimizes expected
loss over all input–output pairs it will see.
 Let 𝜀 be the set of all possible input-output examples.
 The expected generalization loss for a hypothesis ℎ (with

respect to loss function 𝐿) is
 And the best hypothesis, ℎ∗ , is the one with the minimum

expected generalization loss:

Estimating Generalization Loss with Empirical Loss
 Because 𝑃(𝑥, 𝑦) is not known, the learning agent can only

estimate generalization loss with empirical loss on a set of
examples, 𝐸
 The estimated best hypothesis ĥ∗ is then the one with

minimum empirical loss:

Regression and Classification
with Linear Models
Linear Regression
 Hypothesis Space: the class of all linear functions of

continuous valued inputs.
 Univariate Linear Regression
 Multivariate Linear Regression

Univariate Linear Regression
 A univariate linear function (a straight line) with input x and
output y has the form 𝑦 = 𝑤1𝑥 + 𝑤0
 𝑤0 and 𝑤1 are real-valued coefficients to be learned.
 We think of coefficients as weights (hence the letter 𝑤);
 The value of 𝑦 is changed by changing the relative weight

of one term or another.
 Let 𝒘 be the vector [𝑤0, 𝑤1], and let us define

ℎ𝒘 𝑥 = 𝑤1𝑥 + 𝑤0
 The task of finding the ℎ𝒘 that best fits the data is called
linear regression

Doing Linear Regression
• We see an example of a
training set of n points in
the x, y plane
• Each point represents the

size in square feet and the
price of a house offered for
sale
• To fit a line to the data, all

we have to do is find the
values of the weights
Data points of price versus floor space of houses for
[𝑤0, 𝑤1] that minimize the sale in Berkeley, CA, in July 2009, along with the
linear function hypothesis that minimizes squared
empirical loss. error loss: y = 0.232x + 246.

Doing Linear Regression
• Usually empirical loss is computed

using the squared loss function,
𝐿2, summed over all the training
examples.
• 𝐿𝑜𝑠𝑠 ℎ𝒘 = σ𝑁
𝑗=1 𝐿2(𝑦𝑗, ℎ𝒘 𝑥𝑗 )
= σ𝑁
𝑗=1 𝑦𝑗 − ℎ𝒘 𝑥𝑗
2
= σ𝑁
𝑗=1 𝑦𝑗 − (𝑤1𝑥𝑗 + 𝑤0)
2
Plot of the loss function σ𝑵 𝟐

𝒋=𝟏 𝒘𝟏𝒙𝒋 + 𝒘𝟎 − 𝒚𝒋 for
various values of w0, w1. Note that the loss function
• We would like to find is convex, with a single global minimum.
𝒘∗ = 𝑎𝑟𝑔𝑚𝑖𝑛𝒘 𝐿𝑜𝑠𝑠 (ℎ𝒘)

Doing Linear Regression – It Has A Mathematical Solution
• The sum σ𝑁 2
𝑗=1 𝑦𝑗 − (𝑤1𝑥𝑗 + 𝑤0) is minimized when its partial
derivatives with respect to 𝑤0 and 𝑤1 are zero
• i.e., when - Equations (1)

𝜕
σ𝑁 𝑦𝑗 − (𝑤1𝑥𝑗 + 𝑤0) 2 =0
𝜕𝑤𝑜 𝑗=1
and
𝜕
σ𝑁 𝑦𝑗 − (𝑤1𝑥𝑗 + 𝑤0) 2 =0
𝜕𝑤1 𝑗=1
• These equations have a unique solution: - Equations (2)
𝑁 σ 𝑥j 𝑦𝑗 − (σ 𝑥𝑗)(σ 𝑦𝑗) (σ 𝑦𝑗 − 𝑤1(σ 𝑥𝑗))

𝑤1 = ; 𝑤0 =
𝑁 σ 𝑥𝑗 2 − σ 𝑥𝑗 2 𝑁
• In our case, the solution is 𝒘𝟏 = 𝟎. 𝟐𝟑𝟐, 𝒘𝟎 = 𝟐𝟒𝟔 and the line with those
weights is shown as a dashed line in the figure (a)

Doing Linear Regression – It Has A Mathematical Solution
In our case, the solution is 𝑤1 = 0.232, 𝑤0 = 246 and the line with those
weights is shown as a dashed line in the figure (a)
Data points of price versus floor space of houses for Plot of the loss function σ𝑵
𝒋=𝟏 𝒘𝟏𝒙𝒋 + 𝒘𝟎 − 𝒚𝒋
𝟐 for
sale in Berkeley, CA, in July 2009, along with the various values of w0, w1. Note that the loss function
linear function hypothesis that minimizes squared is convex, with a single global minimum.
error loss: y = 0.232x + 246.

Gradient Descent Learning – When There Is No Mathematical Soln
 Involves adjusting weights in the weight space to minimize
loss
 When we go beyond linear models, the equations defining
minimum loss will often have no closed-form solution.
 In such cases, we face a general optimization search problem
in a continuous weight space.
 Such can be addressed by a hill-climbing algorithm that
follows the gradient of the function to be optimized.
 We use gradient descent – because we want to minimize
the loss α - the learning rate
- Equation (3)
Mathematical Derivations – Stochastic Gradient Descent
 We work out the partial derivative in the simplified case of one

training example 𝑥, 𝑦 :
- Equations (4) - final learning rules (-2 folds into α)

Stochastic Gradient Descent
 We consider only a single training point at a time, taking a step

after each one using, equations 4, given below
- Equations (4)
 Can be used in online setting –where new data comes in one at

a time
 In offline setting we cycle through the same data as many times
as necessary, taking a step after considering each single
example.
 Often faster than batch gradient descent.
 With a fixed learning rate α convergence not guaranteed
(possibility of never-ending oscillations around minimum)
 In some cases, decreasing learning rates does guarantee
convergence

Mathematical Derivations – Batch Gradient Descent
 For 𝑁 training examples, we want to minimize the sum of the

individual losses for each example.
 Because the derivative of a sum is the sum of the derivatives,

we have the following update rule:
- Equations (5)

Batch Gradient Descent
 We consider all the 𝑁 training examples at each step and

minimize the sum of the individual losses for each example using
equations 5, given below
- Equations (5)
 Convergence to the unique global minimum is guaranteed – as

long as we pick small enough α)
 Convergence may be very slow – we have to cycle through all

the training data for every step, and there may be many steps

Multivariate Linear Regression
 We can easily extend to multivariate linear regression problems, in
which each example 𝑿𝑗 is an n-element vector
 Our hypothesis space is the set of functions of the form
 The 𝑤0 term, the intercept, stands out as different from the others.
 We can fix that by inventing a dummy input attribute, 𝑥𝑗, 0, which is

defined as always equal to 1.
 Then h is simply the dot product of the weights and the input vector (or
equivalently, the matrix product of the transpose of the weights and
the input vector):
 Best vector of weights 𝑤 ∗ minimizes the squared error loss over

examples:
Gradient Descent for Multivariate Linear Regression
 Gradient descent will reach the (unique) minimum of the loss function
 The update equation for each weight 𝑤𝑖 is

Analytical Solution for Multivariate Linear Regression
 It is also possible to solve analytically for the w that minimizes loss.
 Let y be the vector of outputs for the training examples, and X be the
data matrix, i.e., the matrix of inputs with one 𝑛-dimensional example
per row.
 The solution that minimizes the squared error:

Overfitting and Regularization in Multivariate Linear Regression
 With univariate linear regression we didn’t have to worry about
overfitting.
 Problem of Overfitting – But with multivariate linear regression in
high-dimensional spaces it is possible that some dimension that is
actually irrelevant appears by chance to be useful, resulting in
overfitting.
Regularization –
 It is common to use regularization on multivariate linear functions to
avoid overfitting.
 We search for a hypothesis that directly minimizes the weighted sum
of empirical loss and the complexity of the hypothesis, which we will
call the total cost:
 𝝀, a positive number is a hyperparameter ; it serves as a conversion
rate between loss and complexity measure. If 𝜆 is chosen well it nicely
balances the empirical loss of a simple function against a complicated
function’s tendency to overfit.
 This process of explicitly penalizing complex hypotheses is
called regularization: we’re looking for functions that are more
regular.
L1 and L2 Regularization for Linear Functions
 For linear functions the complexity can be specified as a function of

the weights.
 We can consider a family of regularization functions:
 As with loss functions, with 𝑞 = 1 we have 𝑳𝟏 regularization, which

minimizes the sum of the absolute values;
 With 𝑞 = 2, we have 𝑳𝟐 regularization, which minimizes the sum of

squares.
 Which regularization function should you pick? – That depends

on the specific problem
 𝑳𝟏 Regularization Advantage: It tends to produce a sparse model

(setting many weights to zero effectively declaring those attributes as
irrelevant) - resulting in hypothesis less likely to overfit.
Linear Classifiers with a hard threshold
 In addition to regression, linear Training data
functions can also be used for
classification.
 Example: Classifying a seismic signal

as either a nuclear explosion or as an
earthquake.
 (𝑥1, 𝑥2) ≡ (body wave magnitude, surface

wave magnitude) – The two features of
a seismic signal
Earthquakes (white circles) and
 Classification Task: given the training nuclear explosions (black circles)
occurring between 1982 and
data, learn a hypothesis ℎ that will take 1990 in Asia and the Middle
new (𝑥1, 𝑥2) points and return 0 for East. Also shown is a decision
earthquakes or 1 for explosion. boundary between the classes.

Decision Boundary for a Linearly Separable Data
 A decision boundary is a line (or a
surface, in higher dimensions) that Training data
separates the two classes
 A linear decision boundary is called a
linear separator
 Data that admit such a linear separator are
called as linearly separable
 In our case, the linear separator is:

𝒙𝟐 = 𝟏. 𝟕𝒙𝟏 − 𝟒. 𝟗 or −𝟒. 𝟗 + 𝟏. 𝟕𝒙𝟏 − 𝒙𝟐 = 𝟎
 The explosions, which we want to classify Earthquakes (white circles) and

with value 1, are to the right of this line with nuclear explosions (black circles)
higher values of 𝑥1 and lower values of 𝑥2 occurring between 1982 and
1990 in Asia and the Middle
East. Also shown is a decision
 So explosions are the points for which boundary between the classes.
−𝟒.𝟗+𝟏.𝟕𝒙𝟏−𝒙𝟐 > 𝟎 while earthquakes have
−𝟒.𝟗+𝟏.𝟕𝒙𝟏−𝒙𝟐 < 𝟎
Decision Boundary for a Linearly Separable Data
 We can write the classification hypothesis as
h𝒘(x)= 1 if w.x ≥ 𝟎 and 0 otherwise.
Training data
(assuming 𝑥0 = 1 as a dummy input)
 We can think of ℎ as the result of passing the

linear function w.x through a hard threshold
function:
ℎ𝒘(x) = 𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑(w.x) where
𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑(𝑧) = 1 if 𝑧 ≥ 0 and 0 otherwise.
Earthquakes (white circles) and

The hard threshold function
nuclear explosions (black circles)
𝑻𝒉𝒓𝒆𝒔𝒉𝒐𝒍𝒅(𝒛) with 0/1 output.
Note that the function is occurring between 1982 and
nondifferentiable at 𝒛 = 𝟎 1990 in Asia and the Middle
East. Also shown is a decision
boundary between the classes.

The Perceptron Learning Rule
 How to chose weights w to minimize the loss?
 We cannot solve mathematically for weights by

setting the gradient to zero
 Also we cannot use gradient descent in weight

space !! …. Why??
 The Perceptron Learning Rule:

• A simple weight update rule that converges
The hard threshold function
to a solution provided the data is linearly 𝑻𝒉𝒓𝒆𝒔𝒉𝒐𝒍𝒅(𝒛) with 0/1 output.
separable: Note that the function is
nondifferentiable at 𝒛 = 𝟎
• For a single example (x, 𝑦), we have

Behavior of The Perceptron Learning Rule
 The Perceptron Learning Rule:
• A simple weight update rule that converges to a
solution provided the data is linearly separable:
• For a single example (x, 𝑦), we have
 For our 0/1 classification problem perceptron rule

behaves as below:
• If the output is correct, i.e., 𝑦 = ℎ𝒘(x), then the
The hard threshold
weights are not changed function 𝑻𝒉𝒓𝒆𝒔𝒉𝒐𝒍𝒅(𝒛)
with 0/1 output.
• If 𝑦 is 1 but ℎ𝒘(x) is 0, then 𝑤𝑖 is increased Note that the function is
when the corresponding input 𝑥𝑖 is positive and nondifferentiable at 𝒛 = 𝟎
decreased when 𝑥𝑖 is negative (makes sense?)
• If 𝑦 is 0 but ℎ𝒘(x) is 1, then 𝑤𝑖 is decreased
when the corresponding input 𝑥𝑖 is positive and
increased when 𝑥𝑖 is negative (makes sense?)
 Typically the learning rule is applied one example at a
time as in stochastic gradient descent
Convergence not quick but guaranteed for Linearly Separable Data
 Training Curve: A training curve

measures the classifier performance on a
fixed training set as the learning process
proceeds on that same training set.
 The curve shows the update rule

converging to a zero-error linear separator.
 The “convergence” process isn’t exactly

pretty, but it always works.
 This particular run takes 657 steps to

converge, for a data set with 63 examples,
so each example is presented roughly 10
times on average.

What If Data Isn’t Linearly Separable
• Adding missing data makes our data linearly non- Original partial data
separable.
 The perceptron learning rule failing to converge

even after 10,000 steps: even though it hits the
minimum-error solution (three errors) many times,
the algorithm keeps changing the weights.
 Training curve:
After adding missing data
 In general, the perceptron rule may not converge

to a stable solution for a fixed learning rate 𝛼
 For decaying learning rate it converges to a

minimum error solution after a very long time
Linear Classifiers with a hard threshold

Linear Classifiers with a soft threshold (using logistic regression)
• Softening the threshold function by using logistic function

1
• 𝐿𝑜𝑔𝑖𝑠𝑡𝑖𝑐(𝑧) = 1+𝑒 −𝑧
• Hypothesis
1
• ℎ𝑤(𝑥) = 𝐿𝑜𝑔𝑖𝑠𝑡𝑖𝑐(𝑤. 𝑥) = 1+𝑒 −𝑤.𝑥
• Output can be interpreted as a probability
• Hypothesis forms a soft boundary in the input space (fig c)
• Logistic Regression:
fitting weights to minimize loss on a dataset
• No easy closed form solution
• Gradient descent with L2 loss works

Training Curve for Logistic Classifier for out Seismic Signal Data
 Slow but predictable convergence for linearly separable data (fig a)
 Quick and reliable convergence for noisy and linearly non-separable data (fig
b and fig c)

?? Open to Questions ! . . . 
23/28
Lecture 1: Introduction

Lect02 Understanding Supervised Machine Learning Student

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lect02 Understanding Supervised Machine Learning Student

Uploaded by

Copyright:

Available Formats

Lecture 02:

“Supervised” Machine Learning

Adapted from the “Chapter 18 – Learning from Examples” of the AI – A Modern

Lecture 1: Introduction 1/28

• An agent is learning if it improves its performance on future

• Although this seems restricted, it has vast applicability.

Roger Grosse and Jimmy Ba

 Three main reasons why we need learning ability:

Roger Grosse and Jimmy Ba

Supervised Learning is a form of inductive principles

Roger Grosse and Jimmy Ba

 When the output 𝑦 is one of a finite set of values (such as sunny,

 Binary (or Boolean) classification: classification where there are

 When the output 𝑦 is a number (such as tomorrow’s temperature),

 Technically, solving a regression problem is finding a

Roger Grosse and Jimmy Ba

 We have some example data points in the 𝑥, 𝑦 plane, where 𝑦 =

 We don’t know what 𝑓 is, but we will approximate it with a

 For this example, we will take the hypothesis space 𝐻 to be

 If a hypothesis function ℎ fits all the data points its called a

Roger Grosse and Jimmy Ba

(a) Example (𝑥, 𝑓 𝑥 ) pairs and a (b) A consistent, degree-7

Roger Grosse and Jimmy Ba

How do we choose from among multiple consistent hypothesis?

Roger Grosse and Jimmy Ba

 Both are consistent with the same data.

 This illustrates a fundamental problem in inductive learning – how

 The principle of Ockham’s razor – named after the 14th century

 Since a degree-1 polynomial is simpler than a degree-7 polynomial, thus

(c) A different dataset, which admits two hypothesis:

Roger Grosse and Jimmy Ba

(c) A different dataset, which admits

Note: Here we expand the hypothesis space H to include polynomials over

 Stationarity Assumption: past data (used for training) will be similar in

 Error Rate of a hypothesis is defined as the proportion of mistakes it

Roger Grosse and Jimmy Ba

 leave-one-out cross validation (LOOCV): extreme case of 𝑘 – fold cross

Roger Grosse and Jimmy Ba

 Consider the problem of classifying email messages as spam or non-spam.

 It is worse to classify non-spam as spam (and thus potentially miss an

Roger Grosse and Jimmy Ba

 The loss function 𝐿(𝑥, 𝑦, ŷ) is defined as the amount of

 Often a simplified version is used, 𝐿(𝑦, ŷ), that is

 We can say it is 10 times worse to classify non-spam as

 Note that 𝑳(𝒚, 𝒚) is always zero: by definition there is no

 𝑳𝟏 loss (the absolute value loss):

 𝑳𝟐 𝒍𝒐𝒔𝒔 (the squared error loss):

 𝑳𝟎/𝟏 𝒍𝒐𝒔𝒔 (the 0/1 loss): a loss of 1 for an incorrect answer

Roger Grosse and Jimmy Ba

 Let 𝜀 be the set of all possible input-output examples.

 The expected generalization loss for a hypothesis ℎ (with

 And the best hypothesis, ℎ∗ , is the one with the minimum

Roger Grosse and Jimmy Ba

 Because 𝑃(𝑥, 𝑦) is not known, the learning agent can only

 The estimated best hypothesis ĥ∗ is then the one with

Roger Grosse and Jimmy Ba

 Hypothesis Space: the class of all linear functions of

 Univariate Linear Regression

 Multivariate Linear Regression

Roger Grosse and Jimmy Ba

 𝑤0 and 𝑤1 are real-valued coefficients to be learned.

 We think of coefficients as weights (hence the letter 𝑤);

 The value of 𝑦 is changed by changing the relative weight