Professional Documents
Culture Documents
Pratik Desai
Asst. Prof, MCA Dept , NMITD
Mob No: 9867498362
Learning Agents:
• Agents that can improve their behavior through diligent study of
their own experiences.
Supervised Learning:
• One class of learning problem: from a collection of input-output
pairs, learn a function that predicts the output for new inputs.
principles
Deductive (or Analytical) Learning:
going from a known general rule to a new rule that
is logically entailed, but is useful because it allows
more efficient processing.
Facts
Source: “Bringing Out Occam’s razor in peer-review” (Nature), Credit: Matteo De Ventra
Source: www.phdcomics.com
How to Define the Best Fit?: Errors vs Loss (and thus Error Rates vs
Loss Functions)
Just because a hypothesis h has a low error rate on the training set does
not mean that it will generalize well.
So a classifier with a 1% error rate, where almost all the errors were
classifying spam as non-spam, would be better than a classifier with only
a 0.5% error rate, if most of those errors were classifying non-spam as
spam.
The task of finding the ℎ𝒘 that best fits the data is called
linear regression
• 𝐿𝑜𝑠𝑠 ℎ𝒘 = σ𝑁
𝑗=1 𝐿2(𝑦𝑗, ℎ𝒘 𝑥𝑗 )
= σ𝑁
𝑗=1 𝑦𝑗 − ℎ𝒘 𝑥𝑗
2
= σ𝑁
𝑗=1 𝑦𝑗 − (𝑤1𝑥𝑗 + 𝑤0)
2
• The sum σ𝑁 2
𝑗=1 𝑦𝑗 − (𝑤1𝑥𝑗 + 𝑤0) is minimized when its partial
derivatives with respect to 𝑤0 and 𝑤1 are zero
• In our case, the solution is 𝒘𝟏 = 𝟎. 𝟐𝟑𝟐, 𝒘𝟎 = 𝟐𝟒𝟔 and the line with those
weights is shown as a dashed line in the figure (a)
In our case, the solution is 𝑤1 = 0.232, 𝑤0 = 246 and the line with those
weights is shown as a dashed line in the figure (a)
Data points of price versus floor space of houses for Plot of the loss function σ𝑵
𝒋=𝟏 𝒘𝟏𝒙𝒋 + 𝒘𝟎 − 𝒚𝒋
𝟐 for
sale in Berkeley, CA, in July 2009, along with the various values of w0, w1. Note that the loss function
linear function hypothesis that minimizes squared is convex, with a single global minimum.
error loss: y = 0.232x + 246.
- Equation (3)
Roger Grosse and Jimmy Ba
Lecture 2: Supervised Machine Learning 1/
Mathematical Derivations – Stochastic Gradient Descent
- Equations (4)
- Equations (5)
- Equations (5)
The 𝑤0 term, the intercept, stands out as different from the others.
Then h is simply the dot product of the weights and the input vector (or
equivalently, the matrix product of the transpose of the weights and
the input vector):
Gradient descent will reach the (unique) minimum of the loss function
Let y be the vector of outputs for the training examples, and X be the
data matrix, i.e., the matrix of inputs with one 𝑛-dimensional example
per row.
Training curve:
After adding missing data
Training Curve for Logistic Classifier for out Seismic Signal Data
Slow but predictable convergence for linearly separable data (fig a)
Quick and reliable convergence for noisy and linearly non-separable data (fig
b and fig c)
23/28
Lecture 1: Introduction