Lec 3,4

Polynomial Curve Fitting
BITS F464 – Machine Learning

Navneet Goyal
Department of Computer Science, BITS-Pilani, Pilani Campus, India
• Seems a very trivial concept!!

• All of us know it well!!
• Why are we discussing it in Machine Learning
course?
• A simple regression problem!!
• It motivates a number of key concepts of ML!!
• Let’s discover…
A word about…
• Predictive Modeling
• Parametric vs. Non-parametric ML models
Fundamentals of Modeling
• Abstract representation of a real-world process
• Y=3X+2 is a very simple model of how variable Y might
relate to variable X
• Instance of a more general model structure Y =aX+b
• a & b are parameters
• θ is generally used to denote a generic parameter or a
set (or vector) of parameters
• θ={a,b}
• Values of parameters are chosen by estimation – that
is by min. or max. an appropriate score function
measuring the fit of the model to the data
• Before we can choose the parameters, we must choose
an app. functional form of the model itself
Fundamentals of Modeling
• Predictive modeling
– PM can be thought of as learning a mapping from an input set
of vector measurements x to a scalar output y
– Vector output also possible but rarely used in practice
– One of the variable is expressed as a function of others
(predictor variables)
– Response variable – Y and predictor variables – Xi
– Ÿ = f(x1,x2,….xp; θ)
– When Y is quantitative, this task of estimating a mapping
from the p-dimensional X to Y is called as regression
– When Y is categorical, the task of learning a mapping from X
to Y is called classification learning or supervised
classification
Predictive Modeling
– Predictive modeling
• Predicts the value of some target characteristic of an
object on the basis of observed values of other
characteristics of the object
• Examples: Regression in ML (Prediction in DM) &
Supervised Learning in ML (Classification in DM)
Parametric vs. Non-parametric
Parametric Models
• Parametric models assume some finite set of
parameters θ.
• Given the parameters, future predictions, x, are
independent of the observed data, D
P(x|θ,D) = P(x|θ)
• Therefore, θ capture everything there is to know about
the data.
• Complexity of the model is bounded even if the amount
of data is unbounded.
• This makes Parametric models ”stiff” (not very flexible)
Parametric vs. Non-parametric
Non-parametric Models
• Non-parametric models assume that the data
distribution cannot be defined in terms of such a
finite set of parameters.
• But, they can often be defined by assuming an
infinite dimensional θ. (or flexible no. of parameters)
• Usually, we think of θ as a function.
• The amount of information that θ can capture about
the data D can grow as the amount of data grows.
• This makes Non-parametric Models
more ”flexible”
Observe Real-valued
input variable x
• Use x to predict value
Variable
Target
of target variable t
• Synthetic data
generated from
sin(2π x)
• Random noise in
target values Input Variable
Reference: Christopher M Bishop: Pattern Recognition & Machine Learning, 2006

Springer
N observations of x
x = (x1,..,xN)T
t = (t1,..,tN)T
• Goal is to exploit training
set to predict value of
Variable
Target
from x
• Inherently a difficult
problem
Data Generation:
N = 10
Spaced uniformly in range [0,1]
Generated from sin(2πx) by adding
small Gaussian noise Input Variable
Noise typical due to unobserved
variables

Springer
• Where M is the order of the

polynomial
• Is higher value of M better?
We’ll see shortly!
Variable
Target
• Coefficients w0 ,…wM are
denoted by vector w
• Nonlinear function of x, linear
function of coefficients w
• Called Linear Models
Input Variable

Springer
Sum-of-Squares Error Function

Springer
Polynomial curve fitting
Reference: Christopher M Bishop: Pattern Recognition & Machine Learning, 2006 Springer
Polynomial curve fitting
• Choice of M??
• Called model selection or model comparison

Springer
0th Order Polynomial
Poor representations of sin(2πx)

Springer
1st Order Polynomial
Poor representations of sin(2πx)

Springer
3rd Order Polynomial
Best Fit to sin(2πx)

Springer
9th Order Polynomial
Over Fit: Poor representation of sin(2πx)

Springer
• Good generalization is the objective
• Dependence of generalization performance on M?
• Consider a data set of 100 points
• Calculate E(w*) for both training data & test data
• Choose M which minimizes E(w*)
• Root Mean Square Error (RMS)
– Sometimes convenient to use as division by N allows us to

compare different sizes of data sets on equal footing
– Square root ensures ERMS is measure on the same scale ( and in
same units) as the target variable t
Reference: Christopher M Bishop: Pattern Recognition & Machine Leaning, 2006

Springer
Flexibility & Model Complexity
} M=0, very rigid!! Only 1 parameter to play with!

} M=1, not so rigid!! 2 parameters to play with!

French Curves – Optimum Flexibility
} So what value of M is most suitable?
} Any Answers???
Over-fitting
For small M(0,1,2)
Inflexible to
handle oscillations
of sin(2πx)
M(3-8)
flexible enough to
handle
oscillations of
sin(2πx)
For M=9
Too flexible!!
TE = 0
GE = high
Why is it happening?
Springer
Polynomial Coefficients

Springer
Data Set Size
M=9
- Larger the data set, the more complex
model we can afford to fit to the data
- No. of data pts should be no less than 5-
10 times the no. of adaptive parameters in
the model

Springer
Over-fitting Problem
Should we limit the no. of parameters according
to the available training set?
Complexity of the model should depend only on the
complexity of the problem!
LSE represents a specific case of Maximum Likelihood
Over-fitting is a general property of maximum

likelihood
Over-fitting Problem can be avoided using the

Bayesian Approach!
Over-fitting Problem
In Bayesian Approach, the effective number of
parameters adapts automatically to the size of the data
set
In Bayesian Approach, models can have more

parameters than the number of data points

Springer
Regularization
Penalize large coefficient values

Springer
Regularization:
In 𝛌 = -∞
M=9

Springer
Regularization:
In 𝛌 = -∞
M=9

Springer
Regularization: vs.
Flexibility Optimal Flexibility Rigidness

Springer
Polynomial Coefficients

Springer
Take Aways from Polynomial Curve Fitting
• Concept of over-fitting
• Model Complexity & Flexibility
• Model Selection
Will keep revisiting them from time to time…

Mixture of Distributions/Gaussians
• Real data typically possess an underlying regularity,
which we wish to learn
– But the individual observations are corrupted by random
noise
• Data need not come from a single
distribution/Gaussian
• Read about Mixture of Gaussians

Lec 3,4

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lec 3,4

Uploaded by

Copyright:

Available Formats

Polynomial Curve Fitting

BITS F464 – Machine Learning

• Seems a very trivial concept!!

Reference: Christopher M Bishop: Pattern Recognition & Machine Learning, 2006

Reference: Christopher M Bishop: Pattern Recognition & Machine Learning, 2006

• Where M is the order of the

Reference: Christopher M Bishop: Pattern Recognition & Machine Learning, 2006

Reference: Christopher M Bishop: Pattern Recognition & Machine Learning, 2006

Reference: Christopher M Bishop: Pattern Recognition & Machine Learning, 2006

Poor representations of sin(2πx)

Poor representations of sin(2πx)

Best Fit to sin(2πx)

Over Fit: Poor representation of sin(2πx)

– Sometimes convenient to use as division by N allows us to

Reference: Christopher M Bishop: Pattern Recognition & Machine Leaning, 2006

} M=0, very rigid!! Only 1 parameter to play with!

} M=1, not so rigid!! 2 parameters to play with!

Reference: Christopher M Bishop: Pattern Recognition & Machine Learning, 2006

Reference: Christopher M Bishop: Pattern Recognition & Machine Learning, 2006

LSE represents a specific case of Maximum Likelihood

Over-fitting is a general property of maximum

Over-fitting Problem can be avoided using the

In Bayesian Approach, models can have more

Reference: Christopher M Bishop: Pattern Recognition & Machine Learning, 2006

Reference: Christopher M Bishop: Pattern Recognition & Machine Learning, 2006

Reference: Christopher M Bishop: Pattern Recognition & Machine Learning, 2006

Reference: Christopher M Bishop: Pattern Recognition & Machine Learning, 2006

Flexibility Optimal Flexibility Rigidness

Reference: Christopher M Bishop: Pattern Recognition & Machine Learning, 2006

Reference: Christopher M Bishop: Pattern Recognition & Machine Learning, 2006

Will keep revisiting them from time to time…

You might also like