You are on page 1of 35

Polynomial Curve Fitting

BITS F464 – Machine Learning


Navneet Goyal
Department of Computer Science, BITS-Pilani, Pilani Campus, India
Polynomial Curve Fitting

• Seems a very trivial concept!!


• All of us know it well!!
• Why are we discussing it in Machine Learning
course?
• A simple regression problem!!
• It motivates a number of key concepts of ML!!
• Let’s discover…
A word about…
• Predictive Modeling
• Parametric vs. Non-parametric ML models
Fundamentals of Modeling
• Abstract representation of a real-world process
• Y=3X+2 is a very simple model of how variable Y might
relate to variable X
• Instance of a more general model structure Y =aX+b
• a & b are parameters
• θ is generally used to denote a generic parameter or a
set (or vector) of parameters
• θ={a,b}
• Values of parameters are chosen by estimation – that
is by min. or max. an appropriate score function
measuring the fit of the model to the data
• Before we can choose the parameters, we must choose
an app. functional form of the model itself
Fundamentals of Modeling
• Predictive modeling
– PM can be thought of as learning a mapping from an input set
of vector measurements x to a scalar output y
– Vector output also possible but rarely used in practice
– One of the variable is expressed as a function of others
(predictor variables)
– Response variable – Y and predictor variables – Xi
– Ÿ = f(x1,x2,….xp; θ)
– When Y is quantitative, this task of estimating a mapping
from the p-dimensional X to Y is called as regression
– When Y is categorical, the task of learning a mapping from X
to Y is called classification learning or supervised
classification
Predictive Modeling
– Predictive modeling
• Predicts the value of some target characteristic of an
object on the basis of observed values of other
characteristics of the object
• Examples: Regression in ML (Prediction in DM) &
Supervised Learning in ML (Classification in DM)
Parametric vs. Non-parametric
Parametric Models
• Parametric models assume some finite set of
parameters θ.
• Given the parameters, future predictions, x, are
independent of the observed data, D
P(x|θ,D) = P(x|θ)
• Therefore, θ capture everything there is to know about
the data.
• Complexity of the model is bounded even if the amount
of data is unbounded.
• This makes Parametric models ”stiff” (not very flexible)
Parametric vs. Non-parametric
Non-parametric Models
• Non-parametric models assume that the data
distribution cannot be defined in terms of such a
finite set of parameters.
• But, they can often be defined by assuming an
infinite dimensional θ. (or flexible no. of parameters)
• Usually, we think of θ as a function.
• The amount of information that θ can capture about
the data D can grow as the amount of data grows.
• This makes Non-parametric Models
more ”flexible”
Polynomial Curve Fitting
Observe Real-valued
input variable x
• Use x to predict value

Variable
Target
of target variable t
• Synthetic data
generated from
sin(2π x)
• Random noise in
target values Input Variable

Reference: Christopher M Bishop: Pattern Recognition & Machine Learning, 2006


Springer
Polynomial Curve Fitting
N observations of x
x = (x1,..,xN)T
t = (t1,..,tN)T
• Goal is to exploit training
set to predict value of

Variable
Target
from x
• Inherently a difficult
problem
Data Generation:
N = 10
Spaced uniformly in range [0,1]
Generated from sin(2πx) by adding
small Gaussian noise Input Variable
Noise typical due to unobserved
variables

Reference: Christopher M Bishop: Pattern Recognition & Machine Learning, 2006


Springer
Polynomial Curve Fitting

• Where M is the order of the


polynomial
• Is higher value of M better?
We’ll see shortly!

Variable
Target
• Coefficients w0 ,…wM are
denoted by vector w
• Nonlinear function of x, linear
function of coefficients w
• Called Linear Models

Input Variable

Reference: Christopher M Bishop: Pattern Recognition & Machine Learning, 2006


Springer
Sum-of-Squares Error Function

Reference: Christopher M Bishop: Pattern Recognition & Machine Learning, 2006


Springer
Polynomial curve fitting

Reference: Christopher M Bishop: Pattern Recognition & Machine Learning, 2006 Springer
Polynomial curve fitting
• Choice of M??
• Called model selection or model comparison

Reference: Christopher M Bishop: Pattern Recognition & Machine Learning, 2006


Springer
0th Order Polynomial

Poor representations of sin(2πx)


Reference: Christopher M Bishop: Pattern Recognition & Machine Learning, 2006
Springer
1st Order Polynomial

Poor representations of sin(2πx)


Reference: Christopher M Bishop: Pattern Recognition & Machine Learning, 2006
Springer
3rd Order Polynomial

Best Fit to sin(2πx)


Reference: Christopher M Bishop: Pattern Recognition & Machine Learning, 2006
Springer
9th Order Polynomial

Over Fit: Poor representation of sin(2πx)


Reference: Christopher M Bishop: Pattern Recognition & Machine Learning, 2006
Springer
Polynomial Curve Fitting
• Good generalization is the objective
• Dependence of generalization performance on M?
• Consider a data set of 100 points
• Calculate E(w*) for both training data & test data
• Choose M which minimizes E(w*)
• Root Mean Square Error (RMS)

– Sometimes convenient to use as division by N allows us to


compare different sizes of data sets on equal footing
– Square root ensures ERMS is measure on the same scale ( and in
same units) as the target variable t

Reference: Christopher M Bishop: Pattern Recognition & Machine Leaning, 2006


Springer
Flexibility & Model Complexity

} M=0, very rigid!! Only 1 parameter to play with!


Flexibility & Model Complexity

} M=1, not so rigid!! 2 parameters to play with!


French Curves – Optimum Flexibility
Flexibility & Model Complexity
} So what value of M is most suitable?

} Any Answers???
Over-fitting
For small M(0,1,2)
Inflexible to
handle oscillations
of sin(2πx)

M(3-8)
flexible enough to
handle
oscillations of
sin(2πx)
For M=9
Too flexible!!
TE = 0
GE = high

Why is it happening?
Reference: Christopher M Bishop: Pattern Recognition & Machine Learning, 2006
Springer
Polynomial Coefficients

Reference: Christopher M Bishop: Pattern Recognition & Machine Learning, 2006


Springer
Data Set Size
M=9
- Larger the data set, the more complex
model we can afford to fit to the data
- No. of data pts should be no less than 5-
10 times the no. of adaptive parameters in
the model

Reference: Christopher M Bishop: Pattern Recognition & Machine Learning, 2006


Springer
Over-fitting Problem
Should we limit the no. of parameters according
to the available training set?
Complexity of the model should depend only on the
complexity of the problem!

LSE represents a specific case of Maximum Likelihood

Over-fitting is a general property of maximum


likelihood

Over-fitting Problem can be avoided using the


Bayesian Approach!
Over-fitting Problem
In Bayesian Approach, the effective number of
parameters adapts automatically to the size of the data
set

In Bayesian Approach, models can have more


parameters than the number of data points

Reference: Christopher M Bishop: Pattern Recognition & Machine Learning, 2006


Springer
Regularization
Penalize large coefficient values

Reference: Christopher M Bishop: Pattern Recognition & Machine Learning, 2006


Springer
Regularization:
In 𝛌 = -∞

M=9

Reference: Christopher M Bishop: Pattern Recognition & Machine Learning, 2006


Springer
Regularization:
In 𝛌 = -∞

M=9

Reference: Christopher M Bishop: Pattern Recognition & Machine Learning, 2006


Springer
Regularization: vs.

Flexibility Optimal Flexibility Rigidness

Reference: Christopher M Bishop: Pattern Recognition & Machine Learning, 2006


Springer
Polynomial Coefficients

Reference: Christopher M Bishop: Pattern Recognition & Machine Learning, 2006


Springer
Take Aways from Polynomial Curve Fitting
• Concept of over-fitting
• Model Complexity & Flexibility
• Model Selection

Will keep revisiting them from time to time…


Mixture of Distributions/Gaussians
• Real data typically possess an underlying regularity,
which we wish to learn
– But the individual observations are corrupted by random
noise
• Data need not come from a single
distribution/Gaussian
• Read about Mixture of Gaussians

You might also like