You are on page 1of 3

M156 Lab assignment #1: Polynomial curve fitting

The goal of this first lab session is twofold: first, to get a first grasp of linear regression problems and of the tools
and mechanisms behind it, and second, to get used to programming with MATLAB.

The most useful MATLAB function is the help function. You can get information on the syntax and use of any
function by simply typing help your function, where your function is the name of the function for which you
want information.

Please include figures and plots in your reports when asked to plot a result or when you think it is necessary.
Dont forget to name the axes of your figures!

1 Introduction
We are interested in learning a curve from unknown experimental data points (x1 , t1 ), (x2 , t2 ), ..., (xN , tN ) (possibly
corrupted with noise), where N is the number of available data points, using a polynomial model. x represents the
abscissa, while t represents the ordinate.

The goal is to be able to predict the value of t, given a new value of x. This is called a regression problem.
In the absence of information on the generating model of our data, we can use a flexible approach with a polynomial
model:
M
X
y(x, w) = w0 + w1 x + w2 x2 + ... + wM xM = wj xj , (1)
j=0

where w = [w0 , w1 , ..., wM ]> RM +1 , and M is the degree of the polynomial. It turns out that any continuous
function on a closed interval can be arbitrarily closely approximated by a polynomial, so the model seems reasonable.

Now, the problem is to choose the degree of the polynomial we are going to use, and then to find its coefficients,
such that they fit the data well and represent the underlying model well enough.

In this lab session, we will see how we can find these coefficients and study the quality of the solutions depending
on a certain number of parameters.

2 Theory
In order to find the best coefficients for a given degree of our polynomial, we are going to solve an optimization
problem. The first thing to do is to define an appropriate cost function.

The only information we have is the data points, and we want our model to fit those data points well. So it
seems natural that our cost function will depend on the distance between each value of our data points tn and the
prediction by our model y(xn , w):
N N
1X 2 1X > 2
J(w) = ((y(xn , w) tn ) = zn w t n , (2)
2 n=1 2 n=1
>
where zn = [1, xn , x2n , ..., xM
n ] R
M +1
.
This can be rewritten as:

1
1
J(w) = ||Zw t||22 , (3)
2

z1
z2
with t = [t1 , t2 , ..., tN ]> RN , and Z = . RN (M +1) .

, ..
zN
In the end, our cost function depends on the sum a squared differences between each of our data points and our
model prediction. It can be compacted as a squared Euclidean norm (Eq. (3)). The Euclidean norm seems like a
natural choice, and the square is included for smoothness reasons. The 21 factor is included for later convenience.
Note that here the motivation of this cost function is rather pragmatic, but we will see later that we can find theo-
retical justifications behind it.

The only thing to do to find our coefficients is to find the minimizer of J:


1
arg min ||Zw t||22 . (4)
w 2
Questions:
dJ
1. Compute the gradient J (also denoted as dw ) of J, that is the vector whose entries are the partial derivatives
of J with respect to each of the coefficients of the polynomial.
2. Write down the equation J = 0, whose solutions will give us the minimizer. What kind of equation is this?
3. Compute the closed-form solution of the previous equation (assume that Z is a full rank matrix).

3 Experimental Part
3.1 Setup
1. Generate N = 10 data points equally spaced between and . Compute the values of u using the following
model:
un = sin(xn ) (5)

Useful MATLAB function: linspace


2. Real-life datasets are very often corrupted by noise. The Signal to Noise Ratio (SNR, in dB) is defined as:
 
var(U )
SN R = 10log10 (6)
var(E)

where U denotes a random variable, N realizations of which being given by the tn . Similarly, E is the random
variable representing the noise values. var is the variance of the corresponding random variable, which is a
measure of its energy.

A SNR of 0dB mean that the signal values and the noise values have an equal energy. A SNR of 10dB means
that the energy of the signal is 10 times greater than that of the noise, a SNR of 20dB means that the energy
of the signal is 100 times greater than that of the noise, and so on.
Generate i.i.d. random (zero-mean) Gaussian noise for the previously generated values such that the SNR is
20dB (use the randn function). If you want to use the same noise realizations for the whole session, insert
rng(default) in the preamble of your code. This will block the random number generator.

3. Generate the actual observed signal t with tn = un + en .


4. Plot the obtained signal values, as well as the true (noiseless) sine curve. This can simply be computed by using
much more samples than N in the interval [, ]. This will give the impression that the curve is continuous.

2
3.2 Analysis
5. Code the solution to the polynomial curve fitting described in section 2. Observe the obtained fitting curve for
various values of M between 1 and N-1 (for a given noise realization). What seems to be the best value for M ?

6. Computing the Root Mean Squared Error (RMSE) between the polynomial model and the data, for each value
of M:

1
RM SE = ||Zw
t||2 (7)
N
where w
represents the coefficients of the polynomial (i.e. the solution of Problem (4)).

7. What happens when M = N 1? How do you explain it theoretically? Is this a good solution?
8. The phenomenon happening for large values of M (M close to N or larger) is called overfitting. Can you
explain the name?
9. Now consider the case N = 100. Select 10 samples as training data, and learn the coefficientes from this
training set. Now compute the RMSE using the predicted values for all the remaining samples (referred to as
test data). Conclude on the capacity of this strategy to estimate a good range of the degree values to use (this
is called model selection).

You might also like