Professional Documents
Culture Documents
Stéphane Mottelet
Model : y = fθ (x)
I x ∈ R : independent variable
I y ∈ R : dependent variable (value found by observation)
I θ ∈ Rp : parameters
Regression problem
The least squares method measures the fit with the Sum of Squared Residuals (SSR)
n
X
S(θ) = (yi − fθ (xi ))2 ,
i=1
Important issues
statistical interpretation
existence, uniqueness and practical determination of θ̂ (algorithms)
yi = fθ (xi ) + εi , i = 1 . . . n,
φiθ : R −→ R
y −→ φiθ (y ) = g(y − fθ (xi ))
√ 1
φiθ (y ) −1
= (σ 2π) exp − 2 (y − fθ (xi ))2
2σ
Joint density
When θ is given, as the (yi ) are independent, the density of the vector y = (y1 , . . . , yn ) is
n
Y
φθ (y) = φiθ (yi ) = φ1θ (y1 )φ2θ (y2 ) . . . φnθ (yn ).
i=1
Interpretation : for D ⊂ Rn
Z
Prob(y ∈ D|θ) = φθ (y) dy1 . . . dyn
D
Likelihood function
def
When a sample of y is given, then Ly (θ) = φθ (y) is called
Densities of Gaussian vs. Laplacian random variables (with zero mean and unit variance) :
Doing Least Squares Regression means that you assume that the model error is Gaussian.
1 the nice theoretical and computational framework you will get is worth doing this
assumption. . .
2 a posteriori goodness of fit tests can be used to assess the normality of errors.
Identity matrix
1
I=
..
.
1
For x ∈ Rn , y ∈ Rn ,
n
X
hx, y i = x > y = xi yi , kxk2 = x > x
i=1
rank(A) = m ⇐⇒ {Ax = 0 ⇒ x = 0}
When A is square
in Scilab/Matlab : x = A\y
fm (x)
f is differentiable at a ∈ Rn if
Gradient : if f : Rn −→ R, is differentiable at a,
f (x̂) = 0
can be found (or not) by the Newton’s method : given x0 , for each k
∀x ∈ Rn , f (x̂) ≤ f (x)
xk +1 = xk − ρk ∇f (xk ),
Examples
Pp
I y= θj x j−1
Pj=1
p (j−1)x
I y= j=1
θj cos T
, where T = xn − x1
I ...
Pp
General linear model fθ (x) = j=1 θj φj (x)
n
X
S(θ) = (fθ (xi ) − yi )2 = kr (θ)k2 ,
i=1
n 2
X Pp
= j=1 θj φj (xi ) − yi = kr (θ)k2 ,
i=1
θ2
For the whole residual vector r (θ) = Aθ − y where A has size n × p and
aij = φj (xi ).
θ̂ = arg min
p
S(θ̂) = kAθ − y k2
θR
Theorem : a solution of the LLS problem is given by θ̂, solution of the “normal equations”
A> A θ̂ = A> y ,
Uniqueness :
When A is not square and has full (column) rank, then the command
x=A\y
computes x, the unique least squares solution. i.e. such that norm(A*x-y) is minimal.
Algebraic distance
n
X 2
d(a, b, R) = (xi − a)2 + (yi − b)2 − R 2 = kr k2
i=1
In Scilab
A=[2*x,2*y,ones(x)]
z=x.^2+y.^2
theta=A\z
a=theta(1)
b=theta(2)
R=sqrt(theta(3)+a^2+b^2)
t=linspace(0,2*%pi,100)
plot(x,y,"o",a+R*cos(t),b+R*sin(t))
i.e. do simple linear regression of (log yi ) against (xi ), but this violates a fundamental
hypothesis because
Remember that
S(θ) = kr (θ)k2 , ri (θ) = fθ (xi ) − yi .
I needs to compute the Jacobian of the gradient itself (do you really want to compute second
derivatives ?),
I does not guarantee convergence towards a minimum.
Use the spirit of Newton’s method as follows : start with θ0 and for each k
and take θk +1 such that the squared norm of the affine approximation
is minimal.
finding θk +1 − θk is a LLS problem !
Problem: what can you do when r 0 (θk ) has not full column rank ?
Modify the Gauss-Newton iteration: pick up a λ > 0 and take θk +1 such that
is minimal.
finding θk +1 − θk is a LLS problem and for any λ > 0 a unique solution exists !
1
−1
θk +1 = θk − 2 r 0 (θk )> r 0 (θk ) + λI ∇S(θk ),
1 1 0 > 0
−1
= θk − 2λ λ r (θk ) r (θk ) + I ∇S(θk )
Consider data (xi , yi ) to be fitted by the non linear model fθ (x) = exp(θ1 + θ2 x) :
function j=jac(theta,n)
e=exp(theta(1)+theta(2)*x);
j=[e x.*e];
endfunction
load data_exp.dat
theta0=[0;0];
theta=lsqrsolve(theta0,resid,length(x),jac);
plot(x,y,"ob", x,exp(theta(1)+theta(2)*x),"r")
θ̂ = (0.981, -2.905)
Enzymatic kinetics
s(t)
s0 (t) = θ2 , t > 0,
s(t) + θ3
s(0) = θ1 ,
yi = measurement of s at time ti
yi − s(ti )
S(θ) = kr (θ)k2 , ri (θ) =
σi
Individual weights σi allow to take into account different standard deviations of measurements
function dsdt=michaelis(t,s,theta)
dsdt=theta(2)*s/(s+theta(3))
endfunction
function r=resid(theta,n)
s=ode(theta(1),0,t,michaelis)
r=(s-y)./sigma
endfunction
load michaelis_data.dat
theta0=[y(1);20;80];
theta=lsqrsolve(theta0,resid,n)
θ̂ = (887.9, 37.6, 97.7)
If not provided, the Jacobian r 0 (θ) is approximated by finite differences (but true Jacobian always
speed up convergence).
Since the data (yi )i=1...n is a sample of random variables, then θ̂ too !
I Monte-Carlo method : allows to estimate the distribution of θ̂ but needs thousands of resamplings
I Linearized statistics : very fast, but can be very approximate for high level of measurement error
The Monte Carlo method is a resampling method, i.e. works by generating new samples of
synthetic measurement and redoing the estimation of θ̂. Here model is
y = θ1 + θ2 x + θ3 x 2 ,
1
and data is corrupted by noise with σ = 2
θ1 θ2 θ3
At confidence level=95%,
yi − fθ (xi )
ri (θ) = ,
σi
where σi is the standard deviation of yi .
The covariance matrix of θ̂ can be approximated by
V (θ̂) = F (θ̂)−1
V (θ̂) = σ 2 A> A
d=derivative(resid,theta)
V=inv(d’*d)
sigma_theta=sqrt(diag(V))
t_alpha=cdft("T",m-3,0.975,0.025);
thetamin=theta-t_alpha*sigma_theta
thetamax=theta+t_alpha*sigma_theta
On the previous slide data has been fitted with the model
p
X
y= θk x k , p = 0 . . . 8,
k =0
p S(θ̂)
0 3470.32
1 651.44
2 608.53
3 478.23
4 477.78
5 469.20
6 461.00
7 457.52
8 448.10
p SV (θ̂p )
0 11567.21
1 2533.41
2 2288.52
3 259.27
4 326.09
5 2077.03
6 6867.74
7 26595.40
8 195203.35
Always evaluate your models by either computing confidence intervals for the parameters or by
using validation.