You are on page 1of 20

introduction to

linear and nonlinear regression

thierry.chonavel@imt-atlantique.fr

FC IMT

18 juin 2021 1 / 20
Regression : a probabilistic approach
Problem : in a probability space (Ω, A, P ) we consider real-valued
random variables (RVs) X = X1 , . . . , Xn and Y and we look for the
best approximation (in some sense) of Y as a function of X.
Interest : for an experiment ω ∈ Ω and an observation
x1:n = X1:n (ω), one can be interested in finding an approximation
of the corresponding y = Y (ω).
Hypothesis : in what follows, involved RVs Z are assumed with
finite second order moment : E[Z 2 ] < ∞.
Remarks :
I Here, uppercase letters will denote RVs and lowercase letters
samples from these RVs.
I Note however that in statistics people often use lower case letters
both to denote random variables and realizations of them, the
distinction arising from the context.
I For possibly multivariate RVs (that is, random vectors), bold case
notation will be used.

18 juin 2021 2 / 20
Regression : examples of application

Predict the weight of a person from other measurements such as


height, ...
For a random process X = (Xn )n∈N predict its value Y = Xn+1 at
time n + 1 from the knowledge of its past values X1:n . This can be
used for
I weather forecast
I data compression
I ...
Find the value taken by a random variable from the observation of
a noisy version of it
...

18 juin 2021 3 / 20
Outline

1 Space L2 (Ω, A, P )

2 Nonlinear regression

3 Linear regression

4 Implementation

5 Regression and Machine Learning

6 Exercise

18 juin 2021 4 / 20
Space L2 (Ω, A, P )

Let C(X) denote the set of random variables that are equal almost
everywhere (a.e.) : Y ∈ C(X) iff P (Y 6= X) = 0.
Definition : L2 (Ω, A, P ) = {C(X); E[X 2 ] < ∞}. In practice we
identify random variables names with the class they belong to.
p
Then k X k= E[X 2 ] is a semi-norm for RVs of (Ω, A, P ) with
finite variance but a norm in L2 (Ω, A, P ) :

k X k= 0 ⇔ C(X) = C(0), that is, X = 0, a.e.

18 juin 2021 5 / 20
Properties of L2 (Ω, A, P )

Theorem (Riesz-Fisher theorem)


L2 (Ω, A, P ) is an p
Hilbert space, that is, a complete and normed space where
the norm k X k= E[X 2 ] is induced by the scalar product < X, Y >= E[XY ].

A nice feature of Hilbert spaces is that the projection onto closed convex
sets or closed subspaces operates as in euclidean spaces. A very useful
application of this property is the ability it offers to solve approximation
problems.

Theorem (projection theorem in Hilbert spaces)


Let H denote an Hilbert space and F a closed subspace of H . Then, ∀X ∈ H,
∃!X̂ ∈ F, X̂ = arg minZ∈F k X − Z k2
In addition, X̂ is the only element of F such that ∀Z ∈ F, < X − X̂, Z >= 0

Proofs will be supplied in a separate note.

18 juin 2021 6 / 20
Outline

1 Space L2 (Ω, A, P )

2 Nonlinear regression

3 Linear regression

4 Implementation

5 Regression and Machine Learning

6 Exercise

18 juin 2021 7 / 20
Regression function (I)

Let Y ∈ L2 (Ω, A, P ) that we want to approximate from X, a


scalar/vector-valued RV with entries in L2 (Ω, A, P ), what we note
X ∈ L2 (Ω, A, P ).
As we only know X and we look for an approximation of Y in the
form Ŷ = h(X), where h(X) ∈ L2 (Ω, A, P ).
To avoid too much measure theory considerations, in what follows
we shall restrict our interest to the case where all RVs are
absolutely continuous (discrete valued RVs are treated the same
way, replacing integrals by sums).
Then,
h(X) ∈ L2 (Ω, A, P ) ⇔ Ω [h(X(ω))]2 P (dω) = x h(x)2 p(x)dx < ∞,
R R

where p(x) denotes the pdf of X.


Thus, h(X) ∈ L2 (Ω, A, P ) ⇔ h ∈ L2 (R, B(R), PX ), where B(R) is
the Borel σ-algebra and PX the distribution of X.

18 juin 2021 8 / 20
Regression function (II)
{g(X); g ∈ L2 (R, B(R), PX )} is a closed Hilbert subspace of
L2 (Ω, A, P ) and we can apply the projection theorem :
∃!h ∈ L2 (R, B(R), PX )), h = arg minL2 (R,B(R),PX ) E[(Y − g(X))2 ].
y = h(x) is called the regression function of Y knowing that X = x
and is denoted by h(x) = E[Y | X = x]. It is given by the following
result :
Theorem : conditional expectation in L2 (Ω, A, P ))
If (X, Y ) ∈ L2 (Ω, A, P ) with pdf p(x, y), then
R p(x, y)
E[Y | X = x] = y y dy.
p(x)

Remark The theorem supplies a projection interpretation of the


p(x, y)
mean of the conditional pdf p(y|x) = in term of projection.
p(x)
Exercise Prove the theorem. (Hint : use the characterization
property, ∀g ∈ L2 (R, B(R), PX ), E[(Y − h(X))g(X)] = 0.)
18 juin 2021 9 / 20
Outline

1 Space L2 (Ω, A, P )

2 Nonlinear regression

3 Linear regression

4 Implementation

5 Regression and Machine Learning

6 Exercise

18 juin 2021 10 / 20
Linear regression
For nonlinear regression,
Z the computation of
p(x, y)
E[Y|X = x] = y dy requires knowledge of p(x, y) and
y p(x)
computing an integral.
Remarks
(i) Here we assumed Y(ω) ∈ Rp with p ≥ 1 and the expression of
E[Y|X = x] is a straightforward extension of the scalar case for Y ;
(ii) For large p computing E[Y|X = x] can be demanding.
Simpler than conditional nonlinear regression : linear regression
Ŷ = AX + b, with (A, b) = arg minC,d E[k Y − (CX + d) k2 ].
Solution : Ŷ = E[Y] + cov(Y, X) × cov(X)−1 [X − E[X]].
When X(ω), Y(ω) ∈ R, we get a regression line of Y from X
σY X
y = E[Y ] + 2 [x − E[X]].
σX

18 juin 2021 11 / 20
Linear regression : the Gaussian case

Theorem (Gaussian conditioning )


If      
X mX ΓX ΓXY
∼N ,
Y mY ΓXY ΓY
then, E[Y | X] = mY + ΓYX Γ−1X (X − mX ), and the conditional
distribution of Y knowing that X = x is in the form

Y | X = x ∼ N mY + ΓYX Γ−1 −1 T

X (y − mY ), ΓYX − ΓYX ΓX ΓYX .

Exercise : Prove the theorem.

18 juin 2021 12 / 20
Outline

1 Space L2 (Ω, A, P )

2 Nonlinear regression

3 Linear regression

4 Implementation

5 Regression and Machine Learning

6 Exercise

18 juin 2021 13 / 20
Practical implementation of linear regression from data
Assume the means, correlations or distribution of (X, Y) ∈ Rp×q are not
known but some sample set S = {(xk , yk )}k=1:n , consisting of
independent realizations of (X, Y), is available.
Let m̂X = n1 k=1:n xk , m̂Y = n1 k=1:n yk ,
P P

ĈYX = n1 k (yk − m̂Y )(xk − m̂X )T and


P

ĈX = n1 k (xk − m̂X )(xk − m̂X )T .


P

Then, the linear regression function of Y knowing that X = x is given by


y = m̂Y + ĈYX Ĉ−1
X (x − m̂X ). (1)
Remark about implementation of nonlinear regression from data :
I in the case where X and Y are discrete valued random variables,

P P̂ (X = x, Y = y)
E[Y | X = x] ≈ y∈Y(Ω) y×
P̂ (X = x) (2)
n
where P̂ (X = x, Y = y) = nxy , P̂ (X = x) = nnx ,
nxy = #{k; (xk , yk ) = (x, y)}, nx = #{k; xk = x}
I the case of absolutely continuous random variables is addressed in
the lesson about kernel methods.
18 juin 2021 14 / 20
Example
Example : study the regression of Y w.r.t. X, with
π
X = (1 − 2Z1 ), Y = (0.1X + cos X) × (1 + Z2 ) (3)
2
where Z1 ∼ U[0,1] and Z2 ∼ N (0, 1) are independent.
Clearly, E[Y | X] = 0.1X + cos X
Comparison with linear regression :

18 juin 2021 15 / 20
Outline

1 Space L2 (Ω, A, P )

2 Nonlinear regression

3 Linear regression

4 Implementation

5 Regression and Machine Learning

6 Exercise

18 juin 2021 16 / 20
Regression and Machine Learning
Not all the approximation problems in machine learning boil down
to regression among random vectors.
Often, we look for some parameter θ that is involved in the
distance d between two quantities x and y. θ is obtained by
minimizing this distance.
For example, we consider a transform Fθ (linear or nonlinear) such
that θ is estimated by θ̂ = arg minθ d(Fθ (x), y).
d(Fθ (x), y) can be a very complex function of θ and θ itself can
have large dimension. Then, iterative optimization techniques such
as gradient descent algorithms are often considered.
Example : to train neural networks that are made of a succession
of linear and non linear transforms, where the weights θ of the
linear transforms should be learned, we can search for θ by
minimizing i k Fθ (xi ) − yi k2 where training pairs (xi , yi )i∈I
P
represent input data xi and the corresponding desired outputs yi .

18 juin 2021 17 / 20
Outline

1 Space L2 (Ω, A, P )

2 Nonlinear regression

3 Linear regression

4 Implementation

5 Regression and Machine Learning

6 Exercise

18 juin 2021 18 / 20
Exercise

A machine has to interrupt its operation at random times due to failures.


We assume that the time to the next failure of the machine is a random
variable X with exponential distribution. We also assume that distinct
machines have distinct mean time between failures and that the intensity
of failures is a random variable Λ with uniform distribution over an
interval [a, b]. Thus, X | Λ = λ ∼ E(λ) and Λ ∼ U[a,b] .
We want to predict the value of Λ knowing that X = x. Calculate the
regression curve λ = E[Λ | X = x].
Solution : see the notebook supplied with the slides of the lesson about
regression on Moodle 1 .

1. https ://moodle.imt-atlantique.fr/course/view.php ?id=1162


18 juin 2021 19 / 20
Conclusion

In this lesson, we have


Recalled the projection theorem in Hilbert spaces.
Defined the Hilbert space L2 (Ω, A, P ) of random variables with
finite second order moment.
Established that the conditional expectation represents the best
minimum error norm (a.k.a. MMSE : Minimum Mean
Squared Error) approximation in L2 (Ω, A, P ).
Presented the linear regression formula.
Described a simulation example.

18 juin 2021 20 / 20

You might also like