You are on page 1of 39

MPC Course - Part 2

Regression : predicting a variable using other variable(s)

Simon Malinowski

M1 Miage, Univ. Rennes 1

Simon Malinowski (Univ. Rennes 1/IRISA) 2020/2021 1 / 39


1 Introduction

2 Linear regression

Simon Malinowski (Univ. Rennes 1/IRISA) 2020/2021 2 / 39


What is regression ?
From data to predictions

Simon Malinowski (Univ. Rennes 1/IRISA) 2020/2021 3 / 39


Regression example

House House size (sq. feet) Year built House Price (target, in ke)
1 80 1985 120
2 92 2010 180
3 75 2008 145
4 110 2015 220
5 85 2000 140
6 150 1975 225
7 55 1992 100
8 105 1999 ??
9 95 2018 ??

Simon Malinowski (Univ. Rennes 1/IRISA) 2020/2021 4 / 39


Other regression examples

How many people will retweet your tweet ? (y)


→ depends on x : number of followers, popularity, subject of the
tweet, ...
What will be your next salary ? (y)
→ depends on x : your degree, your experience, ...
How many points will have a football team at the end of the season ?
→ depends on x : statistics about the team performance (goals,
shoots, possession,...)

Simon Malinowski (Univ. Rennes 1/IRISA) 2020/2021 5 / 39


Different kinds of regression

Regression → target variable is numerical

Simple VS multiple
→ simple regression : one feature variable (X) to predict Y
→ multiple regression : several feature variables (X1 , . . . , Xp ) to
predict Y

Linear VS Non-linear

Simon Malinowski (Univ. Rennes 1/IRISA) 2020/2021 6 / 39


Simple linear regression in a nutshell

Size Price
80 120
92 180
75 145
110 220
85 140
150 225
55 100
105 ??
95 ??

Simon Malinowski (Univ. Rennes 1/IRISA) 2020/2021 7 / 39


Simple linear regression in a nutshell

Year Price
1985 120
2010 180
2008 145
2015 220
2000 140
1975 225
1992 100
1999 ??
2018 ??

Simon Malinowski (Univ. Rennes 1/IRISA) 2020/2021 8 / 39


What we’ll see about simple regression

How to find the model ? (i.e. coefficients of the regression line)


How to evaluate the performance of a model ?
→ will the model be good to predict new inputs ?
→ if I have more than one variable, which one seems the best ?

Simon Malinowski (Univ. Rennes 1/IRISA) 2020/2021 9 / 39


Multiple linear regression in a nutshell

Size Year Price


80 1985 120
92 2010 180
75 2008 145
110 2015 220
85 2000 140 P = −2872+1.65× S +1.44× Y
150 1975 225
55 1992 100
105 1999 ??
95 2018 ??

Simon Malinowski (Univ. Rennes 1/IRISA) 2020/2021 10 / 39


Multiple linear regression for non-linear regression

Size Size 2 Price


80 6400 120
92 8464 180
75 5625 145
110 12100 220
85 7225 140
150 22500 225
55 3025 100
105 11025 ??
95 9025 ??
P = −91.3+3.94× S −0.012× S 2

Simon Malinowski (Univ. Rennes 1/IRISA) 2020/2021 11 / 39


What we’ll see about multiple regression

How to find the model ? (i.e. coefficients of the model)


How to evaluate the performance of a model ?
→ will the model be good to predict new inputs ?
→ how to compare models with different number of variables
Non-linear regression with multiple regression
Variable selection
→ do I really need all the variables I have, or a subset might be
better ?
→ how to perform variable selection ? (criteria, methods)

Simon Malinowski (Univ. Rennes 1/IRISA) 2020/2021 12 / 39


Regression with non-numerical variables

Size District Price


80 Center 120
92 Bourg-Lesveque 180
75 Center 145
110 Longchamps 220
85 Longchamps 140
150 Bourg-Lesveque 225
55 Center 100
105 Longchamps ??
95 Center ??

Simon Malinowski (Univ. Rennes 1/IRISA) 2020/2021 13 / 39


1 Introduction

2 Linear regression
Data, problem
Linear model
Estimation of the coefficients
Coefficient of determination (R-squared)
Statistical tests about regression models
Estimating the general performance of a model

Simon Malinowski (Univ. Rennes 1/IRISA) 2020/2021 14 / 39


Data, problem
Simple regression : 1 numerical target variable, 1 numerical explanatory variable
Individuals X Y
1 x1 y1
..
.
i xi yi
..
.
n xn yn

Multiple regression : 1 numerical target variable, several numerical explanatory


variables
Individuals X1 ... Xj ... Xp Y
1 x11 ... x1j ... x1p y1
.. .. ..
. . .
i xi1 ... xij ... xip yi
.. .. ..
. . .
n xn1 ... xnj ... xnp yn

Simon Malinowski (Univ. Rennes 1/IRISA) 2020/2021 15 / 39


Quantity of information in the target variable

n
X
It = (yi − y )2
i=1

Simon Malinowski (Univ. Rennes 1/IRISA) 2020/2021 16 / 39


1 Introduction

2 Linear regression
Data, problem
Linear model
Estimation of the coefficients
Coefficient of determination (R-squared)
Statistical tests about regression models
Estimating the general performance of a model

Simon Malinowski (Univ. Rennes 1/IRISA) 2020/2021 17 / 39


Linear model

We assume that Y is a linear function of the explanatory variables Xi

Y ' β0 + β1 X1 + β2 X2 + · · · + βp Xp

Simon Malinowski (Univ. Rennes 1/IRISA) 2020/2021 18 / 39


1 Introduction

2 Linear regression
Data, problem
Linear model
Estimation of the coefficients
Coefficient of determination (R-squared)
Statistical tests about regression models
Estimating the general performance of a model

Simon Malinowski (Univ. Rennes 1/IRISA) 2020/2021 19 / 39


Estimation of the coefficients

β0 , . . . , βp are the ”optimal” parameters (that we would find if we knew all


the data about the problem)

Aim : find β̂0 , . . . , β̂p , estimations of β0 , . . . βp that are as accurate as


possible

→ Least-square regression

Simon Malinowski (Univ. Rennes 1/IRISA) 2020/2021 20 / 39


Least square regression (simple regression case)

→ Objective : minimizing the sum of squared residuals


n
X
Ir (β0 , β1 ) = (ŷi − yi )2
i=1

n
X
Ir (β0 , β1 ) = ((β0 + β1 xi ) − yi )2
i=1

Partial derivatives needs to be equal to 0 :


( P
β1 = (x Pi −x)(yi −y )
(xi −x)2
β0 = y − β1 x

We can also compute σ̂β1 : uncertainty about the value of β1

Simon Malinowski (Univ. Rennes 1/IRISA) 2020/2021 21 / 39


1 Introduction

2 Linear regression
Data, problem
Linear model
Estimation of the coefficients
Coefficient of determination (R-squared)
Statistical tests about regression models
Estimating the general performance of a model

Simon Malinowski (Univ. Rennes 1/IRISA) 2020/2021 22 / 39


Coefficient of determination

We have seen that the quantity of information in the target variable is :


n
X
It = (yi − y )2
i=1

This quantity of information can be split into two parts :


the quantity of information explained by the model Im
the quantity of information not explained by the model Ir
We have It = Im + Ir
The R-squared is defined as :
Im
R2 =
It

Simon Malinowski (Univ. Rennes 1/IRISA) 2020/2021 23 / 39


1 Introduction

2 Linear regression
Data, problem
Linear model
Estimation of the coefficients
Coefficient of determination (R-squared)
Statistical tests about regression models
Estimating the general performance of a model

Simon Malinowski (Univ. Rennes 1/IRISA) 2020/2021 24 / 39


Simple regression : Test for the significancy of β1
Hypothesis 0 (H0 ) : X does not influence Y → β1 = 0
Hypothesis 1 (H1 ) : X has an influence on Y → β1 6= 0
Under H0 , σ̂β̂β1 is generated by a Student’s law with n − 2 degrees of
1
freedom

95%

(n−2)
-t0.025 0 (n−2)
t0.025

Simon Malinowski (Univ. Rennes 1/IRISA) 2020/2021 25 / 39


Simple regression : Test for the significancy of β1
β̂1
Under H0 , σ̂β1 is generated by a Student’s law with n − 2 degrees of
freedom

pc 2 pc 2

-|b|/σB 0 |b|/σB

pc : probabilty of making a wrong decision when deciding H1


pc < 0.05 ⇒ decide H1
Simon Malinowski (Univ. Rennes 1/IRISA) 2020/2021 26 / 39
Multiple reg. : Test of significancy of one coefficient

Hypothesis H0 : βj = 0 (xj is not significant in the presence of other


variables)
Hypothesis H1 : βj 6= 0 (xj is significant in the presence of other
variables)
β̂j
Test value : T = σ̂j

Under H0 , T is generated by a Student’s law with (n − p − 1) degrees of


freedom.
n−p−1
Si |T | > t0.025 , we refuse H0 .

Simon Malinowski (Univ. Rennes 1/IRISA) 2020/2021 27 / 39


Examples of the interest of this test

Simon Malinowski (Univ. Rennes 1/IRISA) 2020/2021 28 / 39


Examples of the interest of this test

Simon Malinowski (Univ. Rennes 1/IRISA) 2020/2021 29 / 39


Examples of the interest of this test

Simon Malinowski (Univ. Rennes 1/IRISA) 2020/2021 30 / 39


1 Introduction

2 Linear regression
Data, problem
Linear model
Estimation of the coefficients
Coefficient of determination (R-squared)
Statistical tests about regression models
Estimating the general performance of a model

Simon Malinowski (Univ. Rennes 1/IRISA) 2020/2021 31 / 39


Is a given model good to make predictions ?

Are the determination coefficient or the Ir (or the statistical tests) adapted
to tell me if a given model is good to make predictions ?

Simon Malinowski (Univ. Rennes 1/IRISA) 2020/2021 32 / 39


Ir and R 2 are not good indicators

Size Price
80 120
92 180
75 145
110 220
85 140
150 225
55 100

Simon Malinowski (Univ. Rennes 1/IRISA) 2020/2021 33 / 39


Ir and R 2 are not good indicators

Size Size 2 Price


80 6400 120
92 8464 180
75 5625 145
110 12100 220
85 7225 140
150 22500 225
55 3025 100

Simon Malinowski (Univ. Rennes 1/IRISA) 2020/2021 34 / 39


Ir and R 2 are not good indicators

Size Size 2 Size 3 Price


80 6400 120
92 8464 180
75 5625 145
110 12100 220
85 7225 140
150 22500 225
55 3025 100

Simon Malinowski (Univ. Rennes 1/IRISA) 2020/2021 35 / 39


Ir and R 2 are not good indicators

Size Size 2 Size 3 Size 4 Size 5 Price


80 6400 120
92 8464 180
75 5625 145
110 12100 220
85 7225 140
150 22500 225
55 3025 100

Simon Malinowski (Univ. Rennes 1/IRISA) 2020/2021 36 / 39


Generalisation error of a model

Definition : it is the average error that a model would do when predicting


any new individuals.

Problem : It is impossible to know exactly as we don’t have any new


individuals...

→ we will try to have a fair estimation of this value.


2 main methods :
1 Train/Test split
2 K-fold cross validation

Simon Malinowski (Univ. Rennes 1/IRISA) 2020/2021 37 / 39


Train/Test split

Individuals X1 ... Xj ... Xp Y


1 x11 ... x1j ... x1p y1
.. .. ..
. . .
i xi1 ... xij ... xip yi
i +1 xi+1,1 ... xi+1,j ... xi+1,p yi+1
.. .. ..
. . .
n xn1 ... xnj ... xnp yn

Individuals in blue are used to learn a model (or different models)


The model is used to predict y for all the individuals in green
The generalization error is estimated by a Mean Square Error of these
predictions.
Drawback ?

Simon Malinowski (Univ. Rennes 1/IRISA) 2020/2021 38 / 39


K-fold cross validation
Principle : the dataset is split into K disjoint subsets

Apprentissage
.
.
.

Test
ê1
ê3

ê2
êK

The estimation of the generalization error is the average of the K measure


errors.
Simon Malinowski (Univ. Rennes 1/IRISA) 2020/2021 39 / 39

You might also like