You are on page 1of 76

Prediction methods & Machine learning

Session I

Pierre Michel
pierre.michel@univ-amu.fr

M2 EBDS

2021
1. Introduction

1. Introduction

Pierre Michel Prediction methods & Machine learning 2/76


1. Introduction
1.1. What is Machine Larning ?

1.1 What is Machine Larning ?

Pierre Michel Prediction methods & Machine learning 3/76


1. Introduction
1.1. What is Machine Larning ?

Machine Learning

“machine learning is a field that develops algorithms designed


to be applied to datasets, with the main areas of focus being pre-
diction (regression), classification, and clustering or grouping
tasks.” — Susan Athey, 2018

A field at the crossroads of three disciplines:

• Statistics
• Computer Science
• Artificial Intelligence

“Machine Learning” in french is “Apprentissage Machine” or “Apprentissage


Automatique”

Pierre Michel Prediction methods & Machine learning 4/76


1. Introduction
1.1. What is Machine Larning ?

Figure 1: Machine Learning.

Pierre Michel Prediction methods & Machine learning 5/76


1. Introduction
1.1. What is Machine Larning ?

Some examples of application

• Create groups of customers (“market segmentation”), in order to


define the best commercial strategy;
• Recommendation of products for each customer, based on purchases
from similar customers;
• Automatic detection of transactions that appear to be fraudulent,
• Prediction of next year’s revenue, based on the company’s historical
data;
• Automatic diagnosis of pathologies;
• Pattern recognition (images, sounds, texts, etc.).

Pierre Michel Prediction methods & Machine learning 6/76


1. Introduction
1.1. What is Machine Larning ?

Pattern Recognition in images

Figure 2: Source: MNIST dataset.


Pierre Michel Prediction methods & Machine learning 7/76
1. Introduction
1.1. What is Machine Larning ?

Automatic diagnosis

Figure 3: Source: NVidia.


Pierre Michel Prediction methods & Machine learning 8/76
1. Introduction
1.1. What is Machine Larning ?

Connected objects (e-Health)

Figure 4: Source: iRhythm.


Pierre Michel Prediction methods & Machine learning 9/76
1. Introduction
1.2. Supervised and unsupervised learning

1.2 Supervised and unsupervised learning

Pierre Michel Prediction methods & Machine learning 10/76


1. Introduction
1.2. Supervised and unsupervised learning

Supervised versus Unsupervised

Classification (supervised) Clustering (unsupervised)


2.5

2.5
1.5

1.5
x2

x2
0.5

0.5
1 2 3 4 5 6 7 1 2 3 4 5 6 7

x1 x1

Regression (supervised) Density estimation (unsupervised)


0.4
180

Density

0.2
y

120

0.0
60

20 30 40 50 60 4 5 6 7 8

x x

Pierre Michel Prediction methods & Machine learning 11/76


1. Introduction
1.2. Supervised and unsupervised learning

Supervised learning (classification)

Figure 5: Example of classification.

Pierre Michel Prediction methods & Machine learning 12/76


1. Introduction
1.2. Supervised and unsupervised learning

Supervised learning (regression)

Figure 6: Example of regression

Pierre Michel Prediction methods & Machine learning 13/76


1. Introduction
1.2. Supervised and unsupervised learning

Unsupervised learning (clustering)

Figure 7: Example of clustering.

Pierre Michel Prediction methods & Machine learning 14/76


1. Introduction
1.3. Advanced Machine Learning

1.3 Advanced Machine Learning

Pierre Michel Prediction methods & Machine learning 15/76


1. Introduction
1.3. Advanced Machine Learning

Semi-supervised learning

Figure 8: Example of semi-supervised learning.

Pierre Michel Prediction methods & Machine learning 16/76


1. Introduction
1.3. Advanced Machine Learning

Reinforcement learning

Pierre Michel Prediction methods & Machine learning 17/76


1. Introduction
1.4. Which tools to use?

1.4 Which tools to use?

Pierre Michel Prediction methods & Machine learning 18/76


1. Introduction
1.4. Which tools to use?

Python and Machine Learning libraries

This course is based on the Python programming language and the following
libraries:

• Scikit-Learn (http://scikit-learn.org): offers many Machine


Learning algorithms (classification, regression, clustering. . . )
• NumPy (http://numpy.org): Scientific calculation, linear algebra,
N-dimensional arrays. . .
• pandas (http://pandas.pydata.org): allows the handling and
analysis of data.
• Matplotlib (http://matplotlib.org): allows plotting and vizualizing
data in graphical form.

Pierre Michel Prediction methods & Machine learning 19/76


1. Introduction
1.4. Which tools to use?

Install Anaconda

Anaconda:

• free and open source distribution of the Python and R languages


• applications in data science and machine learning
• available on Windows, Mac, Linux
• Anaconda Navigator interface : JupyterLab, Jupyter Notebook,
Anaconda Prompt, Spyder, RStudio. . .

Pierre Michel Prediction methods & Machine learning 20/76


1. Introduction
1.4. Which tools to use?

Anaconda Prompt

First of all, you have to choose a directory in which you are going to work,
we open Anaconda Prompt:

cd my_folder # change current directory


mkdir ml # create a new folder named ml
cd ml # ml becomes current folder
conda list # print installed packages
python # run python
>>> print("Hello World") # print a message
>>> quit() # quit python
anaconda-navigator # run Anaconda Navigator
jupyter notebook # run Jupyter Notebook

Note: All this can also be done manually.

Pierre Michel Prediction methods & Machine learning 21/76


1. Introduction
1.4. Which tools to use?

Anaconda Navigator

Figure 10: Anaconda Navigator interface.

Pierre Michel Prediction methods & Machine learning 22/76


1. Introduction
1.4. Which tools to use?

Jupyter Notebook

Figure 11: Jupyter Notebook interface.

Pierre Michel Prediction methods & Machine learning 23/76


2. Linear regression

2. Linear regression

Pierre Michel Prediction methods & Machine learning 24/76


2. Linear regression
2.1. Model representation

2.1 Model representation

Pierre Michel Prediction methods & Machine learning 25/76


2. Linear regression
2.1. Model representation

Context

• We talk about supervised learning when we know the “true


response” for each observation.
• A regression problem consists in predicting a real value. . .
• . . . unlike a classification problem which consists in predicting a
discrete value.

Pierre Michel Prediction methods & Machine learning 26/76


2. Linear regression
2.1. Model representation

Exemple: housing prices (Portland)


6e+05
Price

4e+05
2e+05

1000 2000 3000 4000

Size

Pierre Michel Prediction methods & Machine learning 27/76


2. Linear regression
2.1. Model representation

Regression line 6e+05


Price

4e+05
2e+05

1000 2000 3000 4000

Size

Pierre Michel Prediction methods & Machine learning 28/76


2. Linear regression
2.1. Model representation

Prediction 6e+05
Price

406509.5
4e+05
2e+05

2500

1000 2000 3000 4000

Size

Pierre Michel Prediction methods & Machine learning 29/76


2. Linear regression
2.1. Model representation

Learning sample
Size (x) Price (y)
1600 329900
2400 369000
1416 232000
3000 539900
1985 299900
1534 314900
Notations:

• n: number of observations in the learning sample


• x: explanatory variable (or predictor)
• y: variable to explain (or target)
• (x, y): one observation (or example)
• (x(i) , y (i) ): the i-th observation

Pierre Michel Prediction methods & Machine learning 30/76


2. Linear regression
2.1. Model representation

Learning

From the learning sample and a learning algorithm, we look for a function h
which expresses y as a function of x. We represent it as a linear function:
variable to be explained

hθ (x) = θ0 + θ1 x

where θ = (θ0 , θ1 ) represents the parameters of the model.

Pierre Michel Prediction methods & Machine learning 31/76


2. Linear regression
2.2. Cost function

2.2 Cost function

Pierre Michel Prediction methods & Machine learning 32/76


2. Linear regression
2.2. Cost function

Parameters

From the data:


Size (x) Price (y)
1600 329900
2400 369000
1416 232000
3000 539900
1985 299900
1534 314900
Considering the linear function hθ (x) = θ0 + θ1 x:
How to estimate the parameters θ0 and θ1 ?

Pierre Michel Prediction methods & Machine learning 33/76


2. Linear regression
2.2. Cost function

Different choices of parameters


3.0

3.0

3.0
2.5

2.5

2.5
2.0

2.0

2.0
1.5

1.5

1.5
1.0

1.0

1.0
0.5

0.5

0.5
0.0

0.0

0.0
0.0 1.0 2.0 3.0 0.0 1.0 2.0 3.0 0.0 1.0 2.0 3.0

(a) (b) (c)

• (a): θ0 = 1.5 and θ1 = 0


• (b): θ0 = 0 and θ1 = 0.5
• (c): θ0 = 1 and θ1 = 0.5
Pierre Michel Prediction methods & Machine learning 34/76
2. Linear regression
2.2. Cost function

Idea 5
4
3
y

2
1
0

0 1 2 3 4 5

We choose θ0 and θ1 so that hθ (x) is as close as possible to the obser-


vations.
Pierre Michel Prediction methods & Machine learning 35/76
2. Linear regression
2.2. Cost function

Optimization problem

We seek to minimize the cost function J(θ0 , θ1 ) defined by:

n
1X
J(θ0 , θ1 ) = (hθ (x(i) ) − y (i) )2
2 i=1

The minimization problem can thus be written:

min J(θ0 , θ1 )
θ0 ,θ1

Note: the cost function is called the residual sum of squares.

Pierre Michel Prediction methods & Machine learning 36/76


2. Linear regression
2.2. Cost function

Naive example (hθ (x) = θ1 x)

hθ(x)
3.0

3.0
2.0

2.0
J(θ1)
y

1.0

1.0
0.0

0.0 1.0 2.0 3.0 0.0 0.0 0.5 1.0 1.5 2.0

x θ1

Pierre Michel Prediction methods & Machine learning 37/76


2. Linear regression
2.2. Cost function

Visualization of the cost function J(θ0 , θ1 )

80
Cost

60

40

20

Pierre Michel Prediction methods & Machine learning 38/76


2. Linear regression
2.3. Gradient descent

2.3 Gradient descent

Pierre Michel Prediction methods & Machine learning 39/76


2. Linear regression
2.3. Gradient descent

Gradient descent algorithm

• Allows to minimize the function of cost J.


• It is an iterative algorithm.
• More general use: not only for linear regression

Pierre Michel Prediction methods & Machine learning 40/76


2. Linear regression
2.3. Gradient descent

Idea of the algorithm

We have a function J(θ0 , θ1 ) that we want to minimize:

• we choose two initial values for θ0 and θ1


• we change the values θ0 and θ1 iteratively in order to reduce J(θ0 , θ1 )
and to obtain a minimum

Pierre Michel Prediction methods & Machine learning 41/76


2. Linear regression
2.3. Gradient descent

Example (Himmelblau’s fonction)

1500
z

1000

500
y

Pierre Michel Prediction methods & Machine learning 42/76


2. Linear regression
2.3. Gradient descent

Other example (other initial values)

1500
z

1000

500
y

Pierre Michel Prediction methods & Machine learning 43/76


2. Linear regression
2.3. Gradient descent

Gradient descent

At each iteration of the algorithm, a new value is assigned to θj (j = 0 and


j = 1), otherwise, the following operation is repeated until convergence:


θj := θj − α J(θ0 , θ1 ) (for j = 0 and j = 1)
∂θj

α is the learning rate (or learning step), it controls the step length of the
gradient descent.

Pierre Michel Prediction methods & Machine learning 44/76


2. Linear regression
2.3. Gradient descent

Choice of the learning rate (too small)


If α is too small, the gradient descent may be slow:
J(θ)

Pierre Michel Prediction methods & Machine learning 45/76


2. Linear regression
2.3. Gradient descent

Choice of the learning rate (too large)


If α is too large, the gradient descent may diverge:
J(θ)

Pierre Michel Prediction methods & Machine learning 46/76


2. Linear regression
2.4. Gradient descent: application to linear regression

2.4 Gradient descent: application to linear regression

Pierre Michel Prediction methods & Machine learning 47/76


2. Linear regression
2.4. Gradient descent: application to linear regression

Reminders

Simple linear regression model

hθ (x) = θ0 + θ1 x
n 2
1 X
J(θ0 , θ1 ) = hθ (x(i) ) − y (i)
2 i=1

Gradient descent algorithm


Repeat until convergence:


θj := θj − α J(θ0 , θ1 ) (for j = 0 and j = 1)
∂θj

Pierre Michel Prediction methods & Machine learning 48/76


2. Linear regression
2.4. Gradient descent: application to linear regression

Partial derivatives (gradient)

n 2
∂ ∂ 1 X
J(θ0 , θ1 ) = hθ (x(i) ) − y (i)
∂θj ∂θj 2 i=1
n
∂ 1X
= (θ0 + θ1 x(i) − y (i) )2
∂θj 2 i=1


Pn
• for j = 0: ∂θ0 J(θ0 , θ1 ) = i=1 (hθ (x(i) ) − y (i) )

Pn
• for j = 1: ∂θ1 J(θ0 , θ1 ) = i=1 (hθ (x(i) ) − y (i) )x(i)

Pierre Michel Prediction methods & Machine learning 49/76


2. Linear regression
2.4. Gradient descent: application to linear regression

Batch gradient descent

Repeat until convergence {

n
X
θ0 := θ0 − α (hθ (x(i) ) − y (i) )
i=1
n
X
θ1 := θ1 − α (hθ (x(i) ) − y (i) )x(i)
i=1

}
Note: each iteration of the algorithm uses all observations.

Pierre Michel Prediction methods & Machine learning 50/76


2. Linear regression
2.4. Gradient descent: application to linear regression

Stochastic gradient descent

Repeat until convergence {


for i = 1 to n {

θ0 := θ0 − α(hθ (x(i) ) − y (i) )


θ1 := θ1 − α(hθ (x(i) ) − y (i) )x(i)

}
}
Note: each iteration of the algorithm uses only one observation.

Pierre Michel Prediction methods & Machine learning 51/76


3. Multivariate linear regression

3. Multivariate linear regression

Pierre Michel Prediction methods & Machine learning 52/76


3. Multivariate linear regression
3.1. Representation of the model

3.1 Representation of the model

Pierre Michel Prediction methods & Machine learning 53/76


3. Multivariate linear regression
3.1. Representation of the model

Multivariate data (p ≥ 2)

Data Nb. rooms Price


1600 3 329900
2400 3 369000
1416 2 232000
3000 4 539900
1985 4 299900
1534 3 314900
Notations:

• p: number of variables
• x(i) : values for the i-th observation
• x(i)
j : value for the j-th variable for the i-th observation

Pierre Michel Prediction methods & Machine learning 54/76


3. Multivariate linear regression
3.1. Representation of the model

Learning

We are looking for a function h which expresses y as a function of


x1 , x2 , ..., xp . We represent it as a linear function:

hθ (x) = θ0 + θ1 x1 + θ2 x2 + ... + θp xp

   
x0 θ0
 x1   θ1 
   
x = x2  ∈ Rp+1 et θ = θ2  ∈ Rp+1
   
 ..   .. 
. .
xp θp

In matrix form (with x0 = 1), this gives: hθ (x) = θT x

Pierre Michel Prediction methods & Machine learning 55/76


3. Multivariate linear regression
3.2. Gradient descent

3.2 Gradient descent

Pierre Michel Prediction methods & Machine learning 56/76


3. Multivariate linear regression
3.2. Gradient descent

Reminders

Pp
• Function of the model: hθ (x) = θT x = θ0 + j=1 θj xj
• Parameters: θ ∈ Rp+1
Pn
• Cost function: J(θ) = 1
2 i=1 (hθ (x
(i)
) − y (i) )2
• Gradient descent (batch): we repeat until convergence:


θj := θj − α J(θ) (for j = 1, ..., p)
∂θj

Pierre Michel Prediction methods & Machine learning 57/76


3. Multivariate linear regression
3.2. Gradient descent

The algorithm (p ≥ 2)
Repeat until convergence:

n
(i)
X
θj := θj − α (hθ (x(i) ) − y (i) )xj
i=1

n
(i)
X
θ0 := θ0 − α (hθ (x(i) ) − y (i) )x0
i=1

n
(i)
X
θ1 := θ1 − α (hθ (x(i) ) − y (i) )x1
i=1

n
(i)
X
θ2 := θ2 − α (hθ (x(i) ) − y (i) )x2
i=1

...
Pierre Michel Prediction methods & Machine learning 58/76
3. Multivariate linear regression
3.3. Feature scaling

3.3 Feature scaling

Pierre Michel Prediction methods & Machine learning 59/76


3. Multivariate linear regression
3.3. Feature scaling

Motivation

• Difficulties for algorithms when variables have different scales


(convergence can be slow)
• Example: in the Portland data, the number of rooms and the size
have very different scales.
• Two commonly used scaling methods (see Scikit-Learn online help):
I min-max1
I standardization2

1 https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.

MinMaxScaler.html)
2 https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.

StandardScaler.html
Pierre Michel Prediction methods & Machine learning 60/76
3. Multivariate linear regression
3.3. Feature scaling

Min-max scaling

The values of the variables are transformed so as to be expressed on a scale


of 0 to 1.
For each variable j = 1, ..., p:

xj − min(xj )
xmin-max
j =
max(xj ) − min(xj )

Pierre Michel Prediction methods & Machine learning 61/76


3. Multivariate linear regression
3.3. Feature scaling

Standardization

Unlike min-max, standardization does not limit the values on a given


interval, this method is less sensitive to outliers.
For each variable j = 1, ..., p:

xj − µj
xstd
j =
σj

With µ the empirical mean of the observations and σ the standard deviation.

Pierre Michel Prediction methods & Machine learning 62/76


3. Multivariate linear regression
3.4. The learning rate α

3.4 The learning rate α

Pierre Michel Prediction methods & Machine learning 63/76


3. Multivariate linear regression
3.4. The learning rate α

Verification of the gradient descent


θj := θj − α J(θ) (pour j = 1, ..., p)
∂θj

Two important things:

• Making sure that the gradient descent works correctly (converges)


• The choice of the value of the learning rate (or not) α

Pierre Michel Prediction methods & Machine learning 64/76


3. Multivariate linear regression
3.4. The learning rate α

Verification and convergence

• J(θ) is decreasing with each iteration


• Stopping criteria: convergence occurs when J(θ) decreases by less
than  in one iteration, in other words:

|J(θt+1 ) − J(θt )| < 

Note: in Scikit-Learn (in the SGDRegressor function of the linear_model


module),  = 0.001 (noted tol) by default3 .

3 https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.

SGDRegressor.html#sklearn.linear_model.SGDRegressor.score
Pierre Michel Prediction methods & Machine learning 65/76
3. Multivariate linear regression
3.4. The learning rate α

Diagnosis

• If the gradient descent “does not work”, we reduce the value of α


• If α is small enough, J(θ) must decrease with each iteration
• If alpha is too small, convergence may take longer

Note: in Scikit-Learn (in the SGDRegressor function of the linear_model


module), α = 0.01 (noted eta0) by default4 .

4 https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.

SGDRegressor.html#sklearn.linear_model.SGDRegressor.score
Pierre Michel Prediction methods & Machine learning 66/76
3. Multivariate linear regression
3.5. Polynomial regression

3.5 Polynomial regression

Pierre Michel Prediction methods & Machine learning 67/76


3. Multivariate linear regression
3.5. Polynomial regression

Introduction

• The choice of variables to include in a regression model impacts the


model
• Polynomial regression: allows you to include non-linearities to the
linear regression model by adding basis functions.

Note: in Scikit-Learn (with the PolynomialFeatures function of the


preprocessing module), you can generate polynomial variables5 .

5 https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.

SGDRegressor.html#sklearn.linear_model.SGDRegressor.score
Pierre Michel Prediction methods & Machine learning 68/76
3. Multivariate linear regression
3.5. Polynomial regression

Polynomial regression

In a polynomial regression, the function hθ takes the following form:

hθ (x) = θ0 + θ1 x + θ2 x2 + ... + θd xd
where d ∈ N is the degree of the polynomial.
Examples:

• d = 1 (linear): hθ (x) = θ0 + θ1 x
• d = 2 (quadratic): hθ (x) = θ0 + θ1 x + θ2 x2
• d = 3 (cubic): hθ (x) = θ0 + θ1 x + θ2 x2 + +θ3 x3

Pierre Michel Prediction methods & Machine learning 69/76


3. Multivariate linear regression
3.5. Polynomial regression

Example (Portland)
7e+05
5e+05
Price

3e+05

1000 1500 2000 2500 3000 3500 4000 4500

Size

Pierre Michel Prediction methods & Machine learning 70/76


3. Multivariate linear regression
3.5. Polynomial regression

Another example of non-linearity



hθ (x) = θ0 + θ1 x + θ2 x
7e+05
5e+05
Price

3e+05

1000 1500 2000 2500 3000 3500 4000 4500

Size
Pierre Michel Prediction methods & Machine learning 71/76
3. Multivariate linear regression
3.6. Normal equation

3.6 Normal equation

Pierre Michel Prediction methods & Machine learning 72/76


3. Multivariate linear regression
3.6. Normal equation

Minimizing J(θ)

• Gradient descent (batch or stochastic) gives a way to minimize J.


• The minimization can also be done explicitly, without an iterative
algorithm.
• This method6 is based on algebra and matrix calculus notions, which
we will not detail here7 .
• Does not require any scaling of the variables.

6 https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.

LinearRegression.html
7 http://cs229.stanford.edu/notes2020fall/notes2020fall/cs229-notes1.pdf
Pierre Michel Prediction methods & Machine learning 73/76
3. Multivariate linear regression
3.6. Normal equation

Explicit least squares solution

The value of θ that minimizes J(θ) is given by the normal equation:

θ = (X T X)−1 X T y

where θ ∈ Rp+1 ,
X ∈ Rn×(p+1) ,
and y ∈ Rn .

Pierre Michel Prediction methods & Machine learning 74/76


3. Multivariate linear regression
3.6. Normal equation

Problems with normal equations:

θ = (X T X)−1 X T y

• The matrix X T X can be non-invertible:


I the variables are redundant (linearly dependent)
I too many variables (n ≤ p)
• The cost in computation time can be high (for p very large)

Solutions: variable selection, regularization.

Pierre Michel Prediction methods & Machine learning 75/76


3. Multivariate linear regression
3.6. Normal equation

Conclusion

• Gradient descent
I Sensitive to the choice of the α hyperparameter
I Iterative method
I Works for p very large
• Normal equation
I No need for hyperparameters
I No iterations
I Compute (X T X)−1
I Slow for p very large (complexity O(p3 ))

Pierre Michel Prediction methods & Machine learning 76/76

You might also like