Session1 Eng PDF

Prediction methods & Machine learning
Session I
Pierre Michel
pierre.michel@univ-amu.fr
M2 EBDS
2021
1. Introduction
1. Introduction
Pierre Michel Prediction methods & Machine learning 2/76

1. Introduction
1.1. What is Machine Larning ?
1.1 What is Machine Larning ?

1. Introduction
Machine Learning
“machine learning is a field that develops algorithms designed

to be applied to datasets, with the main areas of focus being pre-
diction (regression), classification, and clustering or grouping
tasks.” — Susan Athey, 2018
A field at the crossroads of three disciplines:
• Statistics
• Computer Science
• Artificial Intelligence
“Machine Learning” in french is “Apprentissage Machine” or “Apprentissage

Automatique”

1. Introduction
Figure 1: Machine Learning.

1. Introduction
Some examples of application
• Create groups of customers (“market segmentation”), in order to

define the best commercial strategy;
• Recommendation of products for each customer, based on purchases
from similar customers;
• Automatic detection of transactions that appear to be fraudulent,
• Prediction of next year’s revenue, based on the company’s historical
data;
• Automatic diagnosis of pathologies;
• Pattern recognition (images, sounds, texts, etc.).

1. Introduction
Pattern Recognition in images
Figure 2: Source: MNIST dataset.

1. Introduction
Automatic diagnosis
Figure 3: Source: NVidia.

1. Introduction
Connected objects (e-Health)
Figure 4: Source: iRhythm.

1. Introduction
1.2. Supervised and unsupervised learning
1.2 Supervised and unsupervised learning

1. Introduction
Supervised versus Unsupervised
Classification (supervised) Clustering (unsupervised)

2.5
2.5
1.5
1.5
x2
x2
0.5
0.5
1 2 3 4 5 6 7 1 2 3 4 5 6 7
x1 x1
Regression (supervised) Density estimation (unsupervised)

0.4
180
Density
0.2
y
120
0.0
60
20 30 40 50 60 4 5 6 7 8
x x

1. Introduction
Supervised learning (classification)
Figure 5: Example of classification.

1. Introduction
Supervised learning (regression)
Figure 6: Example of regression

1. Introduction
Unsupervised learning (clustering)
Figure 7: Example of clustering.

1. Introduction
1.3. Advanced Machine Learning
1.3 Advanced Machine Learning

1. Introduction
Semi-supervised learning
Figure 8: Example of semi-supervised learning.

1. Introduction
Reinforcement learning

1. Introduction
1.4. Which tools to use?
1.4 Which tools to use?

1. Introduction
Python and Machine Learning libraries
This course is based on the Python programming language and the following
libraries:
• Scikit-Learn (http://scikit-learn.org): offers many Machine

Learning algorithms (classification, regression, clustering. . . )
• NumPy (http://numpy.org): Scientific calculation, linear algebra,
N-dimensional arrays. . .
• pandas (http://pandas.pydata.org): allows the handling and
analysis of data.
• Matplotlib (http://matplotlib.org): allows plotting and vizualizing
data in graphical form.

1. Introduction
Install Anaconda
Anaconda:
• free and open source distribution of the Python and R languages

• applications in data science and machine learning
• available on Windows, Mac, Linux
• Anaconda Navigator interface : JupyterLab, Jupyter Notebook,
Anaconda Prompt, Spyder, RStudio. . .

1. Introduction
Anaconda Prompt
First of all, you have to choose a directory in which you are going to work,
we open Anaconda Prompt:
cd my_folder # change current directory

mkdir ml # create a new folder named ml
cd ml # ml becomes current folder
conda list # print installed packages
python # run python
>>> print("Hello World") # print a message
>>> quit() # quit python
anaconda-navigator # run Anaconda Navigator
jupyter notebook # run Jupyter Notebook
Note: All this can also be done manually.

1. Introduction
Anaconda Navigator
Figure 10: Anaconda Navigator interface.

1. Introduction
Jupyter Notebook
Figure 11: Jupyter Notebook interface.

2. Linear regression

2.1. Model representation
2.1 Model representation

Context
• We talk about supervised learning when we know the “true

response” for each observation.
• A regression problem consists in predicting a real value. . .
• . . . unlike a classification problem which consists in predicting a
discrete value.

Exemple: housing prices (Portland)

6e+05
Price
4e+05
2e+05
1000 2000 3000 4000
Size

Regression line 6e+05

Price
4e+05
2e+05
1000 2000 3000 4000
Size

Prediction 6e+05
Price
406509.5
4e+05
2e+05
2500
1000 2000 3000 4000
Size

Learning sample
Size (x) Price (y)
1600 329900
2400 369000
1416 232000
3000 539900
1985 299900
1534 314900
Notations:
• n: number of observations in the learning sample

• x: explanatory variable (or predictor)
• y: variable to explain (or target)
• (x, y): one observation (or example)
• (x(i) , y (i) ): the i-th observation

Learning
From the learning sample and a learning algorithm, we look for a function h
which expresses y as a function of x. We represent it as a linear function:
variable to be explained
hθ (x) = θ0 + θ1 x
where θ = (θ0 , θ1 ) represents the parameters of the model.

2.2. Cost function
2.2 Cost function

2.2. Cost function
Parameters
From the data:

Size (x) Price (y)
1600 329900
2400 369000
1416 232000
3000 539900
1985 299900
1534 314900
Considering the linear function hθ (x) = θ0 + θ1 x:
How to estimate the parameters θ0 and θ1 ?

2.2. Cost function
Different choices of parameters

3.0
3.0
3.0
2.5
2.5
2.5
2.0
2.0
2.0
1.5
1.5
1.5
1.0
1.0
1.0
0.5
0.5
0.5
0.0
0.0
0.0
0.0 1.0 2.0 3.0 0.0 1.0 2.0 3.0 0.0 1.0 2.0 3.0
(a) (b) (c)
• (a): θ0 = 1.5 and θ1 = 0

• (b): θ0 = 0 and θ1 = 0.5
• (c): θ0 = 1 and θ1 = 0.5
2.2. Cost function
Idea 5
4
3
y
2
1
0
0 1 2 3 4 5
We choose θ0 and θ1 so that hθ (x) is as close as possible to the obser-

vations.
2.2. Cost function
Optimization problem
We seek to minimize the cost function J(θ0 , θ1 ) defined by:
n
1X
J(θ0 , θ1 ) = (hθ (x(i) ) − y (i) )2
2 i=1
The minimization problem can thus be written:
min J(θ0 , θ1 )
θ0 ,θ1
Note: the cost function is called the residual sum of squares.

2.2. Cost function
Naive example (hθ (x) = θ1 x)
hθ(x)
3.0
3.0
2.0
2.0
J(θ1)
y
1.0
1.0
0.0
0.0 1.0 2.0 3.0 0.0 0.0 0.5 1.0 1.5 2.0
x θ1

2.2. Cost function
Visualization of the cost function J(θ0 , θ1 )
80
Cost
60
40
20

2.3. Gradient descent
2.3 Gradient descent

Gradient descent algorithm
• Allows to minimize the function of cost J.

• It is an iterative algorithm.
• More general use: not only for linear regression

Idea of the algorithm
We have a function J(θ0 , θ1 ) that we want to minimize:
• we choose two initial values for θ0 and θ1

• we change the values θ0 and θ1 iteratively in order to reduce J(θ0 , θ1 )
and to obtain a minimum

Example (Himmelblau’s fonction)
1500
z
1000
500
y

Other example (other initial values)
1500
z
1000
500
y

Gradient descent
At each iteration of the algorithm, a new value is assigned to θj (j = 0 and

j = 1), otherwise, the following operation is repeated until convergence:
∂
θj := θj − α J(θ0 , θ1 ) (for j = 0 and j = 1)
∂θj
α is the learning rate (or learning step), it controls the step length of the
gradient descent.

Choice of the learning rate (too small)

If α is too small, the gradient descent may be slow:
J(θ)

Choice of the learning rate (too large)

If α is too large, the gradient descent may diverge:
J(θ)

2.4. Gradient descent: application to linear regression
2.4 Gradient descent: application to linear regression

Reminders
Simple linear regression model
hθ (x) = θ0 + θ1 x
n 2
1 X
J(θ0 , θ1 ) = hθ (x(i) ) − y (i)
2 i=1
Gradient descent algorithm

Repeat until convergence:
∂
θj := θj − α J(θ0 , θ1 ) (for j = 0 and j = 1)
∂θj

Partial derivatives (gradient)
n 2
∂ ∂ 1 X
J(θ0 , θ1 ) = hθ (x(i) ) − y (i)
∂θj ∂θj 2 i=1
n
∂ 1X
= (θ0 + θ1 x(i) − y (i) )2
∂θj 2 i=1
∂
Pn
• for j = 0: ∂θ0 J(θ0 , θ1 ) = i=1 (hθ (x(i) ) − y (i) )
∂
Pn
• for j = 1: ∂θ1 J(θ0 , θ1 ) = i=1 (hθ (x(i) ) − y (i) )x(i)

Batch gradient descent
Repeat until convergence {
n
X
θ0 := θ0 − α (hθ (x(i) ) − y (i) )
i=1
n
X
θ1 := θ1 − α (hθ (x(i) ) − y (i) )x(i)
i=1
}
Note: each iteration of the algorithm uses all observations.

Stochastic gradient descent
Repeat until convergence {

for i = 1 to n {
θ0 := θ0 − α(hθ (x(i) ) − y (i) )

θ1 := θ1 − α(hθ (x(i) ) − y (i) )x(i)
}
}
Note: each iteration of the algorithm uses only one observation.

3. Multivariate linear regression

3.1. Representation of the model
3.1 Representation of the model

Multivariate data (p ≥ 2)
Data Nb. rooms Price

1600 3 329900
2400 3 369000
1416 2 232000
3000 4 539900
1985 4 299900
1534 3 314900
Notations:
• p: number of variables
• x(i) : values for the i-th observation
• x(i)
j : value for the j-th variable for the i-th observation

Learning
We are looking for a function h which expresses y as a function of

x1 , x2 , ..., xp . We represent it as a linear function:
hθ (x) = θ0 + θ1 x1 + θ2 x2 + ... + θp xp
   
x0 θ0
 x1   θ1 
   
x = x2  ∈ Rp+1 et θ = θ2  ∈ Rp+1
   
 ..   .. 
. .
xp θp
In matrix form (with x0 = 1), this gives: hθ (x) = θT x

3.2 Gradient descent

Reminders
Pp
• Function of the model: hθ (x) = θT x = θ0 + j=1 θj xj
• Parameters: θ ∈ Rp+1
Pn
• Cost function: J(θ) = 1
2 i=1 (hθ (x
(i)
) − y (i) )2
• Gradient descent (batch): we repeat until convergence:
∂
θj := θj − α J(θ) (for j = 1, ..., p)
∂θj

The algorithm (p ≥ 2)
Repeat until convergence:
n
(i)
X
θj := θj − α (hθ (x(i) ) − y (i) )xj
i=1
n
(i)
X
θ0 := θ0 − α (hθ (x(i) ) − y (i) )x0
i=1
n
(i)
X
θ1 := θ1 − α (hθ (x(i) ) − y (i) )x1
i=1
n
(i)
X
θ2 := θ2 − α (hθ (x(i) ) − y (i) )x2
i=1
...
3.3. Feature scaling
3.3 Feature scaling

Motivation
• Difficulties for algorithms when variables have different scales

(convergence can be slow)
• Example: in the Portland data, the number of rooms and the size
have very different scales.
• Two commonly used scaling methods (see Scikit-Learn online help):
I min-max1
I standardization2
1 https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.
MinMaxScaler.html)
2 https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.
StandardScaler.html
Min-max scaling
The values of the variables are transformed so as to be expressed on a scale

of 0 to 1.
For each variable j = 1, ..., p:
xj − min(xj )
xmin-max
j =
max(xj ) − min(xj )

Standardization
Unlike min-max, standardization does not limit the values on a given

interval, this method is less sensitive to outliers.
For each variable j = 1, ..., p:
xj − µj
xstd
j =
σj
With µ the empirical mean of the observations and σ the standard deviation.

3.4. The learning rate α
3.4 The learning rate α

Verification of the gradient descent
∂
θj := θj − α J(θ) (pour j = 1, ..., p)
∂θj
Two important things:
• Making sure that the gradient descent works correctly (converges)

• The choice of the value of the learning rate (or not) α

Verification and convergence
• J(θ) is decreasing with each iteration

• Stopping criteria: convergence occurs when J(θ) decreases by less
than in one iteration, in other words:
|J(θt+1 ) − J(θt )| <
Note: in Scikit-Learn (in the SGDRegressor function of the linear_model

module), = 0.001 (noted tol) by default3 .
3 https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.
SGDRegressor.html#sklearn.linear_model.SGDRegressor.score
Diagnosis
• If the gradient descent “does not work”, we reduce the value of α

• If α is small enough, J(θ) must decrease with each iteration
• If alpha is too small, convergence may take longer
Note: in Scikit-Learn (in the SGDRegressor function of the linear_model

module), α = 0.01 (noted eta0) by default4 .
3.5. Polynomial regression
3.5 Polynomial regression

Introduction
• The choice of variables to include in a regression model impacts the

model
• Polynomial regression: allows you to include non-linearities to the
linear regression model by adding basis functions.
Note: in Scikit-Learn (with the PolynomialFeatures function of the

preprocessing module), you can generate polynomial variables5 .
Polynomial regression
In a polynomial regression, the function hθ takes the following form:
hθ (x) = θ0 + θ1 x + θ2 x2 + ... + θd xd
where d ∈ N is the degree of the polynomial.
Examples:
• d = 1 (linear): hθ (x) = θ0 + θ1 x
• d = 2 (quadratic): hθ (x) = θ0 + θ1 x + θ2 x2
• d = 3 (cubic): hθ (x) = θ0 + θ1 x + θ2 x2 + +θ3 x3

Example (Portland)
7e+05
5e+05
Price
3e+05
1000 1500 2000 2500 3000 3500 4000 4500
Size

Another example of non-linearity

√
hθ (x) = θ0 + θ1 x + θ2 x
7e+05
5e+05
Price
3e+05
1000 1500 2000 2500 3000 3500 4000 4500
Size
3.6. Normal equation
3.6 Normal equation

Minimizing J(θ)
• Gradient descent (batch or stochastic) gives a way to minimize J.

• The minimization can also be done explicitly, without an iterative
algorithm.
• This method6 is based on algebra and matrix calculus notions, which
we will not detail here7 .
• Does not require any scaling of the variables.
LinearRegression.html
7 http://cs229.stanford.edu/notes2020fall/notes2020fall/cs229-notes1.pdf
Explicit least squares solution
The value of θ that minimizes J(θ) is given by the normal equation:
θ = (X T X)−1 X T y
where θ ∈ Rp+1 ,
X ∈ Rn×(p+1) ,
and y ∈ Rn .

Problems with normal equations:
θ = (X T X)−1 X T y
• The matrix X T X can be non-invertible:

I the variables are redundant (linearly dependent)
I too many variables (n ≤ p)
• The cost in computation time can be high (for p very large)
Solutions: variable selection, regularization.

Conclusion
• Gradient descent
I Sensitive to the choice of the α hyperparameter
I Iterative method
I Works for p very large
• Normal equation
I No need for hyperparameters
I No iterations
I Compute (X T X)−1
I Slow for p very large (complexity O(p3 ))

Session1 Eng PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Session1 Eng PDF

Uploaded by

Copyright:

Available Formats

Prediction methods & Machine learning

Pierre Michel Prediction methods & Machine learning 2/76

1.1 What is Machine Larning ?

Pierre Michel Prediction methods & Machine learning 3/76

“machine learning is a field that develops algorithms designed

A field at the crossroads of three disciplines:

“Machine Learning” in french is “Apprentissage Machine” or “Apprentissage

Pierre Michel Prediction methods & Machine learning 4/76

Figure 1: Machine Learning.

Pierre Michel Prediction methods & Machine learning 5/76

Some examples of application

• Create groups of customers (“market segmentation”), in order to

Pierre Michel Prediction methods & Machine learning 6/76

Pattern Recognition in images

Figure 2: Source: MNIST dataset.

Figure 3: Source: NVidia.

Connected objects (e-Health)

Figure 4: Source: iRhythm.

1.2 Supervised and unsupervised learning

Pierre Michel Prediction methods & Machine learning 10/76

Supervised versus Unsupervised

Classification (supervised) Clustering (unsupervised)

Regression (supervised) Density estimation (unsupervised)

Pierre Michel Prediction methods & Machine learning 11/76

Supervised learning (classification)

Figure 5: Example of classification.

Pierre Michel Prediction methods & Machine learning 12/76

Supervised learning (regression)

Figure 6: Example of regression

Pierre Michel Prediction methods & Machine learning 13/76

Unsupervised learning (clustering)

Figure 7: Example of clustering.

Pierre Michel Prediction methods & Machine learning 14/76

1.3 Advanced Machine Learning

Pierre Michel Prediction methods & Machine learning 15/76

Figure 8: Example of semi-supervised learning.

Pierre Michel Prediction methods & Machine learning 16/76

Pierre Michel Prediction methods & Machine learning 17/76

1.4 Which tools to use?

Pierre Michel Prediction methods & Machine learning 18/76

Python and Machine Learning libraries

• Scikit-Learn (http://scikit-learn.org): offers many Machine

Pierre Michel Prediction methods & Machine learning 19/76

• free and open source distribution of the Python and R languages

Pierre Michel Prediction methods & Machine learning 20/76

cd my_folder # change current directory

Note: All this can also be done manually.

Pierre Michel Prediction methods & Machine learning 21/76

Figure 10: Anaconda Navigator interface.

Pierre Michel Prediction methods & Machine learning 22/76

Figure 11: Jupyter Notebook interface.

Pierre Michel Prediction methods & Machine learning 23/76

Pierre Michel Prediction methods & Machine learning 24/76

2.1 Model representation

Pierre Michel Prediction methods & Machine learning 25/76

• We talk about supervised learning when we know the “true

Pierre Michel Prediction methods & Machine learning 26/76

Exemple: housing prices (Portland)

1000 2000 3000 4000

Pierre Michel Prediction methods & Machine learning 27/76

Regression line 6e+05

1000 2000 3000 4000

Pierre Michel Prediction methods & Machine learning 28/76