Introduction To Machine Learning: Gilles Gasso

Introduction to Machine Learning
Gilles Gasso
INSA Rouen - ASI Departement

LITIS Lab
September 5, 2019
Gilles Gasso Introduction to Machine Learning 1 / 32

Machine Learning: introduction
Machine Learning ≡ data-based programming

ability of computers to learn how to perform tasks (classification,
detection, translation. . . ) without being explicitly programmed
study of algorithms that improve their performance at some task based
on experience

The rise
Data
Big Data : continuous increase in data generated
Twitter : 50M tweets /day (=7 terabytes)
Facebook : 10 terabytes /day
Youtube : 50h of uploaded videos /minute
2.9 millions of e-mails /second
Computing Power The value-adding process

Moore’s law Interest: from product to customers.
Massively distributed Data Mining ≡ discovering patterns in
computing large data sets

Historical perspective
Today’s IA is Deep Learning (a technique of Machine Learning)

https://blog.alore.io/machine- learning- and- artificial- intelligence/

Applications: sentiment analysis
Classify stored reviews according to users’ sentiment
Score(x) > 0
Score(x) < 0

Product recommendation
Product
features
Matrix factorization
Purchase history of customers
User
features
https://katbailey.github.io/images/matrix_factorization.png
Purchased item Recommended products

Image classification
Labeled training images Deep classification architecture
https://www.researchgate.net/profile/Y_Nikitin/publication/270163511/figure/download/fig5/AS:
295194831409153@1447391340221/MPCNN- architecture- using- alternating- convolutional- and- max- pooling- layers- 13.png
Input images and predicted category
https://adriancolyer.files.wordpress.com/2016/04/imagenet- fig4l.png?w=656
Medical diagnosis
Malignant melanoma detection

130 000 images including over 2 000 cases of cancer
Error rate 28 % (human 34 %)
Digital Mammography DREAM Challenge

640 000 mammographies (1209 participants)
false-positive rate decreased by 5 %
Heart rhythm analysis

500 000 ECG
accuracy 92.6 % (human 80.0 %) sensitivity of 97 %

Object detection
https://www.mdpi.com/applsci/applsci- 09- 01128/article_deploy/html/images/applsci- 09- 01128- g004.pn
https://arxiv.org/pdf/1506.02640.pdf

Audio captioning
https://ars.els- cdn.com/content/image/1- s2.0- S1574954115000151- gr1.jpg

Chatbot
https://miro.medium.com/max/2058/1*LF5T9fsr4w2EqyFJkb- gng.png

Implementing a Machine Learning project
Data
Data Data
Real World Strength- Business
Gathering engineering
ening
Sensors
Industries Cleaning Exploration Interpretation
Transactions
Insurance Organizing Representation Pattern of
Web-clicks,logs
Internet Aggregation Display interests
Mobility
Social Networks Management Machine Learning Strategic decision
Open data
Medical of errors Data Mining Valuation
Docs
... ... ... ...
...

Chain of the data engineering process
Evaluate
Pre- Train the
Data the Model
processing model
model
1 Understand and specify project

goals Y
2 Pre-processing/visualize/analyze
data 1
1
1
1 1
3 Which ML problem is it? 1? 1 1
1 1
4 Design a solving approach 1

1
5 Evaluate its performance

6 Go to to 2) if needed
Course Goal: Study the steps from 2 to 5

The data
Information (past experience) are examples with attributes

Assume the data set consists of N samples
Attributes
An attribute is a property or characteristic of a phenomenon being
observed. Also called it feature or variable
Sample
It is an entity characterising an object; it is made up of attributes.
Synonyms : instance, point, vector (usually in Rd )

Data : visualization
hhhh Features
hh citric acid residual sugar chlorides sulfur dioxide
Points x hh h
1 0 1.9 0.076 11
2 0 2.6 0.098 25
3 0.04 2.3 0.092 15
Point x ∈ R4 0.56 1.9 0.075 17
5 0 1.9 0.076 11
6 0 1.8 0.075 13
7 0.06 1.6 0.069 15
8 0.02 2 0.073 9
9 0.36 2.8 0.071 17
10 0.08 1.8 0.097 15
0.1
25
0.095
Variable 3 : Chlorides
Variable 4 : Sulfur
20
0.09
Points
0.085 Moyenne des points 15
0.08
10
0.075
0.1
2.8
0.09 2.6
0.07 Va
ria 0.08
2.4
bl 2.2
e 0.07 2 r
3 ga
:C 1.8 l Su
0.065 hl
or 0.06 1.6 esid
ua
1.5 2 2.5 3 id 2:R
es a ble
Variable 2 : Residual Sugar Vari

Data types
Sensors → Quantitative and

qualitative variables,
ordinales, nominals
Text → String
Speech → Time Series
Images → 2D Data
Videos → 2D Data + time
Networks → Graphs
Stream → Logs, coupons. . .
Labels → Expected output prediction

Approaches of Machine Learning

Supervised learning
Principle
Given a set of N training examples {(xi , yi ) ∈ X × Y, i = · · · , N}, we
want to estimate a prediction function y = f (x).
The supervision comes from the label knowledge
Examples
Image classification, object detection, stock price prediction . . .

Unsupervised learning
Principle
Only the {xi ∈ X , i = · · · , N} are available. We aim to describe how
data is organized and extract homogeneous subsets from it.
Examples
Applications: Customers segmentation, image segmentation, data
visualization, categorization of similar documents . . .
80
60 7 7 77
7777
27 7 7777 7777 7 9949
7777777777 777 7 7 77
7 7777 77 7 7777 7 77 7 444444
78 99999 9 77777
7 9 444
399 9 9999 999999 99 99 9999 4 44
44 999999 999 9 777 4 444 44 4
40
7
4 99 1 9799 7 777 44 44 4444 44
4444
4 44 4
11
111111111 11 87
1 111111111
2
2 77779 9
9 999979 7 47 7 7 444 444444444 4
7 7 7
2 111 1 11 59 999 4 9 94 4
1
1 1
1 1 1 94 4 44 4 9 9
11111111 7 1 27 2 4 977 9 9 9 2 77 44
2 11111111 111 1 3 722 44 77 9 999 999 4
4
20 6 1 1 1 11 4 9 9
9 9 4 4 4 4
8 6 1111 111117 233 4 3 98 9 99 49 444
4 4
22 22 2 11111 111
1 7 22 2
9 49 4 44 44
775 4 599494444 9
2222 222 1 111 1111112 6
111 222 2 2222 5
2 22 8 81 11 666 2 77 9
604
5 115 26 2 88
0 8 8 8 8 12 55555
8 88 888888 888 8888 33 5555 55
5
5
88 8 88 8 8 93 13 55 555
88 8
888888888 88 8
8 8 8 8 5588 8 99 55
5 5 5555 666
6
366602
6 6666
6
88888888888888 88 8 88 8 883 8888 85 5 55555 6 4 66
8 2 8 8 8 55 3535 555 5 666666666 6
33
33 35333 85 3
5 8 566 6666 6666 666 222 2
-20 3 8 55
5 55 5 6666 496 6 666 6 66666
2 8 5 33
3338 3 85 8 5 55 5 55 666 666666 6 666 6666 666
2
22 2
3
3 3 3359 5 5 5 5 55 6 66666 22222222222
3333 53 333 55 55 5555 5 66 66 66
66 22
2222222 222 23
333333 3 3 3 33 5 3 55 5 55 0 2
5 2
3 333 3 3 3 3
3
3 2222 2222 2
-40 333333 33 55 000
333333333
3 55 0 0 00000000 0
3 3
3 3 3 0
6 0000 00
0 0 0 00 2
53 0000 00 0 000
600 9 0 000
000 00000 00000000
0 00 0 0 00000000
-60 000000000 000000000
0
-80
-60 -40 -20 0 20 40 60 80

Supervised learning : concept
Let X and Y be two sets. Assume p(X , Y ) the joint probability

distribution of (X , Y ) ∈ X × Y.
Goal : find a prediction function f : X → Y which correctly estimates

the output y corresponding to x.
f belongs to a space H called hypothesis class. Example of H : set
of polynomial functions

Supervised learning: principle
Loss function L(Y , f (X ))

evaluates how ”close” is the prediction
f (x) to the true label y
0 if y = f (x)
it penalizes errors: L(y , f (x)) =
≥ 0 if y 6= f (x)
True risk (expected prediction error)

Z
R(f ) = E(X ,Y ) [L(Y , f (X ))] = L(y , f (x))p(x, y )dxdy
X ×Y
Objective
Identify the prediction function which minimizes the true risk i.e.
f ? = arg min R(f )

f ∈H

Loss functions
Quadratic loss : L(Y , f (X )) = (Y − f (X ))2
L(Y , f (X )) = |Y − f (X )|
`1 loss (absolute deviation):
0 if y = f (x)
0 − 1 loss: L(y , f (x)) =
1 otherwise
Hinge loss: L(y , f (x)) = max(0, 1 − yf (x))

Some supervised learning problems
Regression 1
Support Vector Machine Regression
We talk about regression when 0.5
Y is a subset of Rd .
0
y
−0.5
Usual related loss function:

quadratic loss (y − f (x))2
−1
−1.5
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x

Some supervised learning problems
Classification
Output space Y is an un-ordered discrete set
Binary classification: card(Y) = 2
Example: Y = {−1, 1}
Loss functions: 0-1 loss, hinge loss
Multiclass classification: card(Y) > 2
Example: Y = {1, 2, · · · , K }

From true risk to empirical risk
Minimizing the true risk is not duable (in allmost all practical applications)
The joint distribution p(X , Y ) is unknown!
Only a finite training set {(xi , yi ) ∈ X × Y}N

i=1 is available
Empirical risk
N
1 X
Remp (f ) = L(yi , f (xi ))
N
i=1
Empirical risk minimization
fˆ = arg min Remp (f )

f ∈H

Overfitting
Overfitting
Empirical risk is not appropriate for model selection: if H is large enough,
Remp (f ) → 0 but the generalized error (true risk) is high.
Erreur de prediction
Ensemble de Test
Ensemble d’apprentissage
Faible Elevé
Complexité du modèle
Illustration of overfitting
0.3
Risque empirique
0.2
App
0.1 Val
Test
0
30 20 10 0
Valeurs decroissantes de k
https:
//www.cs.princeton.edu/courses/archive/spring16/
cos495/slides/ML_basics_lecture6_overfitting.pdf

Model selection
Find in H the best function f that learned based on the training set
will well generalize (low true risk)
Example : We are looking for a polynomial function of degree α
minimizing the risk : Remp (fα ) = N fα (xi ))2 .
P
i=1 (yi −
Goal :
1 propose a model estimation method in order to choose (approximately)
the best model belonging to H .
2 once the model is selected, estimate its generalization error.

Model selection : basic approache
Case 1 : N is really big (large scale DN )
Données disponibles
Apprentissage {X app ,Y app } Validation {X val ,Y val } Test {X test , Y test }
1 Randomly split DN = Dtrain ∪ Dval ∪ Dtest

2 For each α, train fα based on Dtrain
1 P
3 Evaluating its performance on Dval Rval = Nval i∈Dval L(yi , f (xi ))
4 Select the model with the best performance on Dval
5 Test selected model on Dtest
Note
Dtest is used once!

Model selection : Cross-validation
Case 2 : Small or medium scale DN
Estimate the generalization error by re-sampling.

Principle
1 Split DN into K sets of equal size.
2 For each k = 1, · · · , K , train a model by using the K − 1 remaining
sets and evaluate the model on the k-th part.
3 Average the K error estimates obtained to have the cross-validation
error.
Validation Apprentissage Test
Apprentissage Validation Apprentissage Test
Apprentissage Validation Test

Conclusions
To successfully carry out an automatic data processing project

Clearly identify and spell out the needs.
Create or obtain data representative of the problem
Identify the context of learning
Analyze and reduce data size
Choose an algorithm and/or a space of hypotheses
Choose a model by applying the algorithm to pre-processed data
Validate the performance of the method

Au final ...
Find your way. . .

Introduction To Machine Learning: Gilles Gasso

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction To Machine Learning: Gilles Gasso

Uploaded by

Copyright:

Available Formats

Introduction to Machine Learning

INSA Rouen - ASI Departement

Gilles Gasso Introduction to Machine Learning 1 / 32

Machine Learning ≡ data-based programming

Gilles Gasso Introduction to Machine Learning 2 / 32

Computing Power The value-adding process

Gilles Gasso Introduction to Machine Learning 3 / 32

Today’s IA is Deep Learning (a technique of Machine Learning)

Gilles Gasso Introduction to Machine Learning 4 / 32

Classify stored reviews according to users’ sentiment

Gilles Gasso Introduction to Machine Learning 5 / 32

Purchased item Recommended products

Gilles Gasso Introduction to Machine Learning 6 / 32

Input images and predicted category

Malignant melanoma detection

Digital Mammography DREAM Challenge

Heart rhythm analysis

Gilles Gasso Introduction to Machine Learning 8 / 32

https://www.mdpi.com/applsci/applsci- 09- 01128/article_deploy/html/images/applsci- 09- 01128- g004.pn

Gilles Gasso Introduction to Machine Learning 9 / 32

https://ars.els- cdn.com/content/image/1- s2.0- S1574954115000151- gr1.jpg

Gilles Gasso Introduction to Machine Learning 10 / 32

Gilles Gasso Introduction to Machine Learning 11 / 32

Gilles Gasso Introduction to Machine Learning 12 / 32

1 Understand and specify project

3 Which ML problem is it? 1? 1 1

4 Design a solving approach 1

5 Evaluate its performance

Course Goal: Study the steps from 2 to 5

Information (past experience) are examples with attributes

Gilles Gasso Introduction to Machine Learning 14 / 32

Gilles Gasso Introduction to Machine Learning 15 / 32

Sensors → Quantitative and

Gilles Gasso Introduction to Machine Learning 16 / 32

Gilles Gasso Introduction to Machine Learning 17 / 32

Gilles Gasso Introduction to Machine Learning 18 / 32

Gilles Gasso Introduction to Machine Learning 19 / 32

Let X and Y be two sets. Assume p(X , Y ) the joint probability

Goal : find a prediction function f : X → Y which correctly estimates

Gilles Gasso Introduction to Machine Learning 20 / 32

Loss function L(Y , f (X ))

True risk (expected prediction error)

f ? = arg min R(f )

Gilles Gasso Introduction to Machine Learning 21 / 32

Gilles Gasso Introduction to Machine Learning 22 / 32

We talk about regression when 0.5

Usual related loss function:

Gilles Gasso Introduction to Machine Learning 23 / 32

Gilles Gasso Introduction to Machine Learning 24 / 32

The joint distribution p(X , Y ) is unknown!

Only a finite training set {(xi , yi ) ∈ X × Y}N

Empirical risk minimization

fˆ = arg min Remp (f )

Gilles Gasso Introduction to Machine Learning 25 / 32

Gilles Gasso Introduction to Machine Learning 27 / 32

Gilles Gasso Introduction to Machine Learning 28 / 32

Apprentissage {X app ,Y app } Validation {X val ,Y val } Test {X test , Y test }

1 Randomly split DN = Dtrain ∪ Dval ∪ Dtest

Gilles Gasso Introduction to Machine Learning 29 / 32

Estimate the generalization error by re-sampling.

Validation Apprentissage Test

Apprentissage Validation Apprentissage Test