You are on page 1of 32

Introduction to Machine Learning

Gilles Gasso

INSA Rouen - ASI Departement


LITIS Lab

September 5, 2019

Gilles Gasso Introduction to Machine Learning 1 / 32


Machine Learning: introduction

Machine Learning ≡ data-based programming


ability of computers to learn how to perform tasks (classification,
detection, translation. . . ) without being explicitly programmed
study of algorithms that improve their performance at some task based
on experience

Gilles Gasso Introduction to Machine Learning 2 / 32


The rise

Data
Big Data : continuous increase in data generated
Twitter : 50M tweets /day (=7 terabytes)
Facebook : 10 terabytes /day
Youtube : 50h of uploaded videos /minute
2.9 millions of e-mails /second

Computing Power The value-adding process


Moore’s law Interest: from product to customers.
Massively distributed Data Mining ≡ discovering patterns in
computing large data sets

Gilles Gasso Introduction to Machine Learning 3 / 32


Historical perspective

Today’s IA is Deep Learning (a technique of Machine Learning)


https://blog.alore.io/machine- learning- and- artificial- intelligence/

Gilles Gasso Introduction to Machine Learning 4 / 32


Applications: sentiment analysis

Classify stored reviews according to users’ sentiment

Score(x) > 0

Score(x) < 0

Gilles Gasso Introduction to Machine Learning 5 / 32


Product recommendation

Product
features

Matrix factorization
Purchase history of customers

User
features

https://katbailey.github.io/images/matrix_factorization.png

Purchased item Recommended products

Gilles Gasso Introduction to Machine Learning 6 / 32


Image classification
Labeled training images Deep classification architecture

https://www.researchgate.net/profile/Y_Nikitin/publication/270163511/figure/download/fig5/AS:
295194831409153@1447391340221/MPCNN- architecture- using- alternating- convolutional- and- max- pooling- layers- 13.png

Input images and predicted category

https://adriancolyer.files.wordpress.com/2016/04/imagenet- fig4l.png?w=656
Gilles Gasso Introduction to Machine Learning 7 / 32
Medical diagnosis

Malignant melanoma detection


130 000 images including over 2 000 cases of cancer
Error rate 28 % (human 34 %)

Digital Mammography DREAM Challenge


640 000 mammographies (1209 participants)
false-positive rate decreased by 5 %

Heart rhythm analysis


500 000 ECG
accuracy 92.6 % (human 80.0 %) sensitivity of 97 %

Gilles Gasso Introduction to Machine Learning 8 / 32


Object detection

https://www.mdpi.com/applsci/applsci- 09- 01128/article_deploy/html/images/applsci- 09- 01128- g004.pn

https://arxiv.org/pdf/1506.02640.pdf

Gilles Gasso Introduction to Machine Learning 9 / 32


Audio captioning

https://ars.els- cdn.com/content/image/1- s2.0- S1574954115000151- gr1.jpg

Gilles Gasso Introduction to Machine Learning 10 / 32


Chatbot

https://miro.medium.com/max/2058/1*LF5T9fsr4w2EqyFJkb- gng.png

Gilles Gasso Introduction to Machine Learning 11 / 32


Implementing a Machine Learning project

Data
Data Data
Real World Strength- Business
Gathering engineering
ening
Sensors
Industries Cleaning Exploration Interpretation
Transactions
Insurance Organizing Representation Pattern of
Web-clicks,logs
Internet Aggregation Display interests
Mobility
Social Networks Management Machine Learning Strategic decision
Open data
Medical of errors Data Mining Valuation
Docs
... ... ... ...
...

Gilles Gasso Introduction to Machine Learning 12 / 32


Chain of the data engineering process
Evaluate
Pre- Train the
Data the Model
processing model
model

1 Understand and specify project


goals Y
2 Pre-processing/visualize/analyze
data 1

1
1

1 1

3 Which ML problem is it? 1? 1 1

1 1

4 Design a solving approach 1


1

5 Evaluate its performance


6 Go to to 2) if needed

Course Goal: Study the steps from 2 to 5


Gilles Gasso Introduction to Machine Learning 13 / 32
The data

Information (past experience) are examples with attributes


Assume the data set consists of N samples

Attributes
An attribute is a property or characteristic of a phenomenon being
observed. Also called it feature or variable

Sample
It is an entity characterising an object; it is made up of attributes.
Synonyms : instance, point, vector (usually in Rd )

Gilles Gasso Introduction to Machine Learning 14 / 32


Data : visualization
hhhh Features
hh citric acid residual sugar chlorides sulfur dioxide
Points x hh h
1 0 1.9 0.076 11
2 0 2.6 0.098 25
3 0.04 2.3 0.092 15
Point x ∈ R4 0.56 1.9 0.075 17
5 0 1.9 0.076 11
6 0 1.8 0.075 13
7 0.06 1.6 0.069 15
8 0.02 2 0.073 9
9 0.36 2.8 0.071 17
10 0.08 1.8 0.097 15

0.1
25

0.095
Variable 3 : Chlorides

Variable 4 : Sulfur
20
0.09
Points
0.085 Moyenne des points 15

0.08
10

0.075
0.1
2.8
0.09 2.6
0.07 Va
ria 0.08
2.4
bl 2.2
e 0.07 2 r
3 ga
:C 1.8 l Su
0.065 hl
or 0.06 1.6 esid
ua
1.5 2 2.5 3 id 2:R
es a ble
Variable 2 : Residual Sugar Vari

Gilles Gasso Introduction to Machine Learning 15 / 32


Data types

Sensors → Quantitative and


qualitative variables,
ordinales, nominals
Text → String
Speech → Time Series
Images → 2D Data
Videos → 2D Data + time
Networks → Graphs
Stream → Logs, coupons. . .
Labels → Expected output prediction

Gilles Gasso Introduction to Machine Learning 16 / 32


Approaches of Machine Learning

Gilles Gasso Introduction to Machine Learning 17 / 32


Supervised learning
Principle
Given a set of N training examples {(xi , yi ) ∈ X × Y, i = · · · , N}, we
want to estimate a prediction function y = f (x).
The supervision comes from the label knowledge

Examples
Image classification, object detection, stock price prediction . . .

Gilles Gasso Introduction to Machine Learning 18 / 32


Unsupervised learning
Principle
Only the {xi ∈ X , i = · · · , N} are available. We aim to describe how
data is organized and extract homogeneous subsets from it.

Examples
Applications: Customers segmentation, image segmentation, data
visualization, categorization of similar documents . . .
80

60 7 7 77
7777
27 7 7777 7777 7 9949
7777777777 777 7 7 77
7 7777 77 7 7777 7 77 7 444444
78 99999 9 77777
7 9 444
399 9 9999 999999 99 99 9999 4 44
44 999999 999 9 777 4 444 44 4
40
7
4 99 1 9799 7 777 44 44 4444 44
4444
4 44 4
11
111111111 11 87
1 111111111
2
2 77779 9
9 999979 7 47 7 7 444 444444444 4
7 7 7
2 111 1 11 59 999 4 9 94 4
1
1 1
1 1 1 94 4 44 4 9 9
11111111 7 1 27 2 4 977 9 9 9 2 77 44
2 11111111 111 1 3 722 44 77 9 999 999 4
4
20 6 1 1 1 11 4 9 9
9 9 4 4 4 4
8 6 1111 111117 233 4 3 98 9 99 49 444
4 4
22 22 2 11111 111
1 7 22 2
9 49 4 44 44
775 4 599494444 9
2222 222 1 111 1111112 6
111 222 2 2222 5
2 22 8 81 11 666 2 77 9
604
5 115 26 2 88
0 8 8 8 8 12 55555
8 88 888888 888 8888 33 5555 55
5
5
88 8 88 8 8 93 13 55 555
88 8
888888888 88 8
8 8 8 8 5588 8 99 55
5 5 5555 666
6
366602
6 6666
6
88888888888888 88 8 88 8 883 8888 85 5 55555 6 4 66
8 2 8 8 8 55 3535 555 5 666666666 6
33
33 35333 85 3
5 8 566 6666 6666 666 222 2
-20 3 8 55
5 55 5 6666 496 6 666 6 66666
2 8 5 33
3338 3 85 8 5 55 5 55 666 666666 6 666 6666 666
2
22 2
3
3 3 3359 5 5 5 5 55 6 66666 22222222222
3333 53 333 55 55 5555 5 66 66 66
66 22
2222222 222 23
333333 3 3 3 33 5 3 55 5 55 0 2
5 2
3 333 3 3 3 3
3
3 2222 2222 2
-40 333333 33 55 000
333333333
3 55 0 0 00000000 0
3 3
3 3 3 0
6 0000 00
0 0 0 00 2
53 0000 00 0 000
600 9 0 000
000 00000 00000000
0 00 0 0 00000000
-60 000000000 000000000
0

-80
-60 -40 -20 0 20 40 60 80

Gilles Gasso Introduction to Machine Learning 19 / 32


Supervised learning : concept

Let X and Y be two sets. Assume p(X , Y ) the joint probability


distribution of (X , Y ) ∈ X × Y.

Goal : find a prediction function f : X → Y which correctly estimates


the output y corresponding to x.
f belongs to a space H called hypothesis class. Example of H : set
of polynomial functions

Gilles Gasso Introduction to Machine Learning 20 / 32


Supervised learning: principle

Loss function L(Y , f (X ))


evaluates how ”close” is the prediction
 f (x) to the true label y
0 if y = f (x)
it penalizes errors: L(y , f (x)) =
≥ 0 if y 6= f (x)

True risk (expected prediction error)


Z
R(f ) = E(X ,Y ) [L(Y , f (X ))] = L(y , f (x))p(x, y )dxdy
X ×Y

Objective
Identify the prediction function which minimizes the true risk i.e.

f ? = arg min R(f )


f ∈H

Gilles Gasso Introduction to Machine Learning 21 / 32


Loss functions
Quadratic loss : L(Y , f (X )) = (Y − f (X ))2

 L(Y , f (X )) = |Y − f (X )|
`1 loss (absolute deviation):
0 if y = f (x)
0 − 1 loss: L(y , f (x)) =
1 otherwise
Hinge loss: L(y , f (x)) = max(0, 1 − yf (x))

Gilles Gasso Introduction to Machine Learning 22 / 32


Some supervised learning problems

Regression 1
Support Vector Machine Regression

We talk about regression when 0.5

Y is a subset of Rd .
0

y
−0.5

Usual related loss function:


quadratic loss (y − f (x))2
−1

−1.5
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x

Gilles Gasso Introduction to Machine Learning 23 / 32


Some supervised learning problems
Classification
Output space Y is an un-ordered discrete set
Binary classification: card(Y) = 2
Example: Y = {−1, 1}
Loss functions: 0-1 loss, hinge loss
Multiclass classification: card(Y) > 2
Example: Y = {1, 2, · · · , K }

Gilles Gasso Introduction to Machine Learning 24 / 32


From true risk to empirical risk

Minimizing the true risk is not duable (in allmost all practical applications)

The joint distribution p(X , Y ) is unknown!

Only a finite training set {(xi , yi ) ∈ X × Y}N


i=1 is available

Empirical risk
N
1 X
Remp (f ) = L(yi , f (xi ))
N
i=1

Empirical risk minimization

fˆ = arg min Remp (f )


f ∈H

Gilles Gasso Introduction to Machine Learning 25 / 32


Overfitting
Overfitting
Empirical risk is not appropriate for model selection: if H is large enough,
Remp (f ) → 0 but the generalized error (true risk) is high.
Erreur de prediction

Ensemble de Test

Ensemble d’apprentissage
Faible Elevé
Complexité du modèle
Gilles Gasso Introduction to Machine Learning 26 / 32
Illustration of overfitting

0.3
Risque empirique

0.2

App
0.1 Val
Test
0
30 20 10 0
Valeurs decroissantes de k
https:
//www.cs.princeton.edu/courses/archive/spring16/
cos495/slides/ML_basics_lecture6_overfitting.pdf

Gilles Gasso Introduction to Machine Learning 27 / 32


Model selection

Find in H the best function f that learned based on the training set
will well generalize (low true risk)
Example : We are looking for a polynomial function of degree α
minimizing the risk : Remp (fα ) = N fα (xi ))2 .
P
i=1 (yi −
Goal :
1 propose a model estimation method in order to choose (approximately)
the best model belonging to H .
2 once the model is selected, estimate its generalization error.

Gilles Gasso Introduction to Machine Learning 28 / 32


Model selection : basic approache
Case 1 : N is really big (large scale DN )
Données disponibles

Apprentissage {X app ,Y app } Validation {X val ,Y val } Test {X test , Y test }

1 Randomly split DN = Dtrain ∪ Dval ∪ Dtest


2 For each α, train fα based on Dtrain
1 P
3 Evaluating its performance on Dval Rval = Nval i∈Dval L(yi , f (xi ))
4 Select the model with the best performance on Dval
5 Test selected model on Dtest

Note
Dtest is used once!

Gilles Gasso Introduction to Machine Learning 29 / 32


Model selection : Cross-validation
Case 2 : Small or medium scale DN

Estimate the generalization error by re-sampling.


Principle
1 Split DN into K sets of equal size.
2 For each k = 1, · · · , K , train a model by using the K − 1 remaining
sets and evaluate the model on the k-th part.
3 Average the K error estimates obtained to have the cross-validation
error.

Validation Apprentissage Test

Apprentissage Validation Apprentissage Test

Apprentissage Validation Test

Gilles Gasso Introduction to Machine Learning 30 / 32


Conclusions

To successfully carry out an automatic data processing project


Clearly identify and spell out the needs.
Create or obtain data representative of the problem
Identify the context of learning
Analyze and reduce data size
Choose an algorithm and/or a space of hypotheses
Choose a model by applying the algorithm to pre-processed data
Validate the performance of the method

Gilles Gasso Introduction to Machine Learning 31 / 32


Au final ...
Find your way. . .

Gilles Gasso Introduction to Machine Learning 32 / 32

You might also like