You are on page 1of 132

Artificial Intelligence: models and technologies

- An introduction from the ICT perspective -

Prof Manuel Roveri – manuel.roveri@polimi.it

Lecture 1 - ICT4Dev Summer School


Prof. Manuel Roveri

• Full Professor
Dipartimento di Elettronica, Informazione e Bioingegneria
(DEIB), Politecnico di Milano, Italy
Email: manuel.roveri@polimi.it
Web: http://roveri.faculty.polimi.it
• Research interests: TinyML, IoT and edge computing, privacy-
preserving machine and deep learning
• Lecturer of « Computing Infrastructures» and «Hardware
Architecture for Embedded and edge AI»
• Associate Editor of IEEE Trans. on Artificial Intelligence, Neural
Networks, IEEE Trans. on Emerging Tecnologies in Computational
Intelligence, IEEE Trans. on Neural Networks and Learning Systems
• Chair of the IEEE CIS Technical Activities strategic planning
committee and IEEE CIS Neural Network Technical Committee
• Co-Founder of DHIRIA, a Spin-Off of Politecnico di Milano

2
Summary
• Introduction to AI projects
• AI models and prediction
• The basics of learning
• Neural Networks
• Evaluating AI models
• Technologies for AI
• AI and Cloud
• Hardware accelerators
• AI and IoT
• Challenges and opportunities

3
Artificial Intelligence

Machine Learning

Deep Learning

4
5

Artificial Intelligence

Machine Learning

Deep Learning

1945 1960 1970 1980 Today


Pioneering works in
the field of AI
Statisitcal Methods

BackProp
Neural Networks Deep Learning
SVM

K Nearest Neighbour
Bayesian Inference

5
6

Univac

Eniac Transistor Integrated


Circuits
Microprocessors

1945 1960 1970 1980 Today


Pioneering works in
the field of AI
Statisitcal Methods

BackProp
Neural Networks Deep Learning
SVM

K Nearest Neighbour
Bayesian Inference

6
7

Univac
Computation,
Memory
Eniac Transistor Integrated
Circuits
Microprocessors

1945 1960 1970 1980 Today


Pioneering works in
the field of AI
Statisitcal Methods

BackProp
Neural Networks Deep Learning
SVM
Computational needs
K Nearest Neighbour Memory Requirements
Bayesian Inference

7
8
We live in the era of information abundance

Virtual Sensors
Real Sensors

Data is the new gold

8
The 4-layer AI model

Solutions

Capabilities

Methodologies

Technologies

9
The 4-layer AI model

Solutions

• Intelligent Data Processing,


• Virtual Assistant/Chatbot,
• Recommendation,
• Image Processing,
• Autonomous Vehicle,
• Intelligent Object,
• Language Processing,
• Autonomous Robot

10
The 4-layer AI model

Capabilities

Natural Language Processing Image Processing


Signal Processing
Discourse and Dialogue
Computer Vision
Information Extraction
Face and Gesture Recognition
Question Answering
Image and Video Retrieval

Reasoning and Planning


Knowledge Representation
Learning Automated Reasoning
Computational Intelligence Ontologies
Data Mining Knowledge Engineering
Evolutionary Intelligence Replanning and Plan Repair
Machine Learning Temporal Planning
Optimization
Scheduling

Interacting Socially Interacting Physically


Localization, Mapping and Navigation
Multi-Agent Systems
Behavior and Control
Game Theory
Motion and Path Planning
Coordination and Collaboration
State Estimation

11
The 4-layer AI model

Methodologies

Learning Reasoning and Planning Interacting Socially


Online learning Inference Game theory
Regression Modal logics Mechanism design and
Reinforcement learning Temporal planning auctions
Bandit algorithms Linear planning Voting
Supervised learning Mathematical programming Social choice theory
Unsupervised learning Integer programming …
Neural networks Approximation algorithms
Deep learning Blackbox optimization
Genetic algorithms …
Sequential decision
Classification

12
The 4-layer AI model

Technologies

13
The 4-layer AI model

Solutions

Capabilities

Methodologies

Technologies

14
From the 4-layer model to the design of AI projects

AI Engineer

15
From the 4-layer AI model
to the design of AI projects

Solutions

Capabilities AI Engineer

Methodologies

Technologies

16
Identify the class of solutions

Solutions

Capabilities AI Engineer

Methodologies

Technologies

17
Identify Capabilities and Methodologies

Solutions

Capabilities AI Engineer

Methodologies

Technologies

18
Identify Capabilities and Methodologies

Solutions

Capabilities AI Engineer

Methodologies

Technologies

19
Identify Capabilities and Methodologies

Solutions

Capabilities AI Engineer

Methodologies

Technologies

20
Identify the Technologies

Solutions

Capabilities AI Engineer

Methodologies

Technologies

21
Identify the Technologies

Solutions

Capabilities AI Engineer

Methodologies

Technologies

22
Design of an AI project

Solutions

Capabilities AI Engineer

Methodologies

Technologies

Model

23
The training set

Solutions

Capabilities AI Engineer

Methodologies

Technologies
Training
Data

24
The training set

Solutions

Capabilities AI Engineer

Methodologies

Technologies
Training

TRAINING
Data

25
What after the training?

Training
TRAINING

Data

26
What after the training? Validation

Training
TRAINING

Data
VALIDATION
Perturbed Regression Problem -> BIAS
2 3
y- y-
f( ,x) f( ,x)
1.5 ZN 2.5 ZN
data1

2
1

1.5
0.5

1
0
y

y
0.5

-0.5
0

-1
-0.5

-1.5
-1

-2 -1.5
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x

27
What after the training? Validation

Training
TRAINING

Data
VALIDATION

28
What after the training? Validation

Training
TRAINING

Data
VALIDATION

29
Ready to go!

Prediction

Input Data

30
Ready to go … but never-ending learning

Prediction

Input Data

Training Data

31
Ready to go … but never-ending learning

Prediction

Input Data

TRAINING/
ADAPTATION

Training Data

32
Ready to go … but never-ending learning

Prediction

Input Data

TRAINING/
ADAPTATION

Training Data
Solutions

Capabilities

Methodologies

Technologies
AI Engineer

33
34
Models and prediction

Regression Classification Clustering

Prediction Change Detection Adaptation

Application

Reference
concept

Detection
trigger

35
Models and prediction

Lecture 2 Lecture 3 Lecture 4


Regression Classification Clustering

Prediction Change Detection Adaptation

Application

Reference
concept

Detection
trigger

Lecture 5

Colab Notebook and examples

36
Models and prediction: Regression

Regression

37
Models and prediction: Regression

Regression 2
y-

1.5

0.5

Interest
y
0

-0.5

-1

-1.5
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
Age
x

38
Models and prediction: Regression

Regression 2
ZN

1.5

Interest
0.5

y
0

-0.5

-1

-1.5

Age
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x

39
Models and prediction: Regression

Regression 2
f( ,x)
ZN
1.5

0.5

Interest
0

y
-0.5

-1

-1.5

-2

Age
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x

40
Models and prediction: Regression

Regression 2
y-
f( ,x)
1.5 ZN

0.5

Interest
0

y
-0.5

-1

-1.5

-2

Age
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x

41
Models and prediction: Classification

Classification

42
Models and prediction: Classification

Classification

Income
Age

43
Models and prediction: Classification

Classification

Income
Age

44
Models and prediction: Classification

Classification

Income
Age

45
Models and prediction: Classification

Classification

Income

46
Models and prediction: Clustering-

Clustering

47
Models and prediction: Clustering

Clustering

48
Models and prediction: Clustering

R
TLIE
Clustering
OU

49
Models and prediction: Prediction

Prediction

50
Models and prediction: Prediction

Prediction

Prediction Model

51
Models and prediction: Prediction

Prediction

Prediction Model

52
Models and prediction: Prediction

Prediction

Prediction Model

53
Models and prediction: Change Detection

Change Detection

54
Models and prediction: Change Detection

Change Detection

55
Models and prediction: Change Detection

Change Detection

Change Detection

56
Models and prediction: Change Detection

Change Detection

Change Detection

Application

Alarm

Analytics

57
Models and prediction: Adaptation

Adaptation

Application

Reference
concept

Detection
trigger

58
Models and prediction: Adaptation

Adaptation

Application

Reference
concept

Detection
trigger

59
The basics of Learning

60
A ”toy” example

??? Physical model ? Data-driven might be


A very tough I did not completed a good solution
classification my PhD in Physics (brute-force as well)
problem yet

Pass
Pass

NO Pass
Pass

What is the learning goal here? NO Pass

61
Data-processing and applications

P
Data generating
process

Application
Model
of the
system

62
x(k + 1) = f (x(k), u(k)) + η(k), (46)
y(k) = h(x(k), u(k)) + d(k) (47)

Learning the system model where x ∈ Rn is the state vector, y ∈ R� is the output vector, u ∈ Rm is the input vector, which
may consist of some controlled inputs as well as some uncontrolled inputs which however can be
measured, η is the i.i.d. random variable describing the uncertainty affecting the state vector; d is
an independent and identically distributed (i.i.d.) random variable describing the noise affecting
the output vector. The functions f and h are, in general, non-linear functions and unlike the models
presented in (26)-(27) they are assumed unknown.
iSense D1.1: Specification of System Characteristics
The output equation (47) models the relationship among the output, the state and the input
variables, whileIf the state equation (46), models the evolution of the state variables over time with
the relationship between y(k) and u(k) is linear, the system model simplifies to
respect to the inputs and states.
The discrete-time model presented above isy(k) quite=general + d(k)
Du(k)and allows the modeling of a wide range(49)
of applications. In the following, we
where D is an � × m matrix.
specialize the system model to cover interesting application
cases, namely those where P can be described within a regression framework, the case where the
output
JOURNAL OF LATEX CLASS FILES, variables
VOL. 6, NO. 2.2.2coincide
1, JANUARY 2007 with the
Input-output state variables (input-output description) and the general
models 2 case
P
where the process can be specified with a state space representation.
Of particular interest is the case where P can be described by the input-output representation, here
as in single vs. ensemble, sequentialData generating
vs. batch,
considered passive
in the SISOvs. scenario,
particular, let xt 2 Rd be the observation at time t, generated
active. 2.2.1 Regression models by an unknown process X, and let yt be its class label,
We believe that the classification active process y(k) = is h(y(k
vs. passive 1), y(k − 2),
the −belonging y(k −set
to . a. . ,finite ky ),⇤.
u(k), − 1), . . . , u(k
Theu(kprobability ku )) + d(k)
of −observations (50)
When P
most appropriate one since it refers to does not
the have internal
way classifiers states
at time (i.e., the
t candependency. system has no dynamics), the
be expressed as input-output models represent a further output variables
Application
depend only and
on characterized
the input by a finite
variables at time
timelagkX and, hence,Linear (47) can be rewritten as
adapt in response to concept drift. In thespecific
following, we mainly
(x,y)
focus on active classifiers [2]–[5], [8]–[20], since
subcase of the above where the relationship between the outputX
p(x|t) = p(y|t)p(x|y, t), such that
and the input variables is
p(y|t) = 1,
linear. In the
suchproposed
a case and for the MIMO scenario, the system model assumes the general canonical
approach falls in this category. Readersform interested
[31]: in passive y2⇤ y2⇤

classifiers can refer to [1], [6]. y(k) = h(u(k)) + d(k) (1) (48)
Active classifiers rely on triggering mechanisms detecting where p(y|t) > 0, is the probability of receiving a sample
m
when the classifier is no more aligned with the current concept, ofA(z)y(k) class y 2 X
=
⇤, while
Bi (z) p(x|y, t)C(z)
ui (k) +
is the conditional probability
d(k) (51)
generally by means of a CDT. Estimate distribution of
i=1 i
class
F (z) y at time t. Both the probabilities of
D(z)
The adaptation phase is then activated as soon as a change classes and the23conditional probabilities are unknown and,
a model
where z is the time-shift possibly,
operator, time
A(z),variantBi (z), (whenever
C(z), D(z),a and Fi (z)drift
concept represent the z-transform
occurs).
is detected, and moves the learning machine into a new opera-
functions and, ui is the i-th input.
tional state. On the contrary, in passive solutions, thecanonical The training sequence is composed of the first
classifierform we can specify some linear input-output models for0 the system which T observa-
From the tions, assumed to be generated in stationary conditions, i.e.,
undergoes a continuous update every are time new supervised
widely-used in system identification, e.g., the AR, ARX and OE models. If we have a priori
samples are made available. These latter solutionsaboutgenerally 8y 2 ⇤, p(y|t) and p(x|y, t) do not change in t 2 [0, T0 ]. No
information the nature of the system then we can exploit such an information to build up an
rely on an ensemble of classifiers with effective
adaptation confined
model. in assumptions
An interesting
are made on how often supervised pairs (xt , yt )
positive consequence is that, after having identified the system with
the update of the weights of the fusion/aggregation
the suitable model, rule and are provided during the operational life (t > T ), as these
the bias component of the residual error vanishes and the0same model satisfies
in the creation/removal of classifiers composing the ensemble.
the i.i.d hypothesis, whichcould be for
is useful received
the subsequent followingstatistical
a regular time-pattern
change detectionscheme
phase.
[19] suggests an active classifier that monitors nonstation- (e.g., one supervised sample out of m) or asynchronously.
arity by inspecting variations in the mean AR value
system a slidingWhen the system can be expressed as a linear autoregressive (AR) model, Eq.
of model:
window opened over raw data. Differently, (51) simplifies
[18] to a linear relationship
takes decisions IV. JITbetween
C LASSIFIERS the outputFOR variable
R ECURRENTy(k) at time k and
C ONCEPTS : its previous
values. For instance,
by inspecting the normalized Kolmogorov-Smirnov distancein the case of a scalar single output of
THE G ENERAL F ORMULATION
order ky , the system can be expressed
between the cumulative density functions as estimated from the
training samples and a window of the most recent ones. The key elements composing a JIT classifier are the set
[14]–[17], [20] present triggering mechanisms based on of concept representations C = {C1 , . . . , CN } and the set of
A(z)y(k) = d(k), (52) 63
Supervised Learning: Statistical framework
Regression Classification

64
affecting the parameter vector. Which are then the effects of this perturbation on the

Statistical Learning: the approach


performance of the model? This section aims at addressing the above aspects.

3.4.1 Basics of Learning: inherent, approximation and estimation


risks
The training set
Let ZN = {(x1 , y1 ), ..., (xN , yN )} be the set composed of N (input-output) couples.
The goal of machine learning is to build the simplest approximating model able to
explain past ZN data and future instances that will be provided by the data generating
process.
Consider then the situation where the process generating the data (system model)
is ruled by

y = g(x) + h, (3.7)

65
Designing a classifier

Consider a bidimensional problem

Feature 2
Feature 1

Designing a classifier requires identification of the


function separating the labeled points

66
Some issues we need to focus on

• Linear vs. non linear


• Many points versus the available
points
• Several tecniques are available to
design the classifier (KNN,
feedforward NN, SVM…)

67
Non linear regression

yi

xi
x

Given a set of n noise affected couples (xi,yi)


we wish to reconstruct the unknown function

68
Let ZN = {(x1 , y1 ), ..., (x(”wrong”)
N , yN )} be model
the set to describeofthe
composed data? Which couples.
N (input-output) is the relationship between the optimal
Non-linear
The goal ofregression:
machine learning statistical
parameter framework
configuration,
is to build the simplestconstrained
approximating by model
the selected
able to model family, and the current
explain past ZN data andone configured
future on will
instances that a limited data by
be provided set?
theSince the estimated parameter vector is a re-
data generating
process. alization of a random variable centered on the optimal one, the model we obtain
3.4 Learning from data and uncertainty at the model level
Consider then the situation
from the where the process
available datagenerating the data
can be seen as (system model)
a perturbed model induced by perturbations
is ruled by affecting the parameter h is a noise
where vector. Whichtermare then the
modeling theeffects
existingofuncertainty
this perturbation
affectingon
thethe
unknow
The time invariant process generating the data
performance ofnon the linear
model? function
This section g(x), if any. aims Once the generic the
at addressing dataaboveinstance xi is available, (3.
aspects.
3.4 Learning fromydata = g(x) + h, value
andprovides
uncertainty at the modelyi = g(x leveli ) + hi , hi being (3.7)a realization39of the random variable h.
The goal of learning
practical cases, the system for which we aim to create a model, by receiving inp
is
where
provides, h is a noise
given inputterm modeling
xioutput
, provides instance the existing
value yi . We to build the
uncertainty
comment simplest
affecting
that both theinputs
unknown and outputs are quantities me
xi
non3.4linear function
Learning from datag(x), if any.through
surable
and uncertainty Once
at the model level approximating
thesensors.
generic data
The instance
ultimate x ismodelavailable,39 (3.7) is to build an approximati
generating thei goal of learning
3.4.1
provides value Basics
yi = g(xof iof
)+ Learning:
g(x)hi ,Learning
3.4 hi being
based onadata
from the inherent,
realization
able of
information
and
Process

uncertainty
dataapproximation
toatthe
explainrandom
present
the model level dataseth.ZNInand
invariable
past through estimation
model 39 family
h
where
practical cases, risks
is a noise term modeling the existing uncertainty
the system for which we aim to create a model, by receiving input
data
affecting
Z and
the
future
unknown
xi , provides value yi .g(x),
non linear function if any.
We comment Onceh
where the
that generic
is aboth
noiseinputsdata
term instance
and Noutputs
modeling xi isexisting
the available,
are , x)(3.7) mea-
uncertainty
f (qquantities affecting the unknown (3.
We collect
provides a setyiof=couples
value (training
hi , linear
g(xi ) +non hi being set)
function g(x),instances
a realization if of
any.theOncerandom provided
the variable
generic h.
databy In
instance xi is available, (3.7)
surable through sensors. The ultimate goal of learning is to build an approximation
practical cases, the system provides for whichvalue we aim yi =tog(x i ) + hai ,model,
create hiyibeing byareceiving
realization inputof the random variable h. In
p . Selection
Let
of g(x) Z
based =on the
{(x y parameterized
information
,
N value yi1. We1comment
xi , provides ), ..., present
(x
practicalthat N ,
cases, y
both in
)}in the
be
dataset
Ntheinputs
system the
the Z Ndata
parameter
set
through
andforoutputs
which are generating
vector
composed model
we quantities
q 2 ofQ
family
aim to create mea- N
⇢ R
(input-output) of acouples.
a model, by receiving input
suitable fam
ofThemodels f (q , x) can be isdriven thatby someinputsaapproximating
priori available information
mea- able about
The goal
surable through ofsensors.
machine learning
xi , ultimate
provides value
goal isylearning
of i .to process.
Webuild
comment to the
build simplest
both
an approximationand outputs model
are quantities to t
system
of g(x) based on the information surable model.
through
present If
fin data
, x) are
(qsensors.
dataset ZThe likely
ultimate
N through
to be generated
goal
model of learning isby
family a linear
(3.8)
to build model -or a linear mod
an approximation
explain past
And wish to model unknown Z data
N suffices- and
of g(x)then future
with this
based instances
on thetype of model
information that
presentwill
should be
in dataset provided
be Zconsidered. by
N through model
the datacase,
Infamily
this generating
learning reli
process. family
parameterized
parameterized ofparameter
in theon models
vast results vector x) p
q 2 Q ⇢ Rby. Selection
f (q ,provided the system (3.8) family theory, e.g., see [130]. T
of a identification
suitable
f (q , x) (3.8)
of models Consider
f (q , x) canthen bethe
driven
outcome situation
by the
of some where
the a learning
priori p the process
available
parameterized in the parameter vector q 2 Q ⇢ R . Selection of a suitable family
procedure generating
information
is the about
parameterthethedata (system
configuration model)
q̂ and, henc
system model.fby
ismodels
of ruled If, data
(q x) canarebelikely
model driven f toq̂be
(by x),
,some
parameterized generated
whose
a in
priori by a linear
the quality/accuracy
parameter
available model
vector q 2 -or
information ⇢aRlinear
Qmust
about
pbe
. the modelof a suitable family
assessed.
Selection
suffices-
systemthen thisIf data
model. typeare
of likely
model
ofthe
If should
models
to accuracy
be generatedbe can
f (q , x) considered.
bybe
performance
a linear In is
drivenmodel
by this
somenot
-or case, learning
aa linear
priori
met, relies
available
and
model information
margin about the
for improvement exists, w
on vast results
suffices- provided
then this ofby
typehave system
to
model model.
theselect
system
should abeIfnew
datamodel
are likely
identification
considered. In to becase,
generated
theory,
family
this e.g.,
and by
seearelies
reiterate
learning linear
themodel
[130]. The-or a linear
learning modelFor instance,
process.
on vastofresults
outcome the theprovided
learning suffices-
thebydifference
then identification
the system
procedure this type of
is y=
the parameter
-residual- g(x)
model
theory,
between +
should h,
e.g.,be
seeconsidered.
configuration
the [130].
reconstructed
this case, learning relies (3.7)
TheInhence,
q̂ and, value f (q̂ , x) and the measur
on vast results provided by the system identification theory, e.g., see [130]. The 69
ˆ
mily,
70 i.e., g(x) = f (q o , x). Rewrite the structural risk V̄ (q̂ ) associated with mo
Inherent,
q̂ , x), i.e., theapproximation
performance of and estimation
the obtained risksas
model,

V̄ (q̂ ) = V̄ (q̂ ) V̄ (q o ) + (V̄ (q o ) VI ) +VI . (3.


estimation approximation inherent
The risk associated with the model is composed of three terms
risk risk risk

The inherent risk VI . The risk


• The inherent
depends only on the structure of the learning pr
risk depends only on the structure of the learning
lem and, for thisproblemreason, and,can be improved
for this reason, can be only
reducedby improving
only by improving the problem its
i.e., by acting on thethe process
problem itselfgenerating the data, (e.g., by designing a more p
• The approximation risk depends on how close the model family
cise sensor architecture). Nothing else can be done. This is the minimum risk
3.4 (also named
3.4 Learning
Learning from hypothesis
from data
data and space)at
and uncertainty
uncertainty atisthe
tomodel
the the process
model level generating the
level 41
41
can have and we datareach it -implying optimal accuracy performance in funct
oo). The risk depends on the ability of the learning
approximation- when the other sources).the

• •The
The
Theestimation
estimation
estimation risk
risk
riskV̄
V̄ q̂

depends
(
( )
) V̄
V̄ on
(q
(q ofTheability
uncertainty
risk depends leading
on
of the learning toofthe
the ability two ot
the learning
q̂ q oo. If we have an effective
risks are null; algorithm
algorithm
algorithm to
to
toselect
select
select a
aa parameter
parameter
parameter vector
vector
vector q̂ close
close
close to
to
to q . If we have an effective
q̂ q oo so that the contribution
learning
learning process,
process, we we o hope
hope to
to be
be able
able to
to get
get a
a q̂ close
close to
to q so that the contribution
The approximation to
to the risk risk
the model
model V̄
risk ) VI . The risk depends on how close the mo
(qisis negligible.
negligible.
70
71
Approximation and estimation risks

Optimal Model

Selected Model
Best Reachable Model

Model Space

Target Space

71
72
Approximation and estimation risks

Optimal Model
Approx.
Error
Selected Model
Best Reachable Model
Estimation
Error

Model Space

Target Space

72
What about Neural Networks?

73
Modelling space and time

The model encompasses


the concept of space,
time and status of the
neuron

74
Neural computation

The scalar product:


evaluate the affinity of values

75
Neural computation

Activation function
• Heaviside
• Sigmoidal
• Linear

76
Multi-layer Neural Networks
3.4 Learning from data and uncertainty at the model level 39

where h is a noise term modeling the existing uncertainty affecting the unknown
non linear function g(x), if any. Once the generic data instance xi is available, (3.7)
provides value yi = g(xi ) + hi , hi being a realization of the random variable h. In
practical cases, the system for which we aim to create a model, by receiving input
xi , provides value yi . We comment that both inputs and outputs are quantities mea-
surable through sensors. The ultimate goal of learning is to build an approximation
of g(x) based on the information present in dataset ZN through model family

xi f (q , x) (3.8)

parameterized in the parameter vector q 2 Q ⇢ R p . Selection of a suitable family


of models f (q , x) can be driven by some a priori available information about the
system model. If data are likely to be generated by a linear model -or a linear model
suffices- then this type of model should be considered. In this case, learning relies
on vast results provided by the system identification theory, e.g., see [130]. The
outcome of the the learning procedure is the parameter configuration q̂ and, hence,
model f (q̂ , x), whose quality/accuracy must be assessed.
If the accuracy performance is not met, and margin for improvement exists, we
have to select a new model family and reiterate the learning process. For instance, if
the difference -residual- between the reconstructed value f (q̂ , x) and the measured
y(x) one on a new data set in not a white noise (test procedure), then there is informa-
tion that model f (q̂ , x) was not able to capture. A new richer model family should be
chosen and learning re-started. In this direction, feedforward neural networks have
been shown to be universal function approximators [131], i.e., can approximate any
nonlinear function, and are ideal candidates to solve the above learning problem 77
Why Neural Networks?

78
Universal approximation theorem

A feedforward network with a single hidden layer containing a finite


number of neurons approximates any continuous function defined on
compact subsets

K. Hornik, "Approximation Capabilities of Multilayer Feedforward Networks",


Neural Networks, No.4 Vol. 2, 251–257, 1991

79
80
Approximation and estimation risks

Optimal Model
Approx.
Error
Selected Model
Best Reachable Model
Estimation
Error

Model Space

Target Space

80
81
Approximation and estimation risks

Optimal Model

Selected Model

Estimation
Error

NN Model Space
Approx. Error = 0
Target Space

81
Quality assessment of the solution
«How good is your ‘good’?»

82
Two examples: how good is my good solution?

Confusion Matrix Confusion Matrix

50 0 0 100% 23 1 0 95.8%
1 1
33.3% 0.0% 0.0% 0.0% 30.7% 1.3% 0.0% 4.2%

0 47 0 100% 0 25 2 92.6%
2 2
0.0% 31.3% 0.0% 0.0% 0.0% 33.3% 2.7% 7.4%
Output Class

Output Class
0 3 50 94.3% 0 1 23 95.8%
3 3
0.0% 2.0% 33.3% 5.7% 0.0% 1.3% 30.7% 4.2%

100% 94.0% 100% 98.0% 100% 92.6% 92.0% 94.7%


0.0% 6.0% 0.0% 2.0% 0.0% 7.4% 8.0% 5.3%
1

3
Target Class Target Class

Apparent Error Rate Sample Partitioning

83
84
Assessing the performance
§ Apparent Error Rate (AER),or resubstitution: The whole
set ZN is used both to infer the model and to estimate its
error
§ Sample Partitioning (SP): SD and SE are obtained by
randomly splitting ZN in two disjoint subsets. SD is used to
estimate the model and SE to estimate its accuracy.
§ Leaving-One-Out (LOO): SE contains one pattern in ZN,
and SD contains the remaining n − 1 patterns. The pro-
cedure is iterated n times by holding out each pattern in
ZN, and the resulting n estimates are averaged.

84
85
Assessing the performance (2)
§ w-fold Crossvalidation (wCV): ZN is randomly split into
w disjoint subsets of equal size. For each subset the
re- maining w − 1 subsets are merged to form SD and
the reserved subset is used as SE . The resulting w
estimates are averaged. This procedure can be
iterated and the results averaged when w ≪ n in order
to reduce the random resampling variance. This
estimate is a generalization of LOO.

85
What about the technology for AI?

86
“Artificial Intelligence deals with the development of
hardware-and-software systems endowed with
human-like capabilities, able to autonomously pursue a
given goal and making decisions that, until that moment,
were usually assigned to humans”

Osservatorio Artificial Intelligence,


Politecnico di Milano, 2018

87
87
«… hardware-and-software systems endowed with human-like capabilities …»

AI SOFTWARE

AI HARDWARE

88
88
«… hardware-and-software systems endowed with human-like capabilities …»

AI SOFTWARE
(Application)

AI SOFTWARE
(Framework/Platform/Tool)

AI HARDWARE

89
89
An IT perspective for the AI

AI SOFTWARE SOFTWARE
(Application) (Application)

AI SOFTWARE SOFTWARE
(Framework/Platform/Tool) (Environment)

AI HARDWARE HARDWARE

90
90
An IT perspective for the AI

The application, i.e., the reason why


SOFTWARE
this system exists
(Application)

Programs and libraries to control the


physical resources and provide tools SOFTWARE
to build applications (Environment)

Physical resources of the system


(computation, storage, input/output, HARDWARE
etc..)

91
91
An IT perspective for the AI

AI SOFTWARE The AI-based application running


(Application) into the IT system

Programs and libraries to control the


AI SOFTWARE physical resources and provide tools
(Framework/Platform/Tool) to build applications

Hardware to support AI:


AI HARDWARE datacenters, edge computing, IoT,
CPU, GPU, TPU, …

92
92
AI and Technology

RAM IBM Watson winning


Jeopardy first prize
(2011)
Data Center

Better accuracy han Libratus beats four world’s


traditional CV (2012), top players at Texas
than human (2015) hold’em poker (2017);
AlphaGo beats the world
Edge No.1 ranked player (2016)

50x – 100x
Computing
Systems
Deep Blue defeated
Kasparov (1996)

0.0001x – 0.0005x 0.05x – 0.1x PC 20x – 50x


Autonomous Vehicle, 100x – 1000x
Autonomous Robot
Embedded Computation
Speedup

0.1x – 0.5x
PCs
Intelligent Object Embedded
Devices
Internet of

0.01x–0.05x
Things

93
93
AI Hardware
AI HARDWARE

GPU, TPU, FPGA


Data Center

Edge
Computing
Systems

PC

Embedded
PCs
Embedded
Devices
Internet of
Things

94
94
AI Hardware: dal datacenter a ML-as-a-service

Machine-Learning
Deep-Learning
AI APPLICATION as-a-Service

AI PLATFORM
ML and DL Solutions

AI HARDWARE Data Center

95

95
Why Machine Learning in the Cloud?

•Cloud comuputing simplifies the access to ML capabilities for


• designing a solution (without requiring a deep knowledge of ML)
• setting up a project (managing demand increases and IT solution)
•Amazon, Microsoft and Google provide
• Solutions ML Solutions as a service

• Platforms ML Platforms as a service


ML Infrastructures as a service
• Infrastructures
to support ML in the Cloud

Machine Learning as a Service


12/02/23
96
AI and off-the-shelf technological solutions

Solutions

Cloud Platforms

HW, SW, Libraries

97
Machine Learning Infrastructure as a Service

Design ease

Amazon EC2 AMI, Google Cloud, Azure Deep


Learning VM, IBM GPU bare metal Servers

Design of HW, SW Fully customized


and AI solution HW, SW, Libraries solutions

Flexibility

98
Machine Learning Platform as a Service

Design ease
Amazon SageMaker, AZURE ML service,
Google Cloud ML engine, IBM Watson

Design of the AI Pre-designed


solution Cloud Platforms solutions

HW, SW, Libraries

Flexibility

Provide pre-configured environments used by AI experts to train, tune and host models

99
ML software as a service
Amazon AI Services, Google Cloud AI, Microsoft
Cognitive Services and IBM Watson
Design ease

Ready-to-use Pre-defined
Solutions actions/activities

Cloud Platforms

HW, SW, Libraries

Flexibility

E.g., Amazon Lex, Polly, Rekognition, MS Speech2Text …

100
AI and off-the-shelf technological solutions

Design ease
User

Ready-to-use Pre-defined
Solutions actions/activities

Design of the AI Pre-designed


solution Cloud Platforms solutions

Design of HW, SW Fully customized


and AI solution HW, SW, Libraries solutions

Flexibility

Developer

101
102
A real world example: image classification

102
103
A real world example: image classification

Solutions

Capabilities AI Engineer

Methodologies

Technologies

103
104
A real world example: image classification on AWS

Solutions

Capabilities AI Engineer

Methodologies
• ML Solutions as a service
»Rekognition»
Technologies • ML Platforms as a service
«Sagemaker»
• ML Infrastructures as a service
«EC2»
Training

TRAINING
Data

104
The evolution of computation and memory

Univac Computation,
Memory

Eniac Transistor Integrated


Circuits
Microprocessors
CPU

1945 1960 1970 1980 Today

Pioneering works in
the field of AI
Statisitcal Methods

BackProp
SVM
Neural Networks Deep Learning

Computational needs
Memory Requirements
Bayesian Inference K Nearest Neighbour

105
Deep Learning: brain-inspired architecture

106
Deep representation of knowledge

107
108
Increasing the complexity of deep learning models

Complexity doubles
every 3.5 months

108
109
Increasing the complexity of deep learning models

Complexity doubles
every 3.5 months

18 months for Moore’s


Law

109
Generations and seasons:
the evolution of computation and memory

Univac Computation,
Memory

Eniac Transistor Integrated


Circuits
Microprocessors
CPU

1945 1960 1970 1980 Today

Pioneering works in
the field of AI
Statisitcal Methods

BackProp

Neural Networks Deep Learning


SVM
Computational needs
Memory Requirements
Bayesian Inference K Nearest Neighbour

110
Generations and seasons:
the evolution of computation and memory

Univac Computation,
Memory

Eniac Transistor Integrated


Circuits
Microprocessors
CPU GPU, TPU, FPGA

1945 1960 1970 1980 Today

Pioneering works in
the field of AI
Statisitcal Methods

BackProp

Neural Networks Deep Learning


SVM
Computational needs
Memory Requirements
Bayesian Inference K Nearest Neighbour

111
Enabling accelerator operations

GPU, TPU, FPGA


Data Center

Edge
Computing
Systems

PC

Embedded
PCs
Embedded
Devices

Internet of
Things

112
113
Graphical Processing Units (GPU)

• Data-parallel computations: the same program is executed on many data elements in parallel
• The scientific codes had to be mapped onto the matrix operations.
• High level languages (such as CUDA and OpenCL) that target the GPUs directly
• Up to 1000x faster than CPU

113
114
Tensor Processing Unit (TPU)

• Custom-built integrated circuit developed specifically for machine learning and tailored for TensorFlow
• Powering Google data centers since 2015 as well as CPUs and GPUs
• A Tensor is an n-dimensional matrix. This is the basic unit of operation in with TensorFlow

114
115
Field-Programmable Gate Array (FPGA)

• Array of logic gates that can be programmed (“configured”) in the field, i.e., by the user of the device as
opposed to the people who designed it
• Array of carefully designed and interconnected digital subcircuits that efficiently implement common
functions offering very high levels of flexibility. The digital subcircuits are called configurable logic
blocks (CLBs)

ü VHDL and Verilog are hardware


description languages (HDLs)
languages that allow to “describe”
hardware;
ü HDL code is more like a schematic
that uses text to introduce
components and create
interconnections.

115
116
CPU, GPU, TPU and FPGA: an AI comparison

Advantages Disadvantages
• Easy to be programmed and • Most suited for simple models
support any programming that do not take long to train and
CPU framework. for small models with small
• Fast design space exploration training set
and run your applications.
• Ideal for applications in which • Programmed in languages like
data need to be processed in CUDA and OpenCL and therefore
GPU parallel like the pixels of images provide limited flexibility
or videos. compared to CPUs.
• Very fast at performing dense • For applications and models
vector and matrix computations based on the TensorFlow.
TPU and are specialized on running • Lower flexibility compared to
very fast program based on CPUs and GPUs
Tensorflow.
• Higher performance, lower cost • Programmed using OpenCL and
and lower power consumption High-level Synthesis (HLS)
FPGA compared to other options like • Limited flexibility compared to
CPUs and GPU other platforms.

116
Machine-Learning-as-a-service: pros and cons

Cloud computing Ø Internet connection


simplifies the access to Ø High Power
ML/DL capabilities for Consumption
ü designing a Ø Privacy and Security
solution (without Ø Latency in making
requiring a deep decision
knowledge of
ML/DL)
ü setting up a
project (managing
demand increases
and IT solution)

117
117
From Cloud to IoT and Edge
Computing: new platforms for ML/DL

118
118
IoT, PC Embedded and Edge Computing

AI APPLICATION

Move intelligent processing as close as


AI PLATFORM
possible to data generation units

AI HARDWARE

Internet-of- PC Edge
Things Embedded Computing

119
119
IoT, PC Embedded and Edge Computing

AI APPLICATION

MACHINE AND DEEP LEARNING PLAFORMS FOR IoT AND EDGE


AI PLATFORM (Tensorflow Lite, Pytorch 1.3 Quantization, Openvino,
Intelligent Edge HOL, …)

AI HARDWARE

Internet-of- PC Edge
Things Embedded Computing

120
120
IoT, PC Embedded and Edge Computing

ü Increase autonomy
ü Reduce decision-making latency
ü Reduce transmission bandwidth
AI APPLICATION ü Increase energy-efficiency
ü Security and Privacy

AI PLATFORM MACHINE AND DEEP LEARNING PLAFORMS FOR IoT AND EDGE

AI HARDWARE

Internet-of- PC Edge
Things Embedded Computing

121
121
122
Intelligent Internet-of-Things and Cyber-Physical Systems

Adaptation

Domain
Cyber
Self-
awareness
Intelligent
Intelligent
Cognitive
Processing IoT and Mechanisms
of
Physical
Cyber- for Actuation Reliability
and Control
Sensing Physical
Systems Self-
Diagnosis
Physical
Domain

Self-
healing

122
123
Intelligent Internet-of-Things

Self-
eness mec healing
ar h
User a w
Self- nose to re anisms
g pair
to dia faul
ts
? fault
i s ms
n
a
e ch
i ch M W?
Wh ich H Application / Service
Wh
Intelligent Mechanisms
Remot
e
contro Detect changes
llabilit
HW / Sensors y in the
users’behaviour
Ener
g
Harv y
estin
g mart energy
S
Environment ement
manag

123
124

Designing Intelligent IoT Systems: from centralized ….


124

Scalability Server Application


Responsiveness Cloud Intelligent Mechanisms
Dumb units

EDGE EDGE EDGE

IoT IoT IoT IoT


Sensors

Sensors

Sensors

Sensors
Comm

Comm

Comm

Comm
124
125

Designing Intelligent Embedded Systems:


….to distributed intelligent systems
Scalability Server Application
Responsiveness Cloud Intelligent Mechanisms
Smart units

EDGE EDGE EDGE

IoT IoT IoT IoT


Sensors
Comm

Comm

Comm

Comm
Sensors

Sensors

Sensors
125
The last mile …

126
127

127
128

128
129

New York Times


May 26, 2016

129
recognition algorithms to do verifica-
130 tion and identification, two separate
but related tasks. Verification involves
trying to correctly determine whether
two faces presented to the facial recog-

FINDING ONE FACE nition algorithm belong to the same per-


son. Identification involves trying to find
a matching photo of the same person

IN A MILLION
A new benchmark test shows that even Google’s facial
With that in mind, University of Wash-
ington researchers raised the bar by cre-
among a million “distractor” faces. Ini-
tial results on algorithms developed by
Google and four other research groups
were presented at the IEEE Conference
recognition algorithm is far from perfect
ating the MegaFace Challenge using
1 million Flickr images of 690,000 on Computer Vision and Pattern Recog-
unique faces that are publicly available
under a Creative Commons license. nition on 30 June. (One of MegaFace’s
The MegaFace Challenge forces facial
recognition algorithms to do verifica- developers also heads a computer vision
tion and identification, two separate
Helen of Troy may have had the the project’s principal investigator. “And
but related tasks. Verification involves team at Google’s Seattle office.)
trying to correctly determine whether
face that launched a thousand we make a number of discoveries that are
two faces presented to the facial recog- The results presented were a mix of
FINDING ONE FACE nition algorithm belong to the same per-
ships, but even the best facial recogni- only possible when evaluating at scale.”
son. Identification involves trying to find
a matching photo of the same person
the intriguing and the expected. Nobody

IN A MILLION tion algorithms might have had trouble


finding
A new benchmark test shows that even Google’s facial her
among a million “distractor” faces. Ini-
tial results on algorithms developed by

in a crowd of a million
Google and four other research groups
were presented at the IEEE Conference strang-
The huge drops in accuracy when scan-
ning a million faces matter because facial
was surprised that the algorithms’ per-
formances suffered as the number of
recognition algorithm is far from perfect
ers. The first public benchmark test based recognition algorithms inevitably face distractor faces increased. And the fact
on Computer Vision and Pattern Recog-
nition on 30 June. (One of MegaFace’s
developers also heads a computer vision
on principal
Helen of Troy may have had the the project’s 1 million faces
investigator. “And has shown how facial such challenges in the real world. People
team at Google’s Seattle office.) that algorithms had trouble identifying
The results presented were a mix of
face that launched a thousand we make a number of discoveries that are
recognition
ships, but even the best facial recogni- only possible algorithms from Google and increasingly trust these algorithms to cor-
the intriguing and the expected. Nobody
when evaluating at scale.” the same person at different ages was a
tion algorithms might have had trouble was surprised that the algorithms’ per-
The huge drops in accuracy when scan-
finding her in a crowd of a million strang- ning a other
million facesresearch groups around the world rectly identify them in security verifica-
formances suffered as the number of
matter because facial known problem. However, the results
distractor faces increased. And the fact
ers. The first public benchmark test based recognition algorithms inevitably face
still fall
on 1 million faces has shown how facial such challenges well
in the real short of perfection.
that algorithms had trouble identifying
world. People
the same person at different ages was a
recognition algorithms from Google and increasingly trust these algorithms to cor-
tion scenarios, and law enforcement may also showed that algorithms trained on
Facial recognition algorithms that also rely on facial recognition to pick sus-
known problem. However, the results
other research groups around the world rectly identify them in security verifica-
still fall well short of perfection. also showed that algorithms trained on
tion scenarios, and law enforcement may
relatively small data sets can compete
had previously performed with more pects out of the hundreds of thousands of
relatively small data sets can compete
Facial recognition algorithms that also rely on facial recognition to pick sus-
with those trained on very large ones,
had previously performed with more pects out of the hundreds of thousands of
with those trained on very large ones,
benchmark test involving 13,000 faces The than
most popular95benchmark
percent until accuracy on a popular
such as Google’s FaceNet, which was
than 95 percent accuracy on a popular faces captured on surveillance cameras.
trained on more than 500 million pho- faces captured on surveillance cameras. such as Google’s FaceNet, which was
benchmark test
LFW involving 13,000 faces The most popular benchmark until trained on more than 500 million pho-
tos of 10 million people.
saw significant drops in accuracy when now has been the Labeled Faces in the
taking on the new MegaFace Challenge. Wild (LFW) For example, the FaceN algorithm from
test created in 2007.

tos of 10 million people.


Russia’s N-TechLab performed well on
The best performer, Google’s FaceNet includes 13,000 images of just 5,000
saw significant drops in accuracy when
certain tasks in comparison with FaceNet,
algorithm, dropped from near-perfect people. Many facial recognition algo- now has been the Labeled Faces in the
despite having trained on 18 million pho-
accuracy on the five-figure data set to rithms have been fine-tuned to the point
taking
75 percent on the million-face test. Other that they on the new MegaFace Challenge. Wild (LFW) test created in 2007. LFW
tos of 200,000 people. The SIAT MMLab
scored near-perfect accuracy For example, the FaceN algorithm from
top algorithms dropped from above when picking through the LFW images. algorithm, created by a Chinese team
The best
90 percent to below 60 percent. Some Most researchers say thatperformer,
new bench- Google’s FaceNet includes 13,000 images of just 5,000
under the leadership of Yu Qiao, a profes- Russia’s N-TechLab performed well on
sor with Shenzhen Institutes of Advanced
algorithms made the proper identifica- mark challenges have been long overdue.
tion as seldom as 35 percent of the time. “Thealgorithm, dropped
big disadvantage is that [the field] is from near-perfect people. Many facial recognition algo-
Technology, Chinese Academy of Sciences, certain tasks in comparison with FaceNet,
UNIVERSITY OF WASHINGTON

also performed well on certain tasks.


“MegaFace’s key idea is that algorithms saturated—that is, there are many, many
accuracy on
should be evaluated at large scale,” says algorithms that perform above 95 percent
the five-figure data set to rithms have been fine-tuned to the point
Nevertheless, FaceNet has so far per-
formed the best overall. It delivered the
Ira Kemelmacher-Shlizerman, an assis- on LFW,” Kemelmacher-Shlizerman says.
despite having trained on 18 million pho-
75 percent on
tant professor of computer science at the “This gives the impression that face rec-
all testing.
University of Washington, in Seattle, and ognition is solved and working perfectly.”
the million-face IEEE Spectrum
test. Other that they scored near-perfect accuracy
most consistent performance across
tos of 200,000 people. The SIAT MMLab
top algorithms dropped from above when picking through the LFW images. algorithm, created by a Chinese team
90 percent to below 60 percent. nEwS Aug, 2016
Some Most researchers say that new bench- under the leadership of Yu Qiao, a profes-
algorithms made the proper identifica- mark challenges have been long overdue.
SPECTRUM.IEEE.ORG | nORTh aMERICan | aUG 2016 | 11 sor with Shenzhen Institutes of Advanced
tion as seldom as 35 percent of the time. “The big disadvantage is that [the field] is Technology, Chinese Academy of Sciences,
UNIVERSITY OF WASHINGTON

“MegaFace’s key idea is that algorithms saturated—that is, there are many, many also performed well on certain tasks.
08.News.INT - 08.News.NA [P]{NA}.indd 11 7/13/16 2:10 PM

should be evaluated at large scale,” says algorithms that perform above 95 percent Nevertheless, FaceNet has so far per-
Ira Kemelmacher-Shlizerman, an assis- on LFW,” Kemelmacher-Shlizerman says. formed the best overall. It delivered the
tant professor of computer science at the “This gives the impression that face rec- most consistent performance across
University of Washington, in Seattle, and ognition is solved and working perfectly.” all testing.

130
131
Thank you for the attention!

132

You might also like