Artificial Intelligence: Models and Technologies: - An Introduction From The ICT Perspective

Artificial Intelligence: models and technologies
- An introduction from the ICT perspective -
Prof Manuel Roveri – manuel.roveri@polimi.it
Lecture 1 - ICT4Dev Summer School

Prof. Manuel Roveri
• Full Professor
Dipartimento di Elettronica, Informazione e Bioingegneria
(DEIB), Politecnico di Milano, Italy
Email: manuel.roveri@polimi.it
Web: http://roveri.faculty.polimi.it
• Research interests: TinyML, IoT and edge computing, privacy-
preserving machine and deep learning
• Lecturer of « Computing Infrastructures» and «Hardware
Architecture for Embedded and edge AI»
• Associate Editor of IEEE Trans. on Artificial Intelligence, Neural
Networks, IEEE Trans. on Emerging Tecnologies in Computational
Intelligence, IEEE Trans. on Neural Networks and Learning Systems
• Chair of the IEEE CIS Technical Activities strategic planning
committee and IEEE CIS Neural Network Technical Committee
• Co-Founder of DHIRIA, a Spin-Off of Politecnico di Milano
2
Summary
• Introduction to AI projects
• AI models and prediction
• The basics of learning
• Neural Networks
• Evaluating AI models
• Technologies for AI
• AI and Cloud
• Hardware accelerators
• AI and IoT
• Challenges and opportunities
3
Artificial Intelligence
Machine Learning
Deep Learning
4
5
Artificial Intelligence
Machine Learning
Deep Learning
1945 1960 1970 1980 Today

Pioneering works in
the field of AI
Statisitcal Methods
BackProp
Neural Networks Deep Learning
SVM
K Nearest Neighbour
Bayesian Inference
5
6
Univac
Eniac Transistor Integrated

Circuits
Microprocessors
1945 1960 1970 1980 Today

Pioneering works in
the field of AI
Statisitcal Methods
BackProp
SVM
K Nearest Neighbour
Bayesian Inference
6
7
Univac
Computation,
Memory
Circuits
Microprocessors
1945 1960 1970 1980 Today

Pioneering works in
the field of AI
Statisitcal Methods
BackProp
SVM
Computational needs
K Nearest Neighbour Memory Requirements
Bayesian Inference
7
8
We live in the era of information abundance
Virtual Sensors
Real Sensors
Data is the new gold
8
The 4-layer AI model
Solutions
Capabilities
Methodologies
Technologies
9
Solutions
• Intelligent Data Processing,

• Virtual Assistant/Chatbot,
• Recommendation,
• Image Processing,
• Autonomous Vehicle,
• Intelligent Object,
• Language Processing,
• Autonomous Robot
10
Capabilities
Natural Language Processing Image Processing

Signal Processing
Discourse and Dialogue
Computer Vision
Information Extraction
Face and Gesture Recognition
Question Answering
Image and Video Retrieval
Reasoning and Planning

Knowledge Representation
Learning Automated Reasoning
Computational Intelligence Ontologies
Data Mining Knowledge Engineering
Evolutionary Intelligence Replanning and Plan Repair
Machine Learning Temporal Planning
Optimization
Scheduling
Interacting Socially Interacting Physically

Localization, Mapping and Navigation
Multi-Agent Systems
Behavior and Control
Game Theory
Motion and Path Planning
Coordination and Collaboration
State Estimation
11
Methodologies
Learning Reasoning and Planning Interacting Socially

Online learning Inference Game theory
Regression Modal logics Mechanism design and
Reinforcement learning Temporal planning auctions
Bandit algorithms Linear planning Voting
Supervised learning Mathematical programming Social choice theory
Unsupervised learning Integer programming …
Neural networks Approximation algorithms
Deep learning Blackbox optimization
Genetic algorithms …
Sequential decision
Classification
…
12
Technologies
13
Solutions
Capabilities
Methodologies
Technologies
14
From the 4-layer model to the design of AI projects
AI Engineer
15
From the 4-layer AI model
to the design of AI projects
Solutions
Capabilities AI Engineer
Methodologies
Technologies
16
Identify the class of solutions
Solutions
Methodologies
Technologies
17
Identify Capabilities and Methodologies
Solutions
Methodologies
Technologies
18
Solutions
Methodologies
Technologies
19
Solutions
Methodologies
Technologies
20
Identify the Technologies
Solutions
Methodologies
Technologies
21
Identify the Technologies
Solutions
Methodologies
Technologies
22
Design of an AI project
Solutions
Methodologies
Technologies
Model
23
The training set
Solutions
Methodologies
Technologies
Training
Data
24
The training set
Solutions
Methodologies
Technologies
Training
TRAINING
Data
25
What after the training?
Training
TRAINING
Data
26
What after the training? Validation
Training
TRAINING
Data
VALIDATION
Perturbed Regression Problem -> BIAS
2 3
y- y-
f( ,x) f( ,x)
1.5 ZN 2.5 ZN
data1
2
1
1.5
0.5
1
0
y
y
0.5
-0.5
0
-1
-0.5
-1.5
-1
-2 -1.5
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x
27
Training
TRAINING
Data
VALIDATION
28
Training
TRAINING
Data
VALIDATION
29
Ready to go!
Prediction
Input Data
30
Ready to go … but never-ending learning
Prediction
Input Data
Training Data
31
Prediction
Input Data
TRAINING/
ADAPTATION
Training Data
32
Prediction
Input Data
TRAINING/
ADAPTATION
Training Data
Solutions
Capabilities
Methodologies
Technologies
AI Engineer
33
34
Models and prediction
Regression Classification Clustering
Prediction Change Detection Adaptation
Application
Reference
concept
Detection
trigger
35
Models and prediction
Lecture 2 Lecture 3 Lecture 4

Regression Classification Clustering
Prediction Change Detection Adaptation
Application
Reference
concept
Detection
trigger
Lecture 5
Colab Notebook and examples
36
Models and prediction: Regression
Regression
37
Regression 2
y-
1.5
0.5
Interest
y
0
-0.5
-1
-1.5
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
Age
x
38
Regression 2
ZN
1.5
Interest
0.5
y
0
-0.5
-1
-1.5
Age
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x
39
Regression 2
f( ,x)
ZN
1.5
0.5
Interest
0
y
-0.5
-1
-1.5
-2
Age
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x
40
Regression 2
y-
f( ,x)
1.5 ZN
0.5
Interest
0
y
-0.5
-1
-1.5
-2
Age
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x
41
Models and prediction: Classification
Classification
42
Classification
Income
Age
43
Classification
Income
Age
44
Classification
Income
Age
45
Classification
Income
46
Models and prediction: Clustering-
Clustering
47
Models and prediction: Clustering
Clustering
48
Models and prediction: Clustering
R
TLIE
Clustering
OU
49
Models and prediction: Prediction
Prediction
50
Prediction
Prediction Model
51
Prediction
Prediction Model
52
Prediction
Prediction Model
53
Models and prediction: Change Detection
Change Detection
54
Change Detection
55
Change Detection
Change Detection
56
Change Detection
Change Detection
Application
Alarm
Analytics
57
Models and prediction: Adaptation
Adaptation
Application
Reference
concept
Detection
trigger
58
Models and prediction: Adaptation
Adaptation
Application
Reference
concept
Detection
trigger
59
The basics of Learning
60
A ”toy” example
??? Physical model ? Data-driven might be

A very tough I did not completed a good solution
classification my PhD in Physics (brute-force as well)
problem yet
Pass
Pass
NO Pass
Pass
What is the learning goal here? NO Pass
61
Data-processing and applications
P
Data generating
process
Application
Model
of the
system
62
x(k + 1) = f (x(k), u(k)) + η(k), (46)
y(k) = h(x(k), u(k)) + d(k) (47)
Learning the system model where x ∈ Rn is the state vector, y ∈ R� is the output vector, u ∈ Rm is the input vector, which
may consist of some controlled inputs as well as some uncontrolled inputs which however can be
measured, η is the i.i.d. random variable describing the uncertainty affecting the state vector; d is
an independent and identically distributed (i.i.d.) random variable describing the noise affecting
the output vector. The functions f and h are, in general, non-linear functions and unlike the models
presented in (26)-(27) they are assumed unknown.
iSense D1.1: Specification of System Characteristics
The output equation (47) models the relationship among the output, the state and the input
variables, whileIf the state equation (46), models the evolution of the state variables over time with
the relationship between y(k) and u(k) is linear, the system model simplifies to
respect to the inputs and states.
The discrete-time model presented above isy(k) quite=general + d(k)
Du(k)and allows the modeling of a wide range(49)
of applications. In the following, we
where D is an � × m matrix.
specialize the system model to cover interesting application
cases, namely those where P can be described within a regression framework, the case where the
output
JOURNAL OF LATEX CLASS FILES, variables
VOL. 6, NO. 2.2.2coincide
1, JANUARY 2007 with the
Input-output state variables (input-output description) and the general
models 2 case
P
where the process can be specified with a state space representation.
Of particular interest is the case where P can be described by the input-output representation, here
as in single vs. ensemble, sequentialData generating
vs. batch,
considered passive
in the SISOvs. scenario,
particular, let xt 2 Rd be the observation at time t, generated
active. 2.2.1 Regression models by an unknown process X, and let yt be its class label,
We believe that the classification active process y(k) = is h(y(k
vs. passive 1), y(k − 2),
the −belonging y(k −set
to . a. . ,finite ky ),⇤.
u(k), − 1), . . . , u(k
Theu(kprobability ku )) + d(k)
of −observations (50)
When P
most appropriate one since it refers to does not
the have internal
way classifiers states
at time (i.e., the
t candependency. system has no dynamics), the
be expressed as input-output models represent a further output variables
Application
depend only and
on characterized
the input by a finite
variables at time
timelagkX and, hence,Linear (47) can be rewritten as
adapt in response to concept drift. In thespecific
following, we mainly
(x,y)
focus on active classifiers [2]–[5], [8]–[20], since
subcase of the above where the relationship between the outputX
p(x|t) = p(y|t)p(x|y, t), such that
and the input variables is
p(y|t) = 1,
linear. In the
suchproposed
a case and for the MIMO scenario, the system model assumes the general canonical
approach falls in this category. Readersform interested
[31]: in passive y2⇤ y2⇤
classifiers can refer to [1], [6]. y(k) = h(u(k)) + d(k) (1) (48)
Active classifiers rely on triggering mechanisms detecting where p(y|t) > 0, is the probability of receiving a sample
m
when the classifier is no more aligned with the current concept, ofA(z)y(k) class y 2 X
=
⇤, while
Bi (z) p(x|y, t)C(z)
ui (k) +
is the conditional probability
d(k) (51)
generally by means of a CDT. Estimate distribution of
i=1 i
class
F (z) y at time t. Both the probabilities of
D(z)
The adaptation phase is then activated as soon as a change classes and the23conditional probabilities are unknown and,
a model
where z is the time-shift possibly,
operator, time
A(z),variantBi (z), (whenever
C(z), D(z),a and Fi (z)drift
concept represent the z-transform
occurs).
is detected, and moves the learning machine into a new opera-
functions and, ui is the i-th input.
tional state. On the contrary, in passive solutions, thecanonical The training sequence is composed of the first
classifierform we can specify some linear input-output models for0 the system which T observa-
From the tions, assumed to be generated in stationary conditions, i.e.,
undergoes a continuous update every are time new supervised
widely-used in system identification, e.g., the AR, ARX and OE models. If we have a priori
samples are made available. These latter solutionsaboutgenerally 8y 2 ⇤, p(y|t) and p(x|y, t) do not change in t 2 [0, T0 ]. No
information the nature of the system then we can exploit such an information to build up an
rely on an ensemble of classifiers with effective
adaptation confined
model. in assumptions
An interesting
are made on how often supervised pairs (xt , yt )
positive consequence is that, after having identified the system with
the update of the weights of the fusion/aggregation
the suitable model, rule and are provided during the operational life (t > T ), as these
the bias component of the residual error vanishes and the0same model satisfies
in the creation/removal of classifiers composing the ensemble.
the i.i.d hypothesis, whichcould be for
is useful received
the subsequent followingstatistical
a regular time-pattern
change detectionscheme
phase.
[19] suggests an active classifier that monitors nonstation- (e.g., one supervised sample out of m) or asynchronously.
arity by inspecting variations in the mean AR value
system a slidingWhen the system can be expressed as a linear autoregressive (AR) model, Eq.
of model:
window opened over raw data. Differently, (51) simplifies
[18] to a linear relationship
takes decisions IV. JITbetween
C LASSIFIERS the outputFOR variable
R ECURRENTy(k) at time k and
C ONCEPTS : its previous
values. For instance,
by inspecting the normalized Kolmogorov-Smirnov distancein the case of a scalar single output of
THE G ENERAL F ORMULATION
order ky , the system can be expressed
between the cumulative density functions as estimated from the
training samples and a window of the most recent ones. The key elements composing a JIT classifier are the set
[14]–[17], [20] present triggering mechanisms based on of concept representations C = {C1 , . . . , CN } and the set of
A(z)y(k) = d(k), (52) 63
Supervised Learning: Statistical framework
Regression Classification
64
affecting the parameter vector. Which are then the effects of this perturbation on the
Statistical Learning: the approach

performance of the model? This section aims at addressing the above aspects.
3.4.1 Basics of Learning: inherent, approximation and estimation

risks
The training set
Let ZN = {(x1 , y1 ), ..., (xN , yN )} be the set composed of N (input-output) couples.
The goal of machine learning is to build the simplest approximating model able to
explain past ZN data and future instances that will be provided by the data generating
process.
Consider then the situation where the process generating the data (system model)
is ruled by
y = g(x) + h, (3.7)
65
Designing a classifier
Consider a bidimensional problem
Feature 2
Feature 1
Designing a classifier requires identification of the

function separating the labeled points
66
Some issues we need to focus on
• Linear vs. non linear

• Many points versus the available
points
• Several tecniques are available to
design the classifier (KNN,
feedforward NN, SVM…)
67
Non linear regression
yi
xi
x
Given a set of n noise affected couples (xi,yi)

we wish to reconstruct the unknown function
68
Let ZN = {(x1 , y1 ), ..., (x(”wrong”)
N , yN )} be model
the set to describeofthe
composed data? Which couples.
N (input-output) is the relationship between the optimal
Non-linear
The goal ofregression:
machine learning statistical
parameter framework
configuration,
is to build the simplestconstrained
approximating by model
the selected
able to model family, and the current
explain past ZN data andone configured
future on will
instances that a limited data by
be provided set?
theSince the estimated parameter vector is a re-
data generating
process. alization of a random variable centered on the optimal one, the model we obtain
3.4 Learning from data and uncertainty at the model level
Consider then the situation
from the where the process
available datagenerating the data
can be seen as (system model)
a perturbed model induced by perturbations
is ruled by affecting the parameter h is a noise
where vector. Whichtermare then the
modeling theeffects
existingofuncertainty
this perturbation
affectingon
thethe
unknow
The time invariant process generating the data
performance ofnon the linear
model? function
This section g(x), if any. aims Once the generic the
at addressing dataaboveinstance xi is available, (3.
aspects.
3.4 Learning fromydata = g(x) + h, value
andprovides
uncertainty at the modelyi = g(x leveli ) + hi , hi being (3.7)a realization39of the random variable h.
The goal of learning
practical cases, the system for which we aim to create a model, by receiving inp
is
where
provides, h is a noise
given inputterm modeling
xioutput
, provides instance the existing
value yi . We to build the
uncertainty
comment simplest
affecting
that both theinputs
unknown and outputs are quantities me
xi
non3.4linear function
Learning from datag(x), if any.through
surable
and uncertainty Once
at the model level approximating
thesensors.
generic data
The instance
ultimate x ismodelavailable,39 (3.7) is to build an approximati
generating thei goal of learning
3.4.1
provides value Basics
yi = g(xof iof
)+ Learning:
g(x)hi ,Learning
3.4 hi being
based onadata
from the inherent,
realization
able of
information
and
Process
uncertainty
dataapproximation
toatthe
explainrandom
present
the model level dataseth.ZNInand
invariable
past through estimation
model 39 family
h
where
practical cases, risks
is a noise term modeling the existing uncertainty
the system for which we aim to create a model, by receiving input
data
affecting
Z and
the
future
unknown
xi , provides value yi .g(x),
non linear function if any.
We comment Onceh
where the
that generic
is aboth
noiseinputsdata
term instance
and Noutputs
modeling xi isexisting
the available,
are , x)(3.7) mea-
uncertainty
f (qquantities affecting the unknown (3.
We collect
provides a setyiof=couples
value (training
hi , linear
g(xi ) +non hi being set)
function g(x),instances
a realization if of
any.theOncerandom provided
the variable
generic h.
databy In
instance xi is available, (3.7)
surable through sensors. The ultimate goal of learning is to build an approximation
practical cases, the system provides for whichvalue we aim yi =tog(x i ) + hai ,model,
create hiyibeing byareceiving
realization inputof the random variable h. In
p . Selection
Let
of g(x) Z
based =on the
{(x y parameterized
information
,
N value yi1. We1comment
xi , provides ), ..., present
(x
practicalthat N ,
cases, y
both in
)}in the
be
dataset
Ntheinputs
system the
the Z Ndata
parameter
set
through
andforoutputs
which are generating
vector
composed model
we quantities
q 2 ofQ
family
aim to create mea- N
⇢ R
(input-output) of acouples.
a model, by receiving input
suitable fam
ofThemodels f (q , x) can be isdriven thatby someinputsaapproximating
priori available information
mea- able about
The goal
surable through ofsensors.
machine learning
xi , ultimate
provides value
goal isylearning
of i .to process.
Webuild
comment to the
build simplest
both
an approximationand outputs model
are quantities to t
system
of g(x) based on the information surable model.
through
present If
fin data
, x) are
(qsensors.
dataset ZThe likely
ultimate
N through
to be generated
goal
model of learning isby
family a linear
(3.8)
to build model -or a linear mod
an approximation
explain past
And wish to model unknown Z data
N suffices- and
of g(x)then future
with this
based instances
on thetype of model
information that
presentwill
should be
in dataset provided
be Zconsidered. by
N through model
the datacase,
Infamily
this generating
learning reli
process. family
parameterized
parameterized ofparameter
in theon models
vast results vector x) p
q 2 Q ⇢ Rby. Selection
f (q ,provided the system (3.8) family theory, e.g., see [130]. T
of a identification
suitable
f (q , x) (3.8)
of models Consider
f (q , x) canthen bethe
driven
outcome situation
by the
of some where
the a learning
priori p the process
available
parameterized in the parameter vector q 2 Q ⇢ R . Selection of a suitable family
procedure generating
information
is the about
parameterthethedata (system
configuration model)
q̂ and, henc
system model.fby
ismodels
of ruled If, data
(q x) canarebelikely
model driven f toq̂be
(by x),
,some
parameterized generated
whose
a in
priori by a linear
the quality/accuracy
parameter
available model
vector q 2 -or
information ⇢aRlinear
Qmust
about
pbe
. the modelof a suitable family
assessed.
Selection
suffices-
systemthen thisIf data
model. typeare
of likely
model
ofthe
If should
models
to accuracy
be generatedbe can
f (q , x) considered.
bybe
performance
a linear In is
drivenmodel
by this
somenot
-or case, learning
aa linear
priori
met, relies
available
and
model information
margin about the
for improvement exists, w
on vast results
suffices- provided
then this ofby
typehave system
to
model model.
theselect
system
should abeIfnew
datamodel
are likely
identification
considered. In to becase,
generated
theory,
family
this e.g.,
and by
seearelies
reiterate
learning linear
themodel
[130]. The-or a linear
learning modelFor instance,
process.
on vastofresults
outcome the theprovided
learning suffices-
thebydifference
then identification
the system
procedure this type of
is y=
the parameter
-residual- g(x)
model
theory,
between +
should h,
e.g.,be
seeconsidered.
configuration
the [130].
reconstructed
this case, learning relies (3.7)
TheInhence,
q̂ and, value f (q̂ , x) and the measur
on vast results provided by the system identification theory, e.g., see [130]. The 69
ˆ
mily,
70 i.e., g(x) = f (q o , x). Rewrite the structural risk V̄ (q̂ ) associated with mo
Inherent,
q̂ , x), i.e., theapproximation
performance of and estimation
the obtained risksas
model,
V̄ (q̂ ) = V̄ (q̂ ) V̄ (q o ) + (V̄ (q o ) VI ) +VI . (3.

estimation approximation inherent
The risk associated with the model is composed of three terms
risk risk risk
The inherent risk VI . The risk

• The inherent
depends only on the structure of the learning pr
risk depends only on the structure of the learning
lem and, for thisproblemreason, and,can be improved
for this reason, can be only
reducedby improving
only by improving the problem its
i.e., by acting on thethe process
problem itselfgenerating the data, (e.g., by designing a more p
• The approximation risk depends on how close the model family
cise sensor architecture). Nothing else can be done. This is the minimum risk
3.4 (also named
3.4 Learning
Learning from hypothesis
from data
data and space)at
and uncertainty
uncertainty atisthe
tomodel
the the process
model level generating the
level 41
41
can have and we datareach it -implying optimal accuracy performance in funct
oo). The risk depends on the ability of the learning
approximation- when the other sources).the
•
• •The
The
Theestimation
estimation
estimation risk
risk
riskV̄
V̄ q̂
q̂
depends
(
( )
) V̄
V̄ on
(q
(q ofTheability
uncertainty
risk depends leading
on
of the learning toofthe
the ability two ot
the learning
q̂ q oo. If we have an effective
risks are null; algorithm
algorithm
algorithm to
to
toselect
select
select a
aa parameter
parameter
parameter vector
vector
vector q̂ close
close
close to
to
to q . If we have an effective
q̂ q oo so that the contribution
learning
learning process,
process, we we o hope
hope to
to be
be able
able to
to get
get a
a q̂ close
close to
to q so that the contribution
The approximation to
to the risk risk
the model
model V̄
risk ) VI . The risk depends on how close the mo
(qisis negligible.
negligible.
70
71
Approximation and estimation risks
Optimal Model
Selected Model
Best Reachable Model
Model Space
Target Space
71
72
Optimal Model
Approx.
Error
Selected Model
Estimation
Error
Model Space
Target Space
72
What about Neural Networks?
73
Modelling space and time
The model encompasses

the concept of space,
time and status of the
neuron
74
Neural computation
The scalar product:

evaluate the affinity of values
75
Neural computation
Activation function
• Heaviside
• Sigmoidal
• Linear
76
Multi-layer Neural Networks
3.4 Learning from data and uncertainty at the model level 39
where h is a noise term modeling the existing uncertainty affecting the unknown
non linear function g(x), if any. Once the generic data instance xi is available, (3.7)
provides value yi = g(xi ) + hi , hi being a realization of the random variable h. In
practical cases, the system for which we aim to create a model, by receiving input
xi , provides value yi . We comment that both inputs and outputs are quantities mea-
surable through sensors. The ultimate goal of learning is to build an approximation
of g(x) based on the information present in dataset ZN through model family
xi f (q , x) (3.8)
parameterized in the parameter vector q 2 Q ⇢ R p . Selection of a suitable family

of models f (q , x) can be driven by some a priori available information about the
system model. If data are likely to be generated by a linear model -or a linear model
suffices- then this type of model should be considered. In this case, learning relies
on vast results provided by the system identification theory, e.g., see [130]. The
outcome of the the learning procedure is the parameter configuration q̂ and, hence,
model f (q̂ , x), whose quality/accuracy must be assessed.
If the accuracy performance is not met, and margin for improvement exists, we
have to select a new model family and reiterate the learning process. For instance, if
the difference -residual- between the reconstructed value f (q̂ , x) and the measured
y(x) one on a new data set in not a white noise (test procedure), then there is informa-
tion that model f (q̂ , x) was not able to capture. A new richer model family should be
chosen and learning re-started. In this direction, feedforward neural networks have
been shown to be universal function approximators [131], i.e., can approximate any
nonlinear function, and are ideal candidates to solve the above learning problem 77
Why Neural Networks?
78
Universal approximation theorem
A feedforward network with a single hidden layer containing a finite

number of neurons approximates any continuous function defined on
compact subsets
K. Hornik, "Approximation Capabilities of Multilayer Feedforward Networks",

Neural Networks, No.4 Vol. 2, 251–257, 1991
79
80
Optimal Model
Approx.
Error
Selected Model
Estimation
Error
Model Space
Target Space
80
81
Optimal Model
Selected Model
Estimation
Error
NN Model Space
Approx. Error = 0
Target Space
81
Quality assessment of the solution
«How good is your ‘good’?»
82
Two examples: how good is my good solution?
Confusion Matrix Confusion Matrix
50 0 0 100% 23 1 0 95.8%
1 1
33.3% 0.0% 0.0% 0.0% 30.7% 1.3% 0.0% 4.2%
0 47 0 100% 0 25 2 92.6%
2 2
0.0% 31.3% 0.0% 0.0% 0.0% 33.3% 2.7% 7.4%
Output Class
Output Class
0 3 50 94.3% 0 1 23 95.8%
3 3
0.0% 2.0% 33.3% 5.7% 0.0% 1.3% 30.7% 4.2%
100% 94.0% 100% 98.0% 100% 92.6% 92.0% 94.7%

0.0% 6.0% 0.0% 2.0% 0.0% 7.4% 8.0% 5.3%
1
3
Target Class Target Class
Apparent Error Rate Sample Partitioning
83
84
Assessing the performance
§ Apparent Error Rate (AER),or resubstitution: The whole
set ZN is used both to infer the model and to estimate its
error
§ Sample Partitioning (SP): SD and SE are obtained by
randomly splitting ZN in two disjoint subsets. SD is used to
estimate the model and SE to estimate its accuracy.
§ Leaving-One-Out (LOO): SE contains one pattern in ZN,
and SD contains the remaining n − 1 patterns. The pro-
cedure is iterated n times by holding out each pattern in
ZN, and the resulting n estimates are averaged.
84
85
Assessing the performance (2)
§ w-fold Crossvalidation (wCV): ZN is randomly split into
w disjoint subsets of equal size. For each subset the
remaining w − 1 subsets are merged to form SD and
the reserved subset is used as SE . The resulting w
estimates are averaged. This procedure can be
iterated and the results averaged when w ≪ n in order
to reduce the random resampling variance. This
estimate is a generalization of LOO.
85
What about the technology for AI?
86
“Artificial Intelligence deals with the development of
hardware-and-software systems endowed with
human-like capabilities, able to autonomously pursue a
given goal and making decisions that, until that moment,
were usually assigned to humans”
Osservatorio Artificial Intelligence,

Politecnico di Milano, 2018
87
87
«… hardware-and-software systems endowed with human-like capabilities …»
AI SOFTWARE
AI HARDWARE
88
88
«… hardware-and-software systems endowed with human-like capabilities …»
AI SOFTWARE
(Application)
AI SOFTWARE
(Framework/Platform/Tool)
AI HARDWARE
89
89
An IT perspective for the AI
AI SOFTWARE SOFTWARE
(Application) (Application)
AI SOFTWARE SOFTWARE
(Framework/Platform/Tool) (Environment)
AI HARDWARE HARDWARE
90
90
The application, i.e., the reason why

SOFTWARE
this system exists
(Application)
Programs and libraries to control the

physical resources and provide tools SOFTWARE
to build applications (Environment)
Physical resources of the system

(computation, storage, input/output, HARDWARE
etc..)
91
91
AI SOFTWARE The AI-based application running

(Application) into the IT system
Programs and libraries to control the

AI SOFTWARE physical resources and provide tools
(Framework/Platform/Tool) to build applications
Hardware to support AI:

AI HARDWARE datacenters, edge computing, IoT,
CPU, GPU, TPU, …
92
92
AI and Technology
RAM IBM Watson winning

Jeopardy first prize
(2011)
Data Center
Better accuracy han Libratus beats four world’s

traditional CV (2012), top players at Texas
than human (2015) hold’em poker (2017);
AlphaGo beats the world
Edge No.1 ranked player (2016)
50x – 100x
Computing
Systems
Deep Blue defeated
Kasparov (1996)
0.0001x – 0.0005x 0.05x – 0.1x PC 20x – 50x

Autonomous Vehicle, 100x – 1000x
Autonomous Robot
Embedded Computation
Speedup
0.1x – 0.5x
PCs
Intelligent Object Embedded
Devices
Internet of
0.01x–0.05x
Things
93
93
AI Hardware
AI HARDWARE
GPU, TPU, FPGA

Data Center
Edge
Computing
Systems
PC
Embedded
PCs
Embedded
Devices
Internet of
Things
94
94
AI Hardware: dal datacenter a ML-as-a-service
Machine-Learning
Deep-Learning
AI APPLICATION as-a-Service
AI PLATFORM
ML and DL Solutions
AI HARDWARE Data Center
95
95
Why Machine Learning in the Cloud?
•Cloud comuputing simplifies the access to ML capabilities for

• designing a solution (without requiring a deep knowledge of ML)
• setting up a project (managing demand increases and IT solution)
•Amazon, Microsoft and Google provide
• Solutions ML Solutions as a service
• Platforms ML Platforms as a service

ML Infrastructures as a service
• Infrastructures
to support ML in the Cloud
Machine Learning as a Service

12/02/23
96
AI and off-the-shelf technological solutions
Solutions
Cloud Platforms
HW, SW, Libraries
97
Machine Learning Infrastructure as a Service
Design ease
Amazon EC2 AMI, Google Cloud, Azure Deep

Learning VM, IBM GPU bare metal Servers
Design of HW, SW Fully customized

and AI solution HW, SW, Libraries solutions
Flexibility
98
Machine Learning Platform as a Service
Design ease
Amazon SageMaker, AZURE ML service,
Google Cloud ML engine, IBM Watson
Design of the AI Pre-designed

solution Cloud Platforms solutions
HW, SW, Libraries
Flexibility
Provide pre-configured environments used by AI experts to train, tune and host models
99
ML software as a service
Amazon AI Services, Google Cloud AI, Microsoft
Cognitive Services and IBM Watson
Design ease
Ready-to-use Pre-defined
Solutions actions/activities
Cloud Platforms
HW, SW, Libraries
Flexibility
E.g., Amazon Lex, Polly, Rekognition, MS Speech2Text …
100
AI and off-the-shelf technological solutions
Design ease
User
Ready-to-use Pre-defined
Solutions actions/activities
Design of the AI Pre-designed

solution Cloud Platforms solutions
Design of HW, SW Fully customized

and AI solution HW, SW, Libraries solutions
Flexibility
Developer
101
102
A real world example: image classification
102
103
A real world example: image classification
Solutions
Methodologies
Technologies
103
104
A real world example: image classification on AWS
Solutions
Methodologies
• ML Solutions as a service
»Rekognition»
Technologies • ML Platforms as a service
«Sagemaker»
• ML Infrastructures as a service
«EC2»
Training
TRAINING
Data
104
The evolution of computation and memory
Univac Computation,
Memory

Circuits
Microprocessors
CPU
1945 1960 1970 1980 Today
Pioneering works in
the field of AI
Statisitcal Methods
BackProp
SVM
Computational needs
Memory Requirements
Bayesian Inference K Nearest Neighbour
105
Deep Learning: brain-inspired architecture
106
Deep representation of knowledge
107
108
Increasing the complexity of deep learning models
Complexity doubles
every 3.5 months
108
109
Increasing the complexity of deep learning models
Complexity doubles
every 3.5 months
18 months for Moore’s

Law
109
Generations and seasons:
the evolution of computation and memory
Univac Computation,
Memory

Circuits
Microprocessors
CPU
1945 1960 1970 1980 Today
Pioneering works in
the field of AI
Statisitcal Methods
BackProp

SVM
Computational needs
Memory Requirements
110
Generations and seasons:
the evolution of computation and memory
Univac Computation,
Memory

Circuits
Microprocessors
CPU GPU, TPU, FPGA
1945 1960 1970 1980 Today
Pioneering works in
the field of AI
Statisitcal Methods
BackProp

SVM
Computational needs
Memory Requirements
111
Enabling accelerator operations
GPU, TPU, FPGA

Data Center
Edge
Computing
Systems
PC
Embedded
PCs
Embedded
Devices
Internet of
Things
112
113
Graphical Processing Units (GPU)
• Data-parallel computations: the same program is executed on many data elements in parallel
• The scientific codes had to be mapped onto the matrix operations.
• High level languages (such as CUDA and OpenCL) that target the GPUs directly
• Up to 1000x faster than CPU
113
114
Tensor Processing Unit (TPU)
• Custom-built integrated circuit developed specifically for machine learning and tailored for TensorFlow
• Powering Google data centers since 2015 as well as CPUs and GPUs
• A Tensor is an n-dimensional matrix. This is the basic unit of operation in with TensorFlow
114
115
Field-Programmable Gate Array (FPGA)
• Array of logic gates that can be programmed (“configured”) in the field, i.e., by the user of the device as
opposed to the people who designed it
• Array of carefully designed and interconnected digital subcircuits that efficiently implement common
functions offering very high levels of flexibility. The digital subcircuits are called configurable logic
blocks (CLBs)
ü VHDL and Verilog are hardware

description languages (HDLs)
languages that allow to “describe”
hardware;
ü HDL code is more like a schematic
that uses text to introduce
components and create
interconnections.
115
116
CPU, GPU, TPU and FPGA: an AI comparison
Advantages Disadvantages
• Easy to be programmed and • Most suited for simple models
support any programming that do not take long to train and
CPU framework. for small models with small
• Fast design space exploration training set
and run your applications.
• Ideal for applications in which • Programmed in languages like
data need to be processed in CUDA and OpenCL and therefore
GPU parallel like the pixels of images provide limited flexibility
or videos. compared to CPUs.
• Very fast at performing dense • For applications and models
vector and matrix computations based on the TensorFlow.
TPU and are specialized on running • Lower flexibility compared to
very fast program based on CPUs and GPUs
Tensorflow.
• Higher performance, lower cost • Programmed using OpenCL and
and lower power consumption High-level Synthesis (HLS)
FPGA compared to other options like • Limited flexibility compared to
CPUs and GPU other platforms.
116
Machine-Learning-as-a-service: pros and cons
Cloud computing Ø Internet connection

simplifies the access to Ø High Power
ML/DL capabilities for Consumption
ü designing a Ø Privacy and Security
solution (without Ø Latency in making
requiring a deep decision
knowledge of
ML/DL)
ü setting up a
project (managing
demand increases
and IT solution)
117
117
From Cloud to IoT and Edge
Computing: new platforms for ML/DL
118
118
IoT, PC Embedded and Edge Computing
AI APPLICATION
Move intelligent processing as close as

AI PLATFORM
possible to data generation units
AI HARDWARE
Internet-of- PC Edge
Things Embedded Computing
119
119
AI APPLICATION
MACHINE AND DEEP LEARNING PLAFORMS FOR IoT AND EDGE

AI PLATFORM (Tensorflow Lite, Pytorch 1.3 Quantization, Openvino,
Intelligent Edge HOL, …)
AI HARDWARE
120
120
ü Increase autonomy
ü Reduce decision-making latency
ü Reduce transmission bandwidth
AI APPLICATION ü Increase energy-efficiency
ü Security and Privacy
AI PLATFORM MACHINE AND DEEP LEARNING PLAFORMS FOR IoT AND EDGE
AI HARDWARE
121
121
122
Intelligent Internet-of-Things and Cyber-Physical Systems
Adaptation
Domain
Cyber
Self-
awareness
Intelligent
Intelligent
Cognitive
Processing IoT and Mechanisms
of
Physical
Cyber- for Actuation Reliability
and Control
Sensing Physical
Systems Self-
Diagnosis
Physical
Domain
Self-
healing
122
123
Intelligent Internet-of-Things
Self-
eness mec healing
ar h
User a w
Self- nose to re anisms
g pair
to dia faul
ts
? fault
i s ms
n
a
e ch
i ch M W?
Wh ich H Application / Service
Wh
Intelligent Mechanisms
Remot
e
contro Detect changes
llabilit
HW / Sensors y in the
users’behaviour
Ener
g
Harv y
estin
g mart energy
S
Environment ement
manag
123
124
Designing Intelligent IoT Systems: from centralized ….

124
Scalability Server Application

Responsiveness Cloud Intelligent Mechanisms
Dumb units
EDGE EDGE EDGE
IoT IoT IoT IoT

Sensors
Sensors
Sensors
Sensors
Comm
Comm
Comm
Comm
124
125
Designing Intelligent Embedded Systems:

….to distributed intelligent systems
Scalability Server Application
Responsiveness Cloud Intelligent Mechanisms
Smart units
EDGE EDGE EDGE
IoT IoT IoT IoT

Sensors
Comm
Comm
Comm
Comm
Sensors
Sensors
Sensors
125
The last mile …
126
127
127
128
128
129
New York Times

May 26, 2016
129
recognition algorithms to do verifica-
130 tion and identification, two separate
but related tasks. Verification involves
trying to correctly determine whether
two faces presented to the facial recog-
FINDING ONE FACE nition algorithm belong to the same per-

son. Identification involves trying to find
a matching photo of the same person
IN A MILLION
A new benchmark test shows that even Google’s facial
With that in mind, University of Wash-
ington researchers raised the bar by cre-
among a million “distractor” faces. Ini-
tial results on algorithms developed by
Google and four other research groups
were presented at the IEEE Conference
recognition algorithm is far from perfect
ating the MegaFace Challenge using
1 million Flickr images of 690,000 on Computer Vision and Pattern Recog-
unique faces that are publicly available
under a Creative Commons license. nition on 30 June. (One of MegaFace’s
The MegaFace Challenge forces facial
recognition algorithms to do verifica- developers also heads a computer vision
tion and identification, two separate
Helen of Troy may have had the the project’s principal investigator. “And
but related tasks. Verification involves team at Google’s Seattle office.)
trying to correctly determine whether
face that launched a thousand we make a number of discoveries that are
two faces presented to the facial recog- The results presented were a mix of
FINDING ONE FACE nition algorithm belong to the same per-
ships, but even the best facial recogni- only possible when evaluating at scale.”
son. Identification involves trying to find
a matching photo of the same person
the intriguing and the expected. Nobody
IN A MILLION tion algorithms might have had trouble

finding
A new benchmark test shows that even Google’s facial her
among a million “distractor” faces. Ini-
tial results on algorithms developed by
in a crowd of a million
Google and four other research groups
were presented at the IEEE Conference strang-
The huge drops in accuracy when scan-
ning a million faces matter because facial
was surprised that the algorithms’ per-
formances suffered as the number of
recognition algorithm is far from perfect
ers. The first public benchmark test based recognition algorithms inevitably face distractor faces increased. And the fact
on Computer Vision and Pattern Recog-
nition on 30 June. (One of MegaFace’s
developers also heads a computer vision
on principal
Helen of Troy may have had the the project’s 1 million faces
investigator. “And has shown how facial such challenges in the real world. People
team at Google’s Seattle office.) that algorithms had trouble identifying
The results presented were a mix of
face that launched a thousand we make a number of discoveries that are
recognition
ships, but even the best facial recogni- only possible algorithms from Google and increasingly trust these algorithms to cor-
the intriguing and the expected. Nobody
when evaluating at scale.” the same person at different ages was a
tion algorithms might have had trouble was surprised that the algorithms’ per-
The huge drops in accuracy when scan-
finding her in a crowd of a million strang- ning a other
million facesresearch groups around the world rectly identify them in security verifica-
formances suffered as the number of
matter because facial known problem. However, the results
distractor faces increased. And the fact
ers. The first public benchmark test based recognition algorithms inevitably face
still fall
on 1 million faces has shown how facial such challenges well
in the real short of perfection.
that algorithms had trouble identifying
world. People
the same person at different ages was a
recognition algorithms from Google and increasingly trust these algorithms to cor-
tion scenarios, and law enforcement may also showed that algorithms trained on
Facial recognition algorithms that also rely on facial recognition to pick sus-
known problem. However, the results
other research groups around the world rectly identify them in security verifica-
still fall well short of perfection. also showed that algorithms trained on
tion scenarios, and law enforcement may
relatively small data sets can compete
had previously performed with more pects out of the hundreds of thousands of
relatively small data sets can compete
Facial recognition algorithms that also rely on facial recognition to pick sus-
with those trained on very large ones,
had previously performed with more pects out of the hundreds of thousands of
with those trained on very large ones,
benchmark test involving 13,000 faces The than
most popular95benchmark
percent until accuracy on a popular
such as Google’s FaceNet, which was
than 95 percent accuracy on a popular faces captured on surveillance cameras.
trained on more than 500 million pho- faces captured on surveillance cameras. such as Google’s FaceNet, which was
benchmark test
LFW involving 13,000 faces The most popular benchmark until trained on more than 500 million pho-
tos of 10 million people.
saw significant drops in accuracy when now has been the Labeled Faces in the
taking on the new MegaFace Challenge. Wild (LFW) For example, the FaceN algorithm from
test created in 2007.
tos of 10 million people.

Russia’s N-TechLab performed well on
The best performer, Google’s FaceNet includes 13,000 images of just 5,000
saw significant drops in accuracy when
certain tasks in comparison with FaceNet,
algorithm, dropped from near-perfect people. Many facial recognition algo- now has been the Labeled Faces in the
despite having trained on 18 million pho-
accuracy on the five-figure data set to rithms have been fine-tuned to the point
taking
75 percent on the million-face test. Other that they on the new MegaFace Challenge. Wild (LFW) test created in 2007. LFW
tos of 200,000 people. The SIAT MMLab
scored near-perfect accuracy For example, the FaceN algorithm from
top algorithms dropped from above when picking through the LFW images. algorithm, created by a Chinese team
The best
90 percent to below 60 percent. Some Most researchers say thatperformer,
new bench- Google’s FaceNet includes 13,000 images of just 5,000
under the leadership of Yu Qiao, a profes- Russia’s N-TechLab performed well on
sor with Shenzhen Institutes of Advanced
algorithms made the proper identifica- mark challenges have been long overdue.
tion as seldom as 35 percent of the time. “Thealgorithm, dropped
big disadvantage is that [the field] is from near-perfect people. Many facial recognition algo-
Technology, Chinese Academy of Sciences, certain tasks in comparison with FaceNet,
UNIVERSITY OF WASHINGTON
also performed well on certain tasks.

“MegaFace’s key idea is that algorithms saturated—that is, there are many, many
accuracy on
should be evaluated at large scale,” says algorithms that perform above 95 percent
the five-figure data set to rithms have been fine-tuned to the point
Nevertheless, FaceNet has so far per-
formed the best overall. It delivered the
Ira Kemelmacher-Shlizerman, an assis- on LFW,” Kemelmacher-Shlizerman says.
despite having trained on 18 million pho-
75 percent on
tant professor of computer science at the “This gives the impression that face rec-
all testing.
University of Washington, in Seattle, and ognition is solved and working perfectly.”
the million-face IEEE Spectrum
test. Other that they scored near-perfect accuracy
most consistent performance across
tos of 200,000 people. The SIAT MMLab
top algorithms dropped from above when picking through the LFW images. algorithm, created by a Chinese team
90 percent to below 60 percent. nEwS Aug, 2016
Some Most researchers say that new bench- under the leadership of Yu Qiao, a profes-
algorithms made the proper identifica- mark challenges have been long overdue.
SPECTRUM.IEEE.ORG | nORTh aMERICan | aUG 2016 | 11 sor with Shenzhen Institutes of Advanced
tion as seldom as 35 percent of the time. “The big disadvantage is that [the field] is Technology, Chinese Academy of Sciences,
UNIVERSITY OF WASHINGTON
“MegaFace’s key idea is that algorithms saturated—that is, there are many, many also performed well on certain tasks.
08.News.INT - 08.News.NA [P]{NA}.indd 11 7/13/16 2:10 PM
should be evaluated at large scale,” says algorithms that perform above 95 percent Nevertheless, FaceNet has so far per-
Ira Kemelmacher-Shlizerman, an assis- on LFW,” Kemelmacher-Shlizerman says. formed the best overall. It delivered the
tant professor of computer science at the “This gives the impression that face rec- most consistent performance across
University of Washington, in Seattle, and ognition is solved and working perfectly.” all testing.
130
131
Thank you for the attention!
132

Artificial Intelligence: Models and Technologies: - An Introduction From The ICT Perspective

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Artificial Intelligence: Models and Technologies: - An Introduction From The ICT Perspective

Uploaded by

Copyright:

Available Formats

Artificial Intelligence: models and technologies

- An introduction from the ICT perspective -

Prof Manuel Roveri – manuel.roveri@polimi.it

Lecture 1 - ICT4Dev Summer School

1945 1960 1970 1980 Today

Eniac Transistor Integrated

1945 1960 1970 1980 Today

1945 1960 1970 1980 Today

Data is the new gold

• Intelligent Data Processing,

Natural Language Processing Image Processing

Reasoning and Planning

Interacting Socially Interacting Physically

Learning Reasoning and Planning Interacting Socially

Regression Classification Clustering

Prediction Change Detection Adaptation

Lecture 2 Lecture 3 Lecture 4

Prediction Change Detection Adaptation

Colab Notebook and examples

??? Physical model ? Data-driven might be

What is the learning goal here? NO Pass

Statistical Learning: the approach

3.4.1 Basics of Learning: inherent, approximation and estimation

Consider a bidimensional problem

Designing a classifier requires identification of the

• Linear vs. non linear

Given a set of n noise affected couples (xi,yi)

V̄ (q̂ ) = V̄ (q̂ ) V̄ (q o ) + (V̄ (q o ) VI ) +VI . (3.

The inherent risk VI . The risk

The model encompasses

The scalar product:

parameterized in the parameter vector q 2 Q ⇢ R p . Selection of a suitable family

A feedforward network with a single hidden layer containing a finite

K. Hornik, "Approximation Capabilities of Multilayer Feedforward Networks",

Confusion Matrix Confusion Matrix

100% 94.0% 100% 98.0% 100% 92.6% 92.0% 94.7%

Apparent Error Rate Sample Partitioning

Osservatorio Artificial Intelligence,

The application, i.e., the reason why

Programs and libraries to control the

Physical resources of the system

AI SOFTWARE The AI-based application running

Programs and libraries to control the

Hardware to support AI:

RAM IBM Watson winning

Better accuracy han Libratus beats four world’s

0.0001x – 0.0005x 0.05x – 0.1x PC 20x – 50x

GPU, TPU, FPGA

AI HARDWARE Data Center

•Cloud comuputing simplifies the access to ML capabilities for

• Platforms ML Platforms as a service

Machine Learning as a Service

HW, SW, Libraries

Amazon EC2 AMI, Google Cloud, Azure Deep

Design of HW, SW Fully customized

Design of the AI Pre-designed

HW, SW, Libraries

HW, SW, Libraries

E.g., Amazon Lex, Polly, Rekognition, MS Speech2Text …

Design of the AI Pre-designed

Design of HW, SW Fully customized

Eniac Transistor Integrated

1945 1960 1970 1980 Today