Professional Documents
Culture Documents
Stefano Carrazza
NNPDF/N3PDF meeting,
16-19 September 2018, Palazzo Feltrinelli, Gargnano
European Organization for Nuclear Research (CERN)
Acknowledgement: This project has received funding from HICCUP ERC Consolidator
grant (614577) and by the European Unions Horizon 2020 research and innovation
programme under grant agreement no. 740006.
N 3PDF
Machine Learning • PDFs • QCD
Why talk about machine learning?
1
Why talk about machine learning?
because
1
Outline
Topics:
• A.I. and M.L. overview
• Non-linear models
• From physics to ML
2
Artificial Intelligence
Artificial intelligence timeline
3
Defining A.I.
4
Defining A.I.
Machine learning
Computer vision
Speech
Planning
Robotics
5
A.I. and humans
5
A.I. technologies
6
A.I. technologies
Solution:
The A.I. system needs to acquire its own knowledge.
This capability is known as machine learning (ML).
→ e.g. write a program which learns the task.
6
Machine Learning
Machine learning definition
7
Machine learning definition
7
ML applications in our “day life”
8
Machine learning examples
9
Machine learning examples
9
Machine learning examples
9
Machine learning examples
9
ML applications in HEP
10
ML in experimental HEP
11
ML in theoretical HEP
1 1
NNPDF3.1 (NNLO) g/10
0.9 0.9
xf(x,µ 2=10 GeV2) xf(x,µ 2=104 GeV 2)
• Supervised learning: 0.8 0.8
s
g/10
0.7 0.7
• The structure of the proton at the LHC
0.6
uv
0.6
d
c
• parton distribution functions 0.5 0.5
uv
0.4 0.4 u
dv
• Theoretical prediction and combination
0.3 s 0.3 b dv
0.2 0.2
• Monte Carlo reweighting techniques 0.1
u
d
0.1
c
0 0
• neural network Sudakov 10
−3
10−2
x
10−1 1 10
−3
10−2
x
10−1 1
10-1
#/ST
1
0.8
0.6
#/STJ
1
0.8
#/STJ★
1
0.8
0.6
-4 -3 -2 -1 0 1 2 3 4
y(t)
12
Social experiment:
ML applied to NNPDF meetings (JI & SC)
13
ML for Gargnano 2018
Goal:
• Extract research topics from participants
• Classify and cluster people
• Propose matches
How:
• Unsupervised learning and natural language processing
14
ML for Gargnano 2018
Goal:
• Extract research topics from participants
• Classify and cluster people
• Propose matches
How:
• Unsupervised learning and natural language processing
14
ML for Gargnano 2018
• username: nnpdf
• password: 2018
15
http://apfel.mi.infn.it:7000, usr=nnpdf, psw=2018
16
http://apfel.mi.infn.it:7000, usr=nnpdf, psw=2018
Topic determination:
Apply Latent Dirichlet Allocation (LDA) to the bag of words:
topic 2: u’0.020*”higg” + 0.016*”boson” + 0.012*”qcd”’...
topic 4: u’0.016*”theori” + 0.012*”lattic” + 0.012*”quantum”...
topic 6: u’0.018*”pdf” + 0.014*”parton” + 0.012*”determin”...
Found a total of 8 topics!
16
http://apfel.mi.infn.it:7000, usr=nnpdf, psw=2018
• Repeat this procedure for all participants and compute the euclidean
distance between participants.
17
http://apfel.mi.infn.it:7000, usr=nnpdf, psw=2018
19
http://apfel.mi.infn.it:7000, usr=nnpdf, psw=2018
20
http://apfel.mi.infn.it:7000, usr=nnpdf, psw=2018
21
http://apfel.mi.infn.it:7000, usr=nnpdf, psw=2018
22
Machine learning algorithms
Desired Output
Supervisor
Processing
Output
23
Machine learning algorithms
Discover
Interpretation
from Features
Processing
Output
23
Machine learning algorithms
Reinforcement learning
Machine learning algorithms:
Input Data
• Supervised learning:
regression, classification, ...
Agent
• Unsupervised learning:
clustering, dim-reduction, ... Best Action Reward
• Reinforcement learning:
real-time decisions, ... Environment
Algorithm
Output
23
Machine learning algorithms
Data
Model
Optimizer
25
Model representation trade-offs
Linear Regression
Decision Tree
K-Nearest Neighbors
Interpretability
Random Forest
Neural Nets
Accuracy
26
Model representation trade-offs
26
ML in practice
Grid/random search
27
Most popular public ML frameworks
29
Limitations of linear models
Each image has 28x28 pixels = 785 features (x3 if including RGB colors).
If consider quadratic function O(n2 ) so linear models are impractical.
29
Limitations of linear models
Each image has 28x28 pixels = 785 features (x3 if including RGB colors).
If consider quadratic function O(n2 ) so linear models are impractical.
Solution: use non-linear models.
29
Non-linear models timeline
1980
Neocogitron
SOMs 2006
1985 Deep BMs
Boltzmann Deep Belief NNs
Machine
1958
1990 2012
Perceptron
LeNet Dropout
Dendrite Axion
terminal
Input Node of Output
Ranvier
Soma
Axon
Schwann cell
Myelin sheath
Nucleus
Logical Unit
32
Neuron model
Schematically:
where
• each node has an associate weights and bias w and inputs x,
• the output is modulated by an activation function, g.
Some examples of activation functions: sigmoid, tanh, linear, ...
1
gw (x) = , tanh(wT x), x.
1 + e−wT x 33
Neural networks
where
(l)
• ai is the activation of unit i in layer l,
(l)
• wij is the weight between nodes i, j from layers l, l + 1 respectively.
34
Neural networks
35
Neural networks
36
Artificial neural networks architectures
37
Artificial neural networks architectures
38
Artificial neural networks architectures
39
Artificial neural networks architectures
40
From physics to ML
Introduction
41
Boltzmann machine
Graphical representation:
42
Boltzmann machine
“Visible” sector
42
Boltzmann machine
“Hidden” sector
“Visible” sector
42
Boltzmann machine
Binary valued
states {0, 1}
42
Boltzmann machine
42
Boltzmann machine
42
Boltzmann machine
43
Boltzmann machine
e−E(v,h)
P (v, h) =
Z
with marginalization:
e−F (v)
P (v) = Free energy
Z
44
Boltzmann machine
45
Riemann-Theta Boltzmann machine
How to change the status quo? [Krefl, S.C., Haghighat, Kahlen ‘17]
Keep the inner sector couplings non-trivial, but the machine solvable?
Continuous
valued ∈ R
Continuous
valued ∈ R
46
Riemann-Theta Boltzmann machine
How to change the status quo? [Krefl, S.C., Haghighat, Kahlen ‘17]
Keep the inner sector couplings non-trivial, but the machine solvable?
“Quantized”
∈Z
Continuous
valued ∈ R
47
Riemann-Theta Boltzmann machine
How to change the status quo? [Krefl, S.C., Haghighat, Kahlen ‘17]
Keep the inner sector couplings non-trivial, but the machine solvable?
“Quantized”
∈Z
Continuous
valued ∈ R
s
det T − 1 vt T v−Bvt v− 1 Bvt T −1 Bv θ̃(Bht + v t W |Q)
P (v) ≡ e 2 2
(2π)Nv θ̃(Bht − Bvt T −1 W |Q − W t T −1 W )
w = Av + b, w ∼ PA,b (v),
T −1 → AT −1 At , Bv → (A+ )t Bv − T b ,
W → (A+ )t W , Bh → Bh − W t b .
A+ = (At A)−1 At .
49
RTBM Applications
Riemann-Theta Boltzmann machine
• Probability determination
• Data classification
• Data regression
• Sampling
50
Riemann-Theta Boltzmann machine
0.175
0.05
0.10 0.150
0.04
0.125
0.08
0.03 0.100
P 0.06 P P
0.075
0.02
0.04
0.050
0.02 0.01
0.025
v v v
1.0
2.0
0.8
0.8
1.5
0.6
0.6
P P P
1.0
0.4
0.4
0.5
0.2 0.2
−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0
v v v
51
Riemann-Theta Boltzmann machine
Expectation:
As long as the density is well enough
behaved at the boundaries it can be
learned by an RTBM mixture model.
52
Riemann-Theta Boltzmann machine
0.200 0.200
0.5
0.175 0.175
0.125 0.125
P 0.3
P P
0.100 0.100
0.050 0.050
0.1
0.025 0.025
v v v
0.2
0.4
P(v1) P(v1)
0.1
0.2
0.0
0.0
6
6
4
4
2
2
v2 v2
0 0
−2 −2
−4
−4
−6
−6
New:
Conditional expectations of hidden
states after training
0.4
0.75
0.50
0.2
0.25
E E
0.0
0.00
−0.25 −0.2
−0.50
−0.4
−0.75
v v
54
Feature detector example - jet classification
55
Riemann-Theta Boltzmann machine
Key point:
smaller networks needed but
Riemann-Theta evalution is expensive.
Example (1:3-3-2:1):
2 3
0.5
y y 2 E
1 0.0
1
−0.5
0
0
−1.0
−1
−1
0 20 40 60 80 100 0 25 50 75 100 125 150 175 −4 −2 0 2 4
t t v
(R)
p= 1
θn + (R)
8
is the probability that a point is sampled outside the
6
ellipsoid of radius R, while
4
X θn h2 2
P (h) = ≈1 0
θn + (R)
[h](R) 2
4
i.e. sum over the lattice points inside the ellipsoid. 4 3 2 1 0 1 2 3
h1
58
Sampling distance estimators
59
Sampling examples with affine transformation
RTBM P (v) sampling with affine transformation: [S.C. and Krefl ‘18]
0.4 0.4
P(v1) P(v1)
0.2 0.2
0.0 0.0
6 6
4 4
2 2
v2 0 v2 0
2 2
4 4
6 6
6 4 2 0 2 4 6 0.00 0.25 0.50 6 4 2 0 2 4 6 0.00 0.25 0.50
v1 P(v2) v1 P(v2)
60
Conclusion
Summary and outlook
In summary:
61
Backup
61
Latent Dirichlet Allocation
62
Affinity propagation by message passing
The algorithm sets both matrix to zero and performs the following updates
iteratively:
r(i, k) ← s(i, k) − max {a(i, k0 ) + s(i, k0 )}
k0 6=k
X 0
a(i, k) ← min 0, r(k, k) + max(0, r(i , k)) for i 6= k
i0 ∈{i,k}
/
X
a(k, k) ← max(0, r(i0 , k))
i0 6=k
63
Affinity propagation by message passing
64