ATLAS MLWorkshop

DeepFlavour and
Tensorflow in CMSSW
Mauro Verzetti (CERN and FWO)
on behalf of the CMS Collaboration
The problem - Jet flavour classification
!2 M. Verzetti (CERN and FWO)

HF tagging @ CMS — JINST 13 (2018) no.05, P05011
Track-based
tracks tagger
jet combined
Secondary
PV vertex (SV)
Track selection
based tagger
SV
super
combined
e/µ
Soft-lepton (SL)
based tagger
3 M. Verzetti (CERN and FWO)

Particle-based NN architecture
Charged (16 features) x25 1x1 conv. 64/32/32/8 RNN 150

b 
Neutral (8 features) x25 1x1 conv. 32/16/4 RNN 50 Dense
bb 
200 nodes x1,
c
Secondary Vtx (12 features) x4 1x1 conv. 64/32/32/8 RNN 50 100 nodes x6
l
Global variables (6 features)


b 
bb 
200 nodes x1,
c
l
Convolutional layers progressively

learn a more compact feature
representation (automatic feature
engineering)


b 
bb 
200 nodes x1,
c
l
Convolutional layers progressively

learn a more compact feature The recurrent layers (LSTM) builds
representation (automatic feature a “summary” of the information
engineering) contained in each set of feature
types


b 
bb 
200 nodes x1,
c
l
s = 13 TeV s = 13 TeV
1 1
misid. probability
misid. probability
CMS Simulation Preliminary CMS Simulation Preliminary
tt events tt events
AK4jets (p > 30 GeV) AK4jets (p > 90 GeV)
T T
DeepFlavour phase 1 DeepFlavour phase 1
10−1 10−1
DeepCSV phase 1 DeepCSV phase 1
DeepCSV phase 0 DeepCSV phase 0
udsg udsg
10−2 c 10−2 c
10−3 10−3
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
b jet efficiency b jet efficiency
CMS DP-2018/033 7 M. Verzetti (CERN and FWO)


b 
Neutral (8 features) x25 1x1 conv. 32/16/4 RNN 50 Dense bb 
200 nodes x1, c
Secondary Vtx (12 features) x4 1x1 conv. 64/32/32/8 RNN 50 100 nodes x6 uds 
g
s=13 TeV
1
Misid. probability
0.9
CMS Simulation Preliminary
QCD events, p = 30-50 GeV
T
0.8 jet p > 30 GeV
T
Similar performance to simpler, 0.7 DeepJet
dedicated binary taggers, but with full 0.6 recurrent
multi-class power. convolutional
0.5
0.4
Significantly better performances in
given regions with different quark 0.3
composition 0.2
0.1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

CMS-DP-2017-027 Light quark efficiency
8 M. the
Performance of the DeepJet multi classification algorithm, Verzetti (CERN
recurrent andand FWO)
the convolu
DeepAK8: for boosted resonances
Inclusive particles x100  
Sub-structure P-CNN layers x14
10 features, pT ordered
Charged particles x60
{
P-CNN layers x14 Dense
30 features, sIP ordered Output
521 nodes x1
Flavour
Secondary Vtx x5
14 features,   P-CNN layers x14
displacement ordered
• Significantly larger amount of candidates used to accomodate for

90% of the fat jets
• Need to learn substructure from both charged and neutral

candidates
• RNNs become computationally too expensive to train
• Use particle-level convolutional layers (P-CNN) where each

feature is treated as a “colour”
Performance
• Flavour information largely

improves jet tagging
• Large improvement w.r.t to

the BDT approach
• Introduces mass sculpting,

not necessarily a bad thing
Figure 1. Comparison of the performance of the two BDT taggers and the
CMS-DP-2017-049
CNN taggers in terms of ROC curves in MC simulated events for top jets as
as background. The events correspond to AK8 jets with 1000<pT<1400 GeV
• GRU = Gated Recurrent Unit = Recurrent network to reduce
dimensionality of output from Conv1D layers  
DeepDoubleB (60×32, 5×32) → (50, 50)
Conv1D  GRU  Output 

track  (60, 8)
(2 layers,  
(60, 32) (50)
BN (50 units,  (100)
 
features 32+32 units,  
dropout = 0.1)
dropout = 0.1) Fully Higgs 
QCD
connected 
Conv1D  GRU 
SV  (5, 2)
(2 layers,  
(5, 32) (50)  
BN 32+32 units,  
(50 units,  (1 layer,  
features dropout = 0.1)
dropout = 0.1) 100 units,  
dropout = 0.1)
Double-b (27) BN
features 7
CMS Simulation Pr eliminar y 2016 (13 TeV)

100
Mass Sculpting
Mistagging rate (QCD)
Significantly better
300 < jet pT < 2000 GeV
40 < jet mSD < 200 GeV
DeepDoubleBvL, AUC = 97.3%
DeepDoubleBvL
double-b, AUC = 91.3%
than current BDT
Figure 4. Effect on the jet soft-drop mass
10 1 approach!
distribution of misidentified events by the
DeepDoubleBvL identification algorithm
demonstrating the degree to which the
algorithm is dependent on the mass of the
jet. These histograms are obtained for a fixed
overall mistagging rate from a QCD sample.
10 2
Some mass
sculpting
3
10
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Tagging efficiency (H ! bb̄)
CMS DP-2018/046
7/5/2018 10

Removing mass correlation
Per-batch penalty term proportional to the Kullback-Liebler (KL)

divergence
Mass Sculpting After

Mass Decorrelation
DeepDoubleBvL
Minimal loss in
Figure 6. Effect on the jet soft-drop mass
performance
distribution of misidentified events by the
DeepDoubleBvL identification algorithm,
after it has been decorrelated from the jet
mass demonstrating the degree to which the
algorithm is dependent on the mass of the
jet. These histograms are obtained for a fixed
overall mistagging rate from a QCD sample.
Mass sculpting
gone
13 CMS DP-2018/046
7/5/2018 12

DeepDoubleC!
ROC Comparison
DeepDoubleCvB
Figure 3. Performance of the double-b and
the DeepDoubleBvL and DeepDoubleCvB
quark-antiquark pair jet identification
algorithms demonstrating the probability of
misidentifying Hbb jets as a function of the
tagging efficiency. These receiver operating
characteristic curves are obtained from a
combined sample of Hbb and Hcc.
7/5/2018 8 9
CMS DP-2018/046
From training to
practice
Two worlds colliding
Training / Analysis: Production:
• Keras + TensorFlow
• Custom framework
• Python-based
• C++ based (speed!)
• Private productions
• Mostly ROOT-centric (at least

• Minimal interaction with ROOT
I/O)
• Few processes, single threads
• Many processes, multiple

• Little memory constraints
threads
• Expendable jobs
• Many other concurrent
activities → memory
constraints
•
+ Processes cannot die (e.g.
trigger)

Integration of DeepFlavour into CMSSW. PR #19893

Integration of DeepJet (AK4) into CMSSW. PR #19893

Backend choice
x Interface based on TF python API:
• Uses python C API and a pre-built TF package
• Large overhead and no handle on memory/threading
x Interface based on TF C API:
• Low level and not very convenient
• Lots of customisations and ad-hoc handling needed
✓Interface based on TF C++ API:
• Access to all the needed internals for production usage with minimal
need for custom code
• Shallow interface to connect TF to the CMSSW internals (e.g. logging)

Issue 1: Multithreading
• TF starts lots of threads in its own thread pool to:
• Faster loading of data
• Parallelism between operations (inter_op_parallelism_threads)
• Parallelism within operations (intra_op_parallelism_threads)
• Normally a good thing, has a critical impact on memory consumption in

HEP frameworks, which have their own thread schemes/pools (CMSSW
uses TBB)
Multi-threading in TF
•Solved with the implementation of two custom sessions:
Marcel Rieger - 28.2.
●
• Without any threading (NTSession)
Upon session startup, TensorFlow starts lots of threads
• Sharing the thread pool with the rest of the framework (TBBSession)
Similar behavior
→2 in C/C++ API
→ 10
Issue 2: Memory footprint
• Initially DeepJet graph was large (~150MB)
• Not feasible for production operations
• Weights stored as Variables, which need more memory then

Constants
• By default Keras stores a lot of ancillary information on top of

the model (operations and tensors used for training, optimiser
status etc.)
• Reduction of O(10-100) by removing things not needed for

inference and converting to constants
• Further reduction: one single computation graph loaded and

shared across threads, multiple sessions computing inference
• In the future: AOT compilation?

A new kid in town: MXNet
• Used to train DeepAK8
• Found to be 2-3x faster to train

that type of model
• Model exported with by the

gluon API
• no post-processing needed
• outputs a son describing the

network architecture and a
binary containing the
weights
• “Simple” inference engine:

one .h + one .so file

MXNet integration

MXNet issues - thread safety
Problem: the simple “NaiveEngine” is not thread safe. Meant to run

as a static singleton and only allows for synchronous operations.
• The other engine type “ThreadedEngine” is thread safe, but

spawns its own thread pool
Solution: make the engine “thread_local”
• Only 2 lines of code changed
• Not without potential dangers

Problem: once the previous problem was fixed, the resulting output was still sometimes inconsistent —
race condition
• For each thread, mxnet creates creates a workspace to store temporary variables (partially
computed results).
• If multiple graphs are run simultaneously there might be a race condition, but this does should not
happen since each thread runs one graph at a time…
• …unless the object computing the graph is reattached to a different thread every time, in a
worker-pool design!
Solution: re-attach the workspace every time before running the computation graph (mxnet patch)
• Additional mutex added when initialising the Executor as helgrind reported race conditions in
initialisation (minimal cost)
Thread 1 Thread 2 Executor 1

Executor 1
Workspace Executor 2 Workspace Executor 2

Problem: once the previous problem was fixed, the resulting output was still sometimes inconsistent —
race condition
• For each thread, mxnet creates creates a workspace to store temporary variables (partially
computed results).
• If multiple graphs are run simultaneously there might be a race condition, but this does should not
happen since each thread runs one graph at a time…
• …unless the object computing the graph is reattached to a different thread every time, in a
worker-pool design!
Solution: re-attach the workspace every time before running the computation graph (mxnet patch)
• Additional mutex added when initialising the Executor as helgrind reported race conditions in
initialisation (minimal cost)
Thread 1 Thread 2 Executor 1

Executor 1
Workspace Executor 2 Workspace Executor 2

MXNet vs. TensorFlow
MXNet TensorFlow
• Lighter dependencies (only a • Better thread safety
BLAS library)
• Significantly better operators
• Simple model export for support and coverage (always
inference
cutting edge)
• Official ONNX integration

• Native Keras support
• New: Keras compatible

Summary
• Jet tagging is of paramount importance for the CMS Physics

program
• Lots of development in the last ~1.5 years to apply modern machine

learning techniques to this field
• Large improvements in performance
• Still some room for new developments, especially in the boosted

regime
• Flavour tagging is not only fancy algorithms, but solid and

performing computing infrastructures as well
• Model deployment can take a significant amount of time

Backup
Optimistic MC simulation
Realistic MC simulation
s = 13 TeV
misid. probability 1
CMS Simulation Preliminary
tt events
AK4jets (p > 30 GeV)
T
DeepFlavour phase 1
−1
10
DeepCSV phase 1
DeepCSV phase 0
udsg
10−2 c
10−3
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
b jet efficiency
DeepDoubleC - mass sculpting
ulpting
oubleCvL
on the jet soft-drop mass
misidentified events by the
identification algorithm
the degree to which the
pendent on the mass of the
rams are obtained for a fixed
ng rate from a QCD sample.
11
CMS-DP-2018-XXX

b 
bb 
200 nodes x1,
c
l
CMS-DP-2017-013
mance of the b jet identification algorithms demonstrating the
32probability
Figure 5: Performance of the DeepCSV (retrained for the(CERN
M. Verzetti Phaseand
1 detect
FWO)
P-CNNs
particles (sorted)
features
{
{{
zam = SaSj kaa,j xa,(m+j-1)
P-CNNs
particles (sorted)
features
{
{{
Loop over contiguous elements of

Sweep over the elements
the kernel
P-CNNs
particles (sorted)
features
{
{{
Multiple features (“colours”) are

accounted computing the
transformation
P-CNNs
particles (sorted)
features
{
{{
Different filters/kernels learn

different transformations

ATLAS MLWorkshop

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ATLAS MLWorkshop

Uploaded by

Copyright:

Available Formats

DeepFlavour and

!2 M. Verzetti (CERN and FWO)

3 M. Verzetti (CERN and FWO)

Charged (16 features) x25 1x1 conv. 64/32/32/8 RNN 150

4 M. Verzetti (CERN and FWO)

Charged (16 features) x25 1x1 conv. 64/32/32/8 RNN 150

Convolutional layers progressively

5 M. Verzetti (CERN and FWO)

Charged (16 features) x25 1x1 conv. 64/32/32/8 RNN 150

Convolutional layers progressively

6 M. Verzetti (CERN and FWO)

Charged (16 features) x25 1x1 conv. 64/32/32/8 RNN 150

CMS DP-2018/033 7 M. Verzetti (CERN and FWO)

Charged (16 features) x25 1x1 conv. 64/32/32/8 RNN 150

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Charged particles x60

• Significantly larger amount of candidates used to accomodate for

• Need to learn substructure from both charged and neutral

• RNNs become computationally too expensive to train

• Use particle-level convolutional layers (P-CNN) where each

• Flavour information largely

• Large improvement w.r.t to

• Introduces mass sculpting,

Conv1D GRU Output

CMS Simulation Pr eliminar y 2016 (13 TeV)

!11 M. Verzetti (CERN and FWO)

Per-batch penalty term proportional to the Kullback-Liebler (KL)

Mass Sculpting After

!12 M. Verzetti (CERN and FWO)

Training / Analysis: Production:

• Mostly ROOT-centric (at least

• Few processes, single threads

• Many processes, multiple

!15 M. Verzetti (CERN and FWO)

!16 M. Verzetti (CERN and FWO)

!17 M. Verzetti (CERN and FWO)

x Interface based on TF python API:

• Uses python C API and a pre-built TF package

• Large overhead and no handle on memory/threading

x Interface based on TF C API:

• Low level and not very convenient

• Lots of customisations and ad-hoc handling needed

✓Interface based on TF C++ API:

• Shallow interface to connect TF to the CMSSW internals (e.g. logging)

!18 M. Verzetti (CERN and FWO)

• Faster loading of data

• Parallelism between operations (inter_op_parallelism_threads)

• Parallelism within operations (intra_op_parallelism_threads)

• Normally a good thing, has a critical impact on memory consumption in

Upon session startup, TensorFlow starts lots of threads

• Initially DeepJet graph was large (~150MB)

• Not feasible for production operations

• Weights stored as Variables, which need more memory then

• By default Keras stores a lot of ancillary information on top of

• Reduction of O(10-100) by removing things not needed for

• Further reduction: one single computation graph loaded and

• In the future: AOT compilation?

!20 M. Verzetti (CERN and FWO)

• Used to train DeepAK8

• Found to be 2-3x faster to train

• Model exported with by the

• outputs a son describing the

• “Simple” inference engine:

Conv1D  GRU  Output