You are on page 1of 36

DeepFlavour and

Tensorflow in CMSSW
Mauro Verzetti (CERN and FWO)
on behalf of the CMS Collaboration
The problem - Jet flavour classification

!2 M. Verzetti (CERN and FWO)


HF tagging @ CMS — JINST 13 (2018) no.05, P05011
Track-based
tracks tagger

jet combined

Secondary
PV vertex (SV)
Track selection
based tagger
SV

super
combined
e/µ

Soft-lepton (SL)
based tagger

3 M. Verzetti (CERN and FWO)


Particle-based NN architecture

Charged (16 features) x25 1x1 conv. 64/32/32/8 RNN 150


b

Neutral (8 features) x25 1x1 conv. 32/16/4 RNN 50 Dense
bb

200 nodes x1,
c
Secondary Vtx (12 features) x4 1x1 conv. 64/32/32/8 RNN 50 100 nodes x6
l
Global variables (6 features)

4 M. Verzetti (CERN and FWO)


Particle-based NN architecture

Charged (16 features) x25 1x1 conv. 64/32/32/8 RNN 150


b

Neutral (8 features) x25 1x1 conv. 32/16/4 RNN 50 Dense
bb

200 nodes x1,
c
Secondary Vtx (12 features) x4 1x1 conv. 64/32/32/8 RNN 50 100 nodes x6
l
Global variables (6 features)

Convolutional layers progressively


learn a more compact feature
representation (automatic feature
engineering)

5 M. Verzetti (CERN and FWO)


Particle-based NN architecture

Charged (16 features) x25 1x1 conv. 64/32/32/8 RNN 150


b

Neutral (8 features) x25 1x1 conv. 32/16/4 RNN 50 Dense
bb

200 nodes x1,
c
Secondary Vtx (12 features) x4 1x1 conv. 64/32/32/8 RNN 50 100 nodes x6
l
Global variables (6 features)

Convolutional layers progressively


learn a more compact feature The recurrent layers (LSTM) builds
representation (automatic feature a “summary” of the information
engineering) contained in each set of feature
types

6 M. Verzetti (CERN and FWO)


Particle-based NN architecture

Charged (16 features) x25 1x1 conv. 64/32/32/8 RNN 150


b

Neutral (8 features) x25 1x1 conv. 32/16/4 RNN 50 Dense
bb

200 nodes x1,
c
Secondary Vtx (12 features) x4 1x1 conv. 64/32/32/8 RNN 50 100 nodes x6
l
Global variables (6 features)

s = 13 TeV s = 13 TeV
1 1
misid. probability

misid. probability
CMS Simulation Preliminary CMS Simulation Preliminary
tt events tt events
AK4jets (p > 30 GeV) AK4jets (p > 90 GeV)
T T
DeepFlavour phase 1 DeepFlavour phase 1
10−1 10−1
DeepCSV phase 1 DeepCSV phase 1
DeepCSV phase 0 DeepCSV phase 0

udsg udsg
10−2 c 10−2 c

10−3 10−3
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
b jet efficiency b jet efficiency

CMS DP-2018/033 7 M. Verzetti (CERN and FWO)


Particle-based NN architecture

Charged (16 features) x25 1x1 conv. 64/32/32/8 RNN 150


b

Neutral (8 features) x25 1x1 conv. 32/16/4 RNN 50 Dense bb

200 nodes x1, c
Secondary Vtx (12 features) x4 1x1 conv. 64/32/32/8 RNN 50 100 nodes x6 uds

g
Global variables (6 features)
s=13 TeV
1

Misid. probability
0.9
CMS Simulation Preliminary
QCD events, p = 30-50 GeV
T
0.8 jet p > 30 GeV
T
Similar performance to simpler, 0.7 DeepJet
dedicated binary taggers, but with full 0.6 recurrent
multi-class power. convolutional
0.5
0.4
Significantly better performances in
given regions with different quark 0.3

composition 0.2
0.1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1


CMS-DP-2017-027 Light quark efficiency
8 M. the
Performance of the DeepJet multi classification algorithm, Verzetti (CERN
recurrent andand FWO)
the convolu
DeepAK8: for boosted resonances
Inclusive particles x100 

Sub-structure P-CNN layers x14
10 features, pT ordered

Charged particles x60

{
P-CNN layers x14 Dense
30 features, sIP ordered Output
521 nodes x1
Flavour
Secondary Vtx x5
14 features, 
 P-CNN layers x14
displacement ordered

• Significantly larger amount of candidates used to accomodate for


90% of the fat jets

• Need to learn substructure from both charged and neutral


candidates

• RNNs become computationally too expensive to train

• Use particle-level convolutional layers (P-CNN) where each


feature is treated as a “colour”
!9 M. Verzetti (CERN and FWO)
Performance

• Flavour information largely


improves jet tagging

• Large improvement w.r.t to


the BDT approach

• Introduces mass sculpting,


not necessarily a bad thing

Figure 1. Comparison of the performance of the two BDT taggers and the
CMS-DP-2017-049
CNN taggers in terms of ROC curves in MC simulated events for top jets as
as background. The events correspond to AK8 jets with 1000<pT<1400 GeV
!10 M. Verzetti (CERN and FWO)
• GRU = Gated Recurrent Unit = Recurrent network to reduce
dimensionality of output from Conv1D layers 

DeepDoubleB (60×32, 5×32) → (50, 50)

Conv1D
 GRU
 Output



track
 (60, 8)
(2 layers, 

(60, 32) (50)
BN (50 units,
 (100)

features 32+32 units, 

dropout = 0.1)
dropout = 0.1) Fully Higgs

QCD
connected

Conv1D
 GRU

SV
 (5, 2)
(2 layers, 

(5, 32) (50) 

BN 32+32 units, 

(50 units,
 (1 layer, 

features dropout = 0.1)
dropout = 0.1) 100 units, 

dropout = 0.1)

Double-b (27) BN
features 7

CMS Simulation Pr eliminar y 2016 (13 TeV)


100
Mass Sculpting
Mistagging rate (QCD)

Significantly better
300 < jet pT < 2000 GeV
40 < jet mSD < 200 GeV
DeepDoubleBvL, AUC = 97.3%
DeepDoubleBvL
double-b, AUC = 91.3%
than current BDT
Figure 4. Effect on the jet soft-drop mass
10 1 approach!
distribution of misidentified events by the
DeepDoubleBvL identification algorithm
demonstrating the degree to which the
algorithm is dependent on the mass of the
jet. These histograms are obtained for a fixed
overall mistagging rate from a QCD sample.

10 2
Some mass
sculpting

3
10
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Tagging efficiency (H ! bb̄)
CMS DP-2018/046
7/5/2018 10

!11 M. Verzetti (CERN and FWO)


Removing mass correlation

Per-batch penalty term proportional to the Kullback-Liebler (KL)


divergence

Mass Sculpting After


Mass Decorrelation
DeepDoubleBvL
Minimal loss in
Figure 6. Effect on the jet soft-drop mass
performance
distribution of misidentified events by the
DeepDoubleBvL identification algorithm,
after it has been decorrelated from the jet
mass demonstrating the degree to which the
algorithm is dependent on the mass of the
jet. These histograms are obtained for a fixed
overall mistagging rate from a QCD sample.

Mass sculpting
gone

13 CMS DP-2018/046
7/5/2018 12

!12 M. Verzetti (CERN and FWO)


DeepDoubleC!

ROC Comparison
DeepDoubleCvB
Figure 3. Performance of the double-b and
the DeepDoubleBvL and DeepDoubleCvB
quark-antiquark pair jet identification
algorithms demonstrating the probability of
misidentifying Hbb jets as a function of the
tagging efficiency. These receiver operating
characteristic curves are obtained from a
combined sample of Hbb and Hcc.

7/5/2018 8 9

CMS DP-2018/046
!13 M. Verzetti (CERN and FWO)
From training to
practice
Two worlds colliding

Training / Analysis: Production:

• Keras + TensorFlow
• Custom framework

• Python-based
• C++ based (speed!)

• Private productions

• Mostly ROOT-centric (at least


• Minimal interaction with ROOT
I/O)

• Few processes, single threads

• Many processes, multiple


• Little memory constraints
threads

• Expendable jobs
• Many other concurrent
activities → memory
constraints


+ Processes cannot die (e.g.
trigger)

!15 M. Verzetti (CERN and FWO)


Integration of DeepFlavour into CMSSW. PR #19893

!16 M. Verzetti (CERN and FWO)


Integration of DeepJet (AK4) into CMSSW. PR #19893

!17 M. Verzetti (CERN and FWO)


Backend choice

x Interface based on TF python API:

• Uses python C API and a pre-built TF package

• Large overhead and no handle on memory/threading

x Interface based on TF C API:

• Low level and not very convenient

• Lots of customisations and ad-hoc handling needed

✓Interface based on TF C++ API:

• Access to all the needed internals for production usage with minimal
need for custom code

• Shallow interface to connect TF to the CMSSW internals (e.g. logging)

!18 M. Verzetti (CERN and FWO)


Issue 1: Multithreading
• TF starts lots of threads in its own thread pool to:

• Faster loading of data

• Parallelism between operations (inter_op_parallelism_threads)

• Parallelism within operations (intra_op_parallelism_threads)

• Normally a good thing, has a critical impact on memory consumption in


HEP frameworks, which have their own thread schemes/pools (CMSSW
uses TBB)

Multi-threading in TF
•Solved with the implementation of two custom sessions:
Marcel Rieger - 28.2.


• Without any threading (NTSession)

Upon session startup, TensorFlow starts lots of threads

• Sharing the thread pool with the rest of the framework (TBBSession)

Similar behavior

→2 in C/C++ API

→ 10
!19 M. Verzetti (CERN and FWO)
Issue 2: Memory footprint

• Initially DeepJet graph was large (~150MB)

• Not feasible for production operations

• Weights stored as Variables, which need more memory then


Constants

• By default Keras stores a lot of ancillary information on top of


the model (operations and tensors used for training, optimiser
status etc.)

• Reduction of O(10-100) by removing things not needed for


inference and converting to constants

• Further reduction: one single computation graph loaded and


shared across threads, multiple sessions computing inference

• In the future: AOT compilation?

!20 M. Verzetti (CERN and FWO)


A new kid in town: MXNet

• Used to train DeepAK8

• Found to be 2-3x faster to train


that type of model

• Model exported with by the


gluon API

• no post-processing needed

• outputs a son describing the


network architecture and a
binary containing the
weights

• “Simple” inference engine:


one .h + one .so file

!21 M. Verzetti (CERN and FWO)


MXNet integration

!22 M. Verzetti (CERN and FWO)


MXNet issues - thread safety

Problem: the simple “NaiveEngine” is not thread safe. Meant to run


as a static singleton and only allows for synchronous operations.

• The other engine type “ThreadedEngine” is thread safe, but


spawns its own thread pool

Solution: make the engine “thread_local”

• Only 2 lines of code changed

• Not without potential dangers

!23 M. Verzetti (CERN and FWO)


MXNet issues - thread safety
Problem: once the previous problem was fixed, the resulting output was still sometimes inconsistent —
race condition

• For each thread, mxnet creates creates a workspace to store temporary variables (partially
computed results).

• If multiple graphs are run simultaneously there might be a race condition, but this does should not
happen since each thread runs one graph at a time…

• …unless the object computing the graph is reattached to a different thread every time, in a
worker-pool design!

Solution: re-attach the workspace every time before running the computation graph (mxnet patch)

• Only 2 lines of code changed

• Additional mutex added when initialising the Executor as helgrind reported race conditions in
initialisation (minimal cost)

Thread 1 Thread 2 Executor 1


Executor 1

Workspace Executor 2 Workspace Executor 2

!24 M. Verzetti (CERN and FWO)


MXNet issues - thread safety
Problem: once the previous problem was fixed, the resulting output was still sometimes inconsistent —
race condition

• For each thread, mxnet creates creates a workspace to store temporary variables (partially
computed results).

• If multiple graphs are run simultaneously there might be a race condition, but this does should not
happen since each thread runs one graph at a time…

• …unless the object computing the graph is reattached to a different thread every time, in a
worker-pool design!

Solution: re-attach the workspace every time before running the computation graph (mxnet patch)

• Only 2 lines of code changed

• Additional mutex added when initialising the Executor as helgrind reported race conditions in
initialisation (minimal cost)

Thread 1 Thread 2 Executor 1


Executor 1

Workspace Executor 2 Workspace Executor 2

!25 M. Verzetti (CERN and FWO)


MXNet vs. TensorFlow

MXNet TensorFlow

• Lighter dependencies (only a • Better thread safety

BLAS library)
• Significantly better operators
• Simple model export for support and coverage (always
inference
cutting edge)

• Official ONNX integration


• Native Keras support

• New: Keras compatible

!26 M. Verzetti (CERN and FWO)


Summary

• Jet tagging is of paramount importance for the CMS Physics


program

• Lots of development in the last ~1.5 years to apply modern machine


learning techniques to this field

• Large improvements in performance

• Still some room for new developments, especially in the boosted


regime

• Flavour tagging is not only fancy algorithms, but solid and


performing computing infrastructures as well

• Model deployment can take a significant amount of time

!27 M. Verzetti (CERN and FWO)


Backup
Optimistic MC simulation
Realistic MC simulation
s = 13 TeV
misid. probability 1
CMS Simulation Preliminary
tt events
AK4jets (p > 30 GeV)
T
DeepFlavour phase 1
−1
10
DeepCSV phase 1
DeepCSV phase 0

udsg
10−2 c

10−3
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
b jet efficiency
DeepDoubleC - mass sculpting

ulpting
oubleCvL
on the jet soft-drop mass
misidentified events by the
identification algorithm
the degree to which the
pendent on the mass of the
rams are obtained for a fixed
ng rate from a QCD sample.

11

CMS-DP-2018-XXX
!31 M. Verzetti (CERN and FWO)
Particle-based NN architecture

Charged (16 features) x25 1x1 conv. 64/32/32/8 RNN 150


b

Neutral (8 features) x25 1x1 conv. 32/16/4 RNN 50 Dense
bb

200 nodes x1,
c
Secondary Vtx (17 features) x4 1x1 conv. 64/32/32/8 RNN 50 100 nodes x6
l
Global variables (6 features)

CMS-DP-2017-013
mance of the b jet identification algorithms demonstrating the
32probability
Figure 5: Performance of the DeepCSV (retrained for the(CERN
M. Verzetti Phaseand
1 detect
FWO)
P-CNNs

particles (sorted)

features

{
{{
zam = SaSj kaa,j xa,(m+j-1)
P-CNNs

particles (sorted)

features

{
{{
zam = SaSj kaa,j xa,(m+j-1)

Loop over contiguous elements of


Sweep over the elements
the kernel
P-CNNs

particles (sorted)

features

{
{{
zam = SaSj kaa,j xa,(m+j-1)

Multiple features (“colours”) are


accounted computing the
transformation
P-CNNs

particles (sorted)

features

{
{{
zam = SaSj kaa,j xa,(m+j-1)

Different filters/kernels learn


different transformations

You might also like