You are on page 1of 11

CMSPOS 2019 E/gamma long exercise

Aachen
Exercise 5: MVA training

Sam Harper, Swagata Mukherjee


Task: draw decision boundary

B B

S S

How to find the best decision boundary?


Decision Tree
• Task: Divide events into signal and background
• Need Monte Carlo samples of each. Root node
• Divide each Monte Carlo sample into two
parts: training sample, test sample
• Decide some ID variables useful for Branch
Branch
distinguishing between signal and background.
node
• For each ID variable, order the events by the
value of the variable.
• Pick 1st variable, pick a cut value, and see node
what happens if the training sample is split
into two parts depending on that cut on that
variable. (Repeat)
• Pick the cut which gives the best separation :
one side having mostly signal and the other
mostly background. Non-overlapping
• Then repeat this for each variable.
Decision Tree
Define purity as

Note that P*(1-P) is 0 if the sample is pure signal or pure background.


For a given branch, construct this: n is the number of events on that branch.

Minimize (Gini_left + Gini_right)


Boosting

•If event is misclassified, i.e., a signal event


lands on a background leaf or a background
event lands on a signal leaf, then the weight of
that event is increased (boosted).
•A second tree is built using the new weights
•Again misclassified events have their weights
boosted and the procedure is repeated.
•Typically, one may build few thousand trees
this way: Random Forest.
Different types of Boosting

• Gradient Boosting
• XGBoost: eXtreme Gradient Boosting
• XGBoost algorithm has recently been dominating applied machine learning. Why?
• Parallel Computing: when you run xgboost, by default, it would use all the cores of
your machine
• Regularization: a technique used to avoid overfitting
• Flexibility: supports user defined evaluation metrics
• Availability: Currently, it is available for programming languages such as R, Python,
Java, etc.
• We will use it to train electron MVA ID
Next slides are technical and will be explained
as we take you through the tutorial

Any questions?
Start the exercise by following instructions here
https://github.com/guitargeek/ElectronMVATutorial

First step is running the ntuplizer. Use slc7


ssh -XY your_username@lxplus7.cern.ch
export SCRAM_ARCH=slc7_amd64_gcc700

Take the code from git and remember to do


“scram b” and “voms-proxy-init -voms cms”

Use a Run3 root file (choose a reasonable maxEvents)


/store/mc/Run3Summer19MiniAOD/DYJets_incl_MLL-50_TuneCP5_14TeV-madgraphMLM-
pythia8/MINIAODSIM/2023Scenario_106X_mcRun3_2023_realistic_v3-
v1/270000/222889F5-1E13-F34C-B312-B9B102119CBB.root

Once you have ran the ntuplizing step, open the root tree and have a look what’s in there

Find variable definitions/explanations in:


ElectronIdentification/data/ElectronIDVariables.txt

Now, open the SWAN notebook


Notebook
• Select a cell and then Shift+return to run a cell
• In[*] means running. In[some_number] means
run complete
• You can insert a cell like this
• Insert your own username
• Follow the in-line instructions

At the end,
check
variable importance
More things to try
• Change hyper-parameters and see if performance improves. You can
change learning rate.
• Consult this: https://xgboost.readthedocs.io/en/latest/parameter.html

You might also like