You are on page 1of 6

Home Search Collections Journals About Contact us My IOPscience

HEPDOOP: High-Energy Physics Analysis using Hadoop

This content has been downloaded from IOPscience. Please scroll down to see the full text.

2014 J. Phys.: Conf. Ser. 513 022004

(http://iopscience.iop.org/1742-6596/513/2/022004)

View the table of contents for this issue, or go to the journal homepage for more

Download details:

IP Address: 172.245.147.206
This content was downloaded on 15/09/2016 at 08:27

Please note that terms and conditions apply.

You may also be interested in:

Running a typical ROOT HEP analysis on Hadoop MapReduce


S A Russo, M Pinamonti and M Cobal

Evaluation of Apache Hadoop for parallel data analysis with ROOT


S Lehrack, G Duckeck and J Ebke

Sequential data access with Oracle and Hadoop: a performance comparison


Zbigniew Baranowski, Luca Canali and Eric Grancher

Bayesian Analysis Toolkit: 1.0 and beyond


Frederik Beaujean, Allen Caldwell, D. Greenwald et al.

Reinforcing user data analysis with Ganga in the LHC era: scalability, monitoring and user-support
Johannes Elmsheuser, Frederic Brochu, Ivan Dzhunov et al.

Testing of several distributed file-systems (HDFS, Ceph and GlusterFS) for supporting the HEP
experiments analysis
Giacinto Donvito, Giovanni Marzulli and Domenico Diacono

Agile Infrastructure Monitoring


P Andrade, J Ascenso, I Fedorko et al.
20th International Conference on Computing in High Energy and Nuclear Physics (CHEP2013) IOP Publishing
Journal of Physics: Conference Series 513 (2014) 022004 doi:10.1088/1742-6596/513/2/022004

HEPDOOP: High-Energy Physics Analysis using


Hadoop
W. Bhimji, T.Bristow, A.Washbrook
SUPA School of Physics and Astronomy, University of Edinburgh, Edinburgh, UK
E-mail: wbhimji@staffmail.ed.ac.uk

Abstract. We perform a LHC data analysis workflow using tools and data formats that
are commonly used in the “Big Data” community outside High Energy Physics (HEP). These
include Apache Avro for serialisation to binary files, Pig and Hadoop for mass data processing
and Python Scikit-Learn for multi-variate analysis. Comparison is made with the same analysis
performed with current HEP tools in ROOT.

1. Introduction
“Big Data” has now moved beyond buzzword to business-as-usual in the private sector including
big names such as Yahoo, Twitter, Facebook, Amazon and Google. High-Energy Physics (HEP)
is often cited as the archetypal Big Data use case, however it currently shares very little of
the toolkit used in the private sector which is dominated by the Apache Hadoop ecosystem.
We use those tools to perform aspects of a HEP analysis as part of a longer term goal of
interaction between big data communities. For this study we use established tools with large
user communities (see below). We aim for performance that is good-enough (i.e. of the same
order as HEP tools rather than seeking to exceed it) and make ease-of-use also an important
consideration.

1.1. Technologies used


Hadoop is a framework “for the distributed processing of large data sets across clusters of
computers using simple programming models”[1]. It provides easy execution of MapReduce
programs [2] and the HDFS distributed filesystem [3].
Pig produces MapReduce programs for Hadoop using a query language called Pig Latin [4] which
offers [5] “ease of programming” for parallel execution of tasks, “optimisation opportunities” as
task execution is optimised automatically, and “extensibility” through User Defined Functions
(UDF)s.
Avro [6] is a file-based binary-format data serialization system. It relies on schemas but these
are stored in the file which are thus self-describing. Files are compressed, with a choice of
codecs, and tools for reading/writing are available in several languages including Pig. It is not
a columnar store, though these formats are moving in that direction following Google’s Dremel
paper [7]. Parquet [8] is perhaps the best documented of the column-oriented products but was
still considered too immature for the initial stage of this project.
Scikit-Learn [9] is a mature, well-documented and fully-featured machine learning package

Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution
of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
Published under licence by IOP Publishing Ltd 1
Scikit-Learn for multi-variate analysis. Comparison is made with the same analysis p

Introduction
“Big Data”
20th International is now
Conference business-as-usual
on Computing in the
in High Energy privatePhysics
and Nuclear sector(CHEP2013) Hadoop is a framework “
including bigIOP Publishing
clusters of computers usi
namesConference
Journal of Physics: such as Twitter, Facebook,
Series 513 Amazon and Google.
(2014) 022004 High-Energy
doi:10.1088/1742-6596/513/2/022004
execution of MapReduce
Physics (HEP) is often cited as the archetypal “Big Data” use case,
however it currently shares very little of the toolkit used in the private Pig produces MapReduc
sector which is dominated by the Apache Hadoop “ecosystem”. We
for python built on NumPy and matplotlib. The later are also used for visualisation Pig and
Latin which offers [2]
use those “optimisation opportunitie
histogramming in thistools to perform aspects of a HEP analysis as part of a
project.
“Extensibility” through Us
ROOT [10]longer
[11] term
is thegoal of commonly
most interaction used
between
data big data communities.
analysis technology in HEP both for data
structuresFor
and this
the analysis
study wetoolkit. A number of tools
use established tools have
with been
largedeveloped for ROOT including
user communities Avro [3] is a file-based bi
TMVA [12][13] for multi-variate analysis.
(see box alongside). We aim for performance that is good-enough (i.e. schemas but these are st
compressed, with a choic
of the same order as HEP tools rather than seeking to exceed it) and
2. HEP Analysis Use-case several languages includi
make ease-of-use also an important consideration.
An important workflow within HEP analysis is illustrated in figure 1. This illustation uses both
Scikit-Learn [4] is a mat
HEP Analysis Use-case
terms employed in HEP but also more generally applicable titles for comparison with alternative
disciplines. These stages are discussed in the sections that follow.
learning package for pyth
used for visualisation and
Analysis stages are given below with generic titles and HEP terms.
ROOT [5] is the most com
Data Filtering: Data Mining: Data Visualisation:
data structures and the a
Skimming/Slimming Cut-optimisation / MVA Histogramming for ROOT including TMVA

Figure 1. Illustrative HEP analysis workflow.


Data Filtering Data Mining
We use data from the ATLAS experiment in the popular centrally- Filtered data is then “m
produced SMWZ_D3PD format which contains ROOT TTrees with we implement in Pig
3. Data Filtering
We use data5860 branches
from of simple
the ATLAS types and
experiment vectors
in the (~65centrally-produced,
popular, KB/event). Due to SMWZ the Boosted
D3PD Decision Tree
large file sizes, an analysis group may conduct a “Filtering”
format which contains ROOT TTrees with 5860 branches of simple types and vectorsis withstage. The run on
a the complete
output
total size of is in the same
≈65 KB/event. Due tostructure
the large but with a
file sizes, an selection of events
analysis group and conduct
will often Illustrative
a code is sho
“Filtering”reduction
stage to of branches.
reduce We use
the volume ~1TB
of data of input
used data analysis.
in further which is The
converted
output is in the
trainingData, training
into zlib-compressed
same structure but with a selection Avro. We and
of events apply a selection
reduction motivated
of branches. We useby 1TB
an and
of input
testingData are
data whichanalysis of Higgs
is converted Boson decays Avro
into zlib-compressed to two filesb-quarks (H to conversion
using a simple bb) whichscript
has based on
arrays that can either
an event selection efficiency of 5% and selects 109 branches.
the Avro python interface. We apply a selection motivated by an analysis of Higgs Boson decays
from ROOT TTrees
to two b-quarks (H→bb) which has an event selection efficiency of 5% and selects 109 branches
resulting inREGISTER 'MySelections.py'
a file-size around 0.3% using of
jython
theasoriginal.
myfuncs; Illustrative Pig
Illustrative Pig code
code isisshown
shownin rootpy
figure 2.[7]) or filled from
DEFINE AvroStorage
Filtering isorg.apache.pig.piggybank.storage.avro.AvroStorage; to the left. Filtering
done with simple comparisons (such as MET_RefFinal_et in this example) and more is done
complex selections via'$input'
input = LOAD a python
USING UDF with simple comparisons
(myfuncs.passesJetSelection
AvroStorage(); and For the
in this example).
UDFs, the data
same code can be reused in both Pig and more
= FOREACH input GENERATE $branches;
ROOT. complex selections via a
outputdata = FILTER data BY MET_RefFinal_et > 45.0
AND myfuncs.passesJetSelection(jet_AntiKt4TopoEM) python UDF. For the latter, the
REGISTER AND
’MySelections.py’
... using jython as myfuncs; same code can be reused in
STORE outputdataorg.apache.pig.piggybank.storage.avro.AvroStorage;
DEFINE AvroStorage INTO './results' USING AvroStorage();
both Pig and ROOT.
input = LOAD ’$input’ USING AvroStorage();
data = FOREACH input GENERATE $branches;
Performance
outputdata = FILTER data BY MET_RefFinal_et > 45.0
AND myfuncs.passesJetSelection(jet_AntiKt4TopoEM)
AND ...
The Hadoop processing
time versus
STORE outputdata the’./results’
INTO size of the USING AvroStorage();
input dataset, when run on
a 500 core Hadoop
Conclusion
Figure 2. Illustrative PIG code for Data Filtering. Key components of a
installation at CERN, is
shown to the right. The “Big data” tools on Ha
ROOT-based tool centrally as Amazon Elastic Ma
3.1. Performance
available on ATLAS for this developer communitie
The Hadoop processing time versus the size of the input dataset, when run on a 500 corefor
Hadoop
example, provision
purpose (filter-and-merge)
installation at CERN, is shown in figure 3. The ROOT-based tool centrally available on ATLAS
has similar
for this purpose 270s run timehas
(filter-and-merge) onsimilar 270s run time on a single 760 MB file. However
The performance seen
a single 760 MB file. analysis, particularly in
with custom ROOT code, reading only the branches used, we were able to achieve a single-file
However with custom ROOT code, reading only the branches used, we address this issue, w
were able to achieve a single-file run2
time of 30s and the application storage formats such a
could be run in parallel on a cluster of this sort, with a batch system or
References
using PROOF, possibly with better scaling than seen here for Hadoop.
[1] http://hadoop.apache.org/
We do also observe, however, an improvement of performance in [2] http://pig.apache.org/
20th International Conference on Computing in High Energy and Nuclear Physics (CHEP2013) IOP Publishing
Journal of Physics: Conference Series 513 (2014) 022004 doi:10.1088/1742-6596/513/2/022004

run time of 30s and the application could be run in parallel on a cluster of this sort, with a batch
system or using PROOF, possibly with better scaling than seen here for Hadoop. We do also
observe, however, an improvement of performance in Hadoop/Avro when some large, unused,
vector branches are removed to leave 2300 “branches” at 2.5KB/event.

Time to process file/dataset


104 Pig/Avro: all branches
Pig/Avro: reduced branches
Root: filter-and-merge, single job
Root: optimised code, single job
103
Time (s)

102

101 -1
10 100 101 102 103
Input size (GB)

Figure 3. Time taken to perform filtering in Pig for files with all branches from the D3PD (blue
squares) or when certain very large vector branches are removed (red triangles) together with
that for ROOT running on a single 760 MB file using the filter-and-merge tool (green circle) or
custom ROOT code reading only the branches of interest (yellow triangle).

4. Data Mining
Filtered data is then mined by applying further linear selections, which we implement in Pig, or
a Multi-Variate Analysis (MVA) such as a Boosted Decision Tree (BDT), which we implement in
scikit-learn. This is run on the complete input data for the ATLAS H→bb analysis. Illustrative
code is shown in figure 4, In that code, trainingData, trainingClasses and testingData are NumPy
arrays that can either be read from ROOT TTrees (using rootpy [14]) or filled from Avro using
the python interface.

from sklearn.tree import DecisionTreeClassifier


from sklearn.ensemble import AdaBoostClassifier
ada = AdaBoostClassifier(DecisionTreeClassifier(max_depth=4,compute_importances=True,
min_samples_split=2,min_samples_leaf=100),n_estimators=400, learning_rate=0.5,
algorithm="SAMME",compute_importances=True)
ada.fit(trainingData, trainingClasses)
return ada.decision_function(testingData)

Figure 4. Illustrative python MVA code

3
20th International Conference on Computing in High Energy and Nuclear Physics (CHEP2013) IOP Publishing
Journal of Physics: Conference Series 513 (2014) 022004 doi:10.1088/1742-6596/513/2/022004

We use IPython[15] to enable parallel training on samples and achieve an execution time in
Scikit similar to that of TMVA: for the training of 4 BDTs Scikit takes a combined time of
around 180 ± 5 s while TMVA takes around 200 ± 5 s. Outputs of the BDT are shown in figure
5. We can recreate the same results for the analysis as obtained with ROOT/TMVA, but the
strength of this package is that it also allows for a wide range of alternative approaches.

Figure 5. Outputs from the scikit-learn MVA

5. Conclusions
Key components of a real HEP analysis have been run using out-of-the-box Big Data tools in
Hadoop and python. On advantage of this approach is that it can enable use of wider resources
such as Amazon Elastic Map Reduce as well as benefitting from large user and developer
communities and further off-the-shelf tools for use in, for example, provisioning, monitoring
and scheduling. The performance seen here for filtering is, however, significantly inferior to our
ROOT-based analysis, particularly in the case of large complex nested structures. To attempt
to address this issue, we are exploring the use of the columnar storage format Parquet [8].

Acknowledgments
Thanks to the CERN IT-DSS group for use of Hadoop resources.

References
[1] URL http://hadoop.apache.org/ (Accessed October 2013)
[2] Dean, J. and Ghemawat, S. 2008 MapReduce: simplified data processing on large clusters Communications
of the ACM 51(1) 107-113
[3] Shvachko, K. et al. 2010 The hadoop distributed file system Proceedings of IEEE 26th Symposium on Mass
Storage Systems and Technologies (MSST) 1-10
[4] Olston, C., et al. 2008 Pig latin: a not-so-foreign language for data processing Proceedings of the 2008 ACM
SIGMOD international conference on Management of data 1099-1110
[5] URL http://pig.apache.org/ (Accessed October 2013)
[6] URL http://avro.apache.org/ (Accessed October 2013)
[7] Melnik, S. et al. 2010 Dremel: interactive analysis of web-scale datasets Proceedings of the VLDB Endowment
3(1-2) 330-339
[8] URL http://parquet.io/ (Accessed October 2013)
[9] URL http://scikit-learn.org/ (Accessed October 2013)
[10] Brun R and Rademakers F 1997 Nucl. Instrum. Meth. A 389 81
[11] URL http://root.cern.ch/ (Accessed October 2013)
[12] Hoecker, A. et al. 2007 TMVA - Toolkit for Multivariate Data Analysis arXiv:physics/0703039

4
20th International Conference on Computing in High Energy and Nuclear Physics (CHEP2013) IOP Publishing
Journal of Physics: Conference Series 513 (2014) 022004 doi:10.1088/1742-6596/513/2/022004

[13] URL http://tmva.sourceforge.net/ (Accessed October 2013)


[14] URL http://www.rootpy.org/ (Accessed October 2013)
[15] Perez, F., and Granger, B. E. 2007 IPython: a system for interactive scientific computing Computing in
Science and Engineering 9(3) 21-29

You might also like