TME - Classification of SUSY Data Set

TME - Classification of SUSY data set
Adam Teixidó Bonfill · 30/01/21
Introduction
We attempt to classify the dataset on SUSY Monte Carlo simulations in [1]. The challenge
is to distinguish between a signal process which produces supersymmetric particles and a
background process which does not.
Steps followed
First, we read the associated paper [2]. We are going to take into account the distinction
between low-level and high-level features. The first kind measured experimentally, and the
second kind computed to aid classification. We will try to see if the neural networks can avoid
the necessity of high-level features, because of the big efforts theoretical physicists have to
do to find them. Moreover, the article indicates the hyper-parameters of the neural networks
that they used and that we are going to take into account.
Exploring the data set At first, handling the data set seems complicated because of the
2GB size of its 5 million examples. Several ideas to open the data set were:
1. Split it into multiple data sets using a software to split csv [3].
2. Read it line by line, without storing former lines on memory.
3. Fully load it with Pandas or NumPy. Even though they lasted between one and few
minutes to open the data set, they are able to handle it well.
Using Pandas we were able to plot the distributions of the features, as seen in Figure 1. The
18 features are of the two kinds, which the data set information indicates [1]: “The first 8
features are kinematic properties measured by the particle detectors in the accelerator. The
last ten features are functions of the first 8 features; these are high-level features derived by
physicists to help discriminate between the two classes.” As we said before, we are going
to take the distinction between kinematic and high-level features into account during the
training of models.
We checked that the percentage of SUSY processes is 46%. We also checked that input
features are standardized over the entire set with mean zero and variance one, except for
features that are always positive, which are standardized to have mean one.
1
Hypothesis testing assignment
Figure 1: Distribution of all features for SUSY signal (orange) and background (blue)
processes. The variables with uniform distribution are measures of the azimuthal angle φ of
particles or missing energy.
We split the data set between train and test. We use a validation set with 500,000 examples
(10% of the data set) as in the original paper [2]. We pick the validation set examples
randomly.
Dimensionality reduction First we try to reduce dimensionality with PCA. We get the
results shown in Figure 2 for the two principal components.
(a) Features standardized as in the original (b) Features standardized to mean zero and
paper. variance one.
Figure 2: Principal components of PCA for SUSY signal (orange) and background (blue)
processes.
2
The distributions completely overlap, which tells us that just using the two principal compo-
nents will not allow us to classify. Moreover, we see that there is not much difference between
methods of standardization, whether it is the same followed in the original paper, or fully
standardizing to mean zero and variance one. The first principal components explain more
percentage of the variance when we fully standarize the features, obtaining 27%, 21% and 8%
for the three first components, respectively. Additionally, we need as much as 11 components
to explain at least 95% of the data variance, more than the number of kinematic features.
Upon seeing that 2 component PCA does not distinguish between processes, we tried a non-
linear dimensionality reduction technique, specifically scikit-learn’s multidimensional scaling
(MDS) algorithm, manifold.mds. With default settings, fully standarized features, and
reducing to two dimensions, we get Figure 3. There, we can see a better separation between
Figure 3: Non-linear reduction to 2D of the features of SUSY signal (orange) and

background (blue) processes. Using 1000 examples.
signal and background, which is promising for classification. However, MDS is much slower
than PCA when using big number of examples. This is why we applied the reduction to just
1000 examples.
3
Trials with neural networks

We used Keras to classify with neural networks. For all neural networks we use cross entropy
as the loss function, the stochastic gradient descent algorithm Adam as optimizer, and the
ReLU as activation function. Also, we will collect the classification accuracy for the train
and the validation sets during the training. We first tried to run the neural networks for 200
epochs, which is the low end of the number of epochs in the original paper. At 200 epochs the
smaller networks that we use have a really slow increase in accuracy. Also, with less epochs
we get less over-fit. Finally, we will also compute the gradient with batches of size 100.
With these specifications neural networks take much computational time to run. So when
we explore different possibilities for each network type, we use a fraction of the data set with
100,000 examples.
A Small neural network

We start with a neural network with:
• A first hidden layer with 20 neurons.
• A second hidden layer with 10 neurons.
• Output layer with sigmoid activation function.
When training over the whole data set we obtained a final accuracy on the validation set of
80%. The accuracy during the training evolved as:
In which we see that after an initial increase in percentage of correct classifications, we get
to a slowly increasing plateau. Also, we do not see a considerably worse performance on the
validation set with respect to the training set, which makes us think that the network is not
over-fitting.
We also trained this network with just kinematic or high level features. The accuracy that
we got was respectively 78% and 78.5%. This indicates both types of features are equally
4
useful for this network. Also, the network performs a bit better when it has all the features
available, including the help from high-level ones. The accuracy evolution during training is
shown on the following graphs,
(a) Training with only kinematic features. (b) Training with only high-level features.
Here we see that after an initial increase in percentage of correct classifications, we get to
a slowly increasing plateau. Also, we do not see a considerably worse performance on the
validation set with respect to the training set, which makes us think that the network is not
over-fitting. We also see that accuracy increases faster for kinematic variables, indicating that
the network still has room for improvement. This could be because the kinematic variables
contain the same information or more than high-level ones. This brings the best percentage
of classification with this network to 80%.
B Shallow neural network

We now work with a neural network that has:
• A single hidden layer with 1000 neurons.
When training with the combined features we obtained a final accuracy on the validation set
of 78%. The accuracy during the training evolved as:
5
We clearly see that after 50 epochs the accuracy increases steadily in the train set and
decreases steadily in the validation set. This means that the network is using the big number
of neurons it has in its only hidden layer to learn by heart the training set, while worsening
at the validation. We thus have over-fitting.
We also trained this network with just kinematic or high level features. The accuracy that we
got was respectively 78.2% and 78.5%. As for the small network, both features are equally
useful for this shallow network. The accuracy evolution during training is shown on the
following graphs,
With this network we also tried a different activation function, the hyperbolic tangent, ob-
taining
Here we see that the network avoids to over-fit. For the rest of the trainings we used ReLU,
because it is widely used nowadays. However, in the present case of a shallow network the
hyperbolic tangent activation function has worked better, obtaining the maximum correct
classification percentage for this network: 79.5%.
C Deep neural network

We finally considered a deep neural network, smaller than in the original paper, because
otherwise it would take too long to train. Its layers are:
6
• A 1st hidden layer with 20 neurons.
• A 2nd hidden layer with 10 neurons.
• A 3rd hidden layer with 10 neurons.
• A 4th hidden layer with 8 neurons.
• A 5th hidden layer with 8 neurons.
When training with the combined features we obtained a final accuracy on the validation set
of 79.5%. The accuracy during the training evolved as:
In which we see as in the small network that we first implemented an increase in percentage
of correct classifications and then we get to a slowly increasing plateau. As in the small
network, we do not see over-fitting, as accuracy in validation and train sets are close. We
changed the network performing dropout in the last hidden layer, as done in the original
paper. The results are shown in the following graphs,
(a) Training with 50% dropout in last layer. (b) Training with 20% dropout in last layer.
For 50% dropout we can see that the accuracy on the training set got slightly reduced, but
it decreased considerably the accuracy in the validation set, making the network worse. This
could be because of the much smaller size of the last layer of this network, 8 neurons instead
7
of 300 used in the original paper. For the 20% dropout we see not negative nor positive
effects, retaining a 79.5% accuracy on the validation set.
We also trained this network with just kinematic or high level features. The accuracy that
we got was respectively 78% and 78.5%. As for the small network, both features are equally
useful for this deep network. Also, the network performs a bit better when it has all the
features available. The accuracy evolution during training is shown on the following graphs,
These show a evolution quite similar to the small network. All together, the best percentage
of classification we got with this network is 79.5%.
Conclusion
The first that we observed in this work is that having large data sets make computations long.
To handle them hHigh computing power is really useful. PCA and dimensionality reduction
were useful to gauge the difficulty of classifying correctly SUSY signal processes. Nevertheless,
we got a high percentage of classification with neural networks of 80%. We could not reach
the accuracy of the original paper, 88%, neither their neural networks’ size. Again we see
that computing power is key. But, even with small computational resources, neural networks
are powerful, they were able to distinguish the patterns in high-dimensional data. Moreover,
with the smallest network we made we were able to reach similar classification accuracy as
with deeper and wider ones. Finally, it was really useful to have an API of neural networks
as Keras to implement them easily and quickly, as there is a lot of trial and error. As in the
original paper, the networks performed similarly independently of using low-level, high-level
or combined features. So we can not conclude much about whether they are able to produce
relevant high-level features for themselves to better classify the SUSY signal processes.
Possible improvements for this work would be to search for better hyperparameters to reduce
training time and/or increase accuracy. Also, with more computing we could perform k-fold
cross-validation to obtain a better estimation of the accuracy and its variance.
8
All together it has been an interesting way to learn hands-on about how to use PCA, dimen-
sionality reduction and simple neural networks.
References
[1] SUSY Data Set. UCI Machine Learning Repository. http://archive.ics.uci.edu/ml/
datasets/SUSY
[2] Baldi, P., P. Sadowski, and D. Whiteson. “Searching for Exotic Particles in High-energy
Physics with Deep Learning.” Nature Communications 5 (July 2, 2014)
[3] Free Huge CSV splitter. SourceForge Download: https://sourceforge.net/projects/

splitcsv/files/latest/download

TME - Classification of SUSY Data Set

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

TME - Classification of SUSY Data Set

Uploaded by

Copyright:

Available Formats

TME - Classification of SUSY data set

Adam Teixidó Bonfill · 30/01/21

2. Read it line by line, without storing former lines on memory.

Figure 3: Non-linear reduction to 2D of the features of SUSY signal (orange) and

Trials with neural networks

A Small neural network

• A first hidden layer with 20 neurons.

• A second hidden layer with 10 neurons.

• Output layer with sigmoid activation function.

B Shallow neural network

• A single hidden layer with 1000 neurons.

• Output layer with sigmoid activation function.

C Deep neural network

• A 1st hidden layer with 20 neurons.

• A 2nd hidden layer with 10 neurons.

• A 3rd hidden layer with 10 neurons.

• A 4th hidden layer with 8 neurons.

• A 5th hidden layer with 8 neurons.

• Output layer with sigmoid activation function.

[3] Free Huge CSV splitter. SourceForge Download: https://sourceforge.net/projects/

You might also like