You are on page 1of 23

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/337290621

Automatic log analysis with NLP for the CMS workflow handling

Conference Paper · November 2019

CITATIONS READS

0 149

8 authors, including:

Sharad Agarwal
University of Bristol
13 PUBLICATIONS   17 CITATIONS   

SEE PROFILE

All content following this page was uploaded by Sharad Agarwal on 15 November 2019.

The user has requested enhancement of the downloaded file.


Automatic log analysis with NLP
for the CMS workflow handling
CHEP 2019
Daniel Robert Abercrombie1, Hamed Bakhshiansohi2, Sharad Agarwal3,
Jennifer Adelman-McCarthy4, Andres Vargas Hernandez5, Weinan Si6,
Lukas Layer7, Jean-Roch Vlimant8

on behalf of the CMS collaboration

Massachusetts Institute of Technology1, DESY2, CERN3, FNAL4, Catholic University of America5,


University of California Riverside6, Università e sezione INFN di Napoli7, California Institute of Technology8
Automatization of failing workflow handling
• Central MC production in CMS utilizes the LHC
computing grid

• Thousands of workflow tasks each with


thousands of jobs on more than 100 sites
worldwide

• A certain rate fails and has to be handled


manually by Computing Operators: e.g. resubmit,
kill, change memory, change splitting, etc.

• Common errors are missing/corrupt input files,


high memory usage, etc.

Operational Intelligence: Automatize the failure


handling using Machine Learning -
CMS Tools & Integration group
!2 Lukas Layer | CHEP 2019
Strategy and Dataset
• Information of ~ 33.000 failing
workflow tasks collected since 2017

Site: T1_DE_KIT
• Actions of the operators stored

Error code: 85
Counts: 2
• For each task we know on which sites
how many times an error code was
thrown

Build a sparse matrix for each workflow

Goal: Predict the operator’s action

• Challenges: small data, sparse input, class imbalance

• Tried: Feed-Forward NN, CNN, Embedding Models


Poster at CHEP 2018:
https://indico.cern.ch/event/587955/contributions/2937424/

Dev:


https://github.com/CMSCompOps/AIErrorHandling

Similar results for different models


Idea for further improvement:
Addition of Error Logs using Natural Language Processing
!3 Lukas Layer | CHEP 2019
Pipeline
WTC
Operator
Console
Store actions
Errors + Sites Counts Actions

!4 Lukas Layer | CHEP 2019


Pipeline
WTC CERN SWAN
Operator Platform for interactive

Console analysis in the cloud

Large
Store actions Scale
Store
FWJRs
Analytics
Errors + Sites Counts Actions

Filter + Map Error log

!5 Lukas Layer | CHEP 2019


Pipeline
WTC CERN SWAN
Operator Platform for interactive

Console analysis in the cloud

Large
Store actions Scale
Store
FWJRs
Analytics
Errors + Sites Counts Actions

Filter + Map Error log

Join
Aggregate
Preprocess

!6 Lukas Layer | CHEP 2019


Pipeline
WTC CERN SWAN
Operator Platform for interactive

Console analysis in the cloud

Large
Store actions Scale
Store
FWJRs
Analytics
Errors + Sites Counts Actions

Filter + Map Error log

Join
Aggregate
Preprocess
Caltech
GPU cluster

!7 Lukas Layer | CHEP 2019


NLP Workflow

Word
Log snippet Tokenization + Embeddings +
Indexing
selection Cleaning Message
Representation

!8 Lukas Layer | CHEP 2019


Unsupervised word embeddings
• Map similar words to nearby points in a high-dimensional vector space → Capture
relations and reduce dimensionality

• Train with Word2vec Skip-Gram algorithm: model to predict context given a word

• Average the word vectors in each message → input for ML

Average word vectors

t-sne for 5000 averaged


error message vectors

!9 Lukas Layer | CHEP 2019


Supervised word embeddings: RNN
• Averaging does not exploit the sequential nature of language → use a RNN (LSTM / GRU)

• GRU encodes the embedding vectors of the error log words in a vector
Message padded
with zeros

Words

Mask the zeros


Corresponding
Word vector

Final state
Of the GRU

!10 Lukas Layer | CHEP 2019


NLP model for error logs
Matrix of Error Logs

… Embedding + GRU
for each Error Log

Matrix of Error Counts

Concatenate

Dense

Classification
!11 Lukas Layer | CHEP 2019
NLP model for error logs

Errors Sites Words

Encoded
Error Log

!12 Lukas Layer | CHEP 2019


NLP model for error logs
Errors Sites Counts

Combined information

Arb. depth +
Dim.
Reduction
e.g. Shared
Dense Layer

Binary Classification
!13 Lukas Layer | CHEP 2019
Training and target
~33.000 samples
acdc w.o. modification
Retry only failed jobs vs. all other actions
No modification of memory or Other
splitting 11%

• Simplest target → future: multiclass classification for more actions

acdc w.o. mod


• Small data + imbalanced classes → Cross-validation (3-Folds)
89%

• Sparse input: batch size = 4 → O(1h) / epoch on a NVidia GeForce GTX 1080

Hyper-parameter Optimization
• Bayesian Optimization with scikit-optimize

• 2-8 GPUs on Caltech GPU cluster with NVidia GeForce GTX 1080 / Titan X

• Experimental: distributed training with the NNLO framework →


Talk at CHEP
https://github.com/vlimant/NNLO

• Experimental: Training models in parallel with spark_sklearn on SWAN with the help of CERN IT

!14 Lukas Layer | CHEP 2019


Results
ROC AUC as a function of the fraction of
ROC AUC for acdc w.o. modification vs. other
the total data set used for training

Baseline: AVG: RNN:


FF for averaged RNN for embeddings
FF with counts only
w2v + counts + counts

Successful training of first complex NLP models

Similar results as baseline - performance improving with more data

Work ongoing → Full potential not yet exploited


!15 Lukas Layer | CHEP 2019
Summary
• Implementation of a pipeline for DAQ and ML of error logs using big data
analysis tools → https://github.com/llayer/AIErrorLogAnalysis

• Development of a prototype NLP model in Keras

Outlook
• Further investigations of data and model → Reimplement in pytorch /
tensorflow and exploit the NNLO framework

• Get involved with the NLP community and learn from their experiences

• RUCIO (→ Talk at CHEP): Development of a common system to collect and


categorize errors, provide shifters with actions and collect feedback on the
suggestion → could also contain error log snippets

!16 Lukas Layer | CHEP 2019


BACKUP

!17
Word embeddings for log messages
Sites

• Vocabulary: ~ 30.000 words after


filtering

• Unsupervised training with


word2vec

• Standard parameters with


embedding dimensions 10, 20, 50

• Visualization: non-linear
dimensionality reduction with the
t-sne algorithm in 2D

!18
Message representation: RNN + Attention

GRU returns
• Not all words are equally important
all states

• Return all RNN states and pass through a


shallow network to decide weight
corresponding to each vector

• The weighted sum of each vector embodies


the meaning of those vectors combined

https://github.com/Hsankesara/DeepResearch/tree/master/Hierarchical_Attention_Network
Weighted sum
!19
Model for averaged w2v vectors

!20
Best hyperparameters for the RNN model
Batch size = 4 Maximum number of words = 400

!21
Actions

!22
View publication stats

You might also like