Professional Documents
Culture Documents
net/publication/337290621
Automatic log analysis with NLP for the CMS workflow handling
CITATIONS READS
0 149
8 authors, including:
Sharad Agarwal
University of Bristol
13 PUBLICATIONS 17 CITATIONS
SEE PROFILE
All content following this page was uploaded by Sharad Agarwal on 15 November 2019.
Site: T1_DE_KIT
• Actions of the operators stored
Error code: 85
Counts: 2
• For each task we know on which sites
how many times an error code was
thrown
Dev:
•
https://github.com/CMSCompOps/AIErrorHandling
Large
Store actions Scale
Store
FWJRs
Analytics
Errors + Sites Counts Actions
Large
Store actions Scale
Store
FWJRs
Analytics
Errors + Sites Counts Actions
Join
Aggregate
Preprocess
Large
Store actions Scale
Store
FWJRs
Analytics
Errors + Sites Counts Actions
Join
Aggregate
Preprocess
Caltech
GPU cluster
Word
Log snippet Tokenization + Embeddings +
Indexing
selection Cleaning Message
Representation
• Train with Word2vec Skip-Gram algorithm: model to predict context given a word
• GRU encodes the embedding vectors of the error log words in a vector
Message padded
with zeros
Words
Final state
Of the GRU
… Embedding + GRU
for each Error Log
Concatenate
Dense
Classification
!11 Lukas Layer | CHEP 2019
NLP model for error logs
Encoded
Error Log
Combined information
Arb. depth +
Dim.
Reduction
e.g. Shared
Dense Layer
Binary Classification
!13 Lukas Layer | CHEP 2019
Training and target
~33.000 samples
acdc w.o. modification
Retry only failed jobs vs. all other actions
No modification of memory or Other
splitting 11%
• Sparse input: batch size = 4 → O(1h) / epoch on a NVidia GeForce GTX 1080
Hyper-parameter Optimization
• Bayesian Optimization with scikit-optimize
• 2-8 GPUs on Caltech GPU cluster with NVidia GeForce GTX 1080 / Titan X
• Experimental: Training models in parallel with spark_sklearn on SWAN with the help of CERN IT
Outlook
• Further investigations of data and model → Reimplement in pytorch /
tensorflow and exploit the NNLO framework
• Get involved with the NLP community and learn from their experiences
!17
Word embeddings for log messages
Sites
• Visualization: non-linear
dimensionality reduction with the
t-sne algorithm in 2D
!18
Message representation: RNN + Attention
GRU returns
• Not all words are equally important
all states
https://github.com/Hsankesara/DeepResearch/tree/master/Hierarchical_Attention_Network
Weighted sum
!19
Model for averaged w2v vectors
!20
Best hyperparameters for the RNN model
Batch size = 4 Maximum number of words = 400
!21
Actions
!22
View publication stats