You are on page 1of 18

Seminar

Internal Evaluation Phase - 1 & 2


Automated Log Parsing for Large-
Scale
Log Data Analysis
Shivaraj Abbigeri
1RV17SCS16
Dr.H K Krishnappa

Department of Computer Science and Engineering


R V College of Engineering, Bengaluru
Introduction
• Logs are widely used in system management for
dependability assurance because they are often the
only data available that record detailed system
runtime behaviors in production

• The size of logs is constantly increasing, developers


(and operators) intend to automate their analysis by
applying data mining methods, therefore structured
input data (e.g., matrices) are required.

2
Continued...
• This triggers a number of studies on log parsing
that aims to transform free-text log messages into
structured events

• In general, logs are unstructured text generated by


logging statements (e.g., printf(),
Console.Writeline()) in system source code.

3
Continued...
• Traditional method of log analysis, which largely
relies on manual inspection and is labor-intensive
and error-prone, has been complemented by
automated log analysis techniques.

• Typical examples of log analysis techniques


include
– Anomaly detection
– Program verification
– Problem diagnosis
– Security assurance
4
Review of Literature
Sl.No Title Authors Year Description
1 Detecting W. Xu, L. Huang, 2009 This paper discusses about logs
Large-Scale A. Fox, D. which consist of the voluminous
System Patterson, and M. intermixing of messages from
Problems by Jordon many software components.And
Mining Console propose a general methodology
Logs to mine this rich source of
information to automatically
detect system runtime problems.
.
2 Execution Q. Fu, J. Lou, Y. 2009 This paper proposes an
Anomaly Wang, and J. Li unstructured log analysis
Detection in technique for anomalies
Distributed detection. it also proposes a
Systems through novel algorithm to convert free
Unstructured form text messages in log files
Log Analysis to log keys

5
Review of Literature
Sl.No Title Authors Year Description
3 Structured D. Yuan, S. Park, 2012 This paper proposes a tool
Comparative P. Huang, Y. Liu, which uses machine learning
Analysis of M. Lee, X. Tang, techniques to compare system
Systems Logs Y. Zhou, behaviors extracted from
to Diagnose and S. Savage, the logs and automatically infer
Performance the strongest associations
Problems between system components
and performance.

4 Detection of Alina Oprea ; 2015 This paper discuss about the


Early-Stage Zhou Li ; Ting- importance of detecting
Enterprise Fang Yen ; Sang infections at an early stage. and
Infection by H. Chin ; proposes a framework to detect
Mining Large- Sumayah Alrwais anomalies.
Scale Log Data

6
Technical Relevance

• Log analysis techniques comprise three steps:


– Log parsing
– Matrix generation
– Log mining

• The goal of log parsing is to transform raw log


messages into a sequence of structured events,
which facilitates subsequent matrix generation and
log mining. 7
Continued...
Parallel Log Processing ( POP )
• Good log parsing method should fulfill the
following requirements:
– Accuracy. The parsing accuracy should be
high.
– Efficiency. The running time of a log parser
should be as short as possible.
– Robustness.

• POP is designed to fulfill the above requirements.


8
Continued...
POP processes the logs in 5 steps
• Step 1: Preprocess by Domain Knowledge
– Simple preprocessing using domain knowledge can
improve parsing accuracy, so raw logs are preprocessed
in this step.
– In this step the variable part and the constant parts of
the logs are separated by using simple regular
expressions.
• Step 2: Partition by Log Message Length
– In this step, POP puts logs with the same log message
length into the same group. log message length is the
number of tokens in a log message.
– This heuristic is based on the assumption that logs with
the same log event will likely have the same log
9
message length.
Continued...
• Step 3: Recursively Partition by Token Position
– In step 3, each group is recursively partitioned into
subgroups, where each subgroup contains logs with the
same log event (i.e., same constant parts).
– This step assumes that if the logs in a group having the
same log event, the tokens in some token positions
should be the same.
• Step 4: Generate Log Events
– In this step, POP scans all the logs in each group and
generates the corresponding log event, which is a line
of text containing constant parts and variable parts.
– The constant parts are represented by tokens and the
variable parts are represented by wildcards.
10
Continued...

• Step 5: Merge Groups by Log Event


– In this step, POP employs hierarchical clustering to
cluster similar groups based on their log events. The
groups in the same cluster will be merged, and a new
log event will be generated by calculating the Longest
Common Subsequence (LCS) of the original log
events.

11
Overview of POP
Implementation

12
Experimental Results
• F-measure is used as evaluation metric for
evaluating parsing accuracy.

13
Continued...

14
Tools Used
• Commercial Tools
– Splunk
– Logentries
– Logmatric
• Open Source Tools
– Graylog
– Logstash
– Logz.io

15
Sustainability
When logs grow to a large scale (e.g., 200 million log
messages), which is common in practice, traditional
parsers are not efficient enough to handle such data
on a single computer. this limitation is overcome by
implementing a parallel log parser (namely POP) on
top of Spark.

16
Conclusion
Automated log parsing for the large scale log analysis
of modern systems is fast and reliable than manual
processing of the logs.

Parallel log parsing method (POP) employs specially


designed heuristic rules and hierarchical clustering
algorithm built on top of Spark which can perform
accurately and efficiently on large-scale log data.

17
REFERENCES

[1] Pinjia He, Jieming Zhu, Shilin He, Jian Li, and Michael R. Lyu,“Towards Automated Log Parsing for Large-Scale
Log Data Analysis”, IEEE Transactions on Dependable and Secure Computing Volume: 15 , Issue: 6 , Nov.-Dec. 1 2018

[2] W. Xu, L. Huang, A. Fox, D. Patterson, and M. Jordon, “Detecting large-scale system problems by mining console
logs,” in SOSP’09: Proc. of the ACM Symposium on Operating Systems Principles, 2009.

[3] Q. Fu, J. Lou, Y. Wang, and J. Li, “Execution anomaly detection in distributed systems through unstructured log
analysis,” in ICDM’09: Proc. of International Conference on Data Mining, 2009.

[4] K. Nagaraj, C. Killian, and J. Neville, “structured comparative analysis of systems logs to diagnose performance
problems,” in NSDI’12: Proc. of the 9th USENIX.

[5] A. Oprea, Z. Li, T. Yen, S. Chin, and S. Alrwais, “Dectection of early-stage enterprise infection by mining large-scale
log data,” in DSN’15, 2015.

You might also like