You are on page 1of 65

© 2019 SPLUNK INC.

Splunk 4 Ninjas - ML
Hands on Intro to Splunk Machine Learning Toolkit

14 October 2020
© 2019 SPLUNK INC.

Forward- During the course of this presentation, we may make forward-looking statements regarding
future events or plans of the company. We caution you that such statements reflect our

Looking current expectations and estimates based on factors currently known to us and that actual
events or results may differ materially. The forward-looking statements made in the this

Statements presentation are being made as of the time and date of its live presentation. If reviewed after
its live presentation, it may not contain current or accurate information. We do not assume
any obligation to update any forward-looking statements made herein.

In addition, any information about our roadmap outlines our general product direction and is
subject to change at any time without notice. It is for informational purposes only, and shall
not be incorporated into any contract or other commitment. Splunk undertakes no obligation
either to develop the features or functionalities described or to include any such feature or
functionality in a future release.

Splunk, Splunk>, Turn Data Into Doing, The Engine for Machine Data, Splunk Cloud, Splunk
Light and SPL are trademarks and registered trademarks of Splunk Inc. in the United States
and other countries. All other brand names, product names, or trademarks belong to their
respective owners. © 2019 Splunk Inc. All rights reserved.
© 2019 SPLUNK INC.

• Welcome / Introduction
• Intro Machine Learning @ Splunk
• Demo Machine Learning Toolkit with Q&A
• Intro to the Trackday Dataset
• Four Different Challenges (~ 30min each)
• Challenge 1

Agenda
– Explore the track_day.csv Dataset
• Challenge 2
– Detect Numeric Outliers
• Challenge 3
– Supervised Learning: Predict Categorical Fields
• Challenge 4
– Unsupervised Learning: Clustering

• Wrap Up, Discussion and Feedback


© 2019 SPLUNK INC.

Who are we?

Tanzil Kazi Jenny Seow Dean Moreton


Host Panelist Panelist
© 2019 SPLUNK INC.

Disclaimer
What this session is not about and what it is about

• NO replacement for a PhD in machine learning, data science or AI


• NO replacement for Splunk’s Education class for Data Science
• NO comprehensive lecture about all possible concepts and algorithms in ML … but,

• YES first introduction into Machine Learning @ Splunk


• YES getting to know of Splunk’s Machine Learning Toolkit
• YES guided hands-on challenges to explore a few typical ML tasks
Start your own Splunk
instance!
http://splunk4ninjas.com/5110/self_register/

© 2019 SPLUNK INC.


© 2019 SPLUNK INC.

Machine Learning Tour


© 2019 SPLUNK INC.

Splunk customers want answers from their data


Anomaly detection Predictive Analytics Clustering

► Deviation from past behavior ► Predict Service Health Score/Churn ► Identify peer groups
► Deviation from peers ► Predicting Events ► Event Correlation
► (aka Multivariate AD or Cohesive AD) ► Trend Forecasting ► Reduce alert noise
► Unusual change in features ► Detecting influencing entities ► Behavioral Analytics
► ITSI Metric Anomaly Detection ► Early warning of failure ► ITSI Event Analytics

Joined late? Register on http://splunk4ninjas.com/5110/self_register/ to setup your own Splunk instance.


Environment takes 5-10 minutes to spin up. Credentials are admin/changeme, available for 24 hours.
© 2019 SPLUNK INC.

Types of Machine Learning


Supervised Learning Unsupervised Learning Semi-Supervised Learning
(labeled data) (unlabeled data) (with reinforcement or feedback)

► Regression ► Clustering ► Human in the Loop


► Classification ► Anomaly Detection ► Autonomous Systems

Joined late? Register on http://splunk4ninjas.com/5110/self_register/ to setup your own Splunk instance.


Environment takes 5-10 minutes to spin up. Credentials are admin/changeme, available for 24 hours.
© 2019 SPLUNK INC.

Overview of Machine Learning at Splunk

CORE PLATFORM
SEARCH PACKAGED PREMIUM MACHINE LEARNING
SOLUTIONS TOOLKIT

Platform for Operational Intelligence

Joined late? Register on http://splunk4ninjas.com/5110/self_register/ to setup your own Splunk instance.


Environment takes 5-10 minutes to spin up. Credentials are admin/changeme, available for 24 hours.
© 2019 SPLUNK INC.

Skill Areas for Machine Learning @ Splunk


• Identify use cases
Premium solutions • Drive decisions
provide out of the box
Domain • Understanding of business

ML capabilities. Expertise
(IT, Security…)
ITSI Splunk ML Toolkit
UBA facilitates and simplifies
via examples & guidance
MLTK
Data
Splunk
• Science
Searching
Expertise
• Reporting Expertise • Statistics/math background
• Alerting • Algorithm selection
• Workflow • Model building

Joined late? Register on http://splunk4ninjas.com/5110/self_register/ to setup your own Splunk instance.


Environment takes 5-10 minutes to spin up. Credentials are admin/changeme, available for 24 hours.
© 2019 SPLUNK INC.

What Data Scientists Really Do


Data Preparation accounts for about 80% of the work of data scientists

“Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task, Survey Says”, Forbes Mar 23, 2016

Joined late? Register on http://splunk4ninjas.com/5110/self_register/ to setup your own Splunk instance.


Environment takes 5-10 minutes to spin up. Credentials are admin/changeme, available for 24 hours.
© 2019 SPLUNK INC.

Custom ML with the Splunk Platform


Splunk’s App Ecosystem contains 1000’s of free add-ons for getting data in,
Ecosystem applying structure and visualizing your data giving you faster time to value.
The Machine Learning Toolkit delivers new SPL commands, custom
MLTK visualizations, assistants, and examples to explore a variety of ml concepts.
Splunk Enterprise is the mission-critical platform for indexing, searching,
Splunk analyzing, alerting and visualizing machine data.

Operationalized Data Science Pipeline

Collect Clean & Search & Pre-processing Choose Build, Test, Operationalize Visualize &
Data Munge Explore Feature Selection Algorithm Improve Models Monitor Alert Share

Ecosystem Ecosystem MLTK MLTK MLTK MLTK Ecosystem


Splunk
Splunk Splunk Splunk Splunk Splunk Splunk Splunk

Platform for Operational Intelligence


Joined late? Register on http://splunk4ninjas.com/5110/self_register/ to setup your own Splunk instance.
Environment takes 5-10 minutes to spin up. Credentials are admin/changeme, available for 24 hours.
© 2019 SPLUNK INC.

Continuous Data Ingest at Scale


Engineers Data Security Business
Analysts Analysts Users

Industrial Data
SCADA, AMI, Meter Reads
Industrial Assets
Native Inputs
Search Alert Visualize Predict Develop
TCP, UDP, Logs, Scripts, Wire, Mobile

Consumer and
Mobile Devices Modular Inputs
Real Time
MQTT, AMQP, COAP, REST, JMS

External
OT HTTP Event Collector
Lookups/Enrichment
Token Authenticated Events
Asset Maintenance Data
Info Info Stores
Technology Partnerships
IT Kepware, AWS IoT, Cisco, Palo Alto

Joined late? Register on http://splunk4ninjas.com/5110/self_register/ to setup your own Splunk instance.


Environment takes 5-10 minutes to spin up. Credentials are admin/changeme, available for 24 hours.
© 2019 SPLUNK INC.

Sense and Respond


Every Search Can Use
Machine Learning
Flash lights

Send an
Industrial Assets
email
Email

File a
Consumer and ticket
Mobile Devices Real Time Search Alert Tickets

Trigger
process flow
Third-Party
OT Applications

Send a text
Smartphones
IT and Devices

Joined late? Register on http://splunk4ninjas.com/5110/self_register/ to setup your own Splunk instance.


Environment takes 5-10 minutes to spin up. Credentials are admin/changeme, available for 24 hours.
© 2019 SPLUNK INC.

MLTK + Python for Scientific Computing


| fit y from x* into “model” Python for Scientific Computing

Industrial Assets | apply “model”

Consumer and persisted model


Mobile Devices Real Time Search Alert


OT

Visualize

IT

Joined late? Register on http://splunk4ninjas.com/5110/self_register/ to setup your own Splunk instance.


Environment takes 5-10 minutes to spin up. Credentials are admin/changeme, available for 24 hours.
© 2019 SPLUNK INC.

Splunk Machine Learning Toolkit (MLTK)


Extends Splunk platform functions and provides a guided modeling environment

Built for the Citizen Data Scientist


• Experiments and Assistants: Guided model building,
testing, and deployment for common objectives
• Algorithms: 80+ standard algorithms (supervised &
unsupervised)

Extensible to operationalize any use case


• Python for Scientific Computing Library:
Access to 300+ open source algorithms
• Deep Learning Toolkit : Supports NN and GPU
accelerated machine learning
• ML-SPL API: Import any open-source or proprietary
algorithm

Joined late? Register on http://splunk4ninjas.com/5110/self_register/ to setup your own Splunk instance.


Environment takes 5-10 minutes to spin up. Credentials are admin/changeme, available for 24 hours.
© 2019 SPLUNK INC.

Example: MLTK powered DGA App for Splunk


Detect Malicious Domain Names using Machine Learning

Joined late? Register on http://splunk4ninjas.com/5110/self_register/ to setup your own Splunk instance.


Environment takes 5-10 minutes to spin up. Credentials are admin/changeme, available for 24 hours.
© 2019 SPLUNK INC.

Overview of ML including DL at Splunk


(not covered in this workshop)

CORE PLATFORM
SEARCH
PACKAGED PREMIUM
SOLUTIONS TOOLKIT
+
MACHINE LEARNING DEEP LEARNING
TOOLKIT

Platform for Operational Intelligence

Joined late? Register on http://splunk4ninjas.com/5110/self_register/ to setup your own Splunk instance.


Environment takes 5-10 minutes to spin up. Credentials are admin/changeme, available for 24 hours.
© 2019 SPLUNK INC.

DLTK for Splunk


| fit y from x* into “model”

Industrial Assets | apply “model”

Consumer and persisted model


Mobile Devices Real Time Search Alert


OT

Visualize

IT

Joined late? Register on http://splunk4ninjas.com/5110/self_register/ to setup your own Splunk instance.


Environment takes 5-10 minutes to spin up. Credentials are admin/changeme, available for 24 hours.
© 2018 SPLUNK INC.

Hyatt & Splunk machine learning


1. Improve online check-in experience by using ML to determine
potential issues before they occur.

2. Predictive analytics used for hotel room occupancy.


3. Forecasting the likely Wi-Fi logins based on each property, day of
week, and local and global holiday out two days into the future to
show our expectations to executives.

4. Anomaly detection for security purposes.


© 2020 SPLUNK INC.

© 2019 SPLUNK INC.

Demo:
Machine Learning Toolkit
© 2019 SPLUNK INC.

Before we get started…

> Follow along for the labs using the ‘handrail’ guide:
https://bit.ly/Splunk_ML_Lab
> Username: admin
> Password: changeme

> Quick reference guide for future reference, if you want to use the MLTK on
your own: https://bit.ly/MLTK_guide
© 2019 SPLUNK INC.

Hands-on Challenges
© 2019 SPLUNK INC.

Fun Facts about the Track Day dataset


A popular private event of racing and sportscar affine Splunkers in the early days.

Simple concept

Go on a race track, have


fun and collect some car
data to get insights about
driving behavior etc.

A subset of this data is


available in MLTK!

Image Source: https://www.youtube.com/watch?v=meBjI-ay9-U


Today’s Challenges
© 2019 SPLUNK INC.

We are going to create four dashboards:

Explore the Dataset : Create a sample dataset and


1 explore it using different types of visualizations
such as SPL

Detect Numeric Outliers: Explore the MLTK showcase


2 and adapt it to start a new experiment with your own
dataset

3 Use a Classification Model: Create a classification


model and use it to predict vehicle types from your
sensor data

4 Use a Clustering Model: Create a clustering model


and and use it to analyze your dataset We’re aiming for a
dashboard like this!
© 2019 SPLUNK INC.

Workshop Goals

• Getting to know Splunk in the context of Machine Learning

• Prepare and analyze a dataset and summarize results on 4 dashboards


© 2019 SPLUNK INC.

Challenge 1: Explore the dataset


Create a Sample Dataset
© 2019 SPLUNK INC.

1
2. Change to the
Search Tab

1. Access the Splunk


Machine Learning 3. Insert your
Toolkit search query

What’s the benefit of


? renaming variables?
Use Fieldsummary to Explore your Dataset
© 2019 SPLUNK INC.

1
Eliminate unwanted
Fields with
| fields - values

?
What’s going on
with the engine
coolant
temperature?
Explore your Dataset with Visualizations
© 2019 SPLUNK INC.

1
Using Splunk MLTK’s Histogram Macro
© 2019 SPLUNK INC.

OR

Check Macro in Check Macro with


Settings Cmd + Shift + E (Mac) or
Ctrl + Shift + E (Windows)
3 Adjusting the Histogram Macro
© 2019 SPLUNK INC.

How can we get from the


? top to the bottom
histogram?
3 Adjust the Macro to Split by Vehicle Type
© 2019 SPLUNK INC.

1
| stats count by x_batteryVoltage

x_batteryVoltage count
| chart count over x_batteryVoltage by y_vehicleType

12.78-12.79 1 x_batteryVoltage Ferrari Audi BMW Chevrolet Ford

13.16-13.17 3 13 0 0 0 0 1

13.46-13.47 1 14 0 0 0 1 1

15 1 0 1 0 0

16 1 1 1 0 0

17 1 0 0 0 1
Working with the Boxplot Macro
© 2019 SPLUNK INC.

1
? How can this query be improved?

Hints:
> Scale numeric values using
the fit command with the
StandardScaler
Explore the Dataset with Box Plots
© 2019 SPLUNK INC.

> Standardized data fields


have a mean of 0 and a
standard deviation of 1
> The box plots are less
stretched and can be
analyzed more easily
© 2019 SPLUNK INC.

15 minute break
© 2019 SPLUNK INC.

Challenge 2: Detect Numeric Outliers


Detect Numeric Outliers:
© 2019 SPLUNK INC.

2 Explore the MLTK showcase and adapt it


to start a new experiment with your own dataset

> Explore the Outlier > Start your own Outlier > Optionally try to compare
Detection Showcases Detection Experiment different outlier detection
approaches
Explore the Outlier Detection Showcases
© 2019 SPLUNK INC.

> Switch to the Showcase tab


of the MLTK and explore the
assistant to detect outliers in
server response time
> We are now going to use
statistics to detect the outliers
Explore the Outlier Detection Showcases
© 2019 SPLUNK INC.

Pick an appropriate threshold method (E.g.


View the corresponding SPL Standard deviation +/- 3)
query to the assistant’s settings
Detecting Outliers with the Density Function
© 2019 SPLUNK INC.

> Switch to the Experiments tab of the MLTK and create a new experiment
> Instead of an approach based on statistics
we are now going to use the density function to detect outliers
Create Your Own Smart Outlier Experiment
© 2019 SPLUNK INC.

2
Click here to get to the
next step

Look up the dataset you want to work with


Create Your Own Smart Outlier Experiment
© 2019 SPLUNK INC.

Use these settings to get the result on the


right
© 2019 SPLUNK INC.

Challenge 3: Use a classification model


© 2019 SPLUNK INC.

SPL for MLTK: The fit and apply Commands


3
Examples:
<your search> | fit <model name>
<your search> | apply <model name>

> The fit command produces a machine learning model based on the
behaviour of a set of events. It applies the model to the current search
results in the search pipeline
> The apply command applies the machine learning model
that was learned using the fit command
© 2019 SPLUNK INC.

SPL for MLTK: Adapting fit and apply


3
Examples:
<your search> | fit StandardScaler <fields> into <model name>
<your search> | apply <model name> | `<macro name>`
<your search> | fit SVM “X X X" from “XXX" “XXX" kfold_cv=3
Check out the confusion
matrix and classification
statistics macros!
> The StandardScaler algorithm uses the scikit-learn
StandardScaler algorithm to standardize data fields
> Splunk’s MLTK allows you to cross-validate your models
right from the search queries that train them. Simply
specify the number of cross-validation folds you want
by setting the fit command’s parameter kfold_cv
Use a Classification Model:
© 2019 SPLUNK INC.

3 Create a classification model and use it


to predict vehicle types from your sensor data

> Explore the Classification > Put your Algorithm into > Optionally find a way to deal
Assistant Practice with model overfitting
Explore the Classification Assistant
© 2019 SPLUNK INC.

3
Option 1 – Create New Experiment

Option 2 – Use the showcase


‘Showcase à Predict Fields à Predict Categorical Fields à Predict Vehicle Make and Model’
Explore the Classification Assistant
© 2019 SPLUNK INC.

? Why is SVM doing so bad?


© 2019 SPLUNK INC.

3 Normalising data using pre-processing

Now run the same query again using SVM, use the SS_* fields for predicting and you should see much better results!
Alternatively, you can use the ‘RandomForestClassifier’ algorithm.
Save your Classification Model
© 2019 SPLUNK INC.

Publish your model in the


app of your choice
3 Apply your Classification Model
© 2019 SPLUNK INC.

3
3 Which Car Gets Classified Worst?
© 2019 SPLUNK INC.

3
? How can you find out where your model is off?
© 2019 SPLUNK INC.

Challenge 4: Use a clustering model


Use a Clustering Model:
© 2019 SPLUNK INC.

4 Create a clustering model and


use it to analyze your dataset

> Explore the Clustering > Cluster Analysis of the > Optionally try and detect
Assistant mytrackdata-Dataset outliers
Explore the Cluster Showcases
© 2019 SPLUNK INC.

> Switch to the Showcase tab


of the MLTK and explore the
assistant to identify clusters of
events
© 2019 SPLUNK INC.

The MLTK Comes with Many Different Algorithms


4
Example:
<your search> | fit PCA k=<int> <fields>

> Factor analysis with an algorithm such as PCA can


reduce the number of variables one must deal with
> The k parameter specifies the number of features
to be extracted from the data

? Why is there a cluster with "clusterId: null" ?


© 2019 SPLUNK INC.

The MLTK Comes with Many Different Algorithms


4

> We have missing values in


"x_engineCoolantTemperature”
for that we didn't fix/impute in
mytrackdata.csv
© 2019 SPLUNK INC.

Let’s impute some values


4

| inputlookup mytrackdata.csv
| fit Imputer x_engineCoolantTemperature strategy="median"
| rename Imputed_* as *
| apply car_clustering_StandardScaler_0
| apply car_clustering
| table c* y_* SS_* *
| fit PCA k=3 SS_*
| rename y_vehicleType as clusterId, PC_1 as x, PC_2 as y, PC_3 as z
© 2019 SPLUNK INC.

Wrap Up
© 2019 SPLUNK INC.

Wrap Up

• Don’t boil the ocean: start small or modify existing showcase examples for
some quick wins.

• Docs are your friend: in case you need help, the documentation is pretty
comprehensive. Also conf.splunk.com has > 100+ sessions on ML.

You want to learn more about Splunk’s Machine Learning?


► Check our latest Splunk Blogs around Machine Learning
► Watch videos from Splunk Machine Learning YouTube Channel
► Take the Splunk Education Class for Data Science and Advanced Analytics
► Learn more about Splunk’s Machine Learning Advisory Program
© 2019 SPLUNK INC.
© 2019 SPLUNK INC.

Thank You
© 2019 SPLUNK INC.

Additional Information

Login:

► Username: admin
► PW: changeme

Challenge Solution Examples:

We created a dashboard for each challenge with example solutions in the hidden
app “Splunk 4 Ninjas Machine Learning”. Use this app for preparation, debriefing
after the challenges or as assistance for unexperienced attendees.
► http://{your-host}:8000/en-GB/app/s4n_ml/splunk_4_ninjas_ml
► or click button next to “Splunk 4 Ninjas Machine Learning” on top of Home Dashboard

You might also like