S4N Machine Learning Lab Guide

© 2019 SPLUNK INC.
Hands-on Challenges
© 2019 SPLUNK INC.
Fun Facts about the Track Day dataset

A popular private event of racing and sportscar aficionado Splunkers in the early days.
Simple concept
Go on a race track, have

fun and collect some car
data to get insights about
driving behavior etc.
A subset of this data is

available in MLTK!
Image Source: https://www.youtube.com/watch?v=meBjI-ay9-U

Today’s challenges
© 2019 SPLUNK INC.
We are going to create four dashboards:
Explore the Dataset : Create a sample dataset and

1 explore it using different types of visualizations such as
SPL
Detect Numeric Outliers: Explore the MLTK showcase

2 and adapt it to start a new experiment with your own
dataset
3 Use a Classification Model: Create a classification

model and use it to predict vehicle types from your
sensor data
4 Use a Clustering Model: Create a clustering model and

and use it to analyze your dataset We’re aiming for a
dashboard like this!
© 2019 SPLUNK INC.
Workshop Goals
• Getting to know Splunk in the context of Machine Learning
• Prepare and analyze a dataset and summarize results on 4 dashboards

© 2019 SPLUNK INC.
Challenge 1: Explore the dataset

Create a Sample Dataset
© 2019 SPLUNK INC.
1
2. Change to the
Search Tab
1. Access the Splunk

Machine Learning 3. Insert your
Toolkit search query
?
What’s the benefit of
renaming variables?
Use fieldsummary to explore your dataset
© 2019 SPLUNK INC.
1
Eliminate unwanted
fields with
| fields - values
?
What’s going on
with the engine
coolant
temperature?
Explore your dataset with visualizations
© 2019 SPLUNK INC.
1
Using Splunk MLTK’s Histogram Macro
© 2019 SPLUNK INC.
OR
Check Macro in Check Macro with

Settings Cmd + Shift + E (Mac) or
Ctrl + Shift + E (Windows)
3
1 Adjusting the Histogram Macro
© 2019 SPLUNK INC.
How can we get from

? the top to the bottom
histogram?
3 Adjust the Macro to Split by Vehicle Type
© 2019 SPLUNK INC.
1
| stats count by x_batteryVoltage
x_batteryVoltage count
| chart count over x_batteryVoltage by y_vehicleType
12.78-12.79 1 x_batteryVoltage Ferrari Audi BMW Chevrolet Ford
13.16-13.17 3 13 0 0 0 0 1
13.46-13.47 1 14 0 0 0 1 1
15 1 0 1 0 0
16 1 1 1 0 0
17 1 0 0 0 1
Working with the Boxplot Macro
© 2019 SPLUNK INC.
1
? How can this query be improved?
Hints:
> Scale numeric values using
the fit command with the
StandardScaler
Explore the Dataset with Box Plots
© 2019 SPLUNK INC.
> Standardized data fields

have a mean of 0 and a
standard deviation of 1
> The box plots are less
stretched and can be
analysed more easily
© 2019 SPLUNK INC.
15 minute break
© 2019 SPLUNK INC.
Challenge 2: Detect Numeric Outliers

Detect Numeric Outliers:
© 2019 SPLUNK INC.
2 Explore the MLTK showcase and adapt it

to start a new experiment with your own dataset
> Explore the Outlier > Start your own Outlier > Optionally try to compare
Detection Showcases Detection Experiment different outlier detection
approaches
Explore the Outlier Detection Showcases
© 2019 SPLUNK INC.
> Switch to the Showcase tab

of the MLTK and explore the
assistant to detect outliers in
server response time
> We are now going to use
statistics to detect the
outliers
Explore the Outlier Detection Showcases
© 2019 SPLUNK INC.
Pick an appropriate threshold method (E.g.

View the corresponding SPL
Standard Deviation +/- 3)
query to the assistant’s
settings
Detecting Outliers with the Density Function
© 2019 SPLUNK INC.
> Switch to the Experiments tab of the MLTK and create a new experiment
> Instead of an approach based on statistics we are now going to use the
density function to detect outliers.
Create Your Own Smart Outlier Experiment
© 2019 SPLUNK INC.
2
Click here to get to the
next step
Look up the dataset you want to work

with
Create Your Own Smart Outlier Experiment
© 2019 SPLUNK INC.
Use these settings to get the result on the

right
© 2019 SPLUNK INC.
Challenge 3: Use a classification model

SPL for MLTK: The fit and apply commands
© 2019 SPLUNK INC.
3
Examples:
<your search> | fit <model name>
<your search> | apply <model name>
> The fit command produces a machine learning model based on the
behaviour of a set of events. It applies the model to the current
search results in the search pipeline
> The apply command applies the machine learning model
that was learned using the fit command
SPL for MLTK: Adapting fit and apply
© 2019 SPLUNK INC.
3
Examples:
<your search> | fit StandardScaler <fields> into <model name>
<your search> | apply <model name> | `<macro name>`
<your search> | fit SVM “X X X" from “XXX" “XXX" kfold_cv=3
Check out the confusion
matrix and classification
statistics macros!
> The StandardScaler algorithm uses the scikit-learn
StandardScaler algorithm to standardize data fields
> Splunk’s MLTK allows you to cross-validate your models
right from the search queries that train them. Simply
specify the number of cross-validation folds you want
by setting the fit command’s parameter kfold_cv
Use a Classification Model:
© 2019 SPLUNK INC.
3 Create a classification model and use it

to predict vehicle types from your sensor data
> Explore the Classification > Put your Algorithm > Optionally find a way to deal
Assistant into Practice with model overfitting
Explore the Classification Assistant
© 2019 SPLUNK INC.
3
Option 1 – Create New Experiment (Predict Categorical Fields)
Option 2 – Use the showcase

‘Showcase à Predict Fields à Predict Categorical Fields à Predict Vehicle Make and Model’
Explore the Classification Assistant
© 2019 SPLUNK INC.
? Why is SVM doing so bad?

© 2019 SPLUNK INC.
3 Normalising data using pre-processing
No need to add the y_vehicleType as a preprocess field, select only the x_* fields.
Now run the same query again using SVM, use the SS_* fields for predicting and you should see much better results!
Alternatively, you can use another algorithm, like ‘RandomForestClassifier’.
Save your Classification Model
© 2019 SPLUNK INC.
Publish your model in the

app of your choice
3
3 Apply your Classification Model
© 2019 SPLUNK INC.
3
3 Which Car Gets Classified Worst?
© 2019 SPLUNK INC.
? How can you find out where your model is off?

© 2019 SPLUNK INC.
Challenge 4: Use a clustering model

Use a Clustering Model:
© 2019 SPLUNK INC.
4 Create a clustering model and

use it to analyze your dataset
> Explore the Clustering > Cluster Analysis of the > Optionally try and detect
Assistant mytrackdata-Dataset outliers
Explore the Cluster Showcases
© 2019 SPLUNK INC.
> Switch to the Showcase tab

of the MLTK and explore the
assistant to identify clusters
of events
The MLTK Comes with Many Different Algorithms
© 2019 SPLUNK INC.
4
Example:
<your search> | fit PCA k=<int> <fields>
> Factor analysis with an algorithm such as PCA can

reduce the number of variables one must deal with
> The k parameter specifies the number of features
to be extracted from the data
? Why is there a cluster with "clusterId: null" ?

The MLTK Comes with Many Different Algorithms
© 2019 SPLUNK INC.
> We have missing values in

"x_engineCoolantTemperature”
for that we didn't fix/impute in
mytrackdata.csv
Let’s fix that and end on a high
© 2019 SPLUNK INC.
| inputlookup mytrackdata.csv
| fit Imputer x_engineCoolantTemperature strategy="median"
| rename Imputed_* as *
| apply car_clustering_StandardScaler_0
| apply car_clustering
| table c* y_* SS_* *
| fit PCA k=3 SS_*
| rename y_vehicleType as clusterId, PC_1 as x, PC_2 as y, PC_3 as z

S4N Machine Learning Lab Guide

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

S4N Machine Learning Lab Guide

Uploaded by

Copyright:

Available Formats

© 2019 SPLUNK INC.

Fun Facts about the Track Day dataset

Go on a race track, have

A subset of this data is

Image Source: https://www.youtube.com/watch?v=meBjI-ay9-U

We are going to create four dashboards:

Explore the Dataset : Create a sample dataset and

Detect Numeric Outliers: Explore the MLTK showcase

3 Use a Classification Model: Create a classification

4 Use a Clustering Model: Create a clustering model and

• Getting to know Splunk in the context of Machine Learning

• Prepare and analyze a dataset and summarize results on 4 dashboards

Challenge 1: Explore the dataset

1. Access the Splunk

Check Macro in Check Macro with

How can we get from

12.78-12.79 1 x_batteryVoltage Ferrari Audi BMW Chevrolet Ford

> Standardized data fields

Challenge 2: Detect Numeric Outliers

2 Explore the MLTK showcase and adapt it

> Switch to the Showcase tab

Pick an appropriate threshold method (E.g.

Look up the dataset you want to work

Use these settings to get the result on the

Challenge 3: Use a classification model

3 Create a classification model and use it

Option 2 – Use the showcase

? Why is SVM doing so bad?

3 Normalising data using pre-processing

Publish your model in the

? How can you find out where your model is off?

Challenge 4: Use a clustering model

4 Create a clustering model and

> Switch to the Showcase tab

> Factor analysis with an algorithm such as PCA can

? Why is there a cluster with "clusterId: null" ?

> We have missing values in

You might also like