You are on page 1of 36

© 2019 SPLUNK INC.

© 2019 SPLUNK INC.

Turning ITSI Up to 11!

Peter Thomas, Manager of Application Monitoring @ Fiserv


9/12/2019 | 1.0
© 2019 SPLUNK INC.

Forward- During the course of this presentation, we may make forward-looking statements regarding
future events or plans of the company. We caution you that such statements reflect our

Looking current expectations and estimates based on factors currently known to us and that actual
events or results may differ materially. The forward-looking statements made in the this

Statements presentation are being made as of the time and date of its live presentation. If reviewed after
its live presentation, it may not contain current or accurate information. We do not assume
any obligation to update any forward-looking statements made herein.

In addition, any information about our roadmap outlines our general product direction and is
subject to change at any time without notice. It is for informational purposes only, and shall
not be incorporated into any contract or other commitment. Splunk undertakes no obligation
either to develop the features or functionalities described or to include any such feature or
functionality in a future release.

Splunk, Splunk>, Turn Data Into Doing, The Engine for Machine Data, Splunk Cloud, Splunk
Light and SPL are trademarks and registered trademarks of Splunk Inc. in the United States
and other countries. All other brand names, product names, or trademarks belong to their
respective owners. © 2019 Splunk Inc. All rights reserved.
© 2019 SPLUNK INC.

What Will We Learn Today?

• Strategies and specifics on how to get even more from ITSI!


• Creating powerful KPIs
• Optimizing ITSI performance, reducing stress on Splunk
• Decreasing Maintenance Effort
• Custom Alerting
– Rich email alerts
– Supporting “2 strikes” rule before alerting
– Supporting reoccurring maintenance windows
– Alerting and throttling by “product”
© 2019 SPLUNK INC.

About Peter
20 years in Financial Technology industry
Software Developer/Manager 15 years

Service Operations for last 5 years


Responsible for monitoring strategy for web and mobile banking

From the Pacific Northwest, currently living outside Atlanta


with wife and 2 kids
© 2019 SPLUNK INC.

What Can We Do With ITSI?

• Eliminate all time to assess platform health


• Reduced dependency on subject matter experts
• Consolidation of multiple granular alerts into single holistic alert
• Provides a model to implement large numbers of alertable metrics to monitor without
being overwhelmed by information or maintenance
• Creates a “single pane of glass” where output from other monitors are fed into Splunk to
be processed by ITSI
• Currently have ~400 ITSI services and 3250+ KPIs
© 2019 SPLUNK INC.

Reminder of the Building Blocks…


© 2019 SPLUNK INC.

Let’s turn ITSI up to 11!


© 2019 SPLUNK INC.

Beyond CPU monitoring

• Don’t limit ITSI entities to just being servers


• Entities can be dynamic and not preconfigured
• You can “split by entity” on a different field than you use to “filter by entity”

host=* index=perfmon collection=freediskspace counter="Free Megabytes" instance!=_Total


| eval drivehost=host + "-" + instance
© 2019 SPLUNK INC.
© 2019 SPLUNK INC.
© 2019 SPLUNK INC.

What other entities might you split on?

• Per outbound url/domain your server makes an HTTP call to


• Per action/transaction type your application supports
• Per suspicious IP
• Per response/status code
• Per client/tenant/account

Perfect for anytime the list of entities is dynamic and you don’t know what the complete list
of entities will be
© 2019 SPLUNK INC.

Taking It To The Next Level

• There isn’t a constraint on what kinds of Splunk queries can become an ITSI KPI
• Feel free to use stats, eventstats, lookups, bin, etc. for complex queries

For example: Is my web traffic balanced across all servers?

host=blue* sourcetype=iis
| stats count as RequestVolume by host
| eventstats mean(RequestVolume) as client_mean
| eval PercentVariance=round(((abs(RequestVolume-client_mean)/client_mean)*100), 0)
© 2019 SPLUNK INC.

host=blue* sourcetype=iis
| stats count as RequestVolume by host
| eventstats mean(RequestVolume) as client_mean
| eval PercentVariance=round(((abs(RequestVolume-client_mean)/client_mean)*100), 0)
© 2019 SPLUNK INC.

The Better You Monitor,


the Harder It Becomes
© 2019 SPLUNK INC.

Filtering Unwanted Data Via Lookups

NOT [| inputlookup Filtered_ip.csv | format]

host=blue* sourcetype=iis
NOT [| inputlookup Filtered_ip.csv | format]
| stats count as RequestVolume by host
| eventstats mean(RequestVolume) as client_mean
| eval PercentVariance=round(((abs(RequestVolume-client_mean)/client_mean)*100), 0)
© 2019 SPLUNK INC.

Scaling Out ITSI

• Design to limit maintenance effort and maximize Splunk performance


– Use base searches
– Use service templates
• Service templates allow you to create a default setup for services that share similar
KPIs, push changes to all services that use the same template, but also allows you to
override any configuration within a service.
• Candidate service templates might be web farms, database servers, clients, cloud
hosted service, etc.
© 2019 SPLUNK INC.

Converting ad-hoc Queries to Base Search Queries

Ad-hoc queries that are specific to a set of servers like this…


host=blue* sourcetype=iis
| stats count as RequestVolume by host
| eventstats mean(RequestVolume) as client_mean
| eval PercentVariance=round(((abs(RequestVolume-client_mean)/client_mean)*100), 0)

Can be enhanced to support more servers and converted into base searches like this…

host=* sourcetype=iis
| stats count as RequestVolume by host
| eval TierName=substr(host,1,len(host) - 2 )
| eventstats mean(RequestVolume) as client_mean by TierName
| eval PercentVariance=round(((abs(RequestVolume-client_mean)/client_mean)*100), 0)
© 2019 SPLUNK INC.

host=* sourcetype=iis
| stats count as RequestVolume by host
| eval TierName=substr(host,1,len(host) - 2 )
| eventstats mean(RequestVolume) as client_mean by TierName
| eval PercentVariance=round(((abs(RequestVolume-client_mean)/client_mean)*100), 0)
© 2019 SPLUNK INC.

Service Templates

• ITSI feature that allows services that have similar KPIs to “inherit” from a template
• Templates can be updated in one place and updates can optionally be pushed to all
services that inherit from the template
• Use cases include multiple services that all represent web servers, database servers,
application instances, firewalls, load balances, clients, etc.
© 2019 SPLUNK INC.

Example Service Template for Clients

• KPIs based on data from disjointed systems


– Web traffic centric KPIs based on web logs
– Transaction volume and success rates based on database records
– Login status and execution time from monitoring logs
• Create base searches that expose that data for each source
• Create a service template with the KPIs that leverage those base searches
© 2019 SPLUNK INC.

Base Search Configuration


© 2019 SPLUNK INC.

Entity Configuration
© 2019 SPLUNK INC.

Custom Alerting
© 2019 SPLUNK INC.

Custom Alerting

• If you want 100% control over alerting logic and display logic, you can implement your
own alerts
• ITSI’s KPI and service health results are all searchable in Splunk!
• Implement a Splunk Enterprise alert against the ITSI data
• ITSI data contains:
– Service ID (Built in lookups available to map to service name)
– KPI name
– KPI value for each entity and the aggregate
– KPI state (i.e. normal, low, medium, high, etc.)
– And much more
© 2019 SPLUNK INC.

Example ITSI Event Entry


index=itsi_summary

06/24/2019 22:53:03 -0400, is_entity_in_maintenance=0,


search_name="Indicator - 52fe3d6657468e4e47e38481 - ITSI Search", is_entity_defined=1,
search_now=1561431180.000, entity_key="54900d8f-46c6-44db-add8-d09f999b4a61",
info_min_time=1558839180.000, is_service_in_maintenance=0,
info_max_time=1561431183.533,
is_filled_gap_event=0,
alert_color="#99D18B",
info_search_time=1561431182.897,
alert_level=2,
qf="",
alert_value=12.7,
kpi=“CPU%",
itsi_kpi_id=52fe3d6657468e4e47e38481,
kpiid=52fe3d6657468e4e47e38481, is_service_max_severity_event=0,
sec_grp=default_itsi_security_group, alert_severity=normal,
urgency=11, alert_period=5,
serviceid="326e2338-a579-41c1-b930-b75adcf2c525", entity_title=Green01
itsi_service_id="326e2338-a579-41c1-b930-b75adcf2c525",

is_service_aggregate=0,
© 2019 SPLUNK INC.

Outputting Current KPI Status


index=itsi_summary kpi!="ServiceHealthScore" alert_severity!=disabled
| stats latest(alert_value) as CurrentKPIValue, latest(alert_severity) as
CurrentKPIAlertSeverity by itsi_service_id, kpi, entity_title
| lookup service_kpi_lookup _key as itsi_service_id OUTPUT title as ServiceName
© 2019 SPLUNK INC.

Email Results

Require consecutive failing KPI before alerting?


© 2019 SPLUNK INC.

Including the Previous Value

index=itsi_summary kpi!="ServiceHealthScore" alert_severity!=disabled


| addinfo
| eval Previous_alert_value=if(_time<=relative_time(info_max_time,"-6m"), alert_value,
Previous_alert_value)
| eval Previous_alert_level=if(_time<=relative_time(info_max_time,"-6m"), alert_level,
Previous_alert_level)
| stats latest(Previous_alert_value) as PreviousKPIValue, latest(Previous_alert_level) as
PreviousKPIAlertLevel, latest(alert_value) as CurrentKPIValue, latest(alert_level) as
CurrentKPIAlertLevel by itsi_service_id, kpi, entity_title
| lookup service_kpi_lookup _key as itsi_service_id OUTPUT title as ServiceName
© 2019 SPLUNK INC.

Email Results
© 2019 SPLUNK INC.

Suppressing Alerts During Maintenance

• Certain services may be impacted by regularly occurring maintenance


• ITSI does not natively support reoccurring maintenance windows
• Lookup tables can be used to store maintenance windows
• The alert query join the windows to the services and calculate if the service is currently in
maintenance
© 2019 SPLUNK INC.

• Output all the maintenance window entries


• Determine if the current time is between the start and stop window
• Determine if today is a maintenance day
• Create an IsInMaintenance field
• Set it to 1 if maintenance is today and we are in the window
• Join the results back to your ITSI output, joining on the service name
• Filter out all results where your service is in maintenance or is null
© 2019 SPLUNK INC.

Alerting by Product
It’s silly but…

| where like(service_name, “%platformName%")


© 2019 SPLUNK INC.

Possibilities…

• ITSI doesn’t limit what you can do with your Splunk queries
• Make the most out of entities, especially when they are dynamic
• You can both improve performance and reduce maintenance effort by use of base
searches and service templates
• Custom alerts can be created to support whatever your alert and incident policies and
procedures are
© 2019 SPLUNK INC.

Thank
You!
Go to the .conf19 mobile app to

RATE THIS SESSION


© 2019 SPLUNK INC.

Appendix
© 2019 SPLUNK INC.

ITSI Alert Query


index=itsi_summary alert_value>-3 kpi!="ServiceHealthScore"
`comment("Check lookup table to see if the alerts for this platform should be supressed")`
| eval Platform="MoRASP"
| lookup alarm_supression.csv Platform
| where Suppression=0
`comment("If the entry is an aggregate value, as opposed to an entity value, give it a better title")`
| eval entity_title=if(entity_title = "service_aggregate", "Overall", entity_title)
`comment("When there is no data the aggregate's entity_tile will be null which is a problem later so give it a value")`
| eval entity_title=if(isnull(entity_title), "Overall", entity_title)
`comment("If there no data for some reason, the value will be null so update the value to 'Data Stopped' for clarity")`
| eval alert_value=if(isnum(alert_value), Round(alert_value, 1), "Data Stopped")
`comment("Divide the values between ones that are older than 7 minutes ago from the recent ones.")`
| addinfo
| eval Previous_alert_value=if(_time<=relative_time(info_max_time,"-7m"), alert_value, Previous_alert_value)
| eval Previous_alert_level=if(_time<=relative_time(info_max_time,"-7m"), alert_level, Previous_alert_level)
| stats latest(Previous_alert_value) as PreviousKPIValue, latest(Previous_alert_level) as PreviousKPIAlertLevel, latest(alert_value) as CurrentKPIValue, latest(alert_level) as CurrentKPIAlertLevel by itsi_service_id, kpi, entity_title
`comment("Some KPIs have entities and some don't. ITSI tracks the KPI aggregate like a null entity but we renamed it to an entity called 'Overall'. If a KPI has an entity over threshold, like server CPU, we really don't want to see the aggregate value. However, sometimes a KPI doesn't have entities and the aggregate value is the only one. To enforce that logic, for each KPI,
calculate how many entities there are. If there is more than 1 entity, so more than just the aggregate entity, and we are on the aggregate entitiy, filter it out. If there is only 1 entitiy, the aggregate value, always show it.")`
| eventstats dc(entity_title) as DistinctEntities by itsi_service_id, kpi
| where NOT (DistinctEntities>1 AND entity_title="Overall")
| lookup service_kpi_lookup _key as itsi_service_id OUTPUT title
| rename title as service_name
`comment(“Update the wildcard search below to one that will include all the ITSI services you want to alert together.")`
| where like(service_name, "%")
| lookup alarm_supression.csv service_name
| where IsNull(Service_Suppression) OR Service_Suppression=0
| eval CurrentKPIState=case(CurrentKPIAlertLevel=1, "Info", CurrentKPIAlertLevel=2, "Good", CurrentKPIAlertLevel=3, "Low", CurrentKPIAlertLevel=4, "Medium", CurrentKPIAlertLevel=5, "High", CurrentKPIAlertLevel=6, "Critical")
| eval PreviousKPIState=case(PreviousKPIAlertLevel=1, "Info", PreviousKPIAlertLevel=2, "Good", PreviousKPIAlertLevel=3, "Low", PreviousKPIAlertLevel=4, "Medium", PreviousKPIAlertLevel=5, "High", PreviousKPIAlertLevel=6, "Critical")
| where 'PreviousKPIState'="Critical" OR 'PreviousKPIState'="High" OR 'CurrentKPIState'="Critical" OR 'CurrentKPIState'="High"
| join type=left
[| inputlookup DC_ITSI_Service_Maintenance_Windows.csv
| addinfo
| rex field=DailyWindowStart "(?P<StartHour>\d+)\:(?P<StartMinute>\d+)"
| rex field=DailyWindowStop "(?P<StopHour>\d+)\:(?P<StopMinute>\d+)"
| eval currentHour=strftime(info_max_time, "%H")
| eval currentMinute=strftime(info_max_time, "%M")
| eval currentDay=strftime(info_max_time, "%A")
| eval minutesAfterMidnight=(currentHour*60)+currentMinute
| eval startMinutesAfterMidnight=(StartHour*60)+StartMinute
| eval stopMinutesAfterMidnight=(StopHour*60)+StopMinute
| eval todayWindowStatus=case(currentDay="Monday", Monday, currentDay="Tuesday", Tuesday, currentDay="Wednesday", Wednesday, currentDay="Thursday", Thursday, currentDay="Friday", Friday, currentDay="Saturday", Saturday, currentDay="Sunday", Sunday)
| eval matchedWindow=if(todayWindowStatus = 1 AND minutesAfterMidnight >= startMinutesAfterMidnight AND minutesAfterMidnight < stopMinutesAfterMidnight, 1, 0)
| stats sum(matchedWindow) as MatchedWindows by ITSIServiceName
| eval IsInMaintenance=if(MatchedWindows > 0, 1, 0)
| rename ITSIServiceName as service_name
| fields - MatchedWindows]
| where IsNull(IsInMaintenance) OR IsInMaintenance=0
`comment("We only want to alert on services that had KPIs high or critical state 5 minutes ago and and have KPIs in high or critical state now. Count up how many KPIs were over threshold for each time period by service and filter out the services that don't meet the criteria.")`
| eventstats count(eval(PreviousKPIState="High" OR PreviousKPIState="Critical")) as PreviousOverThresholdCount, count(eval(CurrentKPIState="High" OR CurrentKPIState="Critical")) as CurrentOverThresholdCount by service_name
| where (PreviousOverThresholdCount>0 AND CurrentOverThresholdCount>0)
`comment("To generate a hash of the services that are alerting we will collapse all the services into a single field we will call DistintServices.")`
`comment("Then we will remove any duplicates, put them in alphabetical order, concatenate into a single string, then generate a hash of that string.")`
| eventstats values(service_name) as DistinctServices
| eval ServiceHash=md5(mvjoin(mvsort(mvdedup(DistinctServices)),","))
| table service_name, kpi, entity_title, PreviousKPIValue, PreviousKPIState, CurrentKPIValue, CurrentKPIState, ServiceHash
| stats list(service_name) as "Service Name", list(kpi) as "KPI Name", list(entity_title) as "Entity Name", list(PreviousKPIValue) as "Previous Value", list(PreviousKPIState) as "Previous State", list(CurrentKPIValue) as "Current Value", list(CurrentKPIState) as "Current State", values(ServiceHash) as ServiceHash

You might also like