Introduction To SLOs 1674198486

Introduction to SLOs
https://www.linkedin.com/in/nduytg
SLI - Service Level Indicator
An SLI is a service level indicator—a carefully defined quantitative measure
of some aspect of the level of service that is provided.
Examples:
● QPS
● Uptime
● Error rate
● Latency
● HTTP/gRPC request success rate
● CPU/RAM/Disk used
So again, what is SLIs?
Simply put, you can think that SLIs as Prometheus Metrics

SLI - Collecting Indicators
While many numbers can function as an SLI, we should treating the SLI as the
ratio of two numbers: the number of good events divided by the total number
of events.
Number of good events
Total number of events

SLI - Standardize Indicators
Common definitions for SLI (should be agreed upon):
● Aggregation intervals: “Averaged over 1 week”

● Aggregation regions: “All the tasks in a cluster”
● How frequently measurements are made: “Every 15 seconds”
● Which requests are included: “HTTP GETs from black-box monitoring jobs”
● How the data is acquired: “Through our monitoring, measured at the server”
● Data-access latency: “Time to last byte”
SLO - Service Level Objective
Service level objectives (SLOs) specify a target level for the reliability of your
service.
Because SLOs are key to making data-driven decisions about reliability,

they’re at the core of SRE practices.
Example:
● API service uptime must be greater than 99.99%

SLO - Selecting a Target
Target SLO (Uptime) Allowable Downtime Remarks
(per 30 days)
100% 0.0s Impossible
99.999% (5 nines) 0.43 minutes
99.99% (4 nines) 4.32 minutes
99.95% (3.5 nines) 21.6 minutes Possible
99.9% (3 nines) 43.2 minutes Possible
99.5% (2.5 nines) 216 minutes (~3.5 hours)
99% (2 nines) 432 minutes (~7 hours)
Calculate your uptime here: https://uptime.is/

SLO - Selecting a Target SLO
Is uptime SLO enough for us?
NO! We need to set SLO for performance too!!
Example:
● 99.9% of all API calls should be completed in under 100ms

SLO - Selecting a Target SLO
In addition, making all of your SLIs follow a consistent style allows you to take
better advantage of tooling: you can write alerting logic, SLO analysis tools,
error budget calculation, and reports to expect the same inputs: numerator,
denominator, and threshold.
Simplification is a bonus here.

SLA - Service Level Agreement
SLAs are service level agreements: an explicit or implicit contract with your
users that includes consequences of meeting (or missing) the SLOs they
contain.
SLA vs SLO
Most people really mean SLO when they say “SLA.”
One giveaway: if somebody talks about an “SLA violation,” they are almost
always talking about a missed SLO.
A real SLA violation might trigger a court case for breach of contract.
SLA vs SLO
SLA violation usually means a breach of contract!

Decision-Making Tool
SLIs and SLOs are crucial elements in the control loops used to manage
systems:
1. Monitor and measure the system’s SLIs.

2. Compare the SLIs to the SLOs, and decide whether or not action is
needed.
3. If action is needed, figure out what needs to happen in order to meet the
target.
4. Take that action.
Example situation:
1. Monitor and measure the system’s SLIs.

a. See that the latency of service A is increasing
2. Compare the SLIs to the SLOs, and decide whether or not action is needed.
a. Case A: Short burst of user traffic => Don’t violate our SLO => No need to do anything
b. Case B: Latency is high for too long, It may violate our SLO => Need to do something!!
3. If action is needed, figure out what needs to happen in order to meet the target.
a. Add more nodes to the cluster to spread the load.
4. Take that action.
When we cannot meet our SLOs, what should we do?
92%
● Gives top priority to bugs relating to reliability issues
● The development team focuses exclusively on reliability issue until the
system is within SLO.
● Freeze new releases if necessary
When we meet our SLOs, what should we do?
99.95%
● Tighten your SLOs
● Standardize our workflow and process
● Spend your error budget for new features release or service maintenance
● Gives priority to other more important project (but still keeps an eye on
the project)
Examples SLOs
2 basic types of SLOs for any kind of services
1. Availability
2. Performance
Examples SLOs
Some availability metrics:
● How many nodes are up?

● Number of failed requests?
Examples SLOs
How do we measure our service availability?
There are 2 kinds of availability we can measure:
● uptime availability = (uptime) / (uptime + downtime)

● success_rate availability = (successful requests) / (total requests)
Examples SLOs
Proposed SLOs for Service A
SLO type Objective
Availability 99.5% uptime
Request success rate 98% success

SLOs for service Foo
Calculate Availability SLO for Service Foo
Uptime sum(avg_over_time(up_time{job="serviceFoo"}[1w]))
/
[number_of_nodes] * 100.0
Request sum(rate(grpc_server_handled_total{grpc_code="OK"}[1w]))
success /
rate sum(rate(grpc_server_started_total[1w])) * 100.0
Example SLOs
How do we measure our Service A cluster performance?
● Request response time

● Low number of failed requests
● Low latency (disk, network)
● etcd_server_slow_apply_total
● Disk write performance
A lot of metrics???? Which should we choose??

Example SLOs
Again, we should keep our SLOs simple.
We should metrics that affects our user the most

Example SLOs
Example SLOs
Proposed SLOs for Service A
SLO type Objective
Request Latency 90% of requests < 50ms
99% of requests < 100ms

Error Budget
What is an error budget?
Humans can more quickly understand the impact of “time consumed” and
“time remaining” instead of adherence to a displayed percentage number.
Error Budget represents a time* budget available for pushing risky change or
maintenancing our services
(*) It doesn’t need to be time, maybe other performance metrics
Error Budget
Image from: https://www.datadoghq.com/

Error Budget
Error budget is 100% minus the SLO
SLO Reporting Tool
A tool for tracking your SLOs:
● Registering SLO for your services

● Review SLO/Error Budget
● Decision-making tool
SLO Reporting Tool
Reminder - How to set SLOs
In summary, there are some principles to set our SLOs:
● Keep it simple
● Avoid 100%!!!
● Have as few SLOs as possible
Summary
To summarize:
● If you want to have a reliable service, you must first define “reliability.”
● If you want to know how reliable your service is, you must be able to measure the
rates of successful and unsuccessful queries; these will form the basis of your SLIs.
● The more reliable the service, the more it costs to operate.
● Without an SLO, your team and your stakeholders cannot make principled
judgements about whether your service needs to be made more reliable
(increasing cost and slowing development) or less reliable (allowing greater velocity
of development).
Reference Books
https://landing.google.com/sre/books/
Thank You
Q&A

Introduction To SLOs 1674198486

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction To SLOs 1674198486

Uploaded by

Copyright:

Available Formats

Introduction to SLOs

Simply put, you can think that SLIs as Prometheus Metrics

Number of good events

Total number of events

● Aggregation intervals: “Averaged over 1 week”

Because SLOs are key to making data-driven decisions about reliability,

● API service uptime must be greater than 99.99%

100% 0.0s Impossible

99.999% (5 nines) 0.43 minutes

99.99% (4 nines) 4.32 minutes

99.95% (3.5 nines) 21.6 minutes Possible

99.9% (3 nines) 43.2 minutes Possible

99.5% (2.5 nines) 216 minutes (~3.5 hours)

99% (2 nines) 432 minutes (~7 hours)

Calculate your uptime here: https://uptime.is/

NO! We need to set SLO for performance too!!

● 99.9% of all API calls should be completed in under 100ms

Simplification is a bonus here.

SLA violation usually means a breach of contract!

1. Monitor and measure the system’s SLIs.

1. Monitor and measure the system’s SLIs.

● How many nodes are up?

There are 2 kinds of availability we can measure:

● uptime availability = (uptime) / (uptime + downtime)

SLO type Objective

Availability 99.5% uptime

Request success rate 98% success

● Request response time

A lot of metrics???? Which should we choose??

We should metrics that affects our user the most

SLO type Objective

Request Latency 90% of requests < 50ms

99% of requests < 100ms

Image from: https://www.datadoghq.com/

● Registering SLO for your services

You might also like