Professional Documents
Culture Documents
https://www.linkedin.com/in/nduytg
SLI - Service Level Indicator
An SLI is a service level indicator—a carefully defined quantitative measure
of some aspect of the level of service that is provided.
Examples:
● QPS
● Uptime
● Error rate
● Latency
● HTTP/gRPC request success rate
● CPU/RAM/Disk used
SLI - Service Level Indicator
So again, what is SLIs?
Example:
Example:
One giveaway: if somebody talks about an “SLA violation,” they are almost
always talking about a missed SLO.
A real SLA violation might trigger a court case for breach of contract.
SLA vs SLO
1. Availability
2. Performance
Examples SLOs
Some availability metrics:
Uptime sum(avg_over_time(up_time{job="serviceFoo"}[1w]))
/
[number_of_nodes] * 100.0
Request sum(rate(grpc_server_handled_total{grpc_code="OK"}[1w]))
success /
rate sum(rate(grpc_server_started_total[1w])) * 100.0
Example SLOs
How do we measure our Service A cluster performance?
Humans can more quickly understand the impact of “time consumed” and
“time remaining” instead of adherence to a displayed percentage number.
Error Budget represents a time* budget available for pushing risky change or
maintenancing our services
(*) It doesn’t need to be time, maybe other performance metrics
Error Budget
● Keep it simple
● Avoid 100%!!!
● Have as few SLOs as possible
Summary
To summarize:
● If you want to have a reliable service, you must first define “reliability.”
● If you want to know how reliable your service is, you must be able to measure the
rates of successful and unsuccessful queries; these will form the basis of your SLIs.
● The more reliable the service, the more it costs to operate.
● Without an SLO, your team and your stakeholders cannot make principled
judgements about whether your service needs to be made more reliable
(increasing cost and slowing development) or less reliable (allowing greater velocity
of development).
Reference Books
https://landing.google.com/sre/books/
Thank You
Q&A