How Chaos Engineering Assures The Resilience of Your Services

How chaos engineering assures the
resilience of your services
Elder Moraes
Developer Advocate
@elderjava
Chaos Engineering is the discipline of
experimenting on a system in order to
build confidence in the system’s
capability to withstand turbulent
conditions in production.
http://principlesofchaos.org/
2 @elderjava
Chaos Engineering is the discipline of
experimenting on a system in order to
build confidence in the system’s
capability to withstand turbulent
conditions in production.
http://principlesofchaos.org/
3 @elderjava
Resilience is the adaptability of a
system when facing changes, failures
and anomalies
4 @elderjava
Create failures on purpose before they
happen unexpectedly
5 @elderjava
Find weaknesses and fix them
6 @elderjava
youtube.eldermoraes.com
Developer Advocate at Red Hat
Board Member at SouJava
Author of Jakarta EE Cookbook
Helps developers to build and deliver awesome
applications
@elderjava
Examples of where to inject chaos
Application Cache Database Cloud/Infra
8 @elderjava
Chaos Engineering phases
Steady Design & Lessons

Hypothesis Fix
State Execution Learned
9 @elderjava
Steady state
Usual behaviour of a service based on the business metric
https://medium.com/netflix-techblog/sps-the-pulse-of-netflix-streaming-ae4db0e05f8a
10 @elderjava
Speaking of metrics…
11 @elderjava
Metric is a measure used to evaluate,
to control and/or to select
quantitatively: a person, an event or an
institution
12 @elderjava
Metrics & Health Check
13 @elderjava
RockBalboaService
14 @elderjava
RockBalboaService
• Punch power: 100% • Punch power: 0.5%

• Bleeding: 0 ml/s • Bleeding: 100 ml/s
• Eyesight: 100% • Eyesight: -10%
• Sweating: 1 ml/s • Sweating: 500 ml/s
• Confidence: 10 • Confidence: 0
• Pronunciation: “Adrian!” • Pronunciation: “Anhnaaamnannn…”
15 @elderjava
Back to chaos
16 @elderjava
Hypothesis
What if:
• A service returns a 404
• A database stop working
• The amount of requests spikes up
• Latency grows 100%
• A container is killed
• A port becomes inaccessible
• Etc…
17 @elderjava
Design & Execution
Best practices:
• Start small (baby steps)
• As close as possible of the production environment
• Minimize impact as much as possible
• Have an emergency button
• Automate
18 @elderjava
Design & Execution
Steady state 98%
Control group 1%
Users
Load
Balancer
Chaos group 1%
19 @elderjava
Lessons learned
Ideas for discussion:

• How much time for the failure being detected?
• Did someone get a notification? How long it took?
• There was a graceful degradation? How long until it starts?
• How long until auto-recovery (partial and full)?
• There was a need for manual intervention?
• How long until be back to the steady state?
20 @elderjava
Fix
21 @elderjava
“We learn from failure, not
from succes”
Dracula, Bram Stoker
22 @elderjava
Kubernetes &
Chaos Engineering
23 @elderjava
Kubernetes is perfect for Chaos Engineering
• Native features for resilience

• PODs are restarted/recreated based on health check/status
• PODs can be distributed in different regions
24 @elderjava
Some tools for Chaos with Kubernetes
• Istio
• Chaos Monkey
• Chaos Kong
• Kube Monkey
25 @elderjava
Istio
https://istio.io/docs/tasks/traffic-management/fault-injection/
26 @elderjava
Chaos Monkey
• Randomly kills virtual machines and containers

• Natively integrated to Spinnaker (spinnaker.io)
• Has support to AWS, GCP, Azure, Cloud Foundry and Kubernetes
• https://github.com/netflix/chaosmonkey
27 @elderjava
Chaos Kong
• Kills an entire cloud region (through simulation, of course)

• Though rare, it happens…
• https://medium.com/netflix-techblog/tagged/chaos-
kong
28 @elderjava
Kube Monkey
https://github.com/asobti/kube-monkey
29 @elderjava
30 @elderjava
“Chaos doesn’t cause
problems. It reveals them.”
Nora Jones, Ex-Senior Chaos Engineer at Netflix
31 @elderjava
developer.redhat.com
32 @elderjava
Thank you!
33 @elderjava

How Chaos Engineering Assures The Resilience of Your Services

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

How Chaos Engineering Assures The Resilience of Your Services

Uploaded by

Copyright:

Available Formats

How chaos engineering assures the

resilience of your services

Developer Advocate at Red Hat

Board Member at SouJava

Author of Jakarta EE Cookbook

Helps developers to build and deliver awesome

Application Cache Database Cloud/Infra

Steady Design & Lessons

• Punch power: 100% • Punch power: 0.5%

Steady state 98%

Ideas for discussion:

• Native features for resilience

• Randomly kills virtual machines and containers

• Kills an entire cloud region (through simulation, of course)

You might also like