You are on page 1of 33

How chaos engineering assures the

resilience of your services

Elder Moraes
Developer Advocate
@elderjava
Chaos Engineering is the discipline of
experimenting on a system in order to
build confidence in the system’s
capability to withstand turbulent
conditions in production.
http://principlesofchaos.org/

2 @elderjava
Chaos Engineering is the discipline of
experimenting on a system in order to
build confidence in the system’s
capability to withstand turbulent
conditions in production.
http://principlesofchaos.org/

3 @elderjava
Resilience is the adaptability of a
system when facing changes, failures
and anomalies

4 @elderjava
Create failures on purpose before they
happen unexpectedly

5 @elderjava
Find weaknesses and fix them

6 @elderjava
youtube.eldermoraes.com

Developer Advocate at Red Hat

Board Member at SouJava

Author of Jakarta EE Cookbook

Helps developers to build and deliver awesome

applications

@elderjava
Examples of where to inject chaos

Application Cache Database Cloud/Infra

8 @elderjava
Chaos Engineering phases

Steady Design & Lessons


Hypothesis Fix
State Execution Learned

9 @elderjava
Steady state
Usual behaviour of a service based on the business metric

https://medium.com/netflix-techblog/sps-the-pulse-of-netflix-streaming-ae4db0e05f8a

10 @elderjava
Speaking of metrics…

11 @elderjava
Metric is a measure used to evaluate,
to control and/or to select
quantitatively: a person, an event or an
institution

12 @elderjava
Metrics & Health Check

13 @elderjava
RockBalboaService

14 @elderjava
RockBalboaService

• Punch power: 100% • Punch power: 0.5%


• Bleeding: 0 ml/s • Bleeding: 100 ml/s
• Eyesight: 100% • Eyesight: -10%
• Sweating: 1 ml/s • Sweating: 500 ml/s
• Confidence: 10 • Confidence: 0
• Pronunciation: “Adrian!” • Pronunciation: “Anhnaaamnannn…”

15 @elderjava
Back to chaos

16 @elderjava
Hypothesis

What if:
• A service returns a 404
• A database stop working
• The amount of requests spikes up
• Latency grows 100%
• A container is killed
• A port becomes inaccessible
• Etc…

17 @elderjava
Design & Execution

Best practices:
• Start small (baby steps)
• As close as possible of the production environment
• Minimize impact as much as possible
• Have an emergency button
• Automate

18 @elderjava
Design & Execution

Steady state 98%

Control group 1%

Users
Load
Balancer
Chaos group 1%

19 @elderjava
Lessons learned

Ideas for discussion:


• How much time for the failure being detected?
• Did someone get a notification? How long it took?
• There was a graceful degradation? How long until it starts?
• How long until auto-recovery (partial and full)?
• There was a need for manual intervention?
• How long until be back to the steady state?

20 @elderjava
Fix

21 @elderjava
“We learn from failure, not
from succes”
Dracula, Bram Stoker

22 @elderjava
Kubernetes &
Chaos Engineering

23 @elderjava
Kubernetes is perfect for Chaos Engineering

• Native features for resilience


• PODs are restarted/recreated based on health check/status
• PODs can be distributed in different regions

24 @elderjava
Some tools for Chaos with Kubernetes

• Istio
• Chaos Monkey
• Chaos Kong
• Kube Monkey

25 @elderjava
Istio

https://istio.io/docs/tasks/traffic-management/fault-injection/

26 @elderjava
Chaos Monkey

• Randomly kills virtual machines and containers


• Natively integrated to Spinnaker (spinnaker.io)
• Has support to AWS, GCP, Azure, Cloud Foundry and Kubernetes
• https://github.com/netflix/chaosmonkey

27 @elderjava
Chaos Kong

• Kills an entire cloud region (through simulation, of course)


• Though rare, it happens…
• https://medium.com/netflix-techblog/tagged/chaos-
kong

28 @elderjava
Kube Monkey

https://github.com/asobti/kube-monkey

29 @elderjava
30 @elderjava
“Chaos doesn’t cause
problems. It reveals them.”
Nora Jones, Ex-Senior Chaos Engineer at Netflix

31 @elderjava
developer.redhat.com

32 @elderjava
Thank you!

33 @elderjava

You might also like