You are on page 1of 38

The Practice of

Chaos Engineering
Ana Medina
Chaos Engineer at Gremlin
@ana_m_medina
#reactive18 @ana_m_medina

@ana_m_medina

Gremlin

Uber
SFEFCU Google Quicken Loans
Stanford University Miami
Dade College
#reactive18 @ana_m_medina

How many of you have heard


of Chaos Engineering?
#reactive18 @ana_m_medina

How many of you have run a


Chaos Engineering
experiment?
#reactive18 @ana_m_medina

What is
Chaos Engineering?
#reactive18 @ana_m_medina

Chaos Engineering
Thoughtful, planned
experiments designed to reveal
the weakness in our systems.
#reactive18 @ana_m_medina

Chaos Engineering
Inject something harmful to
build an immunity.

-@KoltonAndrus
Gremlin Founder and CEO
#reactive18 @ana_m_medina

Why?
● Microservices
● Systems are scaling
fast
● Downtime is really
expensive
● Our dependencies
will fail
● Pager fatigue and
burnout really hurts
#reactive18 @ana_m_medina

Use Cases:
● Outage
reproduction
● On-call training
● Strengthen new
products
● Battle test new
infrastructure and
services
#reactive18 @ana_m_medina

What do you need before doing Chaos


Engineering?
● Monitoring/Observability
● On-Call and Incident Management
● Cost of Downtime Per Hour
#reactive18 @ana_m_medina

Chaos Engineering is not


● Unexpected or unmonitored experiments
● Creating outages
#reactive18 @ana_m_medina

“Chaos Engineering
Without Observability ...
Is Just Chaos”
-@mipsytipsy
Charity Majors
CEO of honeycomb
#reactive18 @ana_m_medina

Minimize the
Blast radius
#reactive18 @ana_m_medina

Level 0
VALUE PROVIDED

THE BEGINNING Prepare for host failures in the cloud

Chaos Monkey APPROACH TAKEN

Random

MATURITY REQUIRED

Low
#reactive18 @ana_m_medina

Level 1
VALUE PROVIDED
THE FIRST STEP
Prepare for host-level failures

Infrastructure APPROACH TAKEN

Failures Disciplined

MATURITY REQUIRED

Basic Operations
#reactive18 @ana_m_medina

Level 1.5
VALUE PROVIDED
INTERMEDIATE
Prepare for high impact events

Network Failures APPROACH TAKEN

Gameday

MATURITY REQUIRED

Networking expertise
#reactive18 @ana_m_medina

Level 2
VALUE PROVIDED
THE NEXT STEP
Safely validate the user experience

Application APPROACH TAKEN

Failures Precision Experiments

MATURITY REQUIRED

Advanced
Latency added to 50% of android traffic
Exceptions - 50% of android traffic failed
#reactive18 @ana_m_medina

You can and should inject chaos at every layer


of your stack

● Application
● API
● Caching
● Database
● Hardware
● Cloud Infrastructure / Bare metal
#reactive18 @ana_m_medina

Top places to inject chaos


#reactive18 @ana_m_medina
#reactive18 @ana_m_medina

https://www.gremlin.com/community/tutorials/what-i-learned-running-the-
chaos-lab-kafka-breaks/
#reactive18 @ana_m_medina

Getting Started:

● Identify top 5 critical systems


● Choose system
● Whiteboard the system
● Determine what experiment you want to run:
(resource, state, network)
● Determine Blast Radius
#reactive18 @ana_m_medina

Companies doing
Chaos Engineering
#reactive18 @ana_m_medina

Chaos Days
#reactive18 @ana_m_medina

Chaos Days: Dedicated day for your


entire company to focus on building
resilience instead of new products.

https://www.gremlin.com/community/tutorials/planning-your-own-chaos-day/
#reactive18 @ana_m_medina

“What could go wrong?”

“Do we know what will


happen if this breaks?”
#reactive18 @ana_m_medina

Chaos Day Crew:


VP Engineering / CTO / COO
Executive Assistant
Engineering Director / Manager
Senior Engineer
New Grad / Intern Engineer
#reactive18 @ana_m_medina

What experiments can you


run?
• Reproduce outage conditions
• Unpredictable circumstances
• Large traffic spikes
• Race conditions
• Datacenter failure
• Time travel - system clocks to be out of sync
• Network errors
• CPU overloads
#reactive18 @ana_m_medina
#reactive18 @ana_m_medina
#reactive18 @ana_m_medina
#reactive18 @ana_m_medina
#reactive18 @ana_m_medina
#reactive18 @ana_m_medina

gremlin.com/chaos-monkey/
#reactive18 @ana_m_medina

Learn more:
Join the Chaos, Join Slack:

bit.ly/chaos-eng-slack
1,900+ members across the world
#reactive18 @ana_m_medina

THANKS!

ana@gremlin.com
@ana_m_medina

You might also like