Professional Documents
Culture Documents
AWS Reliability Pillar PDF
AWS Reliability Pillar PDF
September 2018
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Notices
This document is provided for informational purposes only. It represents AWS’s
current product offerings and practices as of the date of issue of this document,
which are subject to change without notice. Customers are responsible for
making their own independent assessment of the information in this document
and any use of AWS’s products or services, each of which is provided “as is”
without warranty of any kind, whether express or implied. This document does
not create any warranties, representations, contractual commitments,
conditions or assurances from AWS, its affiliates, suppliers or licensors. The
responsibilities and liabilities of AWS to its customers are controlled by AWS
agreements, and this document is not part of, nor does it modify, any agreement
between AWS and its customers.
Contents
Introduction 1
Reliability 1
Design Principles 2
Definition 3
Foundation – Limit Management 5
Foundation - Networking 7
Application Design for High Availability 13
Understanding Availability Needs 18
Application Design for Availability 19
Operational Considerations for Availability 27
Example Implementations for Availability Goals 34
Dependency Selection 35
Single Region Scenarios 35
Multi-Region Scenarios 43
Conclusion 50
Contributors 52
Document Revisions 52
Appendix A: Designed-For Availability for Select AWS Services 53
Abstract
The focus of this paper is the reliability pillar of the AWS Well-Architected
Framework. It provides guidance to help you apply best practices in the design,
delivery, and maintenance of Amazon Web Services (AWS) environments.
Amazon Web Services – Reliability Pillar AWS Well-Architected Framework
Introduction
The AWS Well-Architected Framework helps you understand the pros and cons
of decisions you make while building systems on AWS. By using the Framework
you will learn architectural best practices for designing and operating reliable,
secure, efficient, and cost-effective systems in the cloud. It provides a way to
consistently measure your architectures against best practices and identify areas
for improvement. We believe that having well-architected systems greatly
increases the likelihood of business success.
• Operational Excellence
• Security
• Reliability
• Performance Efficiency
• Cost Optimization
This paper focuses on the reliability pillar and how to apply it to your solutions.
Achieving reliability can be challenging in traditional on-premises
environments due to single points of failure, lack of automation, and lack of
elasticity. By adopting the practices in this paper you will build architectures
that have strong foundations, consistent change management, and proven
failure recovery processes.
This paper is intended for those in technology roles, such as chief technology
officers (CTOs), architects, developers, and operations team members. After
reading this paper, you will understand AWS best practices and strategies to use
when designing cloud architectures for reliability. This paper includes high-level
implementation details and architectural patterns, as well as references to
additional resources.
Reliability
The reliability pillar encompasses the ability of a system to recover from
infrastructure or service disruptions, dynamically acquire computing resources
to meet demand, and mitigate disruptions such as misconfigurations or
Page 1
Amazon Web Services – Reliability Pillar AWS Well-Architected Framework
Design Principles
In the cloud, there are a number of principles that can help you increase
reliability:
Page 2
Amazon Web Services – Reliability Pillar AWS Well-Architected Framework
Definition
Service availability is commonly defined as the percentage of time that an
application is operating normally. That is, it’s the percentage of time that it’s
correctly performing the operations expected of it. This percentage is calculated
over periods of time, such as a month, year, or trailing 3 years. Applying the
strictest possible interpretation, availability is reduced any time the application
isn’t operating normally, including both scheduled and unscheduled
interruptions. We define availability using the following criteria:
99% 3 days 15 hours Batch processing, data extraction, transfer, and load
jobs
Page 3
Amazon Web Services – Reliability Pillar AWS Well-Architected Framework
Page 4
Amazon Web Services – Reliability Pillar AWS Well-Architected Framework
For example, if the MTBF is 150 days and the MTTR is 1 hour, the availability
estimate is 99.97%.
For additional details: This document can help you calculate your availability.
Another way that costs escalate in systems that operate with higher availability
design goals is in the selection of dependencies. At these higher goals, the set of
software or services that can be chosen as dependencies will diminish based on
which of these services have had the deep investments we previously described.
As the availability design goal increases, it’s typical to find fewer multi-purpose
services (such as a relational database) and more purpose-built services. This is
because the latter are easier to evaluate, test, and automate, and have a reduced
potential for surprise interactions with included but unused functionality.
Page 5
Amazon Web Services – Reliability Pillar AWS Well-Architected Framework
is alerting and reporting, which enable you to know when you hit a limit or are
about to hit a limit, and then you can react accordingly.
The default limits for cloud resources created by AWS services are documented
for each service. These limits are tracked per account, so if you use multiple
accounts, you need to know what the limits are in each account. Other limits
may be based on your configuration. Examples of these limits are number of
instances in an Auto Scaling group, provisioned IOPS, RDS storage allocated,
EBS volume allocations, network IO, available IP addresses in a subnet or VPC,
etc.
Limits are enforced per AWS Region and per AWS account. If you are planning
to deploy into multiple regions or AWS accounts, then you should ensure that
you increase limits in the regions and accounts that you using. Additionally,
ensure you have sufficient buffer accounted for, such that an Availability Zone
event will not cause you to exceed your limits when requesting additional
resources while the unhealthy resources are being terminated.
AWS provides a list of some service limits via AWS Trusted Advisor, and others
are available from the AWS Management Console. The default service limits
that are provided are available in the Service Limits documentation; you can
contact AWS Support to provide your current limits for the services you are
using if you have not tracked your limit increase requests. For rate limits on
throttled APIs, the SDKs provide mechanisms (retry, exponential back off) to
handle throttled responses. You should evaluate your use cases to decide which
scheme works better for you. If you have a use case where the throttling limits
impact your application’s performance, then contact AWS Support to see if
there are mitigations or if the limit can be increased.
Ideally, limit tracking is automated. You can store what your current service
limits are in a persistent data store like Amazon DynamoDB. If you integrate
your Configuration Management Database (CMDB) or ticketing system with
AWS Support APIs, you can automate the tracking of limit increase requests
and current limits. If you integrate with a CMDB, then it is likely that you can
store the service limits within that system.
Page 6
Amazon Web Services – Reliability Pillar AWS Well-Architected Framework
• Amazon CloudWatch: You can set alarms to indicate when you are
getting close to limits in Network IO, Provisioned IOPS, EBS and
ephemeral volume capacity (through custom metrics), etc. You can also
set alarms for when you are approaching maximum capacity of auto
scaling groups.
• Amazon CloudWatch–Logs: Metric filters can be used to search and
extract patterns in a log event. Log entries are converted to numeric
metrics, and alarms can be applied.
Resources
Refer to the following resources to learn more about AWS best practices for
identifying limits and managing limits, and see AWS Answers for an example of
automated limit monitoring:
Video
• How do I manage my AWS service limits?
Documentation
• AWS Service Limits
• Service Limit Reports Blog Post
• Trusted Advisor FAQs
Foundation - Networking
When architecting systems using IP-address-based networks you need to plan
network topology and addressing in anticipation of future growth and
integration with other systems and their networks. You might find that your
infrastructure is limited if you don’t plan for growth, or you might have
difficulties integrating incompatible addressing structures.
Page 7
Amazon Web Services – Reliability Pillar AWS Well-Architected Framework
Amazon Virtual Private Cloud (Amazon VPC) lets you provision a private,
isolated section of the AWS Cloud where you can launch AWS resources in a
virtual network.
When you plan your network topology, the first step is to define the IP address
space itself. Following RFC 1918 guidelines, Classless Inter-Domain Routing
(CIDR) blocks should be allocated for each VPC. Consider doing the following
things as part of this process:
• Allow IP address space for more than one VPC per Region.
• Consider cross-account connections. For example, each line of business
might have a unique account and VPCs. These accounts should be able
to connect back to shared services.
• Within a VPC, allow space for multiple subnets that span multiple
Availability Zones.
A best practice is to always use private address ranges as identified by RFC 1918
for your VPC CIDR blocks. The range you pick should not overlap your existing
use or anything that you plan to share address space with using VPC peering or
VPN. In general, you need to make sure your allocated range includes sufficient
address space for the number of subnets you need to deploy, the potential size
of Elastic Load Balancing (ELB) load balancers, the number of concurrent
Page 8
Amazon Web Services – Reliability Pillar AWS Well-Architected Framework
Lambda invocations within your VPC, and your servers (including machine
learning servers) and containers deployed within your subnets. In general, you
should plan on deploying large VPC CIDR blocks. Note that VPC CIDR blocks
can be changed after they are deployed, but if you allocate large CIDR ranges for
your VPC, it will be easier to manage in the long term. Subnet CIDRs cannot be
changed. Keep in mind that deploying the largest VPC possible results in over
65,000 IP addresses. The base 10.x.x.x address space means that you can use
over 16,000,000 IP addresses. You should err on the side of too large instead of
too small for all these decisions.
An additional option for inter VPC connectivity is VPC Endpoint Services. This
enables you to use a Network Load Balancer as a private entry point from
another VPC.
You should consider the resiliency and bandwidth requirements that you need
when you select the vendor and instance size on which you need to run the
appliance. If you use a VPN appliance that is not resilient in its implementation,
then you should have a redundant connection through a second appliance. For
all these scenarios, you need to define acceptable time to recovery (TTR) and
test to ensure you can meet those requirements.
If you choose to connect your VPC to your data center via an AWS Direct
Connect connection and you need this connection to be highly available, have a
redundant connection fallback through a second AWS Direct Connect
connection from another location. If you have multiple data centers, then
ensure the connections terminate at different locations.
Page 9
Amazon Web Services – Reliability Pillar AWS Well-Architected Framework
For maximum resiliency, you should have redundant AWS Direct Connect
connections from each data center.
If you choose to fail over to VPN over the internet using AWS Managed VPN, it
is important to understand that it supports up to 1.25 Gbps throughput per VPN
tunnel, but does not support Equal Cost Multi Path (ECMP) for egress in the
case of multiple AWS Managed VPN tunnels terminating on the same VGW. We
Page 10
Amazon Web Services – Reliability Pillar AWS Well-Architected Framework
do not recommend customers use AWS Managed VPN as a backup for AWS
Direct Connect connections with speeds greater than 1 Gbps. Indeed, even with
multiple AWS Direct Connect connections, you need to ensure that the failure of
one network connection does not overwhelm and degrade the redundant
connections.
You should use existing standards for protecting your resources within this
private address space. A subnet or set of subnets (one per Availability Zone)
should be used as a barrier between the internet and your applications. In an
on-premises environment, you often have firewalls with features to deflect
common web attacks, and you often have load balancers or edge routers to
deflect DoS attacks, such as SYN floods. AWS provides many services that can
provide this functionality, such as AWS Shield and AWS Shield Advanced, an
integrated web application firewall (AWS WAF) deployed on Amazon
CloudFront and on ELB, ELB itself, and features of AWS virtual networking like
VPC security groups and network access control lists (ACLs). You can augment
these features with virtual appliances from AWS Partners and the AWS
Marketplace to meet your needs.
Page 11
Amazon Web Services – Reliability Pillar AWS Well-Architected Framework
Videos
• Advanced VPC Design and New Capabilities for Amazon VPC (NET305)
Page 12
Amazon Web Services – Reliability Pillar AWS Well-Architected Framework
Note: If the topic of Reliability is new to you, or if you’re new to AWS, check out
the Automate Deployments to Eliminate Impact (Change Management) and
Recovery Oriented Computing (Failure Management) sections later in this
whitepaper. These sections will cover some concepts that will be helpful as you
read this section.
When designing a new application, it’s common to assume that it must be “five
nines” (99.999%) available without appreciating the true cost to build and
operate applications at that level of availability. Doing so requires that every
piece of networking and infrastructure from end customers to the service
application, data stores, and all other components must be built and operated to
achieve 99.999%. As just one example, most internet service providers aren’t
built to achieve five nines of availability. Therefore, multiple service providers
(with no common point of failure) are required for the application to be
99.999% available to a specific end customer.
In addition, the application and all of its dependencies needs to be built and
tested to this level of availability, also avoiding single points of failure. This will
require extensive custom development, because many software libraries and
systems are not built to five nines availability. The whole system will require
exhaustive testing for failure triggers. Because 99.999% availability provides for
less than 5 minutes of downtime per year, every operation performed to the
system in production will need to be automated and tested with the same level
of care. With a 5 minute per year budget, human judgment and action is
completely off the table for failure recovery. The system must automatically
recover under every situation. Therefore, the production environment will by
Page 13
Amazon Web Services – Reliability Pillar AWS Well-Architected Framework
Applications that truly require 99.999% availability can be built on AWS, as the
following example illustrates.
The devices themselves are not available all of the time, nor is the network
connectivity of any single device, so you deploy a large number of them to
enable a customer to easily use a different device if one is down or lacks
connectivity. In “availability-speak”, they are redundant and fail independently.
The host processor is actually at least two computers and storage that are
deployed across independent AWS Regions, with synchronous replication
between the Regions. The host processors have redundant connections to the
merchant bank and banking networks, and the host processors have standby
copies in an independent location. When cash is required, the host processor
requests an Electronic Funds Transfer to take the money from the customer’s
account and put it in the host processor account. After it has the funds
transferred, it will send a signal to the ATM to dispense the money. It then
initiates an Automated Clearing House (ACH) transfer of those funds to the
merchant bank’s account over the redundant connectivity to the merchant bank.
This will reimburse the merchant bank for the funds dispensed. If a problem
happens to the connectivity between the ATM and the host processor or the
ATM machine itself after the funds have been transferred to the host processor,
it can’t tell the ATM to dispense the cash. It will transfer the funds back to the
customer’s account. The ATM will timeout the transaction and then mark itself
out of service.
Multiple AWS Direct Connect connections will be established from the host
processors on AWS to the merchant bank. The connectivity from the ATM
machines will also be run through redundant AWS Direct Connect providers to
the host processor. If the redundant connectivity to the merchant back is
Page 14
Amazon Web Services – Reliability Pillar AWS Well-Architected Framework
severed from both host processors, they will durably store the request for the
ACH transfer until the connection is restored, while marking the ATMs that it
operates out of service.
Page 15
Amazon Web Services – Reliability Pillar AWS Well-Architected Framework
Category Description
Data induced An input or entry is accepted by the system that it can’t process
(“poison pill”)
Page 16
Amazon Web Services – Reliability Pillar AWS Well-Architected Framework
Category Description
Identifier exhaustion Exceeding available capacity, a throttling limit was hit, an ID ran
out, or a resource that is vended to customers is no longer available
Upon deep analysis, the work involved in achieving and maintaining high
availability applications seems daunting. That often leads to a more refined
definition and prioritization of requirements:
• What are the most valuable transactions to your customers and to your
business?
• Can you predict when certain geographies have the greatest demand?
• What times of day, week, month, quarter, or year are your peak times?
The good news is that AWS provides numerous services that abstract the work
required to do many of these things, provided the right services are chosen to
meet your availability design goals. Defining the right design goal for availability
will start you down the path of how to implement your application and identify
what the additional costs are to achieve that availability. The remainder of this
whitepaper will help you think about selecting the right design goal, and then
choosing AWS services, building software, and constructing operational
practices that meet your goals.
Page 17
Amazon Web Services – Reliability Pillar AWS Well-Architected Framework
Recommendation
Critically evaluate the unique aspects to your applications and, where appropriate,
differentiate the availability design goals to reflect the needs of your business.
Within AWS, we commonly divide services into the “data plane” and the
“control plane.” The data plane is responsible for delivering real time service
while control planes are used to configure the environment. For example,
Page 18
Amazon Web Services – Reliability Pillar AWS Well-Architected Framework
Amazon EC2 instances, Amazon RDS databases, and Amazon DynamoDB table
read/write operations are all data plane operations. In contrast, launching new
EC2 instances or RDS databases, or adding or changing table meta-data in
DynamoDB are all considered control plane operations. While high levels of
availability are important for all of these capabilities, the data planes typically
have higher availability design goals than the control planes.
Availability design goals are then tailored to the different aspects, and the
appropriate work efforts are executed to engineer the system. AWS has
significant experience engineering applications with a range of availability
design goals, including services with 99.999% or greater availability. AWS
Solution Architects (SAs) can help you design appropriately for your availability
goals. Involving AWS early in your design process improves our ability to help
you meet your availability goals. Planning for availability is not only done before
your workload launches. It is done continuously to refine your design as you
gain operational experience, learn from real world events, and endure failures of
different types. You can then apply the appropriate work effort to improve upon
your implementation.
Page 19
Amazon Web Services – Reliability Pillar AWS Well-Architected Framework
AWS also employs the fault isolation construct of Availability Zones (AZs). AWS
Regions are composed of two or more Availability Zones that are designed to be
independent. Each Availability Zone is separated by a large physical distance
from other zones to avoid correlated failure scenarios due to environmental
hazards like fires, floods, and tornadoes. Each Availability Zone has
independent physical infrastructure: dedicated connections to utility power,
standalone backup power sources, independent mechanical services, and
independent network connectivity within and beyond the Availability Zone.
Despite being geographically separated, Availability Zones are located in the
same regional area. This enables synchronous data replication (for example,
between databases) without undue impact on application latency. This allows
customers to use Availability Zones in an active/active or active/standby
Page 20
Amazon Web Services – Reliability Pillar AWS Well-Architected Framework
While AWS control planes typically provide the ability to manage resources
within the entire Region (multiple Availability Zones), certain control planes
(including Amazon EC2 and Amazon EBS) have the ability to filter results to a
single Availability Zone. When this is done, the request is processed only in the
specified Availability Zone, reducing exposure to disruption in other Availability
Zones.
Recommendation
When your application relies on the availability of control plane APIs during a disruption of
one Availability Zone, use API filters to request results for a single Availability Zone with
each API request (for example, with DescribeInstances.)
The most pervasive fault isolation construct is that of the AWS Region. Regions
are designed to be autonomous, with dedicated copies of services deployed in
each Region. Regional AWS services internally use multiple Availability Zones
in an active/active configuration to achieve the availability design goals we
establish.
Page 21
Amazon Web Services – Reliability Pillar AWS Well-Architected Framework
Redundant Components
One of the bedrock principles for service design in AWS is the avoidance of
single points of failure in underlying physical infrastructure. This motivates us
to build software and systems that use multiple Availability Zones and are
resilient to failure of a single zone. Similarly, systems are built to be resilient to
failure of a single compute node, single storage volume, or single instance of a
database.
Micro-Service Architecture
At AWS, we have built our systems using a concept called micro-services. While
micro-services have several attractive qualities, the most important benefit for
availability is that micro-services are smaller and simpler. They allow you to
differentiate the availability required of different services, and thereby focus
investments more specifically to the micro-services that have the greatest
availability needs. For example, to deliver product information pages on
Amazon.com (“detail pages”), hundreds of micro-services are invoked to build
discrete portions of the page. While there are a few services that must be
available to provide the price and the product details, the vast majority of
content on the page can simply be excluded if the service isn’t available. Even
such things as photos and reviews are actually not required to provide an
experience where a customer can buy a product.
• The service has a concise business problem to be served and a small team
that owns the business problem. This allows for better organizational
scaling.
• The team can deploy at any time as long as they meet their API and other
“contract” requirements
• The team can use any technology stack they want to as long as they meet
their API and other “contract” requirements.
Page 22
Amazon Web Services – Reliability Pillar AWS Well-Architected Framework
Recommendation
Isolate discrete functionality into services with a “contract” (API and performance
expectations).
Recovery-Oriented Computing
Complementing the AWS focus on building fault isolation zones and avoiding
single points of failure, we also work to minimize the disruption time when
failures do occur. Since impact duration is a primary input to calculating
availability, reducing recovery time has a direct impact on improving
availability. Recovery-Oriented Computing (ROC) is the term applied to
systematic approaches to improving recovery.
Page 23
Amazon Web Services – Reliability Pillar AWS Well-Architected Framework
ROC, we would apply the same recovery approach to both a network timeout
and a dependency failure where the dependency returns an error. Both events
have a similar effect on the system, so rather than attempting to make either
event a “special case”, ROC would apply a similar strategy of retrying with
exponential back-off. Another example is the use of Auto Scaling with Amazon
Elastic Compute Cloud (Amazon EC2) to manage fleet capacity. An instance
may fail due to hardware failure, operating system bug, memory leak, or other
causes. Rather than building custom remediation for each, treat any as an
instance failure, terminate the instance, and allow Auto Scaling to replace the
instance.
A pattern to avoid is developing recovery paths that are rarely executed. For
example, you might have a secondary data store that is used for read-only
queries. When you write to a data store and the primary fails, you might want to
fail over to the secondary data store. If you don’t frequently test this failover,
you might find that your assumptions about the capabilities of the secondary are
incorrect. The capacity of the secondary data store, which might have been
sufficient when you last tested, may be no longer be able to tolerate the load
under this scenario. Our experience has shown that the only error recovery that
works is the path you test frequently. This is why having a small number of
recovery paths is best. You can establish recovery patterns and regularly test
them. If you have a complex or critical recovery path, you still need to regularly
execute that failure in production to convince yourself that the recovery path
works. In the example we just discussed, you should failover to the standby
regularly, regardless of need.
AWS approaches the design of our services with fault recovery in mind. We
design services to minimize the time to recover from failures and impact on
data. Our services primarily use data stores that acknowledge requests only
after they are durably stored across multiple replicas. These services and
resources include Amazon Aurora, Amazon Relational Database Service
(Amazon RDS) Multi-AZ DB instances, Amazon S3, Amazon DynamoDB,
Amazon Simple Queue Service (Amazon SQS), and Amazon Elastic File System
(Amazon EFS). They are constructed to use cell based isolation and use the
independence of Availability Zones. We use automation extensively in our
operational procedures. We also optimize our replace-and-restart functionality
to recover quickly from interruptions.
Page 24
Amazon Web Services – Reliability Pillar AWS Well-Architected Framework
Retry with exponential fallback: This is the invoking side of the throttling
pattern we just discussed. AWS SDKs implement this by default, and can be
configured. The pattern is to pause and then retry at a later time. If it fails again,
pause longer and retry. This increase in pause time is often called “backing off.”
After a configured number of attempts or elapsed time, it will quit retrying and
return failure.
Fail fast: Simply return an error as soon as possible. This will allow releasing
of resources associated with requests and can often allow a service to recover if
it is running out of resources. It’s preferable to fail fast rather than allowing
requests to be queued. Queues can be created at multiple levels of a system, and
can seriously impede an ability to quickly recover. Be very mindful of places
where queues exist (they often hide in workflows or in work that’s recorded to a
database).
Page 25
Amazon Web Services – Reliability Pillar AWS Well-Architected Framework
Constant work: Systems can fail when there are rapid changes in load. If you
know your service needs to process 100 units of work done per second at peak,
then you should design and tune your system for 100 units of work. If there is
work to be done, it takes one of the slots. But if not, you put “filler” work in the
slot just so you know you’re always doing 100 units per second. An example is a
video player that plays data in a buffer. If you have no work to perform because
no new data has come in, you may be able to process the same data you last
received again and effectively render the same video frame, performing the
same work.
Page 26
Amazon Web Services – Reliability Pillar AWS Well-Architected Framework
configuration state of the entire system. This would add unexpected load to
another component, which might then cause it to fail, triggering other
unexpected consequences. We refer to this as “bi-modal” behavior, because the
system has different behavior under normal and failure modes. To counteract,
this behavior, we prefer building systems that are statically stable and operate in
only one mode. They maintain enough internal state to continue operating as
they were before the failure without adding additional load to the system. These
systems may end up performing less work during certain failures (which is
desirable). Another example of this type of system is one that uses Amazon EC2
for instance capacity. Systems often assume that if an instance or Availability
Zone fails, they will respond by simply launching new instances. However, this
approach means that during failure, the system will be doing much different
work from usual. Instead, we recommend using Elastic Load Balancing or
Amazon Route53 health checks to shift load away from failed instances, and use
Auto Scaling to asynchronously replace them.
Testing is an important part of the delivery pipeline. Aside from common unit
tests and functional tests that are performed at component levels, it is
important to perform sustained load testing. Load tests should discover the
breaking point of your workload, test performance, and perform fault injection
testing. In addition, your monitoring service must be able to add or remove
monitoring of capabilities that are added or deprecated. AWS finds it useful to
perform operational readiness reviews that evaluate the completeness of the
testing, ability to monitor, and importantly, the ability to audit the applications
performance to its SLAs and provide data in the event of an interruption or
other operational anomaly.
Page 27
Amazon Web Services – Reliability Pillar AWS Well-Architected Framework
Recommendation
Although conventional wisdom suggests that you keep humans in the loop for the most
difficult operational procedures, we suggest that you automate the most difficult procedures
for that very reason.
• Canary deployment
• Blue-Green deployment
• Feature toggles
• Failure isolation zone deployments
Blue-Green deployments are similar to the canary deployment except that a full
fleet of the application is deployed in parallel. You alternate your deployments
across the two stacks (blue and green). Once again, you can send traffic to the
new version, and fail back to the old version if you see problems with the
deployment. You can also use fractions of your traffic to each version to dial up
the adoption of the new version. AWS Code Deploy can be configured with a
deployment configuration that will enable a blue-green deployment.
Feature toggles are configuration options on an application. You can deploy the
software with a feature turned off, so that customers don’t see the feature. You
can then turn the feature on, as you’d do for a canary deployment, or you can set
Page 28
Amazon Web Services – Reliability Pillar AWS Well-Architected Framework
the change pace to 100% to see the effect. If the deployment has problems, you
can simply turn the feature back off without rolling back.
One of the most important rules AWS has established for its own deployments
is to avoid touching multiple Availability Zones within a Region at the same
time. This is critical to ensuring that Availability Zones are independent for
purposes of our availability calculations. We recommend that you use similar
considerations in your deployments.
Testing
The testing effort should be commensurate with your availability goals. Your
application’s resiliency to transient failures of dependencies should be tested for
durations that may last from less than a second to hours. Testing to ensure that
you can meet your availability goals is the only way you can have confidence
that you will meet those goals. Our experience is that canary testing that can run
constantly and simulate customer behavior is among the most important testing
processes. You should unit test, load test, performance test, and simulate your
failure modes while under these tests. Don’t forget to test for external
dependency unavailability, and deployment failures. Achieving very high
availability requires implementing fault tolerant software patterns, and testing
that they are effective.
Netflix has provided some example open source software that can be a basis for
this type of testing. You can use their software or develop your own for
simulating failure modes. For simulating conditions that might produce
brownouts, you can use extensions to common proxies to introduce latency,
dropped messages, etc., or you can create your own.
Page 29
Amazon Web Services – Reliability Pillar AWS Well-Architected Framework
1. Generation
Page 30
Amazon Web Services – Reliability Pillar AWS Well-Architected Framework
2. Aggregation
3. Real-time processing and alarming
4. Storage
5. Analytics
Generation
First, determine which services and/or applications require monitoring, define
important metrics and how to extract them from log entries if necessary, and
finally create thresholds and corresponding alarm events. AWS makes an
abundance of monitoring and log information available for consumption, which
can be used to define change-in-demand processes. The following is just a
partial list of services and features that generate log and metric data.
• Amazon ECS and AWS Lambda stream log data to CloudWatch Logs.
• Amazon Machine Learning (Amazon ML), Amazon Rekognition,
Amazon Lex, and Amazon Polly provide metrics for successful and
unsuccessful requests.
• AWS IoT provides metrics for number of rule executions as well as
specific success and failure metrics around the rules.
Page 31
Amazon Web Services – Reliability Pillar AWS Well-Architected Framework
There are a number of tools provided by partners and third parties that allow
for aggregation, processing, storage, and analytics. Some of these tools are New
Relic, Splunk, Loggly, Logstash, CloudHealth, and Nagios. However, generation
outside of system and application logs is unique to each cloud provider, and
often unique to each service.
Page 32
Amazon Web Services – Reliability Pillar AWS Well-Architected Framework
• Amazon S3: Acts as the storage layer, and allows for lifecycle policies
and data management.
• Amazon EMR: Use this service to gain further insight into log and
metric data.
• Amazon Athena: Use this service to gain further insight into data that
is in support formats.
Recommendation
Conduct an Operational Readiness Review (ORR) for applications prior to initial production
use, and periodically thereafter.
Page 33
Amazon Web Services – Reliability Pillar AWS Well-Architected Framework
Auditing
Auditing your monitoring will ensure that you know when an application is
meeting its availability goals. Root Cause Analyses require the ability to discover
what happened when failures occur. AWS provides services that allow you to
track the state of your services during an incident:
• Amazon CloudWatch Logs: You can store your logs in this service
and inspect their contents.
• AWS Config: You can see what AWS infrastructure were used at points
in time.
• AWS CloudTrail: You can see which AWS APIs were invoked at what
time and by what principal.
At AWS we conduct a weekly meeting to review operational performance and to
share learnings between teams. Establishing a regular cadence for operational
performance reviews and knowledge sharing will enhance your ability to achieve
higher performance from your operational teams.
• Use monitoring
• Deploy changes
• Back up data
• Implement resiliency
• Test resiliency
Page 34
Amazon Web Services – Reliability Pillar AWS Well-Architected Framework
Dependency Selection
We have chosen to use Amazon EC2 for our applications. We will show how
using Amazon RDS and multiple Availability Zones improves the availability of
our applications. We will use Amazon Route 53 for DNS. When we use multiple
Availability Zones, we will use Elastic Load Balancing. Amazon S3 is used for
backups and static content. As we design for higher reliability, we can only
adopt services with higher availability themselves. See the Appendix for the
design goals for the respective services.
These applications can be deployed with one Region and one Availability Zone.
We will deploy the software, including the database, to a single instance. We will
use a vendor or purpose built backup solution to send encrypted backup data to
Amazon S3 using a runbook. We will test that the backups work by restoring
and ensuring the ability to use them on a regular basis using a runbook. We will
configure versioning on our Amazon S3 objects and remove permissions for
deletion of the backups. We will use an Amazon S3 bucket lifecycle policy to
archive or permanently delete according to our requirements. We will use AWS
CloudFormation to define our infrastructure as code, and specifically to speed
up reconstruction in the event of a failure. During failures we will wait for the
failure to finish, optionally routing requests to a static website using DNS
modification via a runbook. The recovery time for this will be determined by the
speed at which the infrastructure can be deployed and the database can be
restored to the most recent backup. This deployment can either be into the same
Availability Zone, or into a different Availability Zone in the event of an
Availability Zone failure using a runbook. The deployment pipeline of new
software is scheduled, with some unit testing, but mostly white-box/black-box
Page 35
Amazon Web Services – Reliability Pillar AWS Well-Architected Framework
Let’s see what the implications on availability of recovery time are. We take 30
minutes to understand and decide to execute recovery, deploy the whole stack in
AWS CloudFormation in 10 minutes, assume that we deploy to a new
Availability Zone, and assume the database can be restored in 30 minutes. This
implies that it takes about 70 minutes to recover from a failure. Assuming one
failure per quarter, our estimated impact time for the year is 280 minutes, or
four hours and 40 minutes.
This means the upper limit on availability is 99.9%. The actual availability will
also depend on the real rate of failure, duration of failure and how quickly each
factor actually recovers. For this architecture we require the application to be
offline for updates (estimating 24 hours per year: four hours per change, six
times per year), plus actual events. So referring to the table on application
availability earlier in the whitepaper we see that our availability design goal
is 99%.
Page 36
Amazon Web Services – Reliability Pillar AWS Well-Architected Framework
Topic Implementation
3 9s (99.9%) Scenario
The next availability goal is for applications for which it is important to be
highly available, but they can tolerate short periods of unavailability. This type
of application is typically used for internal operational systems that have an
effect on employees when they are down. This type of application can also be
used for customer-facing systems that are not high revenue for the business and
can tolerate a longer recovery time or recovery point. Such applications include
an administrative system for account or information management.
We can improve availability for applications by using two Availability Zones for
our deployment and by separating the applications to separate tiers. We will use
services that work across multiple Availability Zones, such as Elastic Load
Balancing, Auto Scaling and Amazon RDS Multi-AZ with encrypted storage via
AWS Key Management Service. This will ensure tolerance to failures on the
resource level and on the Availability Zone level. Backup and restore can be
done using Amazon RDS. It will be executed regularly using a runbook to ensure
that we can meet recovery requirements.
Page 37
Amazon Web Services – Reliability Pillar AWS Well-Architected Framework
The infrastructure deployment technologies remain the same. The load balancer
will only route traffic to healthy application instances. The health check needs to
be at the data plane/application layer indicating the capability of the application
on the instance. This check should not be against the control plane. A health
check URL for the web application will be present and configured for use by the
load balancer and Auto Scaling, so that instances that fail are removed and
replaced. Amazon RDS will manage the active database engine to be available in
the second Availability Zone if the instance fails in the primary Availability
Zone, then repair to restore to the same resiliency.
Runbooks exist for total system recovery and common reporting. We will have
playbooks for common database problems, security-related incidents, failed
deployments, and establishing root cause of problems. After the root cause has
been identified, the correction for the error will be identified by a combination
of the operations and development teams. The correction will be deployed when
the fix is developed.
Let’s see what the implications on availability of recovery time are. We assume
that at least some failures will require a manual decision to execute recovery.
However with greater automation in this scenario we assume only two events
per year will require this decision. We take 30 minutes to decide to execute
recovery, and assume recovery is completed within 30 minutes. This implies 60
minutes to recover from failure. Assuming two incidents per year, our estimated
impact time for the year is 120 minutes.
Page 38
Amazon Web Services – Reliability Pillar AWS Well-Architected Framework
This means the upper limit on availability is 99.95%. The actual availability will
also depend on the real rate of failure, duration of failure and how quickly each
factor actually recovers. For this architecture we require the application to be
briefly offline for updates, but these updates are automated. We estimate 150
minutes per year for this: 15 minutes per change, 10 times per year. This adds
up to 270 minutes per year when the service is not available, so our
availability design goal is 99.9%.
Topic Implementation
Page 39
Amazon Web Services – Reliability Pillar AWS Well-Architected Framework
4 9s (99.99%) Scenario
This availability goal for applications requires the application to be highly
available and tolerant to component failures. The application must to be able to
absorb failures without needing to procure additional resources. This
availability goal is for mission critical applications that are main or significant
revenue drivers for a corporation, such as an e-commerce site, a business to
business web service, or a high traffic content/media site.
Page 40
Amazon Web Services – Reliability Pillar AWS Well-Architected Framework
causing the trend. Playbooks will exist for undiscovered failure modes and
security incidents. Playbooks will also exist for establishing the root cause of
failures. We will practice our failure recovery procedures constantly through
game days, using runbooks to ensure that we can perform the tasks and not
deviate from the procedures. The team that builds the website also operates the
website. That team will identify the correction of error of any unexpected failure
and prioritize the fix to be deployed after it is implemented. We will also engage
with AWS Support for Infrastructure Event Management offering.
Let’s see what the implications on availability of recovery time are. We assume
that at least some failures will require a manual decision to execute recovery,
however with greater automation in this scenario we assume only two events
per year will require this decision and the recovery actions will be rapid. We
take 10 minutes to decide to execute recovery, and assume recovery is
completed within five minutes. This implies 15 minutes to recover from failure.
Assuming two per year, our estimated impact time for the year is 30 minutes.
This means the upper limit on availability is 99.99%. The actual availability will
also depend on the real rate of failure, duration of failure and how quickly each
factor actually recovers. For this architecture we assume the application is
online continuously through updates. Based on this, our availability design
goal is 99.99%.
Page 41
Amazon Web Services – Reliability Pillar AWS Well-Architected Framework
Topic Implementation
Page 42
Amazon Web Services – Reliability Pillar AWS Well-Architected Framework
Multi-Region Scenarios
Implementing our application in multiple AWS Regions will increase the cost of
operation, partly because we isolate regions to maintain their independence. It
should be a very thoughtful decision to pursue this path. That said, regions
provide a very strong isolation boundary and we take great pains to avoid
correlated failures across regions. Using multiple regions will give you greater
control over your recovery time in the event of a hard dependency failure on a
regional AWS service. In this section, we’ll discuss various implementation
patterns and their typical availability.
Page 43
Amazon Web Services – Reliability Pillar AWS Well-Architected Framework
Let’s see what the implications on availability of recovery time are. We assume
that at least some failures will require a manual decision to execute recovery,
however with good automation in this scenario we assume only 2 events per
year will require this decision. We take 20 minutes to decide to execute
recovery, and assume recovery is completed within 10 minutes. This implies 30
minutes to recover from failure. Assuming 2 per year, our estimated impact
time for the year is 60 minutes.
This means the upper limit on availability is 99.95%. The actual availability will
also depend on the real rate of failure, duration of failure and how quickly each
factor actually recovers. For this architecture we assume the application is
online continuously through updates. Based on this, our availability design
goal is 99.95%.
Page 44
Amazon Web Services – Reliability Pillar AWS Well-Architected Framework
Topic Implementation
Page 45
Amazon Web Services – Reliability Pillar AWS Well-Architected Framework
Page 46
Amazon Web Services – Reliability Pillar AWS Well-Architected Framework
data replication layers. Amazon Route53 offers 10 second interval health checks
and also offers TTL on your record sets as low as one second.
The deployment pipeline will have a full test suite, including performance, load,
and failure injection testing. We will deploy updates using canary or blue/green
deployments to one isolation zone at a time, in one Region before starting at the
other. During the deployment, the old versions will still be kept running
instances to facilitate a faster rollback. These are fully automated, including a
rollback if KPIs indicate a problem. Monitoring will include success metrics as
well as alerting when problems occur.
Let’s see what the implications on availability of recovery time are. We assume
that heavy investments are made to automate all recovery, and that recovery can
be completed within one minute. We assume no manually-triggered recoveries,
but up to one automated recovery action per quarter. This implies four minutes
Page 47
Amazon Web Services – Reliability Pillar AWS Well-Architected Framework
Page 48
Amazon Web Services – Reliability Pillar AWS Well-Architected Framework
Topic Implementation
Page 49
Amazon Web Services – Reliability Pillar AWS Well-Architected Framework
Conclusion
Whether you are new to the topics of availability and reliability or a seasoned
veteran seeking insights to maximize your mission critical service’s availability,
we hope this section has triggered your thinking, offered a new idea, or
introduced a new line of questioning. We hope this leads to a deeper
understanding of the right level of availability based on the needs of your
business. We encourage you to take advantage of the design, operational, and
recovery-oriented recommendations offered here as well as the knowledge and
experience of our AWS Solution Architects. We’d love to hear from you –
especially about your success stories achieving high levels of availability on
AWS. Contact your account team or use Contact Us via our website.
Resources
Refer to the following resources to learn more about AWS best practices in this
area.
Page 50
Amazon Web Services – Reliability Pillar AWS Well-Architected Framework
Documentation:
• AMI Copies
Page 51
Amazon Web Services – Reliability Pillar AWS Well-Architected Framework
• Canary Release
• Feature Toggles
• Microservice Trade-Offs
Contributors
The following individuals and organizations contributed to this document:
Document Revisions
Date Description
September 2018 Added specific AWS Direct Connect networking recommendations and additional
service design goals.
June 2018 Added Design Principles and Limit Management sections. Updated links, removed
ambiguity of upstream/downstream terminology, and added explicit references to
the remaining Reliability Pillar topics in the availability scenarios.
Page 52
Amazon Web Services – Reliability Pillar AWS Well-Architected Framework
Date Description
March 2018 Changed DynamoDB Cross Region solution to DynamoDB Global Tables
Added service design goals
November 2017 Updated to provide guidance on high availability designs, including concepts, best-
practices and example implementations.
NOTE: The numbers in the table below do not refer to durability (long term
retention of data); they are availability numbers (access to data or functions.)
Page 53
Amazon Web Services – Reliability Pillar AWS Well-Architected Framework
Page 54
Amazon Web Services – Reliability Pillar AWS Well-Architected Framework
Page 55
Amazon Web Services – Reliability Pillar AWS Well-Architected Framework
Page 56