You are on page 1of 22

Well Architected Framework - Whitepaper

Security
● Design Principles
○ Implement principle of least privilege and enforce separation of duties. Centralize
privilege management and reduce / eliminate reliance on long term credentials.
○ Apply security at all layers IE VPC, subnet, load balancer, ACL etc.
○ Enable traceability. Monitor, alert, audit actions and changes to your environment
in real time.
○ Protect data at rest and transit.
○ Keep people away from data. Reduce or eliminate need for direct access or
manual processing of data.
○ Automate responses to security events
○ Focus on securing your system
○ Automate security best practices. Improve your ability to securely scale more
rapidly and cost effective. Create secure architectures.
● Shared responsibility model
○ Customer is responsible for security in the cloud
■ Platforms, applications, identity and access management
■ OS, network & firewall
○ AWS is responsible for security of the cloud
● Definition
○ Security consists of 5 areas
■ Identity Access Management
● Ensure only authorized and authenticated users are able to
access resources, only in manner you intend. Define principles,
build out policies aligned with these principles, and implement
strong credential management.
● Implement least privilege access system
● Credentials must not be shared.
● How do you manage credentials and authentications?
○ Includes passwords, tokens and keys that grant access
directly or indirectly. Protect credentials with appropriate
mechanisms.
○ Define identity and access management requirements
○ Secure AWS root user - use MFA and no access keys
○ Enforce use of multi-factor authentication
○ Automate enforcement of access controls
○ Integrate with centralized federation provider - reduces
requirement for multiple credentials and reduces
management complexity
○ Enforce password requirements
○ Rotate credentials regularly
○ Audit credentials periodically
● How do you control human access?
○ Implement controls inline with defined business
requirements, reduce risk and lower impact of
unauthorized access.
○ Define human access requirements - define access
requirements for users based on job function
○ Grant lease privileges
○ Allocate unique credentials for each individual
○ Manage credentials based on user lifecycles
○ Automate credential management - automate management
to enforce least privilege and disable unused credentials
○ Grant access through roles or federation - use IAM roles
instead of users
● How do you control programmatic access?
○ Control access with appropriately defined, limited and
segregated access to reduce risk of unauthorized access.
Programmatic access includes access that is internal to
your workload.
○ Define programmatic access requirements
○ Grant least privileges
○ Automate credential management
○ Allocate unique credentials for each component -
credentials are not shared between components IE
different IAM roles for lambda and ec2
○ Grant access through roles or federation
○ Implement dynamic authentication
● IAM enables you to securely control access to AWS services and
resources. MFA adds additional layer of protection on user
access. AWS Organizations lets you centrally manage and
enforce policies for multiple accounts.
■ Data protection
● You should organise and classify your data into segments, such as
public or private, only accessible to members of organization or
certain members.
● Categorize organizational data based on levels of sensitivity, and
encryption protects data by way of rendering it unintelligible to
unauthorized access.
● Following practices facilitate protection of data:
○ Customer maintains full control over data
○ AWS makes it easier to encrypt data and manage keys,
including key rotation which can be automated
○ Detailed logging contains important content, is available
○ AWS has designed storage for exceptional resilience.
○ Versioning protects data from accidental deletion
○ AWS never initiates movement of data between region,
unless you give them permissions.
● Encrypt everything where possible, at rest or in transit
● How do you classify your data?
○ Classification helps you determine appropriate protection
and retention controls
○ Define data classification requirements
○ Define data protection controls
○ Implement data identification - classify data with
identifiable indicators IE tags and objects that classify the
data
○ Automate identification and classification
○ Identify the types of data
● How are you encrypting your data at rest?
○ Protect data at rest by defining requirements and
implementing controls
○ Define data management and protection at rest
requirements - such as encryption and data retention to
meet legal and compliance
○ Implement security key management - encryption keys
must be stored securely and rotated with strict access
control IE using AWS KMS. consider using different keys
for segregation of different data classification
○ Enforce encryption at rest
○ Enforce access control
○ Provide mechanisms to keep people away from data - IE
providing a dashboard instead of direct access to data
store, and provide tools to indirectly manage the data
● How are you encrypting your data in transit?
○ Protect data at rest by defining requirements and
implementing controls
○ Define data protection in transit requirements - best
practice is to encrypt and authenticate all traffic, and to
enforce latest standards and ciphers
○ Implement secure key and certificate management
○ Enforce encryption in transit
○ Automate detection of data leak - IE detect database
system that is copying data to an unknown host
○ Authenticate network communications - use protocols like
transport layer security or IPsec to reduce risk of data
tampering
● S3 server side encryption with SSE. you can also arrange for
HTTPS encryption and decryption (SSL) to be handled by your
ELB
● ELB, EBS, S3, RDS includes encryption capabilities to protect
your data in transit and rest. AWS Macie auto discovers, classifies
and protects sensitive data, while KMS makes it easy to create
and control keys for encryption.
■ Privilege management
● Ensures that only authorized and authenticated users are able to
access your resources, and only in manner that is intended
● ACL, role based access, password management
● How are you protecting access to use of AWS root account?
● How are you defining roles and responsibilities of system
users?
● How are you limiting automated access such as from
services and applications to AWS resources?
● How are you managing keys and credentials?
■ Infrastructure protection
● How you protect your data centre.
● Implement stateful and stateless packet inspection, using native
technologies or partner products.
● Use VPC to create private, secure and scalable environment to
define your topology - including gateways, routing tables, public
and private subnets.
● RFID controls, security, lockable cabinets, CCTV etc.
● For AWS, this is talking more about VPC level as AWS takes care
of physical infrastructure
● Multiple layers of defense is advisable.
● How do you protect your networks
○ Public and private networks require multiple layers of
defense
○ Define network protection requirements
○ Limit exposure
○ Automate configuration management
○ Automate network protection
○ Implement inspection and protection - inspect and filter
traffic at application level, IE using firewall
○ Control traffic at all layers - control both ingress and egress
traffic, including data loss prevention
● How do you protect your computer resources
○ Resources in workload require multiple layers of defense
to protect from internal and external threats. Compute
resources like ec2 instances, containers, lambda functions,
etc.
○ Define compute protection requirements
○ Scan for and patch vulnerabilities
○ Automate configuration management
○ Automate compute protection
○ Reduce attack surface
○ Implement managed services - use services that manage
resources IE lambda, RDS and ECS
● VPC enables you to launch resources into virtual network.
Cloudfront is global content delivery network that securely
delivers data, videos, apps and API to your viewers which
integrates with AWS shield for DDOS protection. AWS WAF is
web application firewall that is deployed on either cloudfront or
app load balancer to help protection from common web exploits.
■ Detective controls
● Detect / identify security breach
● conducting inventory of assets and detailed attributes promotes
effective decision making, to establish operational baselines.
● Use internal auditing, examination of controls related to
information systems, to ensure practices meet policies and
requirements, and that you have set correct automated alerting
notifications based on conditions.
● Cloudtrail logs, AWS Api calls, and CloudWatch to provide
monitoring of metrics with alarming, AWS Config provides config
history. AWS GuardDuty is managed threat detection service that
monitors for malicious or unauthorized behavior to protect AWS
accounts. Also use S3 to log access requests.
● Log management is important. Critical you analyze logs and
respond to them to identify potential incidents.
● How to detect and investigate security events?
○ Capture and analyze events from logs and metrics. Take
action on security events and potential threats
○ Define requirements for logs
○ Define requirements for metrics
○ Define requirements for alerts
○ Configure service and application logging - logging
throughout workload, application logs, AWS services logs,
resource logs
○ Analyze logs centrally
○ Automate alerting on key indicators
○ Develop investigation processes
● How do you defend against emerging security threats?
○ Stay up to date with AWS and industry best practices and
threat intelligence
○ Keep up to date with organizational, legal and compliance
requirements
○ Keep up to date with security best practices
○ Keep up to date with security threats
○ Evaluate new security services and features regularly
○ Define and prioritize risks using threat model - use threat
model to identify and maintain up to date register of
potential threats
○ Implement new security services and features
● CloudTrail records API calls, AWS CONFIG provides detailed
inventory of aws resources and configurations. GuardDuty is
managed threat detection service that continuously monitors for
malicious or unauthorized behavior. CloudWatch is monitoring
service for resources which can trigger events to automate
security responses.
■ Incident response
● Put in process response to mitigate the potential impact of security
incidents.
● Architecture of your workload affects ability of team to operate
effectively during crisis, isolate or contain systems, and restore
operations to known good state.
● Best practices:
○ Detailed logging
○ Events automatically processed and trigger tools that
automate response via AWS API
○ Pre-provision tooling and “clean room” using
cloudformation. Allows you to carry out forensics in safe,
isolated environment.
● How do you respond to an incident
○ Preparation is critical to timely investigation and response
to security incidents to help minimize potential disruption to
organization
○ Identify key personnel and external resources
○ Identify tooling
○ Develop incident response plans
○ Automate containment capability
○ Identify forensic capabilities
○ Pre-provision access
○ Pre-deploy tools
○ Run game days - practice incident response game days
regularly
● IAM used to grant appropriate authorization to incident response
teams and response tools. CloudFormation used to create
trusted environment or clean room for conducting investigations.

Reliability
● Ability of a system to recover from service or infrastructure outages as well as ability to
dynamically acquire resources to meet demand
● Design principles
○ Test recovery procedures - test how your system fails, and validate your
recovery procedures. Use automation to simulate different failures or re-create
scenarios that led to failures before.
○ Automatically recover from failure - monitoring system for key performance
indicators (KPI), you can trigger automation when threshold is breached. Allows
for automatic notification and tracking of failures, and for automated recovery
process that work around or repair the failure. More sophisticated automation,
possible to anticipate and remediate failures before they occur.
○ Scale horizontally to increase aggregate system availability - IE replace one
large resource with many small resources, distributing requests to small
resources to avoid single point of failure
○ Stop guessing capacity - common cause of failure is resource saturation, when
demands placed on system exceed capacity of that system. Monitor demand and
system utilization, automate addition or removal of resources to maintain the
optimal level to satisfy demand without over or under provisioning.
● Foundations
○ Foundational requirements that influence reliability should be in place IE you
must have sufficient network bandwidth to your data center.
○ Generally, it is the responsibility of AWS to satisfy requirements of sufficient
networking and compute capacity.
○ How do you manage service limits?
■ Be aware of the limits. AWS direct connect has limits on amount of data
you can transfer on each connection.
■ Aware of limits but not tracking them - be aware of limits but not track
your current limits
■ Monitor and manage limits - evaluate potential usage, increase regional
limits appropriately
■ Use automated monitoring and management of limits
■ Accommodate fixed service limits through architecture
■ Ensure sufficient gap between current service limit and maximum usage
to accommodate failover - when resource fails, it may still be counted
against limits until successfully terminated. Ensure limits cover the
overlap of all failed resources with replacements, before failed resource
are terminated.
■ Manage service limits across all relevant account and regions - ensure
you request same limits in all environments which you run your production
loads
○ How do you manage network topology?
■ Applications can exist in one or more environments IE existing
infrastructure, public cloud / private cloud. Network considerations such
as intra and inter-system connectivity, public ip address management,
private address management and name resolutions are fundamental
■ Use highly available connectivity between private and public clouds and
on-premise environments - use multiple AWS direct connect circuits and
multiple VPN tunnels between separately deployed private ip addresses.
Use multiple DX locations for high availability. If you use multiple regions,
use multiple DX locations in at least 2 regions. Evaluate AWS
marketplace appliances that terminate VPNS.
■ Use highly available network connectivity for users of workload - use high
available DNS, cloudfront, API gateway, load balancing, reverse proxy as
the public facing endpoint of application.
■ Enforce non-overlapping private ip address ranges in multiple private
address spaces - ip range of each VPC must not conflict if they are
peering.
■ Ensure ip subnet allocation accounts for expansion and availability -
individual VPC ip address range must be large enough to accomodate
application requirements, factoring in future expansion and allocation of ip
addresses to subnets across AZ. includes load balancer, lambda, ec2,
container-based app
● Change Management
○ Aware of how change affects a system so you can plan around it. Monitoring
allows you to detect any changes to your environment and react.
○ Monitor behavior of your system and automate responses to the KPI, for example
by adding additional servers as system gains more users
○ How does your system adapt to changes in demand?
■ Scalable system provides elasticity to add and remove resources
automatically so that they closely match current demand
■ Procure resources upon detection of lack of service within workload -
scaling resource is performed manually
■ Procure resources manually upon detection that more resources may be
needed soon
■ Procure resources automatically when scaling workload up or down -
services that auto scale like S3, cloudfront, auto scaling and lambda
■ Load test the workload - adopt load testing methodology to measure if
scaling will meet requirements
○ How do you monitor your resources
■ Via logs and metrics. Configure workload to monitor logs and metrics and
send notification when thresholds are crossed or events occur.
■ Monitor workload in all tiers
■ Send notification based on monitoring
■ Perform automated responses on events
■ Conduct reviews regularly
○ How do you implement change
■ Uncontrolled changes make it difficult to predict effects of change.
Controlled changes to provisioned resources and workloads are
necessary to ensure that workloads and operating environment are
running known software and can be patched or replaced.
■ Deploy changes in planned manner
■ Deploy changes with automation
● Failure Management
○ Architect your system with assumption that failure will occur.
○ Take advantage of automation to react to monitoring data. When particular metric
crosses threshold, trigger automated action to remedy problem.
○ How do you backup data
■ Back up data, applications, OS environments to meet requirements for
mean time recovery and recovery point objectives RPO
■ Identify all data that needs to be backed up and perform backups or
reproduce data from source - backup important data using S3, EBS
snapshot
■ Perform backup automatically
■ Perform periodic recovery of data to verify backup integrity - validate
backup process meets recovery time objective and recovery point
objective
■ Secure and encrypt backup or ensure data is available from secure
source
○ How does your system withstand component failures
■ If your workloads have requirement for high availability and low mean
time recovery MTTR, architect workloads for resilience and distribute
workloads to withstand outages
■ Monitor all layers of workload to detect failures
■ Implement loosely coupled dependencies - such as queueing systems,
streaming systems, load balancers
■ Implement graceful degradation to transform applicable hard
dependencies into soft dependencies
■ Automate complete recovery because technology constraints exist in
parts or all of workload requiring single location - some elements of
workload can only run in one AZ or one data center
■ Deploy workload to multiple locations
■ Automate healing on all layers
■ Send notifications upon availability impacting events
○ How do you test resilience
■ Testing resilience helps your workload to find latent bugs that only surface
in production
■ Use playbooks for unanticipated failures
■ Conduct root cause analysis and share results
■ Inject failures to test resiliency
■ Conduct game days regularly
○ How do you plan for disaster recovery
■ DR is critical for data restoration required from backup methods.
■ Define recovery objectives for downtime and data loss
■ Use defined recovery strategies to meet recovery objectives
■ Test disaster recovery implementation to validate - regularly test recovery
to ensure RTO and RPO are met
■ Manage configuration drift on all changes - ensure AMI and system
configuration state are up to date at the DR site or region
■ Automate recovery
○ Regularly backup your data and test your backup files to ensure you can recover
from both logical and physical errors. Key is frequent and automated testing of
system to cause failure, and observe how they recover.
○ Actively track KPI, like RTO and RPO, to assess systems resiliency. Tracking KPI
will help identify and mitigate single points of failure. Objective is to thoroughly
test your system recovery process.

Performance Efficiency
● How to use computing resources efficiently to meet your requirements and how to
maintain efficiency as demand changes and technology changes
● Essential service is cloudwatch
● Design Principles
○ Democratize advanced technologies - consume new technology as a service,
instead of spending time to learn the complexities of the technology
○ Go global in minutes - cloudformation templates. Easily deploy system in
multiple regions around the world. Provides lower latency and better experience
for your customers at minimal cost.
○ Use serverless architectures - acloudguru for example doesn’t use Ec2
instances, completely serverless using lambda to pull videos from S3. serverless
architecture removes need to run and maintain servers. Example:
■ Storage service act as static website, and event services can host the
code for you.
○ Experiment more often
○ Mechanical sympathy - use technology approach that aligns best with what you
are trying to achieve.
● Definition - take data-driven approach to selecting high-performance architecture.
Gather data on all aspects of architecture. By reviewing choices on cyclical basis, you
will ensure that you are taking advantage of continually evolving AWS cloud. Monitoring
ensures you are aware of any deviance from expected performance. Architecture can
make tradeoffs to improve performance, IE using compression or caching.
● Selection
○ In AWS resources are virtualized and available in number of different types and
configurations, which makes it easy to find approach that matches your needs. IE
DynamoDB provides fully managed noSQL service with single digit millisecond
latency at any scale.
○ How do you select best performing architecture
■ Multiple approaches are required to get optimal performance across
workload. Well architected system uses multiple solutions and enable
different features to improve performance.
■ Understand available services and resources
■ Define process for architectural choices
■ Factor cost or budget into decisions
■ Use policies or reference architectures
■ Use guidance from AWS or APN partner
■ Benchmark existing workloads
■ Load test your workload - deploy latest version of system on AWS using
different resource types and sizes, monitoring to capture performance
metrics that identify bottlenecks or excess capacity
○ Use data driven approach for the most optimal solution.
● Compute
○ Optimal compute solution for system may vary based on application design,
usage patterns and configuration settings.
○ Compute is available in three forms:
■ Instances - virtualized servers, comes in different families and sizes, and
offer wide variety of capabilities, including SSD and GPU
■ Containers - method of OS virtualization that allow you to run application
and its dependencies in resource-isolated processes
■ Functions - abstract execution environment from code you want to
execute. IE lambda lets you execute code without instances
○ How do you select your compute solution
■ Optimal solution depends on the application design, usage patterns and
configuration settings.
■ Evaluate available compute options
■ Understand the available compute configuration options
■ Collect compute related metrics
■ Determine the required configuration by right-sizing - analyze various
performance characteristics of workload, and how they relate to memory,
network and CPU usage
■ Use available elasticity of resources
■ Re-evaluate compute needs based on metrics - use system level metrics
to identify behavior and requirements overtime. Evaluate needs by
comparing available resources with these requirements and make
changes.
○ Important to choose the right kind of server. Some apps require heavy CPU or
heavy memory .
○ Auto scaling is key to ensuring you have enough instances to meet demand
● Storage
○ Optimal solution varies based on kind of access method (block, file, object),
○ Optimal storage solutions depend on:
■ Access method - block, file, object
■ Patterns of access - random or sequential
■ Throughput required
■ Frequency of access - online, offline, archival
■ Frequency of update - worm, dynamic
■ Availability and durability constraints
○ How do you select your storage solution
■ Understand storage characteristics and requirements - understand
different characteristics such as shareable, file size, cache size, access
patterns, latency, throughput, persistence data
■ Evaluate available configuration options
■ Make decisions based on access patterns and metrics- choose storage
system by considering how workload access data.
○ EBS, S3
● Database
○ Optimal database solution depends on factors: consistency, availability, relational
type, partition tolerance, latency, query capability
○ Many systems use different solutions for various subsystems
○ What are you using the database for?
○ How do you select your database solution?
■ Remember to choose the best solution for your architecture. Depending
on factors mentioned above
■ Understand data characteristics - determine if workload requires
transactions, how it interacts with data, what its performance demands
are etc.
■ Evaluate available options
■ Collect and record database performance metrics - measure transactions
per second, slow queries, system latency when accessing database
■ Choose data storage based on access patterns - IE use relational
database for workloads requiring transactions or key-value store that
provides higher throughput
■ Optimize data storage based on access patterns and metrics - measure
how optimizations such as indexing, key distribution, data warehouse
design, or caching strategies affect system performance
○ It is critical to consider the access patterns of your workloads, and to consider if
other non-database solutions could solve the problem more efficiently IE search
engine or data warehousing
● Network
○ Optimal solution varies based on latency, throughput requirements
○ Physical constraints such as user or on-premise resources will drive location
options, which can be offset using edge techniques or resource placement
○ AWS offers services such as enhanced networking, ebs-optimized instances,
s3 transfer acceleration, dynamic AWS cloudfront to optimize network traffic
○ Aws offers services such as route 53 latency routing, vpc endpoints, direct
connect to reduce network distance or jitter
○ How do you configure your networking solution
■ Understand how networking impacts performance - IE network latency
impacts user experience, and using wrong protocols can starve network
capacity through overhead
■ Understand available product options
■ Evaluate available networking features
■ Use minimal network ACL - design network to minimize number of ACL
while meeting requirements. Too many ACL can negatively impact
network performance
■ Leverage encryption offloading and load-balancing - use LB for offloading
encryption termination TLS to improve performance, and manage and
route traffic. Distribute traffic across multiple resources to let your
workload take advantage of elasticity
■ Choose network protocols to improve performance
■ Choose location based on network requirements
■ Optimize network configuration based on metrics
○ When selecting solution, you need to consider location. Take advantage of
regions, placement groups, and edge locations to significantly improve
performance
● Review
○ Over time new technologies and approaches become available that could
improve performance of existing architecture
○ Take advantage of continual innovation
○ How do you evolve your workload to take advantage of new releases
■ Keep up to date on new resources and services
■ Define process to improve workload performance - IE run existing
performance tests on new instance offerings to determine which
improvements in performance would be gained by using them
■ Evolve workload performance over time
○ Understand where your architecture is performance constrained
● Monitoring
○ Monitoring metrics should be used to raise alarms when thresholds are
breached, trigger automated action to work around
○ Use automation to work around performance issues via automated trigger
through kinesis, SQS, lambda
○ Ensure that you do not see many false positives, or overwhelmed with data, is
key to having effective monitoring solution.
○ Plan for game days where you can conduct simulations in production
environment to test your alarm solution and response.
○ How do you monitor your resources to ensure they are performing as
expected
■ Record performance related metrics - use cloudwatch to record
performance related metrics IE db transactions, slow queries, I/O latency,
HTTP request throughput, service latency
■ Analyze metrics when events or incidents occur
■ Establish KPI to measure workload performance
■ Use monitoring to generate alarm based notifications
■ Review metrics at regular intervals
■ Monitor and alarm proactively
● Tradeoff
○ Think about tradeoffs so you can select optimal approach.
○ Often you can improve performance by trading consistency, durability and space
for time and latency
○ Tradeoffs can increase complexity of your architecture and require load testing to
ensure measurable benefit is obtained
○ How do you use trade offs to improve performance
■ Understand areas where performance is most critical
■ Learn about design patterns and services
■ Identify how tradeoffs impact customers and efficiency
■ Use various performance related strategies
Cost Optimization
● Reduce your costs to minimum
● Pay the lowest price while achieving your business objectives
● Key service is Cost Explorer, gives you visibility and insight into your usage in detail,
provides you recommendations on instances.
● Design principles
○ Consumption model - pay only for resources that you require, and increase or
decrease usage depending on business requirements.
○ Measure overall efficiency - measure business output of workload and costs
associated with delivering it.
○ Analyze and attribute expenditure - cloud makes it easy to accurately identify
usage and cost of systems, which allows transparent attribution of IT costs. Helps
measure return on investment, and gives workload owners opportunity to
optimize resources and reduce costs.
○ Transparently attribute expenditure
○ Use managed services to reduce cost of ownership - managed and
application level services remove operational burden of maintaining servers.
○ Stop spending money on data center operations
● Definition
○ Matched supply and demand
■ Align supply with demand. Don't over or under provision. There needs to
be sufficient extra supply to allow for provisioning time and individual
resource failures
■ Use cloudwatch to help you track what your demand is
■ Example is auto scaling or lambda, which you only pay when the function
executes / request comes in
■ Anticipating changes in demand, you can save more money and ensure
resources match workload needs
■ How do you match supply of resources with demand?
● Perform analysis on workload demand
● Provision resources reactively or unplanned
● Provision resources dynamically
■ When designing to match supply against demand, think about pattern of
usage and time it takes to provision new resources
○ Cost effective resources
■ Using correct instance type
■ Appropriate service selection can reduce usage and costs, such as
Cloudfront to minimize data transfer, or completely eliminate costs,
utilizing Aurora to remove expensive database licensing costs.
■ How do you evaluate cost when selecting services
● Identify organization requirements for cost - define balance
between all other pillars
● Analyze components of workload - ensure every workload
component is analyzed
● Perform thorough analysis of each component
● Select components of this workload to optimize cost inline with
organization priorities - using application level and managed
services IE RDS, Dynamo DB, SNS, SES to reduce overall
organization cost. Minimize license cost by using open source
solution IE amazon linux for compute workloads, or migrate DB to
Aurora
● Perform cost analysis for different usage over time
■ How do you meet cost targets when selecting resource type and size
● Perform cost modeling
● Select resource type and size based on estimates
● Select resource type and size based on metrics
■ How do you use pricing models to reduce cost
● Perform pricing model analysis
● Implement different pricing models, with low coverage
● Implement pricing models for all components of workload - short
term capacity is configured for spot instances, spot blocks or spot
fleet. On demand is only used for short-term workloads that
cannot be interrupted, and do not run long enough for reserved
capacity: typically 25 to 75 percent of the year
● Implement regions based on cost
■ How do you plan for data transfer charges
● Perform data transfer modeling
● Select components to optimize data transfer cost
● Implement services to reduce data transfer cost
○ Expenditure awareness
■ Many organizations have different teams with their own aws accounts. Be
aware of what each team is spending. You can use cost allocation tags to
track this, billing alert
■ Accurate cost attribution lets you know which products are profitable, lets
you make informed decisions about where to allocate budget
■ Cost Explorer lets you track your spend, gain insights into exactly where
you spend. AWS Budgets you can send notifications if your usage or
costs are not inline with your forecasts.
■ How do you govern usage?
● Develop policies based on organization requirements - policies
should cover aspects of resources and workloads, including
creation, modification, decommission of resource. Develop cost
targets and goals for workloads
● Implement an account structure
● Implement groups and roles
● Implement cost controls
● Track project lifecycle
■ How do you monitor usage and cost
● Configure AWS cost and usage report
● Identify cost attribution categories
● Establish organization metrics
● Define and implement tagging - implement tagging across all
resources
● Configure billing and cost management tools
● Report and notify cost optimization - configure AWS budgets to
provide notifications on cost and usage.
● Monitor cost proactively
● Allocate cost based on workload metrics - implement process to
analyze AWS cost and usage reports with AWS Athena, which
provide insight and charge back capability
■ How do you decommission resources
● Track resources oer their life time
● Implement decommissioning process
● Decommission resources in unplanned manner - typically
triggered by events such as periodic audits
● Decommission resources automatically
■ You can use cost allocation tags to categorize and track your usage and
costs. When you apply tags, AWS generates a cost and usage report with
your usage and your tags. Apply tags that represent organization
categories IE cost centers, workload names, owners
○ Optimizing over time
■ AWS moves fast.
■ Keep track of changes made to AWS to re-evaluate your environment and
maybe a new service is better / evaluate existing architecture.
■ How do you evaluate new services?
● Establish cost optimization function - regularly review costs and
usage
● Develop workload review process
● Review and implement services in unplanned way
● Review and analyze workload regularly
● Keep up to date with new service releases

Operational Excellence
● Operational practices and procedures to manage production workloads. How planned
changes are executed, and responses to unexpected operational events. Ability to run
and monitor systems to deliver business value and to continually improve
supporting processes and procedures.
● Design principles
○ Perform operations with code - define entire workloads as code and update it
with code. Implement operations procedures as code and automate their
execution via triggering. Limit human error and enable consistent response to
events.
○ Annotate Documentation - Align operations processes to business objectives -
collect metrics to indicate business objectives. Automate creation of annotated
documentation after every build. Use annotations as input to your operations
code.
○ Make regular, small incremental changes - design workloads to allow
components to update regularly. Make small changes that can be reversed.
○ Test for responses to unexpected events - perform pre mortem exercises to
identify potential sources of failure. Test response procedures to failure to ensure
they are effective.
○ Learn from operational events and failures - drive improvement through
lessons learned from operational events and failures.
○ Keep operations procedures current - look for opportunities for improvement.
Evolve your procedures as you evolve your workload. Set up regular game days
to review and validate all procedures are effective.
● Definition
○ Perpreation
■ Effective preparation is required to drive operational excellence
■ Common standards simplify workload design and management, enabling
operational success.
■ Operational readiness is validated through checklists to ensure workload
meets defined standards, and required procedures are adequately
captured in runbooks and playbooks.
■ Workloads should have
● Run books - operations guidance that operations teams can refer
to so they can perform normal daily tasks
● Playbooks - guidance for responding to unexpected operational
events.
■ Cloudformation can be used to ensure that environments contain all
required resources when deployed in production, and environment is
based on test practices
■ Auto scaling to auto scale mechanisms
■ Aws config create mechanisms to automatically track and respond to
changes in your AWS workloads and environments
■ How do you determine what your priorities are?
● Evaluate external customer needs
● Evaluate internal customer needs
● Evaluate compliance requirements - factors such as regulatory
compliance requirements and industry standards, ensure you are
aware of guidelines or obligations that mandate specific focus
● Evaluate threat landscape
● Evaluate tradeoffs - for example, accelerating speed to market for
new features may be emphasized over cost optimization
● Manage benefits and risks - for example, may be beneficial to
deploy system with unresolved issues so new features can be
made available to customers
■ How do you design workload so that you can understand its state?
● Implement application telemetry
● Implement workload telemetry
● Implement user activity telemetry
● Implement dependency telemetry
● Implement transaction telemetry
■ How do you reduce defects, ease remediation, and improve flow into
production?
● use version control
● Test and validate changes
● Use configuration management systems
● Use build and deployment management systems
● Perform patch management
● Share design standards
● Implement practices to improve code quality
● Use multiple environments
● Make frequent, small, reversible changes
● Fully automate integration and deployment
■ How do you mitigate deployment risks?
● Adopt approach that provides fast feedback on quality and enable
rapid recovery from changes that do not have desired outcomes.
● Plan for unsuccessful changes - revert to good known state
● Test and validate changes
● Use deployment management systems
● Test using limited deployments
● Deploy using parallel environments
● Deploy frequent, small, reversible changes
● Fully automate integration and deployment
● Automate testing and rollback
■ How do you know that you are ready to support a workload?
● Evaluate operational readiness of your workload, process and
procedures, and personnel to understand risks related to
workload.
● Ensure personnel capability
● Ensure consistent review of operational readiness
● Use runbooks to perform procedures - runbooks are documented
procedures to achieve specific outcomes.
● Use playbooks to identify issues - playbooks are documented
processes to investigate issues
● Make informed decisions to deploy systems and changes
■ Implement minimum number of architecture standards for your workloads.
Reduce number of supported standards to reduce change that
lower-than-acceptable standards will be applied by error. Invest in
implementing operation activities as code to maximize productivity of
operations personnel, minimize error rates and enable automated
response. Adopt practices that take advantage of elasticity of the cloud.
■ AWS Config and AWS Config Rules used to create standards for
workloads and determine if environments are compliant with those
standards
○ Operation
■ Successful operation of workload is measured by achievement of
business and customer outcomes.
■ Establish baselines from which improvement or degradation of operations
will be identified, collect and analyze metrics, validate your understanding
of operations success and how it changes.
■ Use established runbooks for well understood events, and playbooks to
aid in resolution of other events.
■ Ensure that if alert is raised in response to event, there is associated
process to be executed.
■ Should be standardized and manageable on routine basis. Focus should
be on automation, small changes, regular QA testing.
■ How do you understand the health of your workload?
● Define, capture, analyze workload metrics to gain visibility to
workload events so you can take appropriate action
● Identify key performance indicators
● Define workload metrics - measure the achievement of KPI.
measure health of the workload, evaluate metrics to see if
achieving desired outcomes
● Collect and analyze workload metrics
● Establish workload metrics baseline - baseline for metrics to
provide expected values as basis for comparison and identification
of under and over performing components
● Learn expected patterns of activity for workload - determine when
behavior is outside of expected values
● Alert when workload outcomes are at risk
● Alert when workload anomalies are detected
● Validate achievement of outcomes and effectiveness of KPI and
metrics - create business level view of workload operations to
determine if needs are satisfied and identify areas that need
improvement to reach business goal. Validate effectiveness of KPI
and metrics
■ How do you understand health of your operations?
● Define, capture and analyze operation metrics
● Identify key performance indicators
● Define operations metrics - metrics to measure achievement of
KPI
● Collect and analyze operation metrics baseline
● Establish operations metrics baseline
● Learn expected patterns of activity for operations
● Alert when operations outcomes are at risk
● Alert when operations anomalies are detected
● Validate achievement of outcomes and effectiveness of KPI and
metrics
■ How do you manage workload and operation events?
● Use processes for event, incident and problem management - use
processes to mitigate impact of events on the business by
ensuring timely and appropriate responses
● Use process for root cause analysis - process to identify and
document root cause of event
● Have process per alert
● Prioritize operational events based on business impact - when
multiple events require intervention, those that are most significant
to the business are addressed first.
● Define escalation path - define escalation paths in run / play
books, include what triggers escalations, procedures. Identify
owners for each action
● Enable push notification
● Communicate status through dashboards - provide dashboards
tailored to target audiences
● Automate responses to events
■ Routine operations, as well as responses to unplanned events should be
automated.
■ AWS CloudWatch to monitor operational health of workload
○ Evolution
■ Responses to unexpected events should be automated
■ Alerts should be timely and invoke escalations when responses are not
adequate to mitigate impact of events
■ Dedicate work cycles to make incremental improvements
■ AWS developer tools, you can implement continuous delivery build, test
and deployment activities that work with variety of source code, build,
testing and deployment tools from AWS.
■ How do you evolve operations?
● Dedicate time and resources for incremental improvements
● Have process for continuous improvements
● Implement feedback loops
● Define drivers for improvement
● Validate insights
● Perform operation metrics reviews
● Document and share lessons learned
● Allocate time to make improvements
■ Successful evolution of operations revolve around small frequent
improvements, providing safe environments and time to experiment,
environments which learning from failure is encouraged.
■ AWS Elastisearch allows you to analyze your log data to gain actionable
insights quickly and securely
● Key Services - CloudFormation
○ Create templates of best practices. Provision resources in an orderly and
consistent fashion from your development through production environments.

You might also like