Professional Documents
Culture Documents
June 2020
Document revision history
Revision history
Version Name Date Revision
1.0 Nick Matahen 4-20-2020 Initial Version
1.11 Nick Matahen 5-8-2020 Added content
1.2 Nick Matahen 5-26-2020 Incorporated reviewers feedback
1.21 Nick Matahen 5-30-2020 Incorporated reviewers feedback
1.3 Nick Matahen 6-04-2020 Incorporated reviewers feedback
1.31 Nick Matahen 6-14-2020 Incorporated reviewers feedback
1.4 Nick Matahen 6-19-2020 Incorporated reviewers feedback
These reviews will identify and prioritize changes that will support and improve the
global deployment of infrastructure and Member Firms.
KPMG initiated a project for the above assessment and reviews called the One
Platform Architecture and Automation Assessment and its code name is Omega.
2. Overview – Azure Cloud Resiliency
The primary goal of Project Omega is to ensure that One Platform is well-architected to
maximize the advantages of cloud for KPMG’s internal and external users and services.
The Omega framework consists of five workstreams, and each workstream has multiple
tracks. One of the workstreams is Cloud Architecture and it includes a track called
Cloud Resiliency.
Azure Cloud Resiliency helps Omega achieve one of its objectives by providing
guidelines and recommendations for deploying and enabling highly available and
resilient cloud services.
2.2 Purpose
The purpose of this document is to provide guidelines and recommendations for
deploying resilient IaaS-based infrastructure services. These services are restricted to
One Platform Cloud services (Microsoft Azure Cloud). This document is not a design or
architecture document, but is meant to aid in the decision-making process of deploying
resilient Azure cloud infrastructure services. With the ever-changing nature of cloud
computing, this is a point-in-time document at the time of writing (June 2020) and may
require updates in the future to stay current with cloud changes.
2.3 Objectives
There are two objectives for One Platform Cloud Resiliency: the first is a business
objective to ensure automatic re-adaptation to crisis situations/disruptions of cloud
services by gracefully recovering from failures, allowing for continued functionality with
minimal downtime or data loss. The second is a technical objective to ensure creating
high-performing cloud services by building resilient cloud architecture in order to
seamlessly support computing services and end users.
3. Azure Cloud Resiliency
3.1 Resiliency Definition
Resiliency is the ability of a system to recover from failures and continue to function. It is
not only about avoiding failures, but also responding to failures in a way that minimizes
downtime or data loss.
• Blast Radius
• The radius of protection for applications and data. For example, Availability Sets
protect applications within a datacenter, and Availability Zones protect
applications and data in an Azure region.
• High Availability
• High Availability is the ability to maintain an acceptable uptime due to temporary
failures in services, hardware, or fluctuations in load. High Availability also refers
to a system that is designed to prevent against loss of service or managing failures
and minimizing planned downtime.
• Disaster Recovery
• Disaster Recovery is the ability to protect against the loss of an entire region.
Disaster Recovery also means architecting cloud infrastructure to recover from an
unforeseen event that can put the organization at risk and ensure business
continuity.
• Backup
• Backup is the ability to replicate VMs and data to one or more regions. Azure
backup also means to back up (or protect) and restore your data due to an
unforeseen event.
• Data Residency
• Data Residency is the ability to have two regions that share the same regulatory
requirements for data replication, storage, and residency for the country or region
in which they operate.
• Data Sovereignty
• Data Sovereignty is the ability to keep data within the physical borders of a
particular country or geo-political area.
3.3 Resiliency Blast Radius
Resiliency represents a set of platform-native technical resilience features
facilitating/supporting business continuity, providing high availability, disaster recovery,
and backup to protect mission-critical applications and data. The image below shows
the resiliency features and capabilities. For more information, see Azure Resiliency.
Failures can vary in the scope of their impact. There are many different failure types,
including hardware, software, regional, or transient failures, as well as dependency
service, heavy load, accidental data deletion/corruption, and application deployment
failures. The first step toward achieving resilience is to avoid failures in the first place.
This approach to increasing reliability involves improving the platform’s capability to
minimize impacts during planned maintenance events and giving customers control
over the experience during these events. For more information, see Resiliency Design.
3.5 Resiliency Responsibility
Resiliency is a shared responsibility between the various customer teams and the cloud
provider. Additionally, resiliency depends on the cloud service model being used: IaaS,
PaaS, or SaaS. Being resilient in the event of any failure is a shared responsibility, as
shown in the table below. For more information, see Shared Responsibility.
3.6 Availability Requirements
It is also important to define what it means for the application to be “resilient.” Resiliency
design requires an understanding of the applicable availability requirements and asking
the following, based on the business and technical requirements:
• How much unplanned downtime and/or data loss is acceptable?
• How much will potential unplanned downtime and/or data loss cost the business?
• What means (other than increasing technical resilience) are available to the business
for dealing with unplanned downtime and/or data loss? (e.g. adjusted business
processes, etc.)
• How much money and time can the business realistically invest in making the
application more resilient?
The following table shows the potential cumulative downtime for various SLA levels:
SLA Downtime per week Downtime per month Downtime per year
99% 1.68 hours 7.2 hours 3.65 days
99.9% 10.1 minutes 43.2 minutes 8.76 hours
99.95% 5 minutes 21.6 minutes 4.38 hours
99.99% 1.01 minutes 4.32 minutes 52.56 minutes
99.999% 6 seconds 25.9 seconds 5.26 minutes
3.8 Resiliency Metrics
The following metrics are used to measure resilience:
• Recovery time objective (RTO) is the maximum acceptable time that an application
can be unavailable after an incident.
• Recovery point objective (RPO) is the maximum duration of data loss that is
acceptable during a disaster.
• Service Level Agreement (SLA) describes Microsoft’s commitments regarding
uptime and connectivity. If the SLA for a particular service is 99.9%, that means
customers can expect the service to (on average) be available 99.9% of the time. If
Microsoft does not achieve and maintain the Service Levels for each Service as
described in their SLA, then you may be eligible for a credit towards a portion of your
monthly service fees.
4. Omega Resiliency Scope
4.1. Azure Cloud
Azure has about 55 global regions in 140 countries. Each region has one or more
Availability Zones offering the scale needed to bring infrastructure and applications
closer to users’ geography, preserving data residency and offering comprehensive
compliance and resiliency options for customers. This document is concerned with One
Platform infrastructure resiliency.
• Geographies
• A geography is a discreet physical location containing two or more regions
that are fault-tolerant to withstand complete region failure through their
connection to our dedicated high-capacity networking infrastructure.
• Regions
• A region is a set of datacenters deployed within a latency-defined perimeter
and connected through a dedicated regional low-latency network. It also
preserves data residency and compliance boundaries.
• Availability Zones
• Availability Zones are physically separate locations within an Azure region.
They are designed to run mission-critical applications with high availability and
low-latency replication.
5. One Platform Infrastructure – Cloud Services
The resiliency scope is limited to One Platform infrastructure-related services.
Infrastructure includes the components below. This list is not exhaustive, and it is
applicable to the time of writing this document (June 2020). The following sections will
provide detailed information, guidance, and recommendations for deploying resilient
cloud services (KPMG One Platform-approved services). These services are:
— Front Door is an application delivery network that provides global load balancing and
site acceleration service for web applications. It offers Layer 7 capabilities for your
application like TLS offload, path-based routing, fast failover, caching, etc. to
improve performance and high-availability of your applications. Currently, this is not
in use, but is planned to be implemented in One Platform in the future.
— Traffic Manager is a DNS-based traffic load balancer that enables you to distribute
traffic optimally to services across global Azure regions, while providing high
availability and responsiveness. Because Traffic Manager is a DNS-based load-
balancing service, it load balances only at the domain level. For that reason, it
cannot fail over as quickly as Front Door, because of common challenges around
DNS caching and systems not honoring DNS TTLs. This service is currently in use in
One Platform. For more information, see Section 6.1.1.
When selecting the load-balancing options, here are some factors to consider:
• Global vs. regional: Do you need to load balance VMs or containers within a
virtual network, or load balance scale unit/deployments across regions, or both?
• Availability: What is the service SLA?
• Cost: In addition to the cost of the service itself, consider the operations cost for
managing a solution built on that service. For more information, see Azure
pricing.
• Features and limits: What are the overall limitations of each service? For more
information, see Service limits.
The following flowchart will help you to choose a load-balancing solution for your
application. The flowchart guides you through a set of key decision criteria to reach a
recommendation.
Treat this flowchart as a starting point, as every application has unique requirements.
Then perform a more detailed evaluation.
• Internet facing: Applications that are publicly accessible from the internet. As a
best practice, application owners apply restrictive access policies or protect the
application by setting up offerings like web application firewall and DDoS
protection.
• Global: End users or clients are located beyond a small geographical area. For
example, users across multiple continents, across countries within a continent, or
even across multiple metropolitan areas within a larger country.
Overview
A load balancer that is DNS-based, Azure Traffic Manager allows optimal distribution of
traffic across global Azure regions inherently providing high availability and
responsiveness. The health of endpoints and traffic routing methodology are utilized by
Traffic Manager to direct client requests to the optimal service endpoint hosted inside or
outside of Azure via a range of endpoint monitoring and traffic shaping configured to
meet application requirements and high availability/failover modeling. Microsoft
guarantees that DNS queries will receive a valid response from at least one of our
Azure Traffic Manager name server clusters at least 99.99% of the time.
• Continuous monitoring of endpoint health and automatic failover when endpoints fail.
Most importantly, Traffic Manager operates at the DNS level to direct clients to a
specific service endpoint based on traffic routing rule methodology and does not
operate as a proxy or gateway; clients connect directly to the service endpoint and
traffic flows are not analyzed.
Deployment Options
o Source IP lookup and comparison with the internet latency table can be
used to optimize performance of traffic distribution by tracking round trip
times between the client and service endpoints to direct the traffic to the
optimal location. The latency table is regularly updated to account for
changes in the global internet, however application performance varies on
real-time load variations across the internet, so performance degradation
is possible.
• More advanced traffic routing scenarios can be implemented with Nested Traffic
Manager profiles by combining multiple routing methods listed above by
overriding the default behavior in support of applications with larger and more
complex deployments.
High Availability
Traffic Manager is resilient to failure, including the failure of an entire Azure region,
however, aside from the nature of the Azure Service Fabric integration, the following HA
architecture models should be reviewed to impart redundancy into an application or
service endpoint:
• In the Active-passive with cold standby solution, VMs and other appliances
running in the standby region remain inactive until a failover is initiated. This
methodology utilizes ARM templates, backups or VM images to replicate
resources to a secondary region and should be noted this method extends the
failover/recovery time completion.
• In the Active/Passive with warm standby solution, the secondary region is pre-
warmed, and running in parallel, to begin servicing the base load due to all
instances remaining actively online and autoscaling enabled. It should be noted
this solution is not scaled to maintain the full production load but can provide
functionality, albeit in a reduced performance metric, because all services are
already online and active.
Disaster Recovery
Azure Traffic Manager should leverage the following technical aspects to develop the
DR strategy:
• In conjunction with the Active-passive with cold standby and Active/Passive with
pilot light HA solutions, Azure DNS manual failover utilizes the standard DNS
mechanisms to failover to a backup site. The following assumptions are made:
o Static IPs are deployed for both Primary and Secondary service
endpoints.
o The Primary and Secondary sites have corresponding Azure DNS Zones.
Backup
Due to the nature of the service offering and lack of traditional backup operations for the
Azure Traffic Manager, the following method can be leveraged to store existing
configuration templates for redeployment:
Availability Sets
Azure Traffic Manager does not offer the Availability Set feature as an available
function of the service offering.
Availability Zones
Due to the standard replication operations within the Azure Service Fabric,
Availability Zone configurations are not available as a function of the service
offering.
Regions
Note: Due to the unique nature of the Traffic Manager offering, note that presence is
mostly non-region specific to maintain highest levels of availability and durability.
Deployment Guidelines and Recommendations
The following are some guidelines and best practices to consider when leveraging
Azure Traffic Manager:
Overview
Azure Application Gateway is a highly scalable and robust solution for publishing web-
based applications. The Azure Application Gateway functions as a reverse proxy,
terminating the Inbound sessions from end-users. It performs SSL decryption and
encryption for the sessions between Client and Server and protects the communication
with build-in Web Application Firewall technology leveraging the industry standard
OWASP rule set.
Traditional load balancers (LB) operate at the transport layer (OSI layer 4 - TCP and
UDP) and route traffic based on source IP address and port, to a destination IP address
and port whereas the Application Gateway operates at layer 7 and is referred to as
application layer load balancing. The difference between a traditional LB and an
Application Gateway is the ability to make routing decisions based on header attributes
of the HTTP request such as host header details or the path of a URI. For example, if a
web request has /images in the URL path the traffic can be routed to a specific back
end server pool configured to host image files or if the URL contains /video the traffic
can be routed to a separate back end server pool optimized for video streaming.
Deployment Options
As per Microsoft’s recommendation, Application Gateway v2 will be used for all new
deployments.
The following are the standard deployment patterns for Application Gateways:
• Shared end point: The One Platform team reuses the existing end
points/certificates to onboard the applications. The onboarding will be
quicker, as they will be reusing existing Public IP and shared Palo Alto
configurations.
• Dedicated end point: The application will be hosted in a One Platform-
managed Azure Application Gateway but create an application-specific
end point including a dedicated Public IP address and related Palo Alto
configuration for Internet-facing applications.
High Availability
Application Gateway WAF_V2 supports auto scaling (minimum of 2 instances and
maximum up to the customer needs, and can scale up or down based on changing
traffic load patterns) and zone redundancy, will be available at least 99.95% of the time.
Disaster Recovery
Your tolerance for reduced functionality during a disaster is a business decision that
varies from one application to the next. It might be acceptable for some applications to
be unavailable or to be partially available with reduced functionality or delayed
processing for a period of time. For other applications, any reduced functionality is
unacceptable.
A WAF_v2 Application Gateway can span multiple Availability Zones, offering better
fault resiliency and removing the need to provision separate Application Gateways in
each zone.
Customers looking for DR functionality leveraging OPCS Azure Paired regions have to
configure the application gateway in the paired region (it is not an inbuilt capability). It
can be configured upfront as a passive set up or can be deployed as part of DR
process.
Backup
Due to the nature of the service offering and lack of traditional backup operations for
Application Gateways, the following method can be leveraged to store existing
configuration templates for redeployment:
Availability Sets
Application Gateways now offer the Autoscale feature, in addition to the Manual
configuration of the Scale Unit with the WAF_v2 SKU. Like the Virtual Machine Scale
Sets and App Service Scale Set features, you are now able to create a minimum (2
instances) and maximum amount of Scale Units (Compute resources) to prevent the
Application Gateway from running at peak provisioned capacity for anticipated
maximum workload and enable better elasticity for application workload requirements.
Availability Zones
Application Gateway now offers Availability Zones with the WAF_v2 SKU.
To configure Availability Zones for Application Gateway instances, the following steps
should be executed:
1. Autoscaling:
Regions
• Exception management accountability lies with IPG, but all involved teams
(Application/One Platform) are responsible in maintaining the security standards
throughout the application lifecycle.
• Custom rules/exclusions must be the preferred options over disabling WAF rules.
• If using both HTTP/HTTPS protocols on any sub-site, the default website should
be listening on both 80 and 443. Unless an application specific requirement
exists, all endpoints must be published via HTTPS/443 or redirect from HTTP/80
to HTTPS/443. This would be handled under an exception request with a business
justification.
• The default site should have the TLS certificate bound that will primarily be used
by the Application Gateway to authenticate against the server.
Overview
Operating at Layer 4 of the OSI model, the Azure Load Balancer distributes inbound
flows arriving at the load balancer’s front to backend pools dictated by load balancing
rules and health probes. Azure VMs or instances in a Virtual Machine Scale Set (VMSS)
can comprise backend pools to further enhance HA to applications and services.
Outbound connections for VMs inside the VNet are enabled by NATing their private IPs
to a Public load balancer’s public IP to balance internet traffic to the backend VMs,
whereas private load balancers utilize private IPs on the front end to balance internal
VNet traffic only.
One item of note is a load balancer frontend can be accessed from an on-premises
network in a hybrid scenario.
Applications can be scaled to create highly available services, supporting both ingress
and egress scenarios which provide low latency and high throughput, and scales up to
millions of flows for all TCP and UDP based applications when utilizing a Standard Load
Balancer.
• Distribute resources across, and within, zones to increase application and service
availability.
• Access VMs in a VNet via port forwarding on the load balancer’s public IP.
• Azure Monitor can ingest multi-dimensional metrics that can be filtered, grouped
and further extrapolated for a given dimension from a Standard Load Balancer to
provide current and historic performance and health insights of a service.
• Services hosted on multiple ports and/or multiple IPs can be load balanced.
• Simultaneous TCP and UDP flow load balancing on all port utilizing HA ports.
Deployment Options
Azure Load Balancers contain the following items and options when deploying into an
environment, but first there are two SKUs to consider, as illustrated in the table below:
• The Frontend of the Load Balancer can be configured with two types of IPs for
access, and the selection determines the type of load balancer created. The two
types, available in both Basic and Standard SKUs, are:
• The Health Probe can define the unhealthy thresholds for backend pool
instances so when a probe response fails, the load balancer ceases traffic flow to
the unhealthy node. Health Probe failures do not affect existing connections and
will continue until the application either terminates the flow, experiences a
timeout or the VM shuts down. Health Probes types include TCP, HTTP and
HTTPS for endpoints for Standard Load Balancers; HTTPS probes are not
supported for Basic Load Balancers.
• Load Balancing Rules configures the load balancer’s frontend IP and application
specific port requirements to multiple backend IPs and ports so application and
service requirements should dictate the design for optimal performance.
• Inbound NAT rules employ the same hash-based distribution as load balancing
to implement port forwarding for frontend IP traffic to backend pool’s instances
while Outbound Rules configure and handle outbound NATing for all backend
pool instances.
High Availability
Load balancing TCP and UDP flows on all ports simultaneously is assisted with an
internal Standard Load Balancer. A variant of a load balancing rule, a HA port rule can
simplify the implementation of balancing all TCP and UPD flows by providing a singular
rule configuration and load balancing decisions are executed per flow and is supported
in all global Azure regions. The decision algorithm is based on a five-tuple connection:
1. Source IP
2. Source Port
3. Destination IP
4. Destination Port
5. Protocol
Note: While 5 tuples might result in best possible load distribution amongst the
members in the pool, some scenarios might require Source IP address affinity,
where 2 Tuple algorithms would be the best fit.
If a large number of ports must be load balanced or critical scenarios such as HA and
scale for network virtual appliances inside VNets must be deployed; you should
implement a HA ports load balancing rule by configuring the frontend and backend ports
to ‘0’ and protocols to ‘All’ so the load balancer resources balances all flows, regardless
of port number. Applications requiring load balancing of large numbers of ports are ideal
for HA port rules and can simplify the implementation by deploying an internal Standard
Load Balancer with a HA port rule vs multiple load balancing rules configured per port.
The following are examples of supported HA port rule configurations to assist with
identifying applications and services ideal for use and further raising availability:
o Note: A Public Standard Load Balancer for backend instance set can be
configured in addition to this HA ports rule configuration.
• Direct Server Return (floating IP) HA ports rule Internal Load Balancer
configuration allows you to add more floating IP load balanced rules and/or a
Public Load Balancer. However, you cannot combine this configuration with the
previous non-floating IP configuration.
o Deploy multiple load balancing rules where each rule has a single, unique,
frontend IP address assigned.
o For all load balancing rules, select HA ports option and set Floating IPs to
Enabled.
• Internal Standard Load Balancers are the only supported configuration available
for HA ports rules.
• Non-HA ports and HA ports load balancing rules pointing to the same backend
pool IPs is not supported.
Additionally, load balancing services on multiple ports and/or multiple IPs is supported
on Public and Internal load balancers to balance flows across a set of VMs. A frontend
and a backend pool configuration are connected when defining Azure Load Balancer
rules and the health probe referenced by the rule determines how new flows are sent to
a backend pool node. The frontend virtual IP (VIP) is defined by a 3-tuple comprising a
Public or Internal IP, a transport protocol and port number from the load balancing rule
whereas the backend pool is the collection of VM IP configurations attached to the VM
NIC resource, which reference the Load Balancer backend pool.
Flexibility is provided when defining the load balancing rules; a rule declares IP/port
mapping on the frontend to destination IP/port on the backend and the type of rule
dictates whether backend ports are reused across rules. There are two types of rules:
Mixing of both rule types is supported by Azure Load Balancers by using them
simultaneously for a given VM, or any combination, if constraints of the rule are
followed. Application requirements and support complexity will dictate the rule type
selected so evaluation should be executed.
The following are limitations when utilizing multiple frontends to enhance HA in the
environment:
• IaaS VMs are only supported when utilizing Multiple frontend configurations.
• An application must use the primary IP configuration for outbound SNAT flows
when using the Floating IP rule; if the application binds to the frontend IP
configured on the guest OS loopback interface Azure’s outbound SNAT is
unavailable to rewrite the outbound flow, causing failure.
Disaster Recovery
The Azure Load Balancer offering has DR natively built-in via the Azure Service
Fabric. Additional considerations should reference the Backups and Availability
Zone sections below. Azure load balancer in ASR region for DR should be created
Backups
Due to the nature of the service offering and lack of traditional backup operations for the
Azure Load Balancer service, the following method can be leveraged to store existing
configuration templates for redeployment:
Availability Sets
Azure Load Balancer does not offer the Availability Set feature as an available
function of the service offering.
Availability Zones
Public and Internal Load Balancers can support zone redundant and zonal
configurations and can direct traffic across zones, as needed (cross zone load
balancing). The core components of Load Balancers should include the following design
considerations:
• Frontends for Standard Load Balancers can be zone redundant, meaning all
inbound or outbound flows are served by multiple Azure Availability Zones in a
region simultaneously utilizing a single IP, can survive zone failure and can be
utilized to reach all non-impacted backend pool members regardless of the zone.
One or more Availability Zones can fail and the data path will survive as long as
one zone in the region remains healthy by enabling the frontend’s single IP to be
served simultaneously by multiple, independent, infrastructure deployments in
multiple zones. While this does not imply a hitless data path, any retries should
succeed in other non-impacted zones.
• Zone redundant frontends causes the Load Balancer to expand its internal health
model to independently probe the accessibility of a VM from each Availability
Zone and shut down paths across zones that may have failed, without manual
intervention so it is possible during failure events, each zone having slightly
differing distributions of new flows while protecting overall health of the service.
Regions
The following are some guidelines and best practices to consider when leveraging
Azure Load Balancers:
• Zone redundant flows can leverage any zone and the Load Balancer flows will
utilize all healthy zones in a region so in the event of zone failure healthy zone
traffic flows are unimpacted. Zone failure traffic flows may be impacted but
applications can recover so implementing zone redundancy will help ensure
traffic reaches intended destinations.
• Analyze applications that can support and implement HA Port rules on Load
Balancers where possible to add reliability and high availability.
• Implement VM Availability Sets and Virtual Machine Scale Sets, where possible,
for backend pools to increase both instances and scaling of VM presence during
heavy workloads.
• When designing zone failure resiliency, applications that consist of multiple
components should be carefully analyzed to consider the survival of sufficient
critical components and impacts of zone failure. How an application converges
after a zone failure and restoration should also be taken into consideration when
designing resiliency in regard to all components, including Load Balancing rules.
6.2. Azure Application Insights
6.2.1. Overview
Application Insights, now integrated natively to Azure Monitor, is an extensible
Application Performance Management (APM) service for Developers and DevOps
professionals. It can be leveraged to monitor live applications and will automatically
detect performance anomalies. Application Insights includes powerful analytics tools to
help you diagnose issues and to help better understand how users consume your
applications for continuous improvement of performance and usability. It works for apps
on a wide variety of platforms hosted on-premises, hybrid and is cloud agnostic. It
integrates with your DevOps process, and has connection points to a variety of
development tools.
For example, Application Insights can monitor and analyze telemetry from PaaS web
apps, IaaS web services via custom agent installation, and mobile apps by integrating
with Visual Studio App Center by additionally instrumenting not only the web service
application, but also any background components themselves.
Application Insights is aimed at the development and support teams to help you
understand how your app is performing and how it is being used. It monitors:
• Request rates, response times, and failure rates: Find out which pages are
most popular, at what times of day, and where your users are. See which pages
perform best. If your response times and failure rates go high when there are more
requests, then perhaps you have a resourcing problem.
• Dependency rates, response times, and failure rates: Find out whether
external services are slowing you down.
• Exceptions: Analyze the aggregated statistics, or pick specific instances and drill
into the stack trace and related requests. Both server and browser exceptions are
reported.
• AJAX calls from web pages – rates, response times, and failure rates.
• Diagnostic trace logs from your app - so that you can correlate trace events with
requests.
• Custom events and metrics that you write yourself in the client or server code, to
track business events such as items sold or games won.
• At run time: instrument your web app on the server – Ideal for applications
currently deployed; avoids any update to the code.
o Java
o Node.js
o Python
o Additional platforms
6.2.5. Backup
Due to the nature of the service offering and lack of traditional backup operations for
Application Insights, the following method can be leveraged to store existing
configuration templates for redeployment:
6.2.8. Regions
Please refer to the Appendix for regional availability.
6.3.1. Overview
Azure Automation delivers a cloud-based automation and configuration service enabling
contiguous management across Azure and non-Azure environments. Azure Automation
comprises process automation, configuration management, update management,
shared capabilities, and heterogeneous features. Automation gives you complete
control during deployment, operations, and decommissioning of workloads and
resources. The automation account provides the following capabilities:
Currently, in One Platform, Automation accounts are not used for Configuration
management and Update management.
• Configure VMs: Assess and configure Windows and Linux machines with
configurations for the infrastructure and application.
• Find changes: Identify changes that can cause misconfiguration and improve
operational compliance.
• Monitor: Isolate machine changes that are causing issues and remediate or
escalate them to management systems.
6.3.5. Backup
Due to the nature of the service offering and lack of traditional backup operations for
Azure Automation accounts, the following method can be leveraged to store existing
configuration templates for redeployment:
6.3.8. Regions
Please refer to the Appendix for regional availability.
6.3.9. Deployment Guidelines and Recommendations
The following are some guidelines and best practices to consider when implementing
Azure Automation processes:
• Utilize tags to organize and track workflow to collect and report on metadata from
a range of Azure resources.
• Use Azure Automation assets and never hardcode values, especially secure
information.
6.4. Azure Recovery and Backup Services
6.4.1. Azure Backup Service
Overview
The Azure Backup service provides simple, secure, and cost-effective solutions to back
up your data and recover it from the Microsoft Azure cloud. Azure Backup supports
backup of Azure VMs utilizing Azure Disk Encryption (ADE) leveraging Bitlocker and
dm-crypt features. ADE is integrated with Key Vault, managing disk encryption
keys/secrets and can also leverage Key Vault Key Encryption Keys (KEKs) to add an
additional security layer by encrypting encryption secrets prior to Key Vault writes. The
following are some limitations when backing up encrypted VMs:
• Backup and restore operations on VMs can be executed within the same
subscription and region as the Recovery Services Backup Vault.
• Standalone keys are supported with Azure Backup; certificate key pairs used to
encrypt VMs are currently unsupported.
• Encrypted VMs cannot be restored at the file/folder level; the entire VM needs to
be restored for file/folder recovery.
• When restoring a VM, the replace existing VM option cannot be used for
encrypted VMs, as this option is only available for unencrypted managed disks.
• On-premises: Back up files, folders, system state using the Microsoft Azure
Recovery Services (MARS) agent. Or use the DPM or Azure Backup Server
(MABS) agent to protect on-premises VMs (Hyper-V and VMWare) and other on-
premises workloads
• SQL Server in Azure VMs: Back up SQL Server databases running on Azure
VMs
• SAP HANA databases in Azure VMs: Backup SAP HANA databases running
on Azure VMs
Deployment Options
Azure Backup stores backed up data in Azure Recovery Services vaults, a feature
deeply integrated into the Azure Service Fabric. There are multiple backup options,
however, one main caveat is the backed-up item must reside in the same region as the
Recovery Services Vault. For Azure VMs, data is encrypted-at-rest using Storage
Service Encryption (SSE). Azure Data Encryption-at-Rest is detailed here:
• https://docs.microsoft.com/en-us/azure/security/fundamentals/encryption-atrest
• https://docs.microsoft.com/en-us/azure/security/fundamentals/encryption-
overview
Feature Details
Vaults in subscription Up to 500 vaults in a single subscription
Machines in a vault Up to 1000 Azure VMs in a single vault
Up to 50 MABS (Microsoft Azure Backup Server)
Data sources Max size of individual source is 54TB; this does not apply to Azure VM
backups.
Backups to vault Azure VMs: Once a day
Machines protected by MABS/DPM: Twice daily
Machines backup up directly via MARS (Azure Recovery Services)
agent: Three times daily
Backups between vaults Backup is within a region. A vault is needed in every Azure region that
contains VMs targeted for backup; cross region backups are not
allowed.
Move data between vaults Moving backed-up data between vaults is not supported
Modify vault storage type A vault is created utilizing Geo-redundant storage (GRS) by default. If
Locally redundant storage (LRS) is desired a new vault and manual
configuration change is required as the replication type cannot be
modified after backups begin.
High Availability
Microsoft guarantees at least 99.9% availability of the backup and restore functionality
of the Azure Backup service.
Disaster Recovery
The Azure Backup service offering has DR natively built in via the Azure Service Fabric
and is configured by default to use Geo-redundant storage. If Locally Redundant
Storage is required the vault should be modified before beginning backup
activities.Cross region restore is disabled by default so it should also be configured prior
to initial backups.
The Recovery Services Vault can be configured to utilize Cross Region Restore to
restore Azure VMs in a secondary region, which is an Azure paired region. The vault
must be configured to utilize GRS storage and will auto pick the secondary region for
replication. This option allows:
Backup
Due to the nature of the service offering and lack of traditional backup operations for the
Azure Backup service, the following method can be leveraged to store existing
configuration templates for redeployment:
1. Export templates of each Recovery Services Vault.
Availability Sets
Azure Backup does not offer the Availability Set feature as an available function
of the service offering.
Availability Zones
Azure Backup does not offer Availbility Zones as an available function of the
service offering.
Regions
The following are some guidelines and best practices to consider when implementing
Azure Backups:
• Modify the default schedule times that are set in a policy. For example, if the
default time in the policy is 12:00 AM, increment the timing by several minutes so
that resources are optimally used.
• For backup of VMs that are using premium storage, with Instant Restore, it is
recommended allocation of 50% free space of the total allocated storage space,
which is required only for the initial backup. The 50% free space is not a
requirement for backups after the first backup is complete
• The limit on the number of disks per storage account is relative to how heavily
the disks are being accessed by applications that are running on an Azure VM.
As a general rule, if 5 to 10 disks or more are present on a single storage
account, balance the load by moving some disks to separate storage accounts.
6.4.2. Azure Site Recovery
Overview
Azure Site Recovery Services enables the adoption of a business continuity and
disaster recovery (BCDR) strategy. The organization’s strategy should provide data,
application and workload resiliency to ensure availability and recovery when both
scheduled and unscheduled outages occur.
• Backup service: The Azure Backup service keeps your data safe and
recoverable.
Supported Details
Replication scenarios Replicate Azure VMs from one Azure region to another.
Deployment Options
Azure Site Recovery configurations are stored in Azure Recovery Services vaults, along
with Azure Backup policies. Azure Site Recovery allows the replication of VMs to a
supported geo-cluster region by creating a Recovery Plan and selecting the source and
target regions. The following items need to be configured for a Recovery Plan:
2. Select a source and target based on the machines targeted for the plan.
a. Virtual Machines are added to a default group (Group 1) in the plan. After
failover all machines in this group start at the same time.
b. Only Virtual machines in the source and target locations specified can be
selected.
• A Recovery Plan can only be used for failover from source to target; it cannot be
used for failback.
• The source location machines are required to be enabled for failover and
recovery.
• A recovery plan can contain machines with the same source and target.
A group can be created within a plan to specify behaviors on a group-by-group basis so
a specific group can be identified to start up in a desired sequence to facilitate required
service access during startup.
High Availability
Disaster Recovery
The Azure Site Recovery offering has DR natively built in via the Azure Service
Fabric.
Backup
Due to the nature of the service offering and lack of traditional backup operations for the
Azure Site Recovery service, the following method can be leveraged to store existing
configuration templates for redeployment:
Availability Sets
Azure Site Recovery does not offer the Availability Set feature as an available
function of the service offering.
Availability Zones
Azure Site Recovery is a core feature of the Azure Service Fabric so in lieu of a
traditional Availability Zone configuration the following SLA applies to Azure Site
Recovery Azure-to-Azure Failover.
Regions
The following are some guidelines and best practices to consider when implementing
Azure Site Recovery:
• Ensure VMs, and subsequent NSGs, are configured to allow appropriate service
endpoint access.
• Site Recovery can scale to thousands of VMs, however, Recovery Plans should
be designed as a model for applications that fail over as a group so a limit of 50
VMs per plan helps reduce overall downtime during failover.
• Develop and define Recovery Time Objectives (RTO) and Recovery Point
Objectives (RPO) for each application.
Overview
Azure Blob Storage is an object storage solution service in the Azure Service Fabric.
Blob Storage is optimized for storing large amounts of unstructured data that does not
align to a specific model or definition, such as text or binary data.
• Types:
o Block blobs contain text and binary data up to a max of 4.7TB and are
comprised of individually managed blocks of data.
o Page blobs contain random access files utilized by VHDs, which server as
disks for Azure VMs up to a max of 8TB in size.
• Usage
Blob storage object access is enabled over HTTPS, however KPMG policy requires
HTTPS URLs for user or client applications via Azure Storage REST API, Azure
PowerShell, Azure CLI or an Azure Storage client library such as:
• .NET
• Java
• Node.js
• Python
• PHP
• Ruby
Deployment Options
The following are deployment configurations available and corresponding metrics for
Azure Blob Storage:
o Cool is targeted and optimized for infrequently accessed data stored for at
least 30 days.
o Archive is targeted and optimize for rarely accessed data stored for at
least 180 days with flexible latency requirements.
o Hot and Cool access tiers are only set at the account level whereas
Archive is not available at the account level.
o Hot, Cool and Archive tiers can be set during or after upload at the blob
level.
o Cool access tier data is able to tolerate slightly lower availability but still
requires durability, retrieval latency and throughput metrics similar to Hot
tier data. For this reason, a slightly lower SLA but higher access costs
present a counterbalance to lower storage costs for Hot data storage.
o Archive access tier is offline stored data with lowest storage costs but is
offset by highest access and data rehydration costs.
High Availability
Data in an Azure Storage account is always replicated three times in the primary region
and offers the following options for data replication within primary and secondary
regions:
o GRS data is synchronously copied three times within a single region and
physical location utilizing LRS, then asynchronously copies data to the
secondary region’s single physical location.
Disaster Recovery
While unplanned service outages occur, Microsoft’s goal is to ensure the Azure Storage
Account is always available. If an application or service requires resiliency, Microsoft
recommends utilizing, at a minimum, GRS to replicate data to a second region. Paired
with a DR plan to handle regional service outages, failover to a secondary endpoint, in
the event the primary becomes unavailable, is also an important facet of application and
service continuity.
Azure Storage account failover of ARM supported GRS and RA-GRS accounts is now
GA. You can initiate the failover process for an account if the primary endpoint becomes
unavailable with account failover. Once a failover is completed the secondary endpoint
becomes the active primary endpoint for the account and clients can resume writing.
One item of note is the risk of data loss due to asynchronous replication of data to the
secondary region due to the inherent delay between primary region write and secondary
region replication. This is valid for standard geo redundant storage replication and
during failover operations so it’s important to understand the implications of initiating
account failover.
An alternate to the storage account failover is manual data copy operations. If the
account is configured for RA-GRS, then read access to data via secondary endpoint is
available. In the event of an outage in the primary region you can use AzCopy, Azure
PowerShell or the Azure Data Movement Library tools to copy data from the secondary
storage account region to another account in an unaffected region. Once complete, the
newly copied data can be accessed for read and write operations.
In rare catastrophic disaster scenarios, Microsoft may initiate a regional failover. In this
event, you will not have write access to the storage accounts until the Microsoft-
managed failover is complete; however, no action is required. Also, if RA-GRS is
configured for an account, applications can read from a secondary region.
Backup
Due to the nature of the service offering and lack of traditional backup operations for the
Azure Blob Storage Account service, the following method can be leveraged to store
existing configuration templates for redeployment:
Availability Sets
Azure Blob Storage does not offer the Availability Set feature as an available
function of the service offering.
Availability Zones
Due to the standard replication operations across all storage options the HA
configuration options referenced above should be leveraged to further expand
data replication and availability.
Regions
The following are some guidelines and best practices to consider when implementing
Azure Blob Storage:
• Soft delete for Blob storage should be enabled to prevent accidental or malicious
deletion as a first line function of DR related activities.
• Validate application and service requirements when selecting blob types (Block
and Page) to improve overall performance.
• The Azure Storage Account will automatically improve availability by not requiring
user configured scaling work when utilizing blob storage for static website/data.
• When utilizing GRS/RA-GRS/GZRS/RA-GZRS the Last Sync Time property
indicates the last successful replication timestamp from primary to secondary
region to evaluate discrepancies. This is helpful when designing an application to
switch seamlessly to reading from the secondary region if the primary becomes
unresponsive.
Overview
Azure Files are SMB accessible file shares that are fully managed. Azure File shares
can be seamlessly integrated with Windows and can be cached on Windows Servers
utilizing Azure File Sync for fast access close to user and data consumption points.
• Azure File Shares leverage the SMB protocol to seamlessly transition from on-
premises file shares without requiring additional compatibility configurations for
applications.
• Applications running in Azure can leverage the file system I/O APIs, Azure
Storage Client Libraries and Azure Storage REST API to allow streamlined
migration of existing applications to Azure File Shares. PowerShell and Azure
CLI can also be utilized to administer Azure applications while the Azure Portal
and Azure Storage Explorer can manage file shares.
Deployment Options
The following are deployment configurations available and corresponding metrics for
Azure File Storage:
• Premium storage is recommended for I/O intensive workloads requiring file share
semantics along with significantly high throughput and low latency by utilizing
high performance SSD based storage. Premium file shares are only available on
the LRS replication optionStorage account types and capabilities
Account type Service Performance Access Replication Deployment
availability tier support tier options methods
support
General-purpose Blob, File, Standard, Hot, LRS, GRS, ARM
V2 Queue, Premium Cool, RA-GRS, ZRS,
Table, Archive GZRS(Preview),
Disk, Data RA-
Lake Gen2 GRS(Preview)
General-purpose Blob, File, Standard, N/A LRS, GRS, ARM,
V1 Queue, Premium RA-GRS Classic
Table,
Disk
BlockBlobStorage Blob Premium N/A LRS, ZRS ARM
(block
blobs and
append
blobs
only)
File Storage File only Premium N/A LRS, ZRS ARM
Blob Storage Blob Standard Hot, LRS, GRS, ARM
(block Cool, RA-GRS
blobs and Archive
append
blobs
only)
High Availability
Data in an Azure Storage account is always replicated three times in the primary region
and offers the following options for data replication within primary and secondary
regions:
o GRS data is synchronously copied three times within a single region and
physical location utilizing LRS, then asynchronously copies data to the
secondary region’s single physical location.
o User access and applications that require read access to the secondary
region should enable Read Access Geo Redundant Storage (RA-GRS).
o User access and applications that require read access to the secondary
region should enable Read Access Geo Zone Redundant Storage (RA-
GZRS).
Disaster Recovery
While unplanned service outages occur, Microsoft’s goal is to ensure the Azure Service
Fabric is always available. If an application or service requires resiliency, Microsoft
recommends utilizing, at a minimum, GRS to replicate data to a second region. Paired
with a DR plan to handle regional service outages, failover to a secondary endpoint, in
the event the primary becomes unavailable, is also an important facet of application and
service continuity.
Azure Storage account failover of ARM supported GRS and RA-GRS accounts is now
GA. You can initiate the failover process for an account if the primary endpoint becomes
unavailable with account failover. Once a failover is completed the secondary endpoint
becomes the active primary endpoint for the account and clients can resume writing.
One item of note is the risk of data loss due to asynchronous replication of data to the
secondary region due to the inherent delay between primary region write and secondary
region replication. This is valid for standard geo redundant storage replication and
during failover operations so it’s important to understand the implications of initiating
account failover.
Azure File Share also provides snapshot capabilities to capture a share state at point in
time for restore operations. If an application is misconfigured, unintentional code is
deployed, accidental deletion or block damage occurs a share snapshot can be restored
to the last known good state.
Backup
As previously mentioned, Azure File Share provides snapshot capabilities to capture the
share state at a point in time for restore operations. The Azure Backup service also
provides ability to create a schedule for backup operations. The Azure Backup policy
creates an Azure File Share snapshot that can be viewed by both REST API and SMB.
The same capabilities are available in the client library, Azure CLI and the Azure Portal.
A snapshot of the file share can be utilized for data backup, auditing requirements or
DR activities. After a share snapshot is created either manually or via backup policy it
can be read, copied or deleted but not modified. The entire snapshot object also cannot
be copied to another storage account; however, the contents can be copied via AzCopy
or other copy mechanisms. The current snapshot limit per share is 200; after 200 older
share snapshots must be deleted in order to create new ones.
Share snapshots are incremental so only changed data after the most recent snapshot
is saved, minimizing the time required to create and storage costs. Although saved
incrementally, the most recent snapshot retention is all that is required because they
contain all information needed to browse and restore data from the time the snapshot
was taken. Snapshots do not count towards the 5TB share limit; there is no limitation to
how much space they occupy in total but storage account limits still apply.
Availability Sets
Azure File Storage does not offer the Availability Set feature as an available
function of the service offering.
Availability Zones
Due to the standard replication operations across all storage options the HA
configuration options referenced above should be leveraged to further expand
data replication and availability.
Regions
The following are some guidelines and best practices to consider when implementing
Azure File Storage:
• Automate backups for data recovery where possible as automated activities are
more reliable than manual processes which assist with improving data protection
and recoverability.
• Azure File Share snapshots only provide file level protection and do not prevent
file share or storage account deletions. To protect a storage account against
accidental deletions a resource group or storage account lock should be
implemented along with RBAC based controls.
• General purpose storage account utilization from other storage objects affects
Azure File shares in the same storage account so available resources should be
monitored to prevent space constraints.
• Azure File Storage accounts can only be created utilizing the premium
performance tier so the costs should be considered when provisioning and
replication options are selected.
6.5.3. Azure Table Storage
Overview
Azure Table storage is a storage service providing a schemaless design that stores
structured NoSQL data providing a key/attribute store. The design allows easy
adaptation of data as requirements for applications mature. Table storage data access
is fast and cost effective for many applications and is traditionally lower in cost than
standard SQL for comparable volumes of data.
• Dataset storage not requiring complex joins, stored procedures, foreign keys and
can potentially be normalized for highly efficient access.
• OData and LINQ queries with WCF Data Service library Access.
Deployment Options
Azure Table Storage employs multiple deployment options to meet organization and
application specific requirements.
The following are deployment configurations available and corresponding metrics for
Azure Table Storage:
o Azure Cosmos DB Table API is the premium offering for table storage
featuring throughput-optimized tables, global distribution and automatic
secondary indexes, the. The following table outlines differences between
Azure Table storage and Azure Cosmos DB Table API:
High Availability
Data in an Azure Table Storage is always replicated three times in the primary region
and offers the following options for data replication within primary and secondary
regions:
o GRS data is synchronously copied three times within a single region and
physical location utilizing LRS, then asynchronously copies data to the
secondary region’s single physical location.
Disaster Recovery
While unplanned service outages occur, Microsoft’s goal is to ensure the Azure Service
Fabric is always available. If an application or service requires resiliency, Microsoft
recommends utilizing, at a minimum, GRS to replicate data to a second region. Paired
with a DR plan to handle regional service outages, failover to a secondary endpoint, in
the event the primary becomes unavailable, is also an important facet of application and
service continuity.
Azure Storage account failover of ARM supported GRS and RA-GRS accounts is now
GA. You can initiate the failover process for an account if the primary endpoint becomes
unavailable with account failover. Once a failover is completed the secondary endpoint
becomes the active primary endpoint for the account and clients can resume writing.
One item of note is the risk of data loss due to asynchronous replication of data to the
secondary region due to the inherent delay between primary region write and secondary
region replication. This is valid for standard geo redundant storage replication and
during failover operations so it’s important to understand the implications of initiating
account failover.
An alternate to the storage account failover is manual data copy operations. Azure
Tables data can be exported via AzCopy and imported to another storage account in a
different region.
Backup
Due to the nature of the service offering and lack of traditional backup operations for the
Azure Table Storage Account service, the following method can be leveraged to store
existing configuration templates for redeployment:
However, a combination of the storage account type and replication options referenced
above should be leveraged to enable backup concepts for Table storage.
Availability Sets
Azure Table Storage does not offer the Availability Set feature as an available
function of the service offering.
Availability Zones
Due to the standard replication operations across all storage options the HA
configuration options referenced above should be leveraged to further expand
data replication and availability.
Regions
The following are some guidelines and best practices to consider when implementing
Azure Table Storage:
• Table storage is low cost so store the same entity multiple times, with separate
keys, to enable HA and more efficient operations.
• Distribute data by locating table data in a storage account close to the end users.
This technique, along with leveraging a geo and zone redundant replication
option will enable both resiliency and efficiency of table operations.
• Select keys that enable the distribution of both requests and partitions at any
point in time to also assist with HA.
• Enable the newest version of the Table Service Client Layer to improve
performance and boost uptime. This will also assist with overall Table uptime and
responsiveness.
• If the Storage Account failover process is employed for DR, ensure the full scope
of time constraints, potential data loss and replication configuration continuity.
• If the Table Storage account is configured for RA-GRS, native read access is
enabled to the secondary endpoint. Leveraging AzCopy will allow you to copy
data to a storage account in an unaffected primary region so applications can
then be updated to leverage both read and write operations.
6.5.4. Azure Queue Storage
Overview
Azure Queue Storage stores many messages that can be accessed anywhere in the
world via authenticated HTTPS calls, however, KPMG policy enforces use of HTTPS..
Up to 64KB in size, a queue message can contain millions of messages up to storage
account capacity limit. A Queue is commonly utilized to develop a backlog of work to
asynchronously process.
• https://<<app>>.queue.core.windows.net/<<container>>
• https://contosoqueue.queue.core.windows.net/images-to-download
• Queue access, along with other storage types, is executed through the
sponsoring storage account.
• A Queue contains various messages that are required to use a queue name in all
lowercase.
• 64K is the maximum message size, regardless of format, and with the newer
version (2017-07-29 and later) can be configured with any positive number for
time-to-live and can also be configured to never expire by using -1. If omitted, the
default TTL is 7 days.
Deployment Options
Azure Queue Storage employs multiple deployment options to meet organization and
application specific requirements.
The following are deployment configurations available and corresponding metrics for
Azure Queue Storage:
High Availability
Data in an Azure Queue Storage is always replicated three times in the primary region
and offers the following options for data replication within primary and secondary
regions:
o GRS data is synchronously copied three times within a single region and
physical location utilizing LRS, then asynchronously copies data to the
secondary region’s single physical location.
Disaster Recovery
While unplanned service outages occur, Microsoft’s goal is to ensure the Azure Service
Fabric is always available. If an application or service requires resiliency, Microsoft
recommends utilizing, at a minimum, GRS to replicate data to a second region. Paired
with a DR plan to handle regional service outages, failover to a secondary endpoint, in
the event the primary becomes unavailable, is also an important facet of application and
service continuity.
Azure Storage account failover of ARM supported GRS and RA-GRS accounts is now
GA. You can initiate the failover process for an account if the primary endpoint becomes
unavailable with account failover. Once a failover is completed the secondary endpoint
becomes the active primary endpoint for the account and clients can resume writing.
One item of note is the risk of data loss due to asynchronous replication of data to the
secondary region due to the inherent delay between primary region write and secondary
region replication. This is valid for standard geo redundant storage replication and
during failover operations so it’s important to understand the implications of initiating
account failover.
Backup
Due to the nature of the service offering and lack of traditional backup operations for the
Azure Queue Storage Account service, the following method can be leveraged to store
existing configuration templates for redeployment:
Availability Sets
Azure Queue Storage does not offer the Availability Set feature as an available
function of the service offering.
Availability Zones
Due to the standard replication operations across all storage options the HA
configuration options referenced above should be leveraged to further expand
data replication and availability.
Regions
The following are some guidelines and best practices to consider when implementing
Azure Queue Storage:
• Always test the queue service for performance requirements and where
possible ensure traffic is distributed well across partitions to avoid sudden
spikes in traffic rates, improving overall availability.
Overview
Azure Managed Disks are utilized by Azure VMs to employ block level storage volumes.
Equivalent to physical disks mounted to on-premises physical or virtual machines,
Azure Managed Disks are virtual disks mounted to Azure VMs, similar to VMWare
VMDKs and Hyper-V VHD/VHDX files. Azure manages the disk once size and type are
specified during provisioning.
• Availability and durability are significant design requirements that allow a 5 ‘9’
(99.999%) uptime and SLA.
• 50,000 VM disk types are supported per region and subscription, enabling the
creation of thousands of VM in a subscription. Virtual Machine Scale Sets
(VMSS), paired with Managed Disks, allows further scaling and HA.
• Managed disks are integrated with availability sets to ensure that the disks of
VMs in an availability set are sufficiently isolated from each other to avoid a
single point of failure. Disks are automatically placed in different storage scale
units (stamps). If a stamp fails due to hardware or software failure, only the
VM instances with disks on those stamps fail.
• Azure Backup can be utilized to protect against Azure Regional outages and
disasters by leveraging custom, or built in, policy configurations for time
based backup and retention of VM or Managed Disk restoration of disk sizes
up to 32 TiB.
Deployment Options
Azure Disk Management employs multiple deployment options to meet organization and
application specific requirements.
The following are deployment configurations available and corresponding metrics for
Azure Managed Disks:
• High disk throughput and IO are ideal for Big Data, SQL, NoSQL DBs, data
warehousing, and large transactional databases that are offered with Storage
Optimized VMs.
o LSv2 Series offers high throughput, low latency and direct mapped
NVMe storage running on the AMD EPYC platform with an all core
max boost of 3.0GHz and are offered in sizes from 8 to 80 vCPU,
multithreading configuration with 8GiB RAM per vCPU and one
1.92TB NVMe SSD M.2 device per 8 vCPUs to a max of 19.2TB.
Note: Managed Disks do not have an SLA itself; instead, it is aggregated from the SLA
of underlying storage and attached VMs.
High Availability
Azure Managed Disks are always replicated three times in the primary region based on
the following:
Managed Disks should be paired with VMs configured with Availability Sets to provide
better reliability by ensuring sufficient isolation between VMs to avoid single points of
failure. This is achieved by automatically placing disks in separate storage fault
domains, which are aligned with the VM fault domain.
Disaster Recovery
While unplanned service outages occur, Microsoft’s goal is to ensure the Azure Service
Fabric is always available. If an application or service requires resiliency, Microsoft
recommends utilizing, at a minimum, GRS to replicate data to a second region. Paired
with a DR plan to handle regional service outages, failover to a secondary endpoint, in
the event the primary becomes unavailable, is also an important facet of application and
service continuity. Currently managed disks are not supported with the GZRS and RA-
GZRS replication options.
Azure Storage account failover of ARM supported GRS and RA-GRS accounts is now
GA. You can initiate the failover process for an account if the primary endpoint becomes
unavailable with account failover. Once a failover is completed, the secondary endpoint
becomes the active primary endpoint for the account, and clients can resume writing.
One item of note is the risk of data loss due to asynchronous replication of data to the
secondary region due to the inherent delay between the primary region write and
secondary region replication. This is valid for standard geo redundant storage
replication and during failover operations, so it’s important to understand the
implications of initiating account failover.
Due to the nature of the Managed Disk service offering, it is intimately tied to VM DR
features and strategies. In addition to IaaS resiliency, Azure Backup enables
redundancy and recovery from major disaster incidents. For disk DR, Azure Backup
should be configured to store in a different geographic region from the primary site. This
methodology ensures backed up data is not affected by the same events affecting the
disks and VMs to which they are attached.
DR considerations should include the following aspects:
Backup
Azure Managed Disks have multiple avenues of backup/restore and data preservation.
The following are methods for disk backup:
• A disk snapshot is a full read-only copy of a VMs VHD that can be used on OS or
Data disks for backup or troubleshooting of VM issues. If the restore of a VM with
a snapshot is a part of the backup/restore strategy, the VM should be cleanly
shut down to clear out running processes before executing.
• Incremental snapshots are point in time backups that consist only of changes
made since the previous snapshot, and the full VHD is used when either
downloaded or attempted use. Some differences between regular and
incremental snapshots is incremental always uses standard HDD storage,
regardless of managed disk storage type, while standard snapshots can use
premium SSDs. The following restrictions should be noted:
o A max of seven incremental snapshots per disk can be created every five
minutes.
• A disk export directly from the VM configuration generates a SAS URI of the disk
object to download and store in a storage account in a different region for restore
operations. In the event of catastrophic failure of a VM the VHD can be utilized to
deploy a new system from the exported disk.
Availability Sets
Azure Managed Disks are intimately tied to Azure VM Availability Sets to provide VM
availability and redundancy by update domain and fault domain assignment, which is
handled by the Azure Service Fabric. Utilization of Managed Disks is highly
recommended when configuring Availability Sets to ensure the disks are sufficiently
isolated to prevent single points of failure. Please refer to the Azure Virtual Machine
service section for further information.
Availability Zones
Azure Managed Disks are intimately tied to Azure VM Availability Zones and expand the
level of control to maintain the availability of applications and data. Each zone
comprises one or more data centers, independent of critical resources to maintain the
Azure Service Fabric, within an Azure Region. To ensure resiliency, a minimum of three
separate zones are configured within enabled regions. The physical separation of the
zones within a region protects applications and data from datacenter failures by utilizing
zone redundant services to replicate across Availability Zones. Please refer to the Azure
Virtual Machine service section for further information.
Regions
The following are some guidelines and best practices to consider when leveraging
Azure Managed Disks:
• Utilize a combination of all BCDR service offerings to tailor a strategy that meets
all business requirements across all IaaS tiers.
• Access between on-prem networks and Cloud Services (Azure Service Fabric
and M365) via Layer 3 access provided by a connectivity provider and flexible
configuration options.
• Dynamic scaling and multiple bandwidth options allows the increase of circuit
bandwidth without connection re-provisioning from 50Mbps up to 10Gbps.
The following items are key considerations for ensuring HA principles on ExpressRoute
connections:
• The combination of a fault domain and an update domain, when opting for zone-
redundant Azure IaaS deployment and configuring zone-redundant VNet
Gateways that terminate ExpressRoute private peering, enables Availability Zone
aware ExpressRoute VNet Gateways.
6.6.4. Disaster Recovery
While ExpressRoute is designed for HA using the above principles, there are additional
methods to enable connectivity when outages cannot be addressed using a single
ExpressRoute circuit. The following items should be considered when designing the
overall BCDR strategy to build out robust backend network connectivity, addressing DR
by utilizing geo redundant ExpressRoute circuits.
6.6.5. Backup
Due to the nature of the service offering and lack of traditional backup operations for the
Azure ExpressRoute service, the following method can be leveraged to store existing
configuration templates for redeployment:
Azure ExpressRoute does not offer the Availability Set feature as an available
function of the service offering.
6.6.8. Regions
The following are some guidelines and best practices to consider when leveraging
Azure ExpressRoute:
• Utilize the unique BGP Community values to each Azure Region to optimize
routing over ExpressRoute circuits. You can use BGP’s Local Preferences to
influence routing by assigning a higher local preference value that corresponds
with the regional network prefix to ensure that when multiple paths to Microsoft
are available, users will prefer the route to the closer Azure region in regards to
their current location.
The two Express Route circuits are configured in an Active – Active setup, wherby each
circuit runs the traffic for the particular hosting location (Azure region). BGP AS path
prepending and BGP weight are configured to allow for failover in the scenario where
one ExpressRoute circuit would completely fail.
The Express Route circuits are configured for Azure Privat Peering for Virtual Networks
(and optional Microsoft Peering and/or Public Peering is not used).
6.7. Azure Key Vault
6.7.1. Overview
Azure Key Vault allows the centralization of application secrets, stores the
corresponding keys and secrets, enables access and use monitoring and natively
integrates with Azure Service Fabric by providing the following solutions:
• Access tokens, passwords, certificates, API keys, and other secrets are closely
controlled and stored securely in Key Vault.
• Creating and controlling encryption keys utilized for data encryption can leverage
the Key Vault as a Key Management System.
• Software or FIPS 140-2 Level 2 Hardware Security Modules (HSMs) can protect
secrets and keys.
Applications and users can leverage Key Vault to store and use several types of
secret/key data:
• Algorithms and multiple key types are supported, and high value objects can be
protected by HSMs.
• PKI/X509 certificates.
• Key Vault keys to encrypt the data at rest for PaaS services like Azure SQL,
Storage, Service Bus etc.
Azure Key Vault deployments are deeply integrated into the Azure Service Fabric. Due
to this operational integration, it natively supports and fuses with Azure Storage
Accounts, Event Hubs, and Log Analytics.
In the event service components within the Key Vault fail, alternate regional
components are initiated to service requests to ensure minimal to no degradation of
functionality. These activities are autonomous and do not require manual intervention.
In the rare event of an Azure region failure, the Key Vault requests are automatically
routed (failed over) to an available secondary region and when the primary region
access is restored the requests routed back (failed back), and again, this process is
autonomous so no manual intervention is required.
Key Vault does not require downtime for maintenance via the HA design, but the
following caveats should be noted:
• During a regional failover process, Key Vault requests could fail for a few minutes
during failover execution due to service endpoint updates.
• A Key Vault is available in read-only mode post failover and support the following
request types:
o List/Get secrets
o List keys
o Encrypt/Decrypt
o Wrap/Unwrap
o Verify
o Sign
o Backup
• A Key Vault becomes available in read and write mode post failback.
The Azure Key Vault offering has DR natively built in via the Azure Service Fabric.
Additionally, Key Vault includes a soft-delete feature, supporting the recovery of deleted
vaults and objects which encompass the following:
After enabling soft-delete, Key Vault objects marked as deleted are retained for
customizable timeframe up to 90 days (default configuration), further providing a simple
object recovery mechanism. One item of note is an object name cannot be reused until
the soft-delete retention period has expired.
An optional Key Vault behavior, disabled by default, is Purge Protection and can only be
enabled once soft-delete is functional. A vault or vault object in the deleted state cannot
be purged until the retention period has expired when the Purge Protection feature is
enabled, ensuring deleted vaults and vault objects remain recoverable by enforcing the
retention period configuration.
The following items should be considered when leveraging soft-delete and purge
protection:
• The Key Vault service creates a proxy resource under the subscription with
corresponding metadata for recovery when a vault is deleted and is a stored
object available in the same location as the deleted vault.
• The Key Vault service will place a deleted vault object in a deleted state
rendering it inaccessible to retrieval operations and can only be listed, recovered
or permanently/forcefully deleted while scheduling underlying data corresponding
to deleted vault or vault object, in parallel, in accordance with the retention policy.
6.7.5. Backup
Due to the nature of the service offering and lack of traditional backup operations for the
Azure Key Vault service, the following method can be leveraged to store existing
configuration templates for redeployment:
Azure Key Vault does not offer the Availability Set feature as an available function
of the service offering.
Due to the standard replication operations within the Azure Service Fabric, the
HA configuration options referenced above should be leveraged to further
expand data replication and availability.
6.7.8. Regions
The following are some guidelines and best practices to consider when leveraging
Azure Key Vault:
• Due to the nature of a Key Vault, securing access to vaults and vault objects
should strictly control access while also closely monitoring/managing objects.
The following items are highly recommended to secure access to a vault:
o Generate and utilize Access Policies for each vault and vault object.
• Ensure soft-delete and Purge Protection are enabled to strictly control delete
operations and support quick restores.
• Delineate Key Vaults per application and per environment (Dev/Test, Production)
to avoid sharing secrets across enclaves and reduce breach vector footprints.
• Integrate the vault object (Keys, Secrets, Certificates) backup features into the
BCDR strategy for archival and restore operations.
• The staging and production key vault should be deployed with Premium tier
which gives the privilege to create RSA-HSM keys.
6.8. Azure Network Security Group
6.8.1. Overview
Azure Network Security Groups contain security rules that allow or deny inbound and
outbound network traffic to and from Azure resources in and Azure VNet. Source and
destination ports and protocols can be specified, per rule, for multiple Azure resource
types utilizing default and custom rules to create and customize secured access.
Network security group security rules are evaluated by priority using the 5-tuple
information (source, source port, destination, destination port, and protocol) to allow or
deny the traffic. A flow record is created for existing connections. Communication is
allowed or denied based on the connection state of the flow record.
A network security group contains zero, or as many rules as desired, within Azure
subscription limits. Each rule specifies the following properties:
A network security group contains zero, or as many rules as desired, within Azure
subscription limits. Each rule specifies the following properties:
Property Explanation
Name A unique name within the network security group.
Priority A number between 100 and 4096. Rules are processed in
priority order, with lower numbers processed before higher
numbers, because lower numbers have higher priority.
Once traffic matches a rule, processing stops. As a result,
any rules that exist with lower priorities (higher numbers)
that have the same attributes as rules with higher priorities
are not processed.
Source or destination Any, or an individual IP address, classless inter-domain
routing (CIDR) block (10.0.0.0/24, for example), service
tag, or application security group. If you specify an address
for an Azure resource, specify the private IP address
assigned to the resource. Network security groups are
processed after Azure translates a public IP address to a
private IP address for inbound traffic, and before Azure
translates a private IP address to a public IP address for
outbound traffic. Learn more about Azure IP addresses.
Specifying a range, a service tag, or application security
group, enables you to create fewer security rules. The
ability to specify multiple individual IP addresses and
ranges (you cannot specify multiple service tags or
application groups) in a rule is referred to as augmented
security rules. Augmented security rules can only be
created in network security groups created through the
Resource Manager deployment model. You cannot specify
multiple IP addresses and IP address ranges in network
security groups created through the classic deployment
model. Learn more about Azure deployment models.
Protocol TCP, UDP, ICMP or Any.
Direction Whether the rule applies to inbound, or outbound traffic.
Port range You can specify an individual or range of ports. For
example, you could specify 80 or 10000-10005. Specifying
ranges enables you to create fewer security rules.
Augmented security rules can only be created in network
security groups created through the Resource Manager
deployment model. You cannot specify multiple ports or
port ranges in the same security rule in network security
groups created through the classic deployment model.
Action Allow or deny.
All rules will be evaluated based on their priority using these following five types of
information: source, source port, destination, destination port, and protocol.
The Azure Network Security Group service is intimately tied to Azure VMs and Azure
Virtual Networks so the HA principles of those service offerings should be reviewed for
further information.
6.8.5. Disaster Recovery
The Azure Network Security Group service is intimately tied to Azure VMs and Azure
Virtual Networks so the DR principles of those service offerings must be reviewed for
further information.
Note: To ensure site recovery configurations are in place or creation of explicit NSGs
for secondary regions are configured.
6.8.6. Backup
Due to the nature of the service offering and lack of traditional backup operations for the
Azure Network Security Group, the following method can be leveraged to store existing
configuration templates for redeployment:
However, due to the nature of the service the Network Security Group is backed up in
Azure VM Backups and replicated when utilizing Azure Site Recovery Replication
policies so please refer to both sections for further information.
Azure Network Security Group does not offer the Availability Set feature as an
available function of the service offering.
The Azure Network Security group service is intimately tied to Azure VMs and Azure
Virtual Networks so Availability Zone configuration options for those service offerings
should be reviewed for further information.
6.8.9. Regions
The following are some guidelines and best practices to consider when leveraging
Azure Network Security Groups:
• Enable Diagnostic logs and Network Security Group flow logs to monitor
connection attempts and to also provide additional insight for both
troubleshooting and service performance.
6.9. Azure Network Watcher
6.9.1. Overview
Providing tools to monitor, diagnose, review metrics and log management for
Infrastructure objects in an Azure Virtual Network, Azure Network Watcher can repair
network health of Azure Infrastructure products to include Virtual Machines, Virtual
Networks, Application Gateways, Load balancers, etc.
Note: It is not intended for and will not work for PaaS monitoring or Web analytics.
Monitoring
Diagnostics
• Determine relative latencies between Azure regions and internet service providers
Metrics, Logs
• Analyze traffic to or from a network security group – currently in use in One Platform
Enabling network diagnostic and visualization tools, Network Watcher assists with
diagnosis, gain insights and understand an Azure network. Network Watcher is enabled
with all subscriptions that have Virtual Networks. The following items are options for
how to deploy and integrate into an environment:
• Network Watcher can be utilized to assist with diagnosis of VM networking issues
by installation of the VM agent, which is a requirement for Network Watcher
functionality such as on-demand traffic capture and diagnosis of VM routing
issues. The Next Hop feature retrieves the packet type and IP to help determine
if traffic is directed to the intended destination or if the traffic is being dropped.
• Network Security Group flow logging allows the IP traffic information flows to be
logged in an Azure Storage account as well as SIEM or IDS platforms. This
feature captures and transmits source of truth for network activity that
ingress/egresses the VM’s NIC or VNet’s subnet, depending on association.
The Azure Network Watcher offering has DR natively built in via the Azure Service
Fabric.
6.9.5. Backup
Due to the nature of the service offering and lack of traditional backup operations for the
Azure Network Watcher there are no methods to export configuration templates as the
feature set are all mostly ephemeral scope of execution.
Azure Network Watcher does not offer the Availability Set feature as an available
function of the service offering.
Due to the standard replication operations within the Azure Service Fabric,
Availability Zone configurations are not available as a function of the service
offering.
6.9.8. Regions
The following are some guidelines and best practices to consider when leveraging
Azure Network Watcher:
• Utilize the Network Watcher Topology builder feature to both analyze and
troubleshoot network related configurations and to also develop a visual
architecture for both documentation and potential rebuild of high level VNet
hierarchy.
6.10. Azure Virtual Machines
6.10.1. Overview
Azure Virtual Machines (VM) is one of several types of on-demand, scalable computing
resources that Azure offers. Typically, you choose a VM when you need more control
over the computing environment than the other choices offer. This article gives you
information about what you should consider before you create a VM, how you create it,
and how you manage it.
An Azure VM gives you the flexibility of virtualization without having to buy and maintain
the physical hardware that runs it. However, you still need to maintain the VM by
performing tasks, such as configuring, patching, and installing the software that runs on
it.
The number of VMs that your application uses can scale up and out to whatever is
required to meet your needs.
Azure Virtual Machine deployment options cover numerous use cases to meet an
organization’s application and workload requirements for both Windows and Linux VMs.
Please refer to the Appendix for a link to the current SKU availability.
Gen2 VMs are now supported, leveraging UEFI boot architectures vs traditional BIOS
architecture utilized by Gen1 VMs. Gen2 VMs have improved boot and installation times
as well.
Additionally, Azure Virtual Machines can leverage the Virtual Machine Scale Set
(VMSS) feature to create and manage a group of identical, load balanced VMs. The
VMSS enables autoscaling to increase and decrease VM instances in response to
demand or schedule-based configurations, providing high availability to applications and
facilitate central management, configuration and update of these scaled instances. The
following table identifies scenario-based benefits of deploying VMSS vs individual VMs:
Scenario Manual group of VMs VMSS
Add additional VM instances Manual process to create, Automatically spawn from central
configure and ensure configuration
compliance.
Traffic balancing and Manual process to create and Automatically create and integrate
distribution configure Azure Load Balancer with Azure Load Balancer or
or Application Gateway Application Gateway.
High availability and Manually create Availability Set Automatic distribution of VM
redundancy or distribute and track VMs instances across Availability Zones
across Availability Zones. or Availability Sets.
Scaling of VMs Manual monitoring and Azure Autoscale based on host or in-
Automation. guest metrics, Application Insights
or schedule.
Note: No additional cost is incurred with scale sets; underlying compute resources such
as VM instances, load balancer or Managed Disk storage are standard billing metrics.
• A Fault domain shares common power source and switch, logically grouped.
Azure Virtual Machines can also leverage multiple avenues for DR operations and the
overall strategy to keep applications and workloads online when planned and unplanned
outages occur. We must prepare for entire region outages due to widespread service
interruption or a major natural disaster, even though they are rare occurrences, by
designing DR into applications and services, as well as the underlying/supporting
features.
The following features of the Azure Recovery Services vault are key pieces of the DR
strategy:
• Azure Backup retains copies of your data safe for future recovery.
Azure Site Recovery provides the following features to supplement the organization’s
BCDR strategy:
Feature Details
Simple BCDR solution Azure Site Recovery enables replication, failover and
failback from a single management point.
Azure VM replication Azure VM DR can be configured from a primary to
secondary region.
On-prem VM replication Both physical and VMs hosted on-prem can be
replicated to an Azure region.
Workload replication Supported Azure VMs can have workloads
replicated.
Data resilience Orchestrating replication without application data
interception, Azure Site Recovery stores data in
Azure Storage which provides resilience; when
failover occurs, Azure VMs are provisioned based on
the replicated data.
RTO and RPO targets Maintain organizational recovery time objectives
(RTO) and recovery point objectives (RPO) limits by
continuous replication via Site Recovery and further
reduce RTO by integrating with Azure Traffic
Manager.
Keep applications consistent during failover Recovery points with application consistent
snapshots can be leveraged for replication capturing
disk data, data in memory and in-process
transactions.
Testing without disruption Quick and nonintrusive DR drills can be executed
without affecting production replication.
Flexible failovers Execute planned failovers for expected outages with
zero-data loss or unplanned failovers with minimal
data-loss, depending on replication frequency, for
unexpected outages.
Customized recovery plans Customize and sequence failover and recovery plans
of multi-tier applications spanning multiple VM
instances by grouping machines in a plan and
optionally adding scripts and manual actions which
can be integrated with Azure Automation runbooks.
BCDR integration Integrating with other technologies, Site Recovery
can supplement SQL AlwaysOn services by
protecting SQL server backends supporting
corporate workloads.
Azure Automation integration The Azure Automation library provides application
specific and production ready scripts that can be
downloaded and integrated into Site Recovery.
Network integration Application network management is integrated
direction with Site Recovery to reserve IPs, configure
load balancers and leverage Traffic Manager for
efficient transitions.
An example of supported workloads that can be replicated with Site Recovery are:
• AD DS and DNS
• SharePoint Farms
The Azure Backups and Azure Site Recovery service sections should also be
referenced for additional information and constraints regarding those service offerings.
6.10.5. Backup
Azure Backup is another key feature for the overarching Virtual Machine BCDR strategy
by creating recovery points stored in geo redundant recovery vaults, which can restore
entire VMs or specific files. When a backup job is triggered by the Azure Backup service
the VM extension takes a point in time snapshot when the VM is running and if not
running at time of backup it will snapshot the underlying storage since no application
writes are occurring. The Azure Backup service coordinates activities with the Volume
Shadow Copy Service (VSS) for consistent snapshots of the VM’s disks, then transfers
the data to the vault and to maximize efficiency the service identifies and transfers only
changed blocks of data since previous backup. After transfer is complete the recovery
point is created, and the snapshot is removed.
The following are supported backup scenarios when utilizing the Azure Backup service:
Scenario Backup Agent Restore
Direct backup Back up the entire VM No additional Restore as follows:
of Azure VMs agent is needed
on the Azure VM; Create a basic VM – Useful if
Azure Backup the VM has no special
installs and configurations, such as
utilizes the multiple IPs.
Backup extension
via existing VM Restore VM disk – Restore
agent. then disk then attach to an
existing VM or create a new
VM from the disk by
PowerShell.
Logically grouping capability for isolating VM resources from each other when deployed,
Availability Sets ensure the VMs span multiple physical servers, compute racks, storage
units and network switches to reduce impact to both VMs and overall solution in the
event of hardware or software failure and are essential to building reliable cloud based
solutions.
For example, an Azure VM based solution containing web front ends and back end VMs
you should define two Availability Sets before deploying VMs; one for the web tier and
one for the backend tier. When creating a new VM the Availability Set should be
specified as a parameter to ensure the Azure Service Fabric isolates the VMs across
the spanning parameters to prevent interruption of services in the event of subsystem
error. Additionally, Azure Advisor can provide recommendations on improvement of VM
availability by analyzing configuration and usage telemetry.
Lastly, when implementing multi-tier applications ensure each tier is deployed using
separate Availability Sets or Availability Zones to guarantee at least one machine per
tier is available during planned or unplanned outages. Leveraging Azure Load
Balancers along with Availability Sets or Zones to distribute traffic between the VMs to
provide additional application resiliency.
6.10.8. Regions
The following are some guidelines and best practices to consider when leveraging
Azure Virtual Machines:
• Deploy all VM sizes in a single template; to avoid landing on hardware that does
not support all the VM SKUs and sizes you require, include all of the application
tiers in a single template so that they will deploy at the same time.
• Separate and configure each application tier into separate Availability Zones
and/or Availability Sets along with load balancer services to provide the highest
resiliency and availability.
• When reusing an existing placement group from which VMs were deleted, wait
for the deletion to fully complete before adding VMs to it.
• If latency is a priority, put VMs in a proximity placement group and the entire
solution in an availability zone. However, if resiliency is a higher priority, spread
your instances across multiple availability zones (a single proximity placement
group cannot span zones).
• Use Availability Zones and Fault Domains together for even greater fault isolation
by specifying the zone and fault domain count if utilizing Dedicated Hosts.
VNet Concepts:
• Address space: When creating a VNet, you must specify a custom private IP
address space using public and private (RFC 1918) addresses. Azure assigns
resources in a virtual network a private IP address from the address space that you
assign.
• Subnets: Subnets enable you to segment the virtual network into one or more sub-
networks and allocate a portion of the virtual network's address space to each
subnet. You can then deploy Azure resources in a specific subnet. Just like in a
traditional network, subnets allow you to segment your VNet address space into
segments that are appropriate for the organization's internal network. This also
improves address allocation efficiency. You can secure resources within subnets
using Network Security Groups. For more information, see Security groups.
With planning, deployment of Azure Virtual Networks and connectivity can be effectively
and efficiently managed to support production requirements. The following deployment
options and recommendations should be considered with Virtual Network deployment
and management:
• Virtual Network naming must be unique within a resource group but can be
duplicated within a subscription or Azure region so ensure a standardized
naming convention is developed and adhered to assist with overall management
and to also align into an overall governance framework.
• A resource connected to a Virtual Network must exist in the same region and
subscription as the Virtual Network; however, Virtual Networks can be
connected, or peered, so the following considerations should be reviewed to
effectively architect your Virtual Network infrastructure:
o Not all Azure regions support Availability Zones so ensure current, and
future, resources are deployed in regions and resiliency designs address
availability zone capabilities.
• When creating and deploying Virtual Networks the following design questions
should be reviewed:
o Are there any organizational security requirements for traffic isolation into
separate Virtual Networks? When connecting Virtual Networks, Network
Virtual Appliances can control the flow of traffic between the networks.
o How many network interfaces and private IPs are required in a Virtual
Network? Review the limitations of each object per-Virtual Network to
understand and design effective segmentation.
Logical representation of your network in the cloud, you can define IP address space
and segment the Virtual Network into subnets while acting as a trust boundary to
compute resources such as Azure VMs and Azure App Services. The Virtual Network
allows direct, private IP communication between attached resources but the Virtual
Network is created within the scope of a region and while you can create Virtual
Networks with the same address space in two separate regions, you cannot connect
them due to requirements of non-overlapping address spaces.
In the event of Azure regional outages you should utilize Azure Site Recovery services
to both replicate primary region resource configurations (VMs, Disks, VNets, NSGs) to a
secondary region and also to execute failover and recovery plans to ensure continuity of
operations during the primary region outage. This failover is a supported by the
underlying replication of resources so the testing functionality should be employed to
validate replicated object fidelity as a part of your BCDR strategy.
6.11.5. Backups
Due to the nature of the service offering and lack of traditional backup operations for the
Azure Virtual Network, the following method can be leveraged to store existing
configuration templates for redeployment:
Azure Virtual Network does not offer the Availability Set feature as an available
function of the service offering.
Due to the standard replication operations within the Azure Service Fabric,
Availability Zone configurations are not available as a function of the service
offering.
6.11.8. Regions
The following are some guidelines and best practices to consider when leveraging
Azure Virtual Networks:
• Develop, implement, and test replication utilizing Azure Site Recovery to validate
your BCDR strategy.
• Restrict broad range assignment for access rules when attaching Network
Security Groups to subnet objects and Network Interfaces.
• Avoid small Virtual Networks to ensure simplicity and flexibility for growth.
• Utilize Azure Virtual Network service endpoints to extend the identity of the
Virtual Network and extend private address space. You can secure critical Azure
service resources by utilizing endpoints by keeping traffic on the Azure backbone
network.
6.12. Non-Azure Services
6.12.1. Palo Alto – KNET-Facing
Overview
The KNET-facing Palo Altos have been deployed in all OPCS Regional and Satellite
instances. The Regional and Satellite instances are based on the concept of Hub and
Spoke topologies and built in specific Azure regions, whereby the Hub VNET is part of
the Core Services subscription and the Spoke VNETs are part of the Customer
subscriptions. Placement of the KNET-facing Palo Altos is in the Core Services
subscription, the Hub of the topology.
Regional and Satellite instances often exists of two hosting locations, referred to as
Primary and a Secondary (or DR) hosting location. These hosting locations leverage the
Azure paired regions concept. One Hub and Spoke topology exists on the primary
location/region, while the other Hub and Spoke topology exists on the secondary
location/region. Connectivity between both locations/regions is facilitated through the
Express Route layer.
The typical deployment pattern for KNET-facing Palo Altos is referred to as “Active-
Active One-armed model” and placed behind an Azure ILB (Internal Load Balancer) for
resiliency and horizontal scaling. This setup allows for running multiple active firewalls in
a single pool of VMs, without having the need to configure SNAT (Source Network
Address Translation).
• Traffic between the Hub and the Spoke VNets (within a single hosting
location/region).
Configuration of UDRs (User Defined Route) is the mechanism to have the Palo Alto
firewalls placed in the data path. This configuration is required in both directions of the
traffic flows.
The diagram below shows a logical representation of the deployment in a single hosting
location/region:
• BYOL (Bring Your Own license) instead of PAYG (Pay As You Go).
• PANOS version 9.1 (or later) allowing for Accelerated Networking (= higher
throughput).
• One-Armed setup (single data Interface and Zone) for resiliency and horizontal
scaling.
• Standard ILB to load balance HA-ports (all TCP and UDP ports).
Overview
The Internet-facing Palo Altos have been deployed in all OPCS Regional and Satellite
instances. The Regional and Satellite instances are based on the concept of Hub and
Spoke topologies and build in specific Azure regions, whereby the Hub VNET is part of
the Core Services subscription, and the Spoke VNETs are part of the Customer
subscriptions. Placement of the Internet-facing Palo Altos is in Core Services
subscription, the Hub of the topology.
Regional and Satellite instances often exists of two hosting locations, referred to as
Primary and a Secondary (or DR) hosting location. These hosting locations leverage the
Azure paired regions concept. One Hub and Spoke topology exists on the primary
location/region, while the other Hub and Spoke topology exists on the secondary
location/region. Connectivity between both locations/regions is facilitated through the
Express Route layer.
The typical deployment pattern for Internet-facing Palo Altos is referred to as “Two-
armed model”, and placed behind an Azure ELB (External Load Balancer) for resiliency
and horizontal scaling. This setup allows for running multiple active firewalls in a single
pool of VM’s. This setup requires the configuration of SNAT (Source Network Address
Translation) for Inbound traffic, in order to have the return traffic for individual sessions
running through the same firewall (symmetric traffic patterns).
DNS (Domain Name Services) in combination with Public IP addresses on the ELB
(External Load Balancer) is the mechanism to have the Palo Alto firewalls placed in the
data path for Inbound traffic. Configuration of UDR’s (User Defined Route) is the
mechanism to have the Palo Alto firewalls placed in the data path for Outbound initiated
sessions.
The diagram below shows a logical representation of the deployment in a single hosting
location/region for Inbound traffic:
Deployment Guidelines and Recommendations
• BYOL (Bring Your Own license) instead of PAYG (Pay As You Go)
• PANOS version 9.1 (or later) allowing for Accelerated Networking (= higher
throughput)
• Two-Armed setup (two data Interfaces and Zones) for resiliency and horizontal
scaling
• Standard ILB (Internal Load Balancer) to load balance Outbound initiated traffic
Overview
All OPCS Palo Altos are managed through GSOC Panorama. GSOC Panorama is
located in On-premises DXC data centers in Amsterdam and Atlanta.
go2azrapp074 (primary)
go2azrapp072 (secondary)
imsspanorama01.kworld.kpmg.com
imsspanorama02.kworld.kpmg.com
ExpressRoute
KNet
ExpressRoute
.4
. 185
10.82.100.0/25
10.178.26.160/27
PAN firewalls
6.12.4. Layer 7 - CA API Gateways
Overview
The CA API Gateway (Layer7) is an enterprise security management solution that
provides centralized management and access control over web applications and web
services, as well as related resources. The gateway is designed to protect web traffic
and mediate communications between Service Oriented Architecture (SOA) clients and
endpoints. KPMG is using CA API Gateways for web application proxy and single sign
on support to secure application access from the internet and intranet. In KPMG, Layer7
gateway provides services including but not limited to: Global Single Sign On, Web
Application Firewall, Multi-Factor Authentication, secure API and API management,
authentication and authorization, secure token service, secure session management
and other related services.
API Gateway engineers have implemented many applications on the gateway and most
of them are very similar except minor differences in authentication and authorization
controls, application security control, etc. This is consuming lot of time and effort and
creating huge backlog for application delivery. Team has designed a policy framework
for KPMG Enterprise Application Platform to support application registration and
enforces security on the gateways. This has some caveats to support some additional
features on the gateways.
By keeping in view of all the challenges and re-use of the existing framework Dynamic
Application Security Framework is designed to support many different use cases and
applications and can expand when needed by new applications. This allows
applications define pre-configured policy sets and assign to services and control
behavior on the fly. This policy framework is useful for both application proxy and
federation services on the API Gateways.
Internet RATW
External
Users
Frontend Authentication
Active AD
Backend Authentication
LDAP Bind Active
Directory AD
XXE
PIN+Token
Kerberos SAML User Name Header SAML RSA RADIUS
Deployment Options
• URL / Base64 Encode any special characters that are consider unsafe.
• Consider hosting static content in CDN and share across multiple applications to
reduce initial load time of the pages.
• Do not block any web request for more than 10 seconds. Start responding the
request within 10 seconds.
• Consider using Asynchronous API for long running transactions rather than
blocking the threads.
Factors to consider when integrating when Accessing APIs Hosted Within KPMG:
o Define proper error codes and descriptions for different failure conditions
for better user experience and automation.
Factors to consider when integrating when Accessing APIs Hosted Outside KPMG:
• Access APIs through API Gateways for better security and central API
management.
• Use Kerberos OR basic over TLS for access control to simplify API access from
internal network.
• Let API gateways handle authentication with 3rd party APIs server and mediate
authentication.
Appendix
References
Overview of the resiliency pillar:
https://docs.microsoft.com/en-
us/azure/architecture/framework/resiliency/overview
Backup and disaster recovery for Azure applications:
https://docs.microsoft.com/en-us/azure/architecture/framework/resiliency/backup-
and-recovery
Using business metrics to design resilient Azure applications:
https://docs.microsoft.com/en-
us/azure/architecture/framework/Resiliency/business-metrics#workload-
availability-targets
Resiliency checklist for specific Azure services:
https://docs.microsoft.com/en-us/azure/architecture/checklist/resiliency-per-
service
SLA for Virtual Machines:
https://azure.microsoft.com/en-us/support/legal/sla/virtual-machines/v1_9/
Manage the availability of Windows virtual machines in Azure:
https://docs.microsoft.com/en-us/azure/virtual-machines/windows/manage-
availability
What are Availability Zones in Azure?
https://docs.microsoft.com/en-us/azure/availability-zones/az-overview
Set up disaster recovery to a secondary Azure region for an Azure VM:
https://docs.microsoft.com/en-us/azure/site-recovery/azure-to-azure-quickstart
Business continuity and disaster recovery (BCDR): Azure Paired Regions:
https://docs.microsoft.com/en-us/azure/best-practices-availability-paired-regions
Azure Global Infrastructure
https://azure.microsoft.com/en-us/global-infrastructure/regions/
Azure Application Gateway
https://docs.microsoft.com/en-us/azure/application-gateway/overview
Azure Monitor
https://docs.microsoft.com/en-us/azure/azure-monitor/
Azure Automation
https://docs.microsoft.com/en-us/azure/automation/automation-intro
Azure Backup
https://docs.microsoft.com/en-us/azure/backup/backup-overview
Azure Site Recovery
https://docs.microsoft.com/en-us/azure/site-recovery/site-recovery-overview
Core Azure Storage services
https://docs.microsoft.com/en-us/azure/storage/common/storage-introduction
Azure Blob Storage
https://docs.microsoft.com/en-us/azure/storage/blobs/storage-blobs-introduction
Azure Table Storage
https://docs.microsoft.com/en-us/azure/storage/tables/table-storage-overview
Azure Queue Storage
https://docs.microsoft.com/en-us/azure/storage/queues/storage-queues-
introduction
Azure Managed Disk Storage
https://docs.microsoft.com/en-us/azure/virtual-machines/windows/managed-
disks-overview
Azure File Storage
https://docs.microsoft.com/en-us/azure/storage/files/storage-files-introduction
• Azure ExpressRoute
• https://docs.microsoft.com/en-us/azure/expressroute/expressroute-introduction
• Azure Key Vault
• https://docs.microsoft.com/en-us/azure/key-vault/general/overview
• Azure Load Balancer
• https://docs.microsoft.com/en-us/azure/load-balancer/load-balancer-overview
• Azure Network Security Group
• https://docs.microsoft.com/en-us/azure/virtual-network/security-overview
• Azure Network Watcher
• https://docs.microsoft.com/en-us/azure/network-watcher/network-watcher-
monitoring-overview
• Azure Traffic Manager
• https://docs.microsoft.com/en-us/azure/traffic-manager/traffic-manager-overview
• Azure Virtual Machines – VM and available SKUs
• https://docs.microsoft.com/en-us/azure/virtual-machines/windows/overview
• https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes
• Azure Virtual Network - VNet
• https://docs.microsoft.com/en-us/azure/virtual-network/virtual-networks-overview
• Azure Web Apps Services
• https://docs.microsoft.com/en-us/azure/app-service/
• Azure SLA
• https://azure.microsoft.com/en-us/support/legal/sla/summary/
• Azure Product by Region
• https://azure.microsoft.com/en-us/global-infrastructure/services/
© 2020 KPMG International Cooperative (“KPMG
International”), a Swiss entity. Member firms of the
KPMG network of independent firms are affiliated
with KPMG International. KPMG International
provides no client services. No member firm has any
authority to obligate or bind KPMG International or
any other member firm vis-à-vis third parties, nor
does KPMG International have any such authority to
obligate or bind any member firm. All rights
reserved.