You are on page 1of 136

One Platform

OMEGA: Azure Cloud Architecture


Resiliency Guidelines & Recommendations
Version 1.4

June 2020
Document revision history
Revision history
Version Name Date Revision
1.0 Nick Matahen 4-20-2020 Initial Version
1.11 Nick Matahen 5-8-2020 Added content
1.2 Nick Matahen 5-26-2020 Incorporated reviewers feedback
1.21 Nick Matahen 5-30-2020 Incorporated reviewers feedback
1.3 Nick Matahen 6-04-2020 Incorporated reviewers feedback
1.31 Nick Matahen 6-14-2020 Incorporated reviewers feedback
1.4 Nick Matahen 6-19-2020 Incorporated reviewers feedback

This document has been reviewed/approved by


Version Reviewer Date reviewed Date approved
Cees Schmeitink 5/22/2020
5/24/2020
Nirav Kumar Shah 5/14/2020
5/18/2020
Subrata Ghosh 5/26/2020
6/5/2020
Jacques Bron 6/5/2020 6/19/2020
Defeng Hou 5/20/202
6/16/2020
Damian Zwamborn 5/20/2020
Giles French 5/20/2020
6/5/2020
Anoop Manganahalli 5/22/2020
Reena Parekh See Reed Feedback
Erik de Rooij 5/19/2020
6/4/2020
Shawn Sutherland 5/22/2020
6/4/2020
6/17/2020
Jennifer Galligan 5/22/2020
Debbie Mulcahy 6/19/2020
Ilya Gerber 5/28/2020
6/5/2020
Somasundaram, Sathish 5/9/2020
5/22/2020
Ralston, Matthew 5/19/2020
Taylor, Reede 5/20/2020
Sathi, Manikanta 5/22/2020
Pelgrim, Nikki 5/20/2020
6/5/2020
Peter Brown
Table of Contents
1. Introduction – One Platform ....................................................................................................... 7
2. Overview – Azure Cloud Resiliency .......................................................................................... 8
2.1 Intended Audience................................................................................................................. 8
2.2 Purpose .................................................................................................................................. 8
2.3 Objectives .............................................................................................................................. 8
3. Azure Cloud Resiliency .............................................................................................................. 9
3.1 Resiliency Definition............................................................................................................. 9
3.2 Resiliency Features ............................................................................................................... 9
3.3 Resiliency Blast Radius....................................................................................................... 10
3.4 Azure Resiliency Services ................................................................................................... 10
3.5 Resiliency Responsibility .................................................................................................... 13
3.6 Availability Requirements................................................................................................... 14
3.7 Service Level Agreement (SLA) ......................................................................................... 14
3.8 Resiliency Metrics ............................................................................................................... 15
4. Omega Resiliency Scope .......................................................................................................... 16
4.1. Azure Cloud ....................................................................................................................... 16
4.2. Azure Global Infrastructure ............................................................................................... 16
5. One Platform Infrastructure – Cloud Services .......................................................................... 17
5.1. Azure Services.................................................................................................................... 17
5.2. Non-Azure Services ........................................................................................................... 17
6. Deployment Guidelines & Recommendations.......................................................................... 18
6.1. Azure Load Balancing Services ......................................................................................... 18
6.1.1. Azure Traffic Manager ................................................................................................ 21
6.1.2. Azure Application Gateway ........................................................................................ 25
6.1.3. Azure Load Balancer ................................................................................................... 30
6.2. Azure Application Insights................................................................................................. 40
6.2.1. Overview ..................................................................................................................... 40
6.2.2. Deployment Options .................................................................................................... 41
6.2.3. High Availability ......................................................................................................... 42
6.2.4. Disaster Recovery ........................................................................................................ 42
6.2.5. Backup ......................................................................................................................... 42
6.2.6. Availability Sets .......................................................................................................... 43
6.2.7. Availability Zones ....................................................................................................... 43
6.2.8. Regions ........................................................................................................................ 43
6.2.9. Deployment Guidelines and Recommendations.......................................................... 43
6.3. Azure Automation (Account) ............................................................................................. 45
6.3.1. Overview ..................................................................................................................... 45
6.3.2. Deployment Options .................................................................................................... 46
6.3.3. High Availability ......................................................................................................... 46
6.3.4. Disaster Recovery ........................................................................................................ 47
6.3.5. Backup ......................................................................................................................... 47
6.3.6. Availability Sets .......................................................................................................... 47
6.3.7. Availability Zones ....................................................................................................... 47
6.3.8. Regions ........................................................................................................................ 47
6.3.9. Deployment Guidelines and Recommendations.......................................................... 48
6.4. Azure Recovery and Backup Services ............................................................................... 49
6.4.1. Azure Backup Service ................................................................................................. 49
6.4.2. Azure Site Recovery .................................................................................................... 54
6.5. Azure Storage Services ...................................................................................................... 58
6.5.1. Azure Blob Storage (Block/Append/Page).................................................................. 58
6.5.2. Azure File Storage ....................................................................................................... 65
6.5.3. Azure Table Storage .................................................................................................... 70
6.5.4. Azure Queue Storage ................................................................................................... 75
6.5.5. Azure Managed Disks ................................................................................................. 80
6.6. Azure ExpressRoute ........................................................................................................... 87
6.6.1. Overview ..................................................................................................................... 87
6.6.2. Deployment Options .................................................................................................... 88
6.6.3. High Availability ......................................................................................................... 89
6.6.4. Disaster Recovery ........................................................................................................ 90
6.6.5. Backup ......................................................................................................................... 91
6.6.6. Availability Sets .......................................................................................................... 91
6.6.7. Availability Zones ....................................................................................................... 91
6.6.8. Regions ........................................................................................................................ 91
6.6.9. Deployment Guidelines and Recommendations.......................................................... 92
6.7. Azure Key Vault................................................................................................................. 94
6.7.1. Overview ..................................................................................................................... 94
6.7.2. Deployment Options .................................................................................................... 94
6.7.3. High Availability ......................................................................................................... 95
6.7.4. Disaster Recovery ........................................................................................................ 96
6.7.5. Backup ......................................................................................................................... 97
6.7.6. Availability Sets .......................................................................................................... 97
6.7.7. Availability Zones ....................................................................................................... 97
6.7.8. Regions ........................................................................................................................ 98
6.7.9. Deployment Guidelines and Recommendations.......................................................... 98
6.8. Azure Network Security Group.......................................................................................... 99
6.8.1. Overview ..................................................................................................................... 99
6.8.2. One Platform NSG Standards .................................................................................... 100
6.8.3. Deployment Options .................................................................................................. 101
6.8.4. High Availability ....................................................................................................... 101
6.8.5. Disaster Recovery ...................................................................................................... 102
6.8.6. Backup ....................................................................................................................... 102
6.8.7. Availability Sets ........................................................................................................ 102
6.8.8. Availability Zones ..................................................................................................... 102
6.8.9. Regions ...................................................................................................................... 102
6.8.10. Deployment Guidelines and Recommendations...................................................... 103
6.9. Azure Network Watcher................................................................................................... 104
6.9.1. Overview ................................................................................................................... 104
6.9.2. Deployment Options .................................................................................................. 104
6.9.3. High Availability ....................................................................................................... 105
6.9.4. Disaster Recovery ...................................................................................................... 105
6.9.5. Backup ....................................................................................................................... 106
6.9.6. Availability Sets ........................................................................................................ 106
6.9.7. Availability Zones ..................................................................................................... 106
6.9.8. Regions ...................................................................................................................... 106
6.9.9. Deployment Guidelines and Recommendations........................................................ 106
6.10. Azure Virtual Machines ................................................................................................. 107
6.10.1. Overview ................................................................................................................. 107
6.10.2. Deployment Options ................................................................................................ 107
6.10.3. High Availability ..................................................................................................... 108
6.10.4. Disaster Recovery .................................................................................................... 109
6.10.5. Backup ..................................................................................................................... 111
6.10.6. Availability Sets ...................................................................................................... 113
6.10.7. Availability Zones ................................................................................................... 114
6.10.8. Regions .................................................................................................................... 114
6.10.9. Deployment Guidelines and Recommendations...................................................... 114
6.11. Azure Virtual Network – VNet ...................................................................................... 116
6.11.1. Overview ................................................................................................................. 116
6.11.2. Deployment Options ................................................................................................ 116
6.11.3. High Availability ..................................................................................................... 118
6.11.4. Disaster Recovery .................................................................................................... 118
6.11.5. Backups ................................................................................................................... 118
6.11.6. Availability Sets ...................................................................................................... 119
6.11.7. Availability Zones ................................................................................................... 119
6.11.8. Regions .................................................................................................................... 119
6.11.9. Deployment Guidelines and Recommendations...................................................... 119
6.12. Non-Azure Services ....................................................................................................... 121
6.12.1. Palo Alto – KNET-Facing ....................................................................................... 121
6.12.2. Palo Alto – Internet-Facing ..................................................................................... 124
6.12.3. Palo Alto – Network (Panorama) Firewall Management ........................................ 127
6.12.4. Layer 7 - CA API Gateways .................................................................................... 129
Appendix ..................................................................................................................................... 133
References ................................................................................................................................... 133
1. Introduction – One Platform
KPMG is assessing the current One Platform cloud environment to ensure that One
Platform has been designed to maximize the advantages of cloud for KPMG’s internal
and external applications. Furthermore, KPMG is identifying areas of improvement and
reviewing existing cloud legacy services that could be replaced with cloud native
options. Additionally, KPMG will review the existing automation framework, tools, and
activities to ensure they are prioritized correctly for customer satisfaction, operational
efficiency, and multicloud architecture while recommending changes and updates where
needed based on business requirements. Project Omega will drive this assessment as
part of its scope, however, the overall goal and intention is to support One Platform.

These reviews will identify and prioritize changes that will support and improve the
global deployment of infrastructure and Member Firms.

KPMG initiated a project for the above assessment and reviews called the One
Platform Architecture and Automation Assessment and its code name is Omega.
2. Overview – Azure Cloud Resiliency
The primary goal of Project Omega is to ensure that One Platform is well-architected to
maximize the advantages of cloud for KPMG’s internal and external users and services.
The Omega framework consists of five workstreams, and each workstream has multiple
tracks. One of the workstreams is Cloud Architecture and it includes a track called
Cloud Resiliency.

Azure Cloud Resiliency helps Omega achieve one of its objectives by providing
guidelines and recommendations for deploying and enabling highly available and
resilient cloud services.

2.1 Intended Audience


The intended audience for this document is anyone involved in the One Platform cloud
environment from a technical perspective. This includes (but is not limited to) architects,
security, engineering, operations, and support personnel. This document is Azure-
oriented and geared toward One Platform customers.

2.2 Purpose
The purpose of this document is to provide guidelines and recommendations for
deploying resilient IaaS-based infrastructure services. These services are restricted to
One Platform Cloud services (Microsoft Azure Cloud). This document is not a design or
architecture document, but is meant to aid in the decision-making process of deploying
resilient Azure cloud infrastructure services. With the ever-changing nature of cloud
computing, this is a point-in-time document at the time of writing (June 2020) and may
require updates in the future to stay current with cloud changes.

2.3 Objectives
There are two objectives for One Platform Cloud Resiliency: the first is a business
objective to ensure automatic re-adaptation to crisis situations/disruptions of cloud
services by gracefully recovering from failures, allowing for continued functionality with
minimal downtime or data loss. The second is a technical objective to ensure creating
high-performing cloud services by building resilient cloud architecture in order to
seamlessly support computing services and end users.
3. Azure Cloud Resiliency
3.1 Resiliency Definition
Resiliency is the ability of a system to recover from failures and continue to function. It is
not only about avoiding failures, but also responding to failures in a way that minimizes
downtime or data loss.

3.2 Resiliency Features


There are multiple features in Azure resiliency. Azure resiliency includes a set of native
business continuity, high availability, disaster recovery, and backup cloud environments.

• Blast Radius
• The radius of protection for applications and data. For example, Availability Sets
protect applications within a datacenter, and Availability Zones protect
applications and data in an Azure region.
• High Availability
• High Availability is the ability to maintain an acceptable uptime due to temporary
failures in services, hardware, or fluctuations in load. High Availability also refers
to a system that is designed to prevent against loss of service or managing failures
and minimizing planned downtime.
• Disaster Recovery
• Disaster Recovery is the ability to protect against the loss of an entire region.
Disaster Recovery also means architecting cloud infrastructure to recover from an
unforeseen event that can put the organization at risk and ensure business
continuity.
• Backup
• Backup is the ability to replicate VMs and data to one or more regions. Azure
backup also means to back up (or protect) and restore your data due to an
unforeseen event.
• Data Residency
• Data Residency is the ability to have two regions that share the same regulatory
requirements for data replication, storage, and residency for the country or region
in which they operate.
• Data Sovereignty
• Data Sovereignty is the ability to keep data within the physical borders of a
particular country or geo-political area.
3.3 Resiliency Blast Radius
Resiliency represents a set of platform-native technical resilience features
facilitating/supporting business continuity, providing high availability, disaster recovery,
and backup to protect mission-critical applications and data. The image below shows
the resiliency features and capabilities. For more information, see Azure Resiliency.

3.4 Azure Resiliency Services


Azure resiliency levels start at the single virtual machine (VM) level and end with paired
regions. Protecting against failures requires the cloud provider to have a comprehensive
set of built-in resilience services that customers can easily enable and control based on
individual business needs. Azure offers solutions, including the following:
• Single VM – Standalone: Azure provides an SLA for a Single Virtual Machine. A
Single Instance Virtual Machine must be using Premium SSD or Ultra Disk for all
Operating System Disks and Data Disks in order for Microsoft to guarantee an SLA
of 99.9% with a Single Virtual Machine Connectivity of at least 99.9%. Single VM
does not provide high availability or fault tolerance, it only provides an uptime of
99.9%.
• Availability sets: To protect against localized hardware failures, such as a disk or
network switch failure, deploy two or more VMs in an availability set. An availability
set consists of two or more fault domains that share a common power source and
network switch. It also ensures that the virtual machines (VMs) deployed on Azure
are distributed across multiple isolated hardware nodes in a cluster. For more
information, see Availability sets.
• Scale sets: Azure virtual machine scale sets let you create and manage a group of
identical, load balanced VMs. The number of VM instances can automatically
increase or decrease in response to demand/load or a defined schedule. Azure
virtual machine scale sets are used to provide redundancy and improved
performance.
• Availability zones: An Availability Zone is a physically separate location within an
Azure region. Availability Zones provide a combination of low latency and high
availability through the strategic physical location separation within an Azure region.
Each Availability Zone has independent physical infrastructure with a distinct power
source, network, and cooling system. It also protects customers’ applications and
data from datacenter failures by enabling distribution of VMs belonging to one tier
across multiple physical locations within a region. For more information, see
Availability Zones.
• Azure Site Recovery: Azure Site Recovery helps replicate Azure VMs to another
Azure region for business continuity and disaster recovery. It is recommended to
conduct periodic data recovery (DR) drills to ensure compliance requirements are
met. It also allows replication of VMs to another Azure region for business continuity
and disaster recovery needs. For more information, see Azure Site Recovery.
• Paired regions: To protect an application in case of a regional outage, deploy the
application across multiple regions using Azure Traffic Manager to distribute internet
traffic to different regions. In this scenario, each Azure region is paired with another
region. For more information, see Azure Paired Regions.
• Azure Load Balancer: Distributes inbound traffic, according to rules and health
probes.
• Azure Traffic Manager: Enables distribution of traffic optimally to services across
global Azure regions, while providing high availability and responsiveness.
• Azure Backup serves as a general-purpose backup solution for cloud and on-
premises workflows that run on VMs or physical servers.
• Geo-Replication for Azure SQL Database allows an application to perform quick
disaster recovery of individual databases in case of a regional disaster or large-scale
outage.
• Locally redundant storage (LRS) provides object durability by replicating customer
data to a storage scale unit.
• Zone redundant storage (ZRS) replicates customer data synchronously across
three storage clusters in a single region.
• Geo-redundant storage (GRS) provides durability of objects over a given year by
replicating customer data to a secondary region that is hundreds of miles away from
the primary region.

Failures can vary in the scope of their impact. There are many different failure types,
including hardware, software, regional, or transient failures, as well as dependency
service, heavy load, accidental data deletion/corruption, and application deployment
failures. The first step toward achieving resilience is to avoid failures in the first place.
This approach to increasing reliability involves improving the platform’s capability to
minimize impacts during planned maintenance events and giving customers control
over the experience during these events. For more information, see Resiliency Design.
3.5 Resiliency Responsibility
Resiliency is a shared responsibility between the various customer teams and the cloud
provider. Additionally, resiliency depends on the cloud service model being used: IaaS,
PaaS, or SaaS. Being resilient in the event of any failure is a shared responsibility, as
shown in the table below. For more information, see Shared Responsibility.
3.6 Availability Requirements
It is also important to define what it means for the application to be “resilient.” Resiliency
design requires an understanding of the applicable availability requirements and asking
the following, based on the business and technical requirements:
• How much unplanned downtime and/or data loss is acceptable?
• How much will potential unplanned downtime and/or data loss cost the business?
• What means (other than increasing technical resilience) are available to the business
for dealing with unplanned downtime and/or data loss? (e.g. adjusted business
processes, etc.)
• How much money and time can the business realistically invest in making the
application more resilient?

3.7 Service Level Agreement (SLA)


In Azure, the Service Level Agreement (SLA) describes Microsoft’s commitments
regarding uptime and connectivity. If the SLA for a service is 99.9%, customers can
expect the service to be available 99.9% of the time. Azure customers should define
their own target SLAs for each workload in their solutions. An SLA makes it possible to
evaluate whether the architecture meets the business requirements. For example, if a
workload requires 99.99% uptime, but depends on a service with a 99.9% SLA, that
service cannot be a single point of failure in the system. If Microsoft does not achieve
and maintain the Service Levels for each Cloud Service as described in this SLA, then
customers may be eligible for a credit towards a portion of your monthly service fees.
There is No SLA is provided for the Free tier of Azure services. For more information,
see SLA Levels.

The following table shows the potential cumulative downtime for various SLA levels:

SLA Downtime per week Downtime per month Downtime per year
99% 1.68 hours 7.2 hours 3.65 days
99.9% 10.1 minutes 43.2 minutes 8.76 hours
99.95% 5 minutes 21.6 minutes 4.38 hours
99.99% 1.01 minutes 4.32 minutes 52.56 minutes
99.999% 6 seconds 25.9 seconds 5.26 minutes
3.8 Resiliency Metrics
The following metrics are used to measure resilience:

• Recovery time objective (RTO) is the maximum acceptable time that an application
can be unavailable after an incident.
• Recovery point objective (RPO) is the maximum duration of data loss that is
acceptable during a disaster.
• Service Level Agreement (SLA) describes Microsoft’s commitments regarding
uptime and connectivity. If the SLA for a particular service is 99.9%, that means
customers can expect the service to (on average) be available 99.9% of the time. If
Microsoft does not achieve and maintain the Service Levels for each Service as
described in their SLA, then you may be eligible for a credit towards a portion of your
monthly service fees.
4. Omega Resiliency Scope
4.1. Azure Cloud
Azure has about 55 global regions in 140 countries. Each region has one or more
Availability Zones offering the scale needed to bring infrastructure and applications
closer to users’ geography, preserving data residency and offering comprehensive
compliance and resiliency options for customers. This document is concerned with One
Platform infrastructure resiliency.

4.2. Azure Global Infrastructure


Cloud Services resiliency varies by region, so it is critical to know in advance the region
in which resiliency is available. Azure regions, geographies, and Availability Zones form
the foundation of our global infrastructure—providing customers high availability,
disaster recovery, and backup:

• Geographies
• A geography is a discreet physical location containing two or more regions
that are fault-tolerant to withstand complete region failure through their
connection to our dedicated high-capacity networking infrastructure.
• Regions
• A region is a set of datacenters deployed within a latency-defined perimeter
and connected through a dedicated regional low-latency network. It also
preserves data residency and compliance boundaries.
• Availability Zones
• Availability Zones are physically separate locations within an Azure region.
They are designed to run mission-critical applications with high availability and
low-latency replication.
5. One Platform Infrastructure – Cloud Services
The resiliency scope is limited to One Platform infrastructure-related services.
Infrastructure includes the components below. This list is not exhaustive, and it is
applicable to the time of writing this document (June 2020). The following sections will
provide detailed information, guidance, and recommendations for deploying resilient
cloud services (KPMG One Platform-approved services). These services are:

5.1. Azure Services


• Azure Load Balancing Services
o Azure Traffic Manager
o Azure Application Gateway
o Azure Load Balancer
• Azure Application Insights
• Azure Automation (Account)
• Azure Recovery and Backup Services
o Azure Backup Service
o Azure Site Recovery
• Azure Blob Storage (Block/Page/Append)
• Azure File Storage
• Azure Table Storage
• Azure Queue Storage
• Azure Managed Disks
• Azure ExpressRoute
• Azure Key Vault
• Azure Network Security Group
• Azure Network Watcher
• Azure Virtual Machines - VM
• Azure Virtual Network - VNet

5.2. Non-Azure Services


• Palo Alto – KNET Facing
• Palo Alto – Internet Facing
• Palo Alto – Network (Panorama) Firewall Management
• Layer 7 – CA API Gateway
6. Deployment Guidelines & Recommendations
This is section provides details regarding the deployment of Azure Cloud Services that
are already implemented in One Platform

6.1. Azure Load Balancing Services


Here are the main load-balancing services currently available in Azure:

— Front Door is an application delivery network that provides global load balancing and
site acceleration service for web applications. It offers Layer 7 capabilities for your
application like TLS offload, path-based routing, fast failover, caching, etc. to
improve performance and high-availability of your applications. Currently, this is not
in use, but is planned to be implemented in One Platform in the future.

— Traffic Manager is a DNS-based traffic load balancer that enables you to distribute
traffic optimally to services across global Azure regions, while providing high
availability and responsiveness. Because Traffic Manager is a DNS-based load-
balancing service, it load balances only at the domain level. For that reason, it
cannot fail over as quickly as Front Door, because of common challenges around
DNS caching and systems not honoring DNS TTLs. This service is currently in use in
One Platform. For more information, see Section 6.1.1.

— Application Gateway provides application delivery controller (ADC) as a service,


offering various Layer 7 load-balancing capabilities. Use it to optimize web farm
productivity by offloading CPU-intensive TLS termination to the gateway. This
service is currently in use in One Platform. For more information, see Section 6.1.2.

— Azure Load Balancer is a high-performance, low-latency Layer 4 load-balancing


service (inbound and outbound) for all UDP and TCP protocols. It is built to handle
millions of requests per second while ensuring your solution is highly available.
Azure Load Balancer is zone-redundant, ensuring high availability across Availability
Zones. This service is currently in use in One Platform. For more information, see
Section 6.1.3.

Decision tree for load balancing in Azure

When selecting the load-balancing options, here are some factors to consider:

• Traffic type: Is it a web (HTTP/HTTPS) application? Is it public-facing or a


private application?

• Global vs. regional: Do you need to load balance VMs or containers within a
virtual network, or load balance scale unit/deployments across regions, or both?
• Availability: What is the service SLA?

• Cost: In addition to the cost of the service itself, consider the operations cost for
managing a solution built on that service. For more information, see Azure
pricing.

• Features and limits: What are the overall limitations of each service? For more
information, see Service limits.

The following flowchart will help you to choose a load-balancing solution for your
application. The flowchart guides you through a set of key decision criteria to reach a
recommendation.

Treat this flowchart as a starting point, as every application has unique requirements.
Then perform a more detailed evaluation.

If your application consists of multiple workloads, evaluate each workload separately. A


complete solution may incorporate two or more load-balancing solutions.
Definitions

• Internet facing: Applications that are publicly accessible from the internet. As a
best practice, application owners apply restrictive access policies or protect the
application by setting up offerings like web application firewall and DDoS
protection.

• Global: End users or clients are located beyond a small geographical area. For
example, users across multiple continents, across countries within a continent, or
even across multiple metropolitan areas within a larger country.

• PaaS: Platform as a service (PaaS) services provide a managed hosting


environment, where you can deploy your application without needing to manage
VMs or networking resources. In this case, PaaS refers to services that provide
integrated load balancing within a region. For more information, see Choosing a
compute service – Scalability.
• IaaS: Infrastructure as a service (IaaS) is a computing option where you
provision the VMs that you need, along with associated network and storage
components. IaaS applications require internal load balancing within a virtual
network, using Azure Load Balancer.

• Application-layer processing: Refers to special routing within a virtual network.


For example, path-based routing within the virtual network across VMs or virtual
machine scale sets.

6.1.1. Azure Traffic Manager

Overview

A load balancer that is DNS-based, Azure Traffic Manager allows optimal distribution of
traffic across global Azure regions inherently providing high availability and
responsiveness. The health of endpoints and traffic routing methodology are utilized by
Traffic Manager to direct client requests to the optimal service endpoint hosted inside or
outside of Azure via a range of endpoint monitoring and traffic shaping configured to
meet application requirements and high availability/failover modeling. Microsoft
guarantees that DNS queries will receive a valid response from at least one of our
Azure Traffic Manager name server clusters at least 99.99% of the time.

Traffic Manager provides two key benefits:

• Distribution of traffic according to one of several traffic-routing methods.

• Continuous monitoring of endpoint health and automatic failover when endpoints fail.

Most importantly, Traffic Manager operates at the DNS level to direct clients to a
specific service endpoint based on traffic routing rule methodology and does not
operate as a proxy or gateway; clients connect directly to the service endpoint and
traffic flows are not analyzed.

Deployment Options

Azure Traffic Manager can be deployed to address multiple DNS-based traffic


optimization scenarios:

• Six traffic-routing methods are supported by Traffic Manager to determine


pathing for client requests:
o A prioritized list of service endpoints is maintained and leveraged during
the negotiation process based on a combination of the endpoint priority
assignment and availability/health by constant monitoring and update of
an endpoint profile.

o Pre-defined weights for service endpoints and availability/health are


utilized in response to client DNS queries to direct and shape traffic
between managed endpoints. One item of note is DNS queries are cached
by clients and DNS server recursion queries so this could impact traffic
distribution depending on number of clients and recursive DNS servers.

o Source IP lookup and comparison with the internet latency table can be
used to optimize performance of traffic distribution by tracking round trip
times between the client and service endpoints to direct the traffic to the
optimal location. The latency table is regularly updated to account for
changes in the global internet, however application performance varies on
real-time load variations across the internet, so performance degradation
is possible.

o Configuring Traffic Manager to assign service endpoints based on client


DNS query Geographic origination, empowering customers to shape traffic
based on both Azure regional proximity and possible compliance/data
sovereignty requirements for service endpoint access.

o Traffic Manager profiles that can only host IPv4/IPv6 addresses as


endpoints should utilize the Multi-value option during configuration, so a
query response contains all healthy endpoints.

o Mapping client IP ranges to specific endpoints should utilize the Subnet


method so a received request is served the corresponding service
endpoint mapped to the source IP.

• More advanced traffic routing scenarios can be implemented with Nested Traffic
Manager profiles by combining multiple routing methods listed above by
overriding the default behavior in support of applications with larger and more
complex deployments.

High Availability

Traffic Manager is resilient to failure, including the failure of an entire Azure region,
however, aside from the nature of the Azure Service Fabric integration, the following HA
architecture models should be reviewed to impart redundancy into an application or
service endpoint:

• In the Active-passive with cold standby solution, VMs and other appliances
running in the standby region remain inactive until a failover is initiated. This
methodology utilizes ARM templates, backups or VM images to replicate
resources to a secondary region and should be noted this method extends the
failover/recovery time completion.

• In the Active/Passive with pilot light solution, the secondary/standby environment


is configured to only support a critical subset applications to provide minimal
functionality, however, it can be scaled up and out to spawn additional services
to service the bulk of a production load post-failover.

• In the Active/Passive with warm standby solution, the secondary region is pre-
warmed, and running in parallel, to begin servicing the base load due to all
instances remaining actively online and autoscaling enabled. It should be noted
this solution is not scaled to maintain the full production load but can provide
functionality, albeit in a reduced performance metric, because all services are
already online and active.

Disaster Recovery

Azure Traffic Manager should leverage the following technical aspects to develop the
DR strategy:

• Natively handled via Azure Site Recovery by a replication deployment


mechanism between the primary and standby environments.

• Diverting network/web traffic from primary to secondary/standby sites via custom


dev solution, comprising Azure Traffic Manager and Azure DNS load balancing.

• In conjunction with the Active-passive with cold standby and Active/Passive with
pilot light HA solutions, Azure DNS manual failover utilizes the standard DNS
mechanisms to failover to a backup site. The following assumptions are made:

o Static IPs are deployed for both Primary and Secondary service
endpoints.

o The Primary and Secondary sites have corresponding Azure DNS Zones.

o Recovery Time Objective (RTO) SLA thresholds are at or below when


configuring TTL for the Zones or records.
• In conjunction with the Active/Passive with warm standby solution, Azure Traffic
Manager can be configured to validate the health of service endpoints and route
client requests to the healthy endpoint. In this scenario, the secondary region
becomes active only when the primary region’s service endpoint experiences a
disruption and the Traffic Manager’s health checks reflect the status change,
prompting all new network requests to be routed accordingly. The option to
configure additional regional failover endpoints can further expand DR
operations during a rare, multi-regional service interruption.

Backup

Due to the nature of the service offering and lack of traditional backup operations for the
Azure Traffic Manager, the following method can be leveraged to store existing
configuration templates for redeployment:

1. Export templates of each Traffic Manager object.

2. Store exported templates in a Zone/Geo-redundant Blob/File account.

3. Test restore activities in dev environment to validate export fidelity.

Availability Sets

Azure Traffic Manager does not offer the Availability Set feature as an available
function of the service offering.

Availability Zones

Due to the standard replication operations within the Azure Service Fabric,
Availability Zone configurations are not available as a function of the service
offering.

Regions

Please refer to the Appendix for regional availability.

Note: Due to the unique nature of the Traffic Manager offering, note that presence is
mostly non-region specific to maintain highest levels of availability and durability.
Deployment Guidelines and Recommendations

The following are some guidelines and best practices to consider when leveraging
Azure Traffic Manager:

• To provide the highest availability to service endpoints, ensure the applications


and services are geographically distributed to utilize both geographic
prioritization and service endpoint redundancy.

• Implement Nested Traffic Manager profiles to both meet application requirements


and also design more complex, and larger, application deployments.

• Utilize a combination of Azure Traffic Manager, Application Gateway and Load


balancer to build the optimal solution.

• For deploying a new version of software, the Blue-Green deployment is a


software rollout method that can reduce the impact of interruptions caused due to
issues in the new version being deployed. For more information, see Blue-Green
deployments.

6.1.2. Azure Application Gateway

Overview
Azure Application Gateway is a highly scalable and robust solution for publishing web-
based applications. The Azure Application Gateway functions as a reverse proxy,
terminating the Inbound sessions from end-users. It performs SSL decryption and
encryption for the sessions between Client and Server and protects the communication
with build-in Web Application Firewall technology leveraging the industry standard
OWASP rule set.

Traditional load balancers (LB) operate at the transport layer (OSI layer 4 - TCP and
UDP) and route traffic based on source IP address and port, to a destination IP address
and port whereas the Application Gateway operates at layer 7 and is referred to as
application layer load balancing. The difference between a traditional LB and an
Application Gateway is the ability to make routing decisions based on header attributes
of the HTTP request such as host header details or the path of a URI. For example, if a
web request has /images in the URL path the traffic can be routed to a specific back
end server pool configured to host image files or if the URL contains /video the traffic
can be routed to a separate back end server pool optimized for video streaming.
Deployment Options

Application Gateway must be deployed using WAF_V2. Microsoft does offer


Standard_V2, but it is not used in KPMG Global.

As per Microsoft’s recommendation, Application Gateway v2 will be used for all new
deployments.

The following are the standard deployment patterns for Application Gateways:

• Dedicated App Gateway: The Azure Application Gateway is owned by the


Project team. The One Platform team helps with the configuration and technical
assistance.

• Shared App Gateway: Shared Application Gateways are deployed in One


Platform-owned subscriptions (core services) and is managed by the One
Platform team.

• Shared end point: The One Platform team reuses the existing end
points/certificates to onboard the applications. The onboarding will be
quicker, as they will be reusing existing Public IP and shared Palo Alto
configurations.
• Dedicated end point: The application will be hosted in a One Platform-
managed Azure Application Gateway but create an application-specific
end point including a dedicated Public IP address and related Palo Alto
configuration for Internet-facing applications.

High Availability
Application Gateway WAF_V2 supports auto scaling (minimum of 2 instances and
maximum up to the customer needs, and can scale up or down based on changing
traffic load patterns) and zone redundancy, will be available at least 99.95% of the time.

Disaster Recovery

Your tolerance for reduced functionality during a disaster is a business decision that
varies from one application to the next. It might be acceptable for some applications to
be unavailable or to be partially available with reduced functionality or delayed
processing for a period of time. For other applications, any reduced functionality is
unacceptable.

A WAF_v2 Application Gateway can span multiple Availability Zones, offering better
fault resiliency and removing the need to provision separate Application Gateways in
each zone.

Customers looking for DR functionality leveraging OPCS Azure Paired regions have to
configure the application gateway in the paired region (it is not an inbuilt capability). It
can be configured upfront as a passive set up or can be deployed as part of DR
process.

Note: To have a complete picture and overall architecture, it is required to include as


part of the Disaster Recovery Strategy all components for the business
system/application and no single services.

Backup

Due to the nature of the service offering and lack of traditional backup operations for
Application Gateways, the following method can be leveraged to store existing
configuration templates for redeployment:

1. Export templates of each Application Gateway.

2. Store exported templates in a Zone/Geo-redundant Blob/File account.


3. Test restore activities in dev environment to validate export fidelity.

Availability Sets

Application Gateways now offer the Autoscale feature, in addition to the Manual
configuration of the Scale Unit with the WAF_v2 SKU. Like the Virtual Machine Scale
Sets and App Service Scale Set features, you are now able to create a minimum (2
instances) and maximum amount of Scale Units (Compute resources) to prevent the
Application Gateway from running at peak provisioned capacity for anticipated
maximum workload and enable better elasticity for application workload requirements.

Availability Zones

Application Gateway now offers Availability Zones with the WAF_v2 SKU.

To configure Availability Zones for Application Gateway instances, the following steps
should be executed:

1. Autoscaling:

a. Navigate to the Application Gateway, then Configuration.

b. Select Autoscale in Capacity type and choose the minimum and


maximum scale units, then save the configuration.

2. Zone Redundancy Configuration Options:

a. Deploy new Application Gateways, configure Zone availability, and


migrate configurations.

b. Redeploy existing Application Gateway and add Zone configurations.

Regions

Please refer to the Appendix for regional availability.

Deployment Guidelines and Recommendations

Microsoft Application Gateway deployments should leverage the following guidelines:


• It is always recommended to deploy Application Gateway WAF in prevention
mode. The only scenario where Application Gateway WAF can be deployed in
detection mode is when the Application Gateway is behind Layer 7.

o Note: All Azure Application Gateways must be configured behind Palo


Alto Internet-facing appliances and use private IPs for communication.
External endpoints and DNS records are configured with outer most public
load balancer responsible for Palo Altos.

• Exception management accountability lies with IPG, but all involved teams
(Application/One Platform) are responsible in maintaining the security standards
throughout the application lifecycle.

• Application Gateway v2 should be preferred over v1, as it offers many benefits,


such as header response hardening, per-site WAF policies, autoscaling, zone
redundancy, and many new features.

• Custom rules/exclusions must be the preferred options over disabling WAF rules.

• When an exception must be implemented on a shared Application Gateway or a


dedicated Application Gateway with multiple applications, then per-site WAF
policies must be created with the exception and assigned to the particular
application listener.

• If using both HTTP/HTTPS protocols on any sub-site, the default website should
be listening on both 80 and 443. Unless an application specific requirement
exists, all endpoints must be published via HTTPS/443 or redirect from HTTP/80
to HTTPS/443. This would be handled under an exception request with a business
justification.

• The default site should have the TLS certificate bound that will primarily be used
by the Application Gateway to authenticate against the server.

• If NTLM or Windows Authentication is used on any published site (except Forms


Based Authentication) a WAF_v1 SKU and site/page is required with anonymous
authentication to be used for the Application Gateway custom probe. WAF_v2
does not yet support NTLM authentication.

• Leverage IP address for backend pool vs FQDN.

• Disable SNI on default website.


• Depending on required instances a minimum subnet size of /28 recommended;
this will provide 11 usable IPs. If the application requires more than 10 instances
a subnet size of /27 or /26 should be considered.

• Per KPMG policy, HTTPS implementation is required to utilize TLS termination or


end-to-end TLS encryption as HTTP does not encrypt traffic between the client
and Application Gateway.

• A subnet can host multiple Application Gateways, however a multiple SKU


(WAF_v1 and WAF_v2) deployment is not supported.

6.1.3. Azure Load Balancer

Overview

Operating at Layer 4 of the OSI model, the Azure Load Balancer distributes inbound
flows arriving at the load balancer’s front to backend pools dictated by load balancing
rules and health probes. Azure VMs or instances in a Virtual Machine Scale Set (VMSS)
can comprise backend pools to further enhance HA to applications and services.

Outbound connections for VMs inside the VNet are enabled by NATing their private IPs
to a Public load balancer’s public IP to balance internet traffic to the backend VMs,
whereas private load balancers utilize private IPs on the front end to balance internal
VNet traffic only.

One item of note is a load balancer frontend can be accessed from an on-premises
network in a hybrid scenario.
Applications can be scaled to create highly available services, supporting both ingress
and egress scenarios which provide low latency and high throughput, and scales up to
millions of flows for all TCP and UDP based applications when utilizing a Standard Load
Balancer.

A Standard Load Balancer can address the following key scenarios:

• Internal and external traffic flows to Azure VMs.

• Distribute resources across, and within, zones to increase application and service
availability.

• Enable outbound connectivity for Azure VMs

• Monitor load balanced resources via custom or default health probes.

• Access VMs in a VNet via port forwarding on the load balancer’s public IP.

• IPv6 load balancing support.

• Azure Monitor can ingest multi-dimensional metrics that can be filtered, grouped
and further extrapolated for a given dimension from a Standard Load Balancer to
provide current and historic performance and health insights of a service.

• Services hosted on multiple ports and/or multiple IPs can be load balanced.

• Simultaneous TCP and UDP flow load balancing on all port utilizing HA ports.
Deployment Options
Azure Load Balancers contain the following items and options when deploying into an
environment, but first there are two SKUs to consider, as illustrated in the table below:

Standard Load Balancer Basic Load Balancer


Backend pool Supports up to 1000 instances. Supports up to 300 instances.
size
Backend pool Any virtual machines or virtual machine Virtual machines in a single
endpoints scale sets in a single virtual network. availability set or virtual
machine scale set.
Health probes TCP, HTTP, HTTPS TCP, HTTP
Health probe TCP connections stay alive on an TCP connections stay alive on
down behavior instance probe down and on all probes an instance probe down. All
down. TCP connections terminate
when all probes are down.
Availability Zone-redundant and zonal frontends for Not available
Zones inbound and outbound traffic.
Diagnostics Azure Monitor multi-dimensional metrics Azure Monitor logs
HA Ports Available for Internal Load Balancer Not available
Secure by Closed to inbound flows unless allowed Open by default. Network
default by a network security group. Please note security group optional.
that internal traffic from the VNet to the
internal load balancer is allowed.
Outbound Declarative outbound NAT configuration Not available
Rules
TCP Reset on Available on any rule Not available
Idle
Multiple front Inbound and outbound Inbound only
ends
Management Most operations < 30 seconds 60-90+ seconds typical
Operations
SLA 99.99% Not available

• The Frontend of the Load Balancer can be configured with two types of IPs for
access, and the selection determines the type of load balancer created. The two
types, available in both Basic and Standard SKUs, are:

o Public Load Balancer maps the public IP/port of inbound traffic to


internal/private IP/port of the backend pool and maps traffic in reverse for
return traffic to the client. Load balancing rules can distribute specific
types of traffic spanning multiple VMs or services, such as spreading web
request traffic loads across multiple web servers.

o Internal Load Balancer frontend IP access is restricted for a load


balanced VNet by distributing traffic to resources inside a VNet and never
directly exposing it to an internet endpoint. Internal LOB applications run
within Azure and are accessed in Azure or from on-prem resources.

• The Backend pool comprises a standard group of VMs in a VNET or instances in


a Virtual Machine Scale Set (VMSS) to serve inbound requests and can meet
high volumes of traffic by scaling backend instances up or out.

o Scaling instances up or down and scaling out or in automatically


reconfigures the Load Balancer without any additional operational
requirements.

o To optimize the length of management operations, design the least


number of backend pool resources as the difference between data plane
performance or scale is indistinguishable.

• The Health Probe can define the unhealthy thresholds for backend pool
instances so when a probe response fails, the load balancer ceases traffic flow to
the unhealthy node. Health Probe failures do not affect existing connections and
will continue until the application either terminates the flow, experiences a
timeout or the VM shuts down. Health Probes types include TCP, HTTP and
HTTPS for endpoints for Standard Load Balancers; HTTPS probes are not
supported for Basic Load Balancers.

• Load Balancing Rules configures the load balancer’s frontend IP and application
specific port requirements to multiple backend IPs and ports so application and
service requirements should dictate the design for optimal performance.

• Inbound NAT rules employ the same hash-based distribution as load balancing
to implement port forwarding for frontend IP traffic to backend pool’s instances
while Outbound Rules configure and handle outbound NATing for all backend
pool instances.

High Availability

Load balancing TCP and UDP flows on all ports simultaneously is assisted with an
internal Standard Load Balancer. A variant of a load balancing rule, a HA port rule can
simplify the implementation of balancing all TCP and UPD flows by providing a singular
rule configuration and load balancing decisions are executed per flow and is supported
in all global Azure regions. The decision algorithm is based on a five-tuple connection:

1. Source IP

2. Source Port

3. Destination IP

4. Destination Port

5. Protocol

Note: While 5 tuples might result in best possible load distribution amongst the
members in the pool, some scenarios might require Source IP address affinity,
where 2 Tuple algorithms would be the best fit.

If a large number of ports must be load balanced or critical scenarios such as HA and
scale for network virtual appliances inside VNets must be deployed; you should
implement a HA ports load balancing rule by configuring the frontend and backend ports
to ‘0’ and protocols to ‘All’ so the load balancer resources balances all flows, regardless
of port number. Applications requiring load balancing of large numbers of ports are ideal
for HA port rules and can simplify the implementation by deploying an internal Standard
Load Balancer with a HA port rule vs multiple load balancing rules configured per port.

The following are examples of supported HA port rule configurations to assist with
identifying applications and services ideal for use and further raising availability:

• Non-Direct Server Return (non-floating IP) HA ports Internal Load Balancer


configuration, a basic HA ports configuration, does not allow any other rule
configuration on the load balancer resource and does not allow other internal
load balancer resource configuration for the backend instance set. This includes
the following:

o Select HA ports check box in Load Balancer rule configuration in a


Standard Load Balancer.

o Disable Floating IP.

o Note: A Public Standard Load Balancer for backend instance set can be
configured in addition to this HA ports rule configuration.

• Direct Server Return (floating IP) HA ports rule Internal Load Balancer
configuration allows you to add more floating IP load balanced rules and/or a
Public Load Balancer. However, you cannot combine this configuration with the
previous non-floating IP configuration.

• Multi-HA ports configurations Internal Load Balancer configuration requires the


configuration of more than one HA port front endo for the same backend pool
and should implement the following options:

o For a single Internal Standard Load Balancer resource, configure more


than one frontend private IP.

o Deploy multiple load balancing rules where each rule has a single, unique,
frontend IP address assigned.

o For all load balancing rules, select HA ports option and set Floating IPs to
Enabled.

The following limitations should be reviewed to ensure an optimal and supported HA


ports rule configuration:

• Internal Standard Load Balancers are the only supported configuration available
for HA ports rules.

• Non-HA ports and HA ports load balancing rules pointing to the same backend
pool IPs is not supported.

• IP fragmenting UDP or TCP packet is not supporting as existing IP fragments will


be forwarded by HA ports rules to the same destination as the initial packet.

• Primarily for Network Virtual Appliances, flow symmetry is supported with


backend instance and singular NIC only when using HA Ports rules and is not
provided in any other scenario. The implication of this design decision is that two
or more Load Balancers, and respective rules, execute independent decisions
and are never coordinated.

Additionally, load balancing services on multiple ports and/or multiple IPs is supported
on Public and Internal load balancers to balance flows across a set of VMs. A frontend
and a backend pool configuration are connected when defining Azure Load Balancer
rules and the health probe referenced by the rule determines how new flows are sent to
a backend pool node. The frontend virtual IP (VIP) is defined by a 3-tuple comprising a
Public or Internal IP, a transport protocol and port number from the load balancing rule
whereas the backend pool is the collection of VM IP configurations attached to the VM
NIC resource, which reference the Load Balancer backend pool.

Flexibility is provided when defining the load balancing rules; a rule declares IP/port
mapping on the frontend to destination IP/port on the backend and the type of rule
dictates whether backend ports are reused across rules. There are two types of rules:

• No backend port reuse default rule.

• Backend port reuse via Floating IP.

Mixing of both rule types is supported by Azure Load Balancers by using them
simultaneously for a given VM, or any combination, if constraints of the rule are
followed. Application requirements and support complexity will dictate the rule type
selected so evaluation should be executed.

The following are limitations when utilizing multiple frontends to enhance HA in the
environment:

• IaaS VMs are only supported when utilizing Multiple frontend configurations.

• An application must use the primary IP configuration for outbound SNAT flows
when using the Floating IP rule; if the application binds to the frontend IP
configured on the guest OS loopback interface Azure’s outbound SNAT is
unavailable to rewrite the outbound flow, causing failure.

Disaster Recovery

The Azure Load Balancer offering has DR natively built-in via the Azure Service
Fabric. Additional considerations should reference the Backups and Availability
Zone sections below. Azure load balancer in ASR region for DR should be created

Backups

Due to the nature of the service offering and lack of traditional backup operations for the
Azure Load Balancer service, the following method can be leveraged to store existing
configuration templates for redeployment:

1. Export templates of each Azure Load Balancer.

2. Store exported templates in a Zone/Geo-redundant Blob/File account.


3. Test restore activities in dev environment to validate export fidelity.

Availability Sets

Azure Load Balancer does not offer the Availability Set feature as an available
function of the service offering.

Availability Zones

Optimizing availability in an end-to-end scenario by aligning resources with zones and


distributing them across zones is supported with Azure Standard Load Balancers. While
the Load Balancer itself is regional and never zonal (isolating the service to a single
zone) and the configuration granularity is constrained by each frontend, rule and
backend pool definition. Load Balancer behavior and properties are described as:

• Zone redundant, meaning multiple zones.

• Zonal, single zone service isolation.

Public and Internal Load Balancers can support zone redundant and zonal
configurations and can direct traffic across zones, as needed (cross zone load
balancing). The core components of Load Balancers should include the following design
considerations:

• Frontends for Standard Load Balancers can be zone redundant, meaning all
inbound or outbound flows are served by multiple Azure Availability Zones in a
region simultaneously utilizing a single IP, can survive zone failure and can be
utilized to reach all non-impacted backend pool members regardless of the zone.
One or more Availability Zones can fail and the data path will survive as long as
one zone in the region remains healthy by enabling the frontend’s single IP to be
served simultaneously by multiple, independent, infrastructure deployments in
multiple zones. While this does not imply a hitless data path, any retries should
succeed in other non-impacted zones.

• Backends comprise standalone Availability Sets or Virtual Machine Scale Sets


(VMSSs), so addressing VMs across multiple zones requires placing VMs from
multiple zones into the same backend pool. VMSSs can be placed into the same
backend pool with each VMSS residing in a single, or multiple zones.
• Outbound connections also utilize zone redundant and zonal principles where a
zone redundant public IP used for outbound connection is served by all zones
while a zonal public IP is served only by the zone which it is guaranteed. With
zone redundant outbound connections, SNAT port allocations survive zone
failures and the design will continue to provide outbound SNAT connectivity if
unaffected by zone failure.

• Zone redundant frontends causes the Load Balancer to expand its internal health
model to independently probe the accessibility of a VM from each Availability
Zone and shut down paths across zones that may have failed, without manual
intervention so it is possible during failure events, each zone having slightly
differing distributions of new flows while protecting overall health of the service.

Regions

Please refer to the Appendix for regional availability.

Deployment Guidelines and Recommendations

The following are some guidelines and best practices to consider when leveraging
Azure Load Balancers:

• Use standard Load Balancer SKU.

• Zone redundant flows can leverage any zone and the Load Balancer flows will
utilize all healthy zones in a region so in the event of zone failure healthy zone
traffic flows are unimpacted. Zone failure traffic flows may be impacted but
applications can recover so implementing zone redundancy will help ensure
traffic reaches intended destinations.

• Analyze applications that can support and implement HA Port rules on Load
Balancers where possible to add reliability and high availability.

• Utilize Standard Load Balancers vs Basic Load Balancers where possible to


expand scope of backend pool resources; Basic is scoped to an Availability Set,
supporting up to 100 instances and Standard is scoped to an entire VNet,
supporting up to 1000 instances.

• Implement VM Availability Sets and Virtual Machine Scale Sets, where possible,
for backend pools to increase both instances and scaling of VM presence during
heavy workloads.
• When designing zone failure resiliency, applications that consist of multiple
components should be carefully analyzed to consider the survival of sufficient
critical components and impacts of zone failure. How an application converges
after a zone failure and restoration should also be taken into consideration when
designing resiliency in regard to all components, including Load Balancing rules.
6.2. Azure Application Insights

6.2.1. Overview
Application Insights, now integrated natively to Azure Monitor, is an extensible
Application Performance Management (APM) service for Developers and DevOps
professionals. It can be leveraged to monitor live applications and will automatically
detect performance anomalies. Application Insights includes powerful analytics tools to
help you diagnose issues and to help better understand how users consume your
applications for continuous improvement of performance and usability. It works for apps
on a wide variety of platforms hosted on-premises, hybrid and is cloud agnostic. It
integrates with your DevOps process, and has connection points to a variety of
development tools.

For example, Application Insights can monitor and analyze telemetry from PaaS web
apps, IaaS web services via custom agent installation, and mobile apps by integrating
with Visual Studio App Center by additionally instrumenting not only the web service
application, but also any background components themselves.

Application Insights is aimed at the development and support teams to help you
understand how your app is performing and how it is being used. It monitors:

• Request rates, response times, and failure rates: Find out which pages are
most popular, at what times of day, and where your users are. See which pages
perform best. If your response times and failure rates go high when there are more
requests, then perhaps you have a resourcing problem.

• Dependency rates, response times, and failure rates: Find out whether
external services are slowing you down.

• Exceptions: Analyze the aggregated statistics, or pick specific instances and drill
into the stack trace and related requests. Both server and browser exceptions are
reported.

• Page views and load performance: reported by your users' browsers.

• AJAX calls from web pages – rates, response times, and failure rates.

• User and session counts.

• Performance counters from your Windows or Linux server machines, such as


CPU, memory, and network usage.

• Host diagnostics from Docker or Azure.

• Diagnostic trace logs from your app - so that you can correlate trace events with
requests.

• Custom events and metrics that you write yourself in the client or server code, to
track business events such as items sold or games won.

6.2.2. Deployment Options


Application Insights is now a part of the Azure Monitor service offering comprising Azure
Monitor, Log Analytics, and Application Insights, now a single service that provides end-
to-end monitoring of applications and components they rely on. There are several
methods to deploy, additional options later:

• At run time: instrument your web app on the server – Ideal for applications
currently deployed; avoids any update to the code.

o ASP.NET or ASP.NET Core applications hosted on Azure Web Apps.

o ASP.NET applications hosted in IIS on Azure VM or Azure VMSS.

o ASP.NET applications hosted in on-premises IIS VMs.

• At development time: add Application Insights to your code – Allows


customization of telemetry collection and send additional telemetry.
o ASP.NET Applications

o ASP.NET Core Applications

o .NET Console Applications

o Java

o Node.js

o Python

o Additional platforms

6.2.3. High Availability


Azure Monitor and Application Insights allows you to collect granular performance and
utilization data, activity and diagnostics logs, and define alerts and notifications from
your Azure resources in a consistent manner. Microsoft guarantees 99.9% of the time,
Azure Monitor will execute alert rules, trigger, and deliver notifications.

6.2.4. Disaster Recovery


The Application Insights, and subsequently Azure Monitor service offering has
DR natively built in via the Azure Service Fabric.

6.2.5. Backup
Due to the nature of the service offering and lack of traditional backup operations for
Application Insights, the following method can be leveraged to store existing
configuration templates for redeployment:

1. Export templates of each Application Insights configuration.

2. Store exported templates in a Zone/Geo-redundant Blob/File account.

3. Test restore activities in dev environment to validate export fidelity.


6.2.6. Availability Sets
Azure Application Insights and Azure Monitor do not offer the Availability Set
feature as an available function of the service offering.

6.2.7. Availability Zones


Azure Application Insights, now integrated directly into Azure Monitor, is a core feature
of the Azure Service Fabric, so in lieu of a traditional Availability Zone configuration the
following SLA applies to Azure Monitor Alerts and Notification Delivery:

Monthly Uptime Percentage Service Credit


< 99.9% 10%
< 99% 25%

6.2.8. Regions
Please refer to the Appendix for regional availability.

6.2.9. Deployment Guidelines and Recommendations


The following are guidelines and best practices to consider when monitoring
applications:
• Create multiple Application Insights resources to split telemetry for different
environments vs a single resource and custom dimensions to tag the data
source. This provides:
o Better separation of telemetry; alerts, work items configuration, RBAC
permissions.
o Spread limits such as web test count, throttling and data allowance.
o Implement cross-resource queries.
• Utilize the Status Monitor tool if your application is instrumented with the SDK to
log detailed dependency information if using .NET 4.5 or if using .NET 4.6 and
require additional detailed dependency info. The Status Monitor tool is being
deprecated for Azure Monitor Application Insights Agent (Public Preview) and
should be considered when GA.
• Ensure a capping limit for a dev/test environment is configured to prevent
production data loss in the event a set limit is surpassed. Application Insights
heavily samples data and may potentially retain 5% of actual telemetry, so tuning
is highly recommended for more accurate data analysis.
• To ensure application data, such as dependencies and performance counters, is
successfully collected you should validate:
o Firewall and Proxy rules
o iKey configuration
o User/Service account running IIS has access
o Periodically call the Flush method
• Configure alerts only for relevant resources to avoid notification oversaturation.
• Configure Application Insights resource permissions via RBAC to control access
to telemetry data.
6.3. Azure Automation (Account)

6.3.1. Overview
Azure Automation delivers a cloud-based automation and configuration service enabling
contiguous management across Azure and non-Azure environments. Azure Automation
comprises process automation, configuration management, update management,
shared capabilities, and heterogeneous features. Automation gives you complete
control during deployment, operations, and decommissioning of workloads and
resources. The automation account provides the following capabilities:

Currently, in One Platform, Automation accounts are not used for Configuration
management and Update management.

• Process Automation in Azure Automation allows you to automate frequent,


time-consuming, and error-prone cloud management tasks.

• Configuration Management is a cloud-based solution for PowerShell desired


state configuration (DSC) to help you diagnose unwanted changes and raise
alerts.

• Update Management for Windows and Linux systems across hybrid


environments.

o Note: Update Management process failed in the past due to vastness of


the environment.

• Shared Capabilities, including shared resources, role-based access control,


flexible scheduling, source control integration, auditing, and tagging.

• Heterogeneous support Automation is designed to work across hybrid cloud


environment and also for Windows and Linux systems. It delivers a consistent
way to automate and configure deployed workloads and the operating systems
that run them.

Common scenarios for Automation

Azure Automation supports management throughout the lifecycle of your infrastructure


and applications. Common scenarios include:

• Write runbooks: Author PowerShell, PowerShell Workflow, graphical, Python 2,


and DSC runbooks in common languages.
• Build and deploy resources: Deploy virtual machines across a hybrid
environment using runbooks and Azure Resource Manager templates. Integrate
into development tools, such as Jenkins and Azure DevOps.

• Configure VMs: Assess and configure Windows and Linux machines with
configurations for the infrastructure and application.

• Share knowledge: Transfer knowledge into the system on how your


organization delivers and maintains workloads.

• Retrieve inventory: Get a complete inventory of deployed resources for


targeting, reporting, and compliance.

• Find changes: Identify changes that can cause misconfiguration and improve
operational compliance.

• Monitor: Isolate machine changes that are causing issues and remediate or
escalate them to management systems.

• Protect: Quarantine machines if security alerts are raised. Set in-guest


requirements.

• Govern: Set up RBAC for teams. Recover unused resources.

6.3.2. Deployment Options


The Azure Automation Account service utilizes a highly scalable and reliable workflow
execution engine native to the Azure Service Fabric enabling the deep integration into
both cloud and on-premises environments. Automation accpount can be deployed by
using Automation account using an Azure Resource Manager template or standalone
Azure Automation account

6.3.3. High Availability


Microsoft guarantees that at least 99.9% of runbook jobs will start within 30 minutes of
their planned start times. At least 99.9% availability of the Azure Automation DSC agent
service. No SLA is provided for the Free tier of Azure Automation.
6.3.4. Disaster Recovery
The Azure Automation Account service offering has DR natively built in via the
Azure Service Fabric.

6.3.5. Backup
Due to the nature of the service offering and lack of traditional backup operations for
Azure Automation accounts, the following method can be leveraged to store existing
configuration templates for redeployment:

1. Export templates of each Azure Automation account.

2. Store exported templates in a Zone/Geo-redundant Blob/File account.

3. Test restore activities in dev environment to validate export fidelity.

6.3.6. Availability Sets


Azure Automation does not offer the Availability Set feature as an available
function of the service offering.

6.3.7. Availability Zones


Azure Automation is a core feature of the Azure Service Fabric so in lieu of a traditional
Availability Zone configuration the following SLA applies to Azure Automation Service -
DSC and Process Automation:

Monthly Uptime Percentage Service Credit


< 99.9% 10%
< 99% 25%

6.3.8. Regions
Please refer to the Appendix for regional availability.
6.3.9. Deployment Guidelines and Recommendations
The following are some guidelines and best practices to consider when implementing
Azure Automation processes:

• Utilize tags to organize and track workflow to collect and report on metadata from
a range of Azure resources.

• Understand and deploy serverless options, where possible, to utilize and


implement a layer of abstraction, offloading maintenance of underlying
infrastructure to the Azure Service Fabric.

• Implement infrastructure-as-code templates in ARM to clone groups of related


resources and utilize automation script in the Azure Portal to download the entire
declared group as a JSON file. One caveat is the templates do not replicate data
and not all configuration items will be ported over, such as Azure Key Vault
access policies, so additional manual declaration of resources will be required.

• Use Azure Automation assets and never hardcode values, especially secure
information.
6.4. Azure Recovery and Backup Services
6.4.1. Azure Backup Service

Overview

The Azure Backup service provides simple, secure, and cost-effective solutions to back
up your data and recover it from the Microsoft Azure cloud. Azure Backup supports
backup of Azure VMs utilizing Azure Disk Encryption (ADE) leveraging Bitlocker and
dm-crypt features. ADE is integrated with Key Vault, managing disk encryption
keys/secrets and can also leverage Key Vault Key Encryption Keys (KEKs) to add an
additional security layer by encrypting encryption secrets prior to Key Vault writes. The
following are some limitations when backing up encrypted VMs:

• Backup and restore operations on VMs can be executed within the same
subscription and region as the Recovery Services Backup Vault.

• Standalone keys are supported with Azure Backup; certificate key pairs used to
encrypt VMs are currently unsupported.

• Encrypted VMs cannot be restored at the file/folder level; the entire VM needs to
be restored for file/folder recovery.

• When restoring a VM, the replace existing VM option cannot be used for
encrypted VMs, as this option is only available for unencrypted managed disks.

• Azure Backup requires access to Key Vaults to back up encrypted VMs, so


permissions need to be set before successful backup.

What can I back up?

• On-premises: Back up files, folders, system state using the Microsoft Azure
Recovery Services (MARS) agent. Or use the DPM or Azure Backup Server
(MABS) agent to protect on-premises VMs (Hyper-V and VMWare) and other on-
premises workloads

• Azure VMs: Back up entire Windows/Linux VMs (using backup extensions) or


back up files, folders, and system state using the MARS agent.

• Azure Files shares: Back up Azure File shares to a storage account

• SQL Server in Azure VMs: Back up SQL Server databases running on Azure
VMs
• SAP HANA databases in Azure VMs: Backup SAP HANA databases running
on Azure VMs

Deployment Options

Azure Backup stores backed up data in Azure Recovery Services vaults, a feature
deeply integrated into the Azure Service Fabric. There are multiple backup options,
however, one main caveat is the backed-up item must reside in the same region as the
Recovery Services Vault. For Azure VMs, data is encrypted-at-rest using Storage
Service Encryption (SSE). Azure Data Encryption-at-Rest is detailed here:

• https://docs.microsoft.com/en-us/azure/security/fundamentals/encryption-atrest

• https://docs.microsoft.com/en-us/azure/security/fundamentals/encryption-
overview

The following are vault-supported features:

Feature Details
Vaults in subscription Up to 500 vaults in a single subscription
Machines in a vault Up to 1000 Azure VMs in a single vault
Up to 50 MABS (Microsoft Azure Backup Server)
Data sources Max size of individual source is 54TB; this does not apply to Azure VM
backups.
Backups to vault Azure VMs: Once a day
Machines protected by MABS/DPM: Twice daily
Machines backup up directly via MARS (Azure Recovery Services)
agent: Three times daily
Backups between vaults Backup is within a region. A vault is needed in every Azure region that
contains VMs targeted for backup; cross region backups are not
allowed.
Move data between vaults Moving backed-up data between vaults is not supported
Modify vault storage type A vault is created utilizing Geo-redundant storage (GRS) by default. If
Locally redundant storage (LRS) is desired a new vault and manual
configuration change is required as the replication type cannot be
modified after backups begin.

High Availability

Microsoft guarantees at least 99.9% availability of the backup and restore functionality
of the Azure Backup service.

Disaster Recovery

The Azure Backup service offering has DR natively built in via the Azure Service Fabric
and is configured by default to use Geo-redundant storage. If Locally Redundant
Storage is required the vault should be modified before beginning backup
activities.Cross region restore is disabled by default so it should also be configured prior
to initial backups.

The Recovery Services Vault can be configured to utilize Cross Region Restore to
restore Azure VMs in a secondary region, which is an Azure paired region. The vault
must be configured to utilize GRS storage and will auto pick the secondary region for
replication. This option allows:

• Conducting of drills to meet an audit or compliance requirement.

• Restoring the VM or its disk if the primary region is experiencing a disaster.

This feature must be enabled to onboard a subscription by executing the following:


Register-AzProviderFeature -FeatureName CrossRegionRestore -ProviderNamespace Microsoft.RecoveryServices

Backup

Due to the nature of the service offering and lack of traditional backup operations for the
Azure Backup service, the following method can be leveraged to store existing
configuration templates for redeployment:
1. Export templates of each Recovery Services Vault.

2. Store exported templates in a Zone/Geo-redundant Blob/File account.

3. Test restore activities in dev environment to validate export fidelity.

Availability Sets

Azure Backup does not offer the Availability Set feature as an available function
of the service offering.

Availability Zones

Azure Backup does not offer Availbility Zones as an available function of the
service offering.

Regions

Please refer to the Appendix for regional availability.

Deployment Guidelines and Recommendations

The following are some guidelines and best practices to consider when implementing
Azure Backups:

• Modify the default schedule times that are set in a policy. For example, if the
default time in the policy is 12:00 AM, increment the timing by several minutes so
that resources are optimally used.

• If you're restoring VMs from a single vault, it is recommended to utilize


different general-purpose v2 storage accounts to ensure that the target storage
account is not negatively impacted. For example, each VM must have a different
storage account; if 10 VMs are restored, use 10 different storage accounts.

• For backup of VMs that are using premium storage, with Instant Restore, it is
recommended allocation of 50% free space of the total allocated storage space,
which is required only for the initial backup. The 50% free space is not a
requirement for backups after the first backup is complete
• The limit on the number of disks per storage account is relative to how heavily
the disks are being accessed by applications that are running on an Azure VM.
As a general rule, if 5 to 10 disks or more are present on a single storage
account, balance the load by moving some disks to separate storage accounts.
6.4.2. Azure Site Recovery

Overview

Azure Site Recovery Services enables the adoption of a business continuity and
disaster recovery (BCDR) strategy. The organization’s strategy should provide data,
application and workload resiliency to ensure availability and recovery when both
scheduled and unscheduled outages occur.

Azure Site Recovery supports to your BCDR strategy by:

• Site Recovery service: Site Recovery replicates workloads running on physical


and virtual machines (VMs) from a primary site to a secondary location so if
outage occurs at your primary site, you fail over to and access apps from the
secondary location. After the primary location is running again, you can fail back
to it. A replica of the source VM and data is provisioned in the target region and
executes an App consistent snapshot at a minimum of once an hour.

• Backup service: The Azure Backup service keeps your data safe and
recoverable.

Site Recovery can manage replication for:

• Azure VMs replicating between Azure regions.

• On-premises VMs, Azure Stack VMs, and physical servers.

Site Recovery is not used for DR scenarios for on-premised VMs in


KPMG. It is used only migration purposes.

What can be replicated:

Supported Details
Replication scenarios Replicate Azure VMs from one Azure region to another.

Replicate on-premises VMware VMs, Hyper-V VMs, physical servers


(Windows and Linux), Azure Stack VMs to Azure.

Replicate on-premises VMware VMs, Hyper-V VMs managed by System


Center VMM, and physical servers to a secondary site.
Regions Review supported regions for Site Recovery.
Replicated machines Review the replication requirements for Azure VM replication, on-
premises VMware VMs and physical servers, and on-premises Hyper-V
VMs.
Workloads You can replicate any workload running on a machine that's supported
for replication. And, the Site Recovery team did app-specific tests for a
number of apps.

Deployment Options

Azure Site Recovery configurations are stored in Azure Recovery Services vaults, along
with Azure Backup policies. Azure Site Recovery allows the replication of VMs to a
supported geo-cluster region by creating a Recovery Plan and selecting the source and
target regions. The following items need to be configured for a Recovery Plan:

1. Create a Recovery Plan in Azure Recovery Services Vault.

2. Select a source and target based on the machines targeted for the plan.

Failover Source Target


Azure to Azure Current VM region Target region
On premises to Azure Select on-prem vCenter Select Azure

3. Select the machines to be added to the plan.

a. Virtual Machines are added to a default group (Group 1) in the plan. After
failover all machines in this group start at the same time.

b. Only Virtual machines in the source and target locations specified can be
selected.

The following items should be noted:

• A Recovery Plan can only be used for failover from source to target; it cannot be
used for failback.

• The source location machines are required to be enabled for failover and
recovery.

• A recovery plan can contain machines with the same source and target.
A group can be created within a plan to specify behaviors on a group-by-group basis so
a specific group can be identified to start up in a desired sequence to facilitate required
service access during startup.

High Availability

Microsoft guarantees at least 99.9% availability of the On-Premises-to-On-Premises


Failover, a 100% credit for On-Premises-to-Azure Failover and Azure-to-Azure Failover
recovery time greater than 2 hours.

Note: The entire VM state is replicated to include applications/services.

Disaster Recovery

The Azure Site Recovery offering has DR natively built in via the Azure Service
Fabric.

Backup

Due to the nature of the service offering and lack of traditional backup operations for the
Azure Site Recovery service, the following method can be leveraged to store existing
configuration templates for redeployment:

1. Export templates of each Recovery Services Vault.

2. Store exported templates in a Zone/Geo-redundant Blob/File account.

3. Test restore activities in development environment to validate export fidelity. It is


strictly forbidden to have PROD data in DEV or QA.

Availability Sets

Azure Site Recovery does not offer the Availability Set feature as an available
function of the service offering.
Availability Zones

Azure Site Recovery is a core feature of the Azure Service Fabric so in lieu of a
traditional Availability Zone configuration the following SLA applies to Azure Site
Recovery Azure-to-Azure Failover.

Regions

Please refer to the Appendix for regional availability.

Deployment Guidelines and Recommendations

The following are some guidelines and best practices to consider when implementing
Azure Site Recovery:

• Ensure VMs, and subsequent NSGs, are configured to allow appropriate service
endpoint access.

• Execute a risk assessment for each application to validate app-specific


requirements. Mission critical applications require additional scrutiny to ensure
service continuity.

• Site Recovery can scale to thousands of VMs, however, Recovery Plans should
be designed as a model for applications that fail over as a group so a limit of 50
VMs per plan helps reduce overall downtime during failover.

• Develop and define Recovery Time Objectives (RTO) and Recovery Point
Objectives (RPO) for each application.

• Implement multi-site DR and Backup strategies to span individual service up to


regional outages.

• Develop and execute scheduled DR simulations, validating expected failover


timeframes and data integrity to further refine strategies.

• For On-premises to Azure replication, Orange Team involvement is required for


replication traffic prioritization to avoid negative impacts on production
applications.
6.5. Azure Storage Services
6.5.1. Azure Blob Storage (Block/Append/Page)

Overview

Azure Blob Storage is an object storage solution service in the Azure Service Fabric.
Blob Storage is optimized for storing large amounts of unstructured data that does not
align to a specific model or definition, such as text or binary data.

Blob storage supports three types of blobs and is designed for:

• Types:

o Block blobs contain text and binary data up to a max of 4.7TB and are
comprised of individually managed blocks of data.

o Append blobs are blocks optimized for append operations, such as


logging data from VMs.

o Page blobs contain random access files utilized by VHDs, which server as
disks for Azure VMs up to a max of 8TB in size.

• Usage

o Image or document hosting accessed from a browser.

o Distributed file storage access.

o Audio and video streaming.

o Backup/Restore, DR and archive data storage.

o Azure hosted service analysis data storage.

o VM disk storage and management.

Blob storage object access is enabled over HTTPS, however KPMG policy requires
HTTPS URLs for user or client applications via Azure Storage REST API, Azure
PowerShell, Azure CLI or an Azure Storage client library such as:

• .NET

• Java

• Node.js

• Python
• PHP

• Ruby

Deployment Options

Azure Storage employs multiple deployment options to meet organization and


application specific requirements.

The following are deployment configurations available and corresponding metrics for
Azure Blob Storage:

• Storage Performance Tiers

o Standard v2 storage is recommended for block and append blobs that


provide access to up to date Azure Storage features. The Hot tier is
targeted for most workloads while the Cool and Archive tiers are best
suited for cool/cold data optimized for cost efficiency.

o Premium storage is recommended for block and append blogs requiring


consistent and low latency data access optimized for high transaction
rates.

• Access Tiers and considerations

o Hot is targeted and optimized for frequently accessed data storage.

o Cool is targeted and optimized for infrequently accessed data stored for at
least 30 days.

o Archive is targeted and optimize for rarely accessed data stored for at
least 180 days with flexible latency requirements.

o Hot and Cool access tiers are only set at the account level whereas
Archive is not available at the account level.

o Hot, Cool and Archive tiers can be set during or after upload at the blob
level.

o Cool access tier data is able to tolerate slightly lower availability but still
requires durability, retrieval latency and throughput metrics similar to Hot
tier data. For this reason, a slightly lower SLA but higher access costs
present a counterbalance to lower storage costs for Hot data storage.
o Archive access tier is offline stored data with lowest storage costs but is
offset by highest access and data rehydration costs.

• Storage account types and capabilities

Account type Service Performance Access Replication Deployment


availability tier support tier options methods
support
General-purpose Blob, File, Standard, Hot, LRS, GRS, ARM
V2 Queue, Premium Cool, RA-GRS, ZRS,
Table, Archive GZRS(Preview),
Disk, Data RA-
Lake Gen2 GRS(Preview)
General-purpose Blob, File, Standard, N/A LRS, GRS, ARM,
V1 Queue, Premium RA-GRS Classic
Table,
Disk
BlockBlobStorage Blob Premium N/A LRS, ZRS ARM
(block
blobs and
append
blobs
only)
File Storage File only Premium N/A LRS, ZRS ARM
Blob Storage Blob Standard Hot, LRS, GRS, ARM
(block Cool, RA-GRS
blobs and Archive
append
blobs
only)

High Availability

Data in an Azure Storage account is always replicated three times in the primary region
and offers the following options for data replication within primary and secondary
regions:

• Locally Redundant Storage (LRS)

o LRS data is synchronously copied three times within a single physical


location in the primary region and while the least expensive replication
option, it is not recommended for applications requiring HA.
o An application is best suited for LRS if it stores easily reconstructed data,
in the event of data loss, or it is restricted by governance requirements to
not traverse country or region.

• Zone Redundant Storage (ZRS)

o ZRS data is synchronously copied across three Availability Zones in the


primary region and is highly recommended for high availability application
requirements by leveraging ZRS and replicating to a secondary region.

o Applications and scenarios best suited for ZRS require consistency,


durability and HA by providing low latency and resiliency for data should
access become temporarily unavailable. When designing applications for
ZRS implementation, where possible, should include transient fault
handling including retry policy implementation with exponential back-off.

• Geo Redundant Storage (GRS)

o GRS data is synchronously copied three times within a single region and
physical location utilizing LRS, then asynchronously copies data to the
secondary region’s single physical location.

o Applications and scenarios that require read access to the secondary


region should enable Read Access Geo Redundant Storage (RA-GRS).

• Geo Zone Redundant Storage (GZRS)

o GZRS data is synchronously copied across three Azure Availability Zones


in the primary region utilizing ZRS, then asynchronously copies to
secondary region’s single physical location.

o Applications and scenarios requiring the highest levels of performance,


availability, consistency and resiliency for DR should utilize GZRS as you
can continue to read/write data if an Availability Zone becomes
unavailable or is unrecoverable.

o Applications and scenarios that require read access to the secondary


region should enable Read Access Geo Zone Redundant Storage (RA-
GZRS).

Disaster Recovery

While unplanned service outages occur, Microsoft’s goal is to ensure the Azure Storage
Account is always available. If an application or service requires resiliency, Microsoft
recommends utilizing, at a minimum, GRS to replicate data to a second region. Paired
with a DR plan to handle regional service outages, failover to a secondary endpoint, in
the event the primary becomes unavailable, is also an important facet of application and
service continuity.

Azure Storage account failover of ARM supported GRS and RA-GRS accounts is now
GA. You can initiate the failover process for an account if the primary endpoint becomes
unavailable with account failover. Once a failover is completed the secondary endpoint
becomes the active primary endpoint for the account and clients can resume writing.
One item of note is the risk of data loss due to asynchronous replication of data to the
secondary region due to the inherent delay between primary region write and secondary
region replication. This is valid for standard geo redundant storage replication and
during failover operations so it’s important to understand the implications of initiating
account failover.

An alternate to the storage account failover is manual data copy operations. If the
account is configured for RA-GRS, then read access to data via secondary endpoint is
available. In the event of an outage in the primary region you can use AzCopy, Azure
PowerShell or the Azure Data Movement Library tools to copy data from the secondary
storage account region to another account in an unaffected region. Once complete, the
newly copied data can be accessed for read and write operations.

In rare catastrophic disaster scenarios, Microsoft may initiate a regional failover. In this
event, you will not have write access to the storage accounts until the Microsoft-
managed failover is complete; however, no action is required. Also, if RA-GRS is
configured for an account, applications can read from a secondary region.

Backup

Due to the nature of the service offering and lack of traditional backup operations for the
Azure Blob Storage Account service, the following method can be leveraged to store
existing configuration templates for redeployment:

1. Export templates of each Azure Storage Account.

2. Store exported templates in a Zone/Geo-redundant Blob/File account.

3. Test restore activities/processes in development environment, utilizing non-


KPMG Confidential data, to validate export fidelity.
However, a combination of the storage account type and replication options referenced
above and utilizing blob snapshots should be leveraged to enable backup concepts for
blob storage.

Availability Sets

Azure Blob Storage does not offer the Availability Set feature as an available
function of the service offering.

Availability Zones

Due to the standard replication operations across all storage options the HA
configuration options referenced above should be leveraged to further expand
data replication and availability.

Regions

Please refer to the Appendix for regional availability.

Deployment Guidelines and Recommendations

The following are some guidelines and best practices to consider when implementing
Azure Blob Storage:

• Soft delete for Blob storage should be enabled to prevent accidental or malicious
deletion as a first line function of DR related activities.

• To improve availability and caching, configure blob snapshots to delineate read


and write operations by assigning the corresponding snapshot to usage/activity.
Snapshots allow user caching of blob level data increasing the availability of the
content.

• Validate application and service requirements when selecting blob types (Block
and Page) to improve overall performance.

• The Azure Storage Account will automatically improve availability by not requiring
user configured scaling work when utilizing blob storage for static website/data.
• When utilizing GRS/RA-GRS/GZRS/RA-GZRS the Last Sync Time property
indicates the last successful replication timestamp from primary to secondary
region to evaluate discrepancies. This is helpful when designing an application to
switch seamlessly to reading from the secondary region if the primary becomes
unresponsive.

• To improve availability, define the HTTP cache-control header to decrease the


number of transactions executed in each storage control and server traffic loads
by enabling the cache on the client side.

• Enabling a content delivery/distribution network (CDN) caches a copy of the blob


duplicate closer to the client for access to improve availability and latency.
6.5.2. Azure File Storage

Overview

Azure Files are SMB accessible file shares that are fully managed. Azure File shares
can be seamlessly integrated with Windows and can be cached on Windows Servers
utilizing Azure File Sync for fast access close to user and data consumption points.

The following are key advantages when implementing:

• Azure File Shares leverage the SMB protocol to seamlessly transition from on-
premises file shares without requiring additional compatibility configurations for
applications.

• Azure File Shares enables serverless data consumption by directly integrating


into the Azure Storage service and Azure Service Fabric to provide built-in
resiliency.

• Applications running in Azure can leverage the file system I/O APIs, Azure
Storage Client Libraries and Azure Storage REST API to allow streamlined
migration of existing applications to Azure File Shares. PowerShell and Azure
CLI can also be utilized to administer Azure applications while the Azure Portal
and Azure Storage Explorer can manage file shares.

Deployment Options

Azure Storage employs multiple deployment options to meet organization and


application specific requirements.

The following are deployment configurations available and corresponding metrics for
Azure File Storage:

• Storage Performance Tiers

o Standard v2 storage is a shared pool of storage that provides access to up


to date Azure Storage features in which file shares, as well blob or queue
resources, can be deployed with pricing optimized for lowest GiB costs.
Standard file shares are available on LRS, ZRS and GRS.

• Premium storage is recommended for I/O intensive workloads requiring file share
semantics along with significantly high throughput and low latency by utilizing
high performance SSD based storage. Premium file shares are only available on
the LRS replication optionStorage account types and capabilities
Account type Service Performance Access Replication Deployment
availability tier support tier options methods
support
General-purpose Blob, File, Standard, Hot, LRS, GRS, ARM
V2 Queue, Premium Cool, RA-GRS, ZRS,
Table, Archive GZRS(Preview),
Disk, Data RA-
Lake Gen2 GRS(Preview)
General-purpose Blob, File, Standard, N/A LRS, GRS, ARM,
V1 Queue, Premium RA-GRS Classic
Table,
Disk
BlockBlobStorage Blob Premium N/A LRS, ZRS ARM
(block
blobs and
append
blobs
only)
File Storage File only Premium N/A LRS, ZRS ARM
Blob Storage Blob Standard Hot, LRS, GRS, ARM
(block Cool, RA-GRS
blobs and Archive
append
blobs
only)

High Availability

Data in an Azure Storage account is always replicated three times in the primary region
and offers the following options for data replication within primary and secondary
regions:

• Locally Redundant Storage (LRS)

o LRS data is synchronously copied three times within a single physical


location in the primary region and while the least expensive replication
option, it is not recommended for user access or applications requiring
HA.

• Zone Redundant Storage (ZRS)

o ZRS data is synchronously copied across three Availability Zones in the


primary region and is highly recommended for high availability user
access and application requirements by leveraging ZRS and replicating to
a secondary region.
• Geo Redundant Storage (GRS)

o GRS data is synchronously copied three times within a single region and
physical location utilizing LRS, then asynchronously copies data to the
secondary region’s single physical location.

o User access and applications that require read access to the secondary
region should enable Read Access Geo Redundant Storage (RA-GRS).

• Geo Zone Redundant Storage (GZRS)

o GZRS data is synchronously copied across three Azure Availability Zones


in the primary region utilizing ZRS, then asynchronously copies to
secondary region’s single physical location.

o User access and applications requiring the highest levels of performance,


availability, consistency and resiliency for DR should utilize GZRS as you
can continue to read/write data if an Availability Zone becomes
unavailable or is unrecoverable.

o User access and applications that require read access to the secondary
region should enable Read Access Geo Zone Redundant Storage (RA-
GZRS).

Disaster Recovery

While unplanned service outages occur, Microsoft’s goal is to ensure the Azure Service
Fabric is always available. If an application or service requires resiliency, Microsoft
recommends utilizing, at a minimum, GRS to replicate data to a second region. Paired
with a DR plan to handle regional service outages, failover to a secondary endpoint, in
the event the primary becomes unavailable, is also an important facet of application and
service continuity.

Azure Storage account failover of ARM supported GRS and RA-GRS accounts is now
GA. You can initiate the failover process for an account if the primary endpoint becomes
unavailable with account failover. Once a failover is completed the secondary endpoint
becomes the active primary endpoint for the account and clients can resume writing.
One item of note is the risk of data loss due to asynchronous replication of data to the
secondary region due to the inherent delay between primary region write and secondary
region replication. This is valid for standard geo redundant storage replication and
during failover operations so it’s important to understand the implications of initiating
account failover.
Azure File Share also provides snapshot capabilities to capture a share state at point in
time for restore operations. If an application is misconfigured, unintentional code is
deployed, accidental deletion or block damage occurs a share snapshot can be restored
to the last known good state.

Backup

As previously mentioned, Azure File Share provides snapshot capabilities to capture the
share state at a point in time for restore operations. The Azure Backup service also
provides ability to create a schedule for backup operations. The Azure Backup policy
creates an Azure File Share snapshot that can be viewed by both REST API and SMB.
The same capabilities are available in the client library, Azure CLI and the Azure Portal.

A snapshot of the file share can be utilized for data backup, auditing requirements or
DR activities. After a share snapshot is created either manually or via backup policy it
can be read, copied or deleted but not modified. The entire snapshot object also cannot
be copied to another storage account; however, the contents can be copied via AzCopy
or other copy mechanisms. The current snapshot limit per share is 200; after 200 older
share snapshots must be deleted in order to create new ones.

Share snapshots are incremental so only changed data after the most recent snapshot
is saved, minimizing the time required to create and storage costs. Although saved
incrementally, the most recent snapshot retention is all that is required because they
contain all information needed to browse and restore data from the time the snapshot
was taken. Snapshots do not count towards the 5TB share limit; there is no limitation to
how much space they occupy in total but storage account limits still apply.

Availability Sets

Azure File Storage does not offer the Availability Set feature as an available
function of the service offering.

Availability Zones

Due to the standard replication operations across all storage options the HA
configuration options referenced above should be leveraged to further expand
data replication and availability.
Regions

Please refer to the Appendix for regional availability.

Deployment Guidelines and Recommendations

The following are some guidelines and best practices to consider when implementing
Azure File Storage:

• Automate backups for data recovery where possible as automated activities are
more reliable than manual processes which assist with improving data protection
and recoverability.

• Azure File Share snapshots only provide file level protection and do not prevent
file share or storage account deletions. To protect a storage account against
accidental deletions a resource group or storage account lock should be
implemented along with RBAC based controls.

• General purpose v2 storage account is recommended for most scenarios. An


existing general purpose v1 or Azure Blob account can be upgraded with no
downtime or need to copy the existing data.

• General purpose storage account utilization from other storage objects affects
Azure File shares in the same storage account so available resources should be
monitored to prevent space constraints.

• Azure File Storage accounts can only be created utilizing the premium
performance tier so the costs should be considered when provisioning and
replication options are selected.
6.5.3. Azure Table Storage

Overview

Azure Table storage is a storage service providing a schemaless design that stores
structured NoSQL data providing a key/attribute store. The design allows easy
adaptation of data as requirements for applications mature. Table storage data access
is fast and cost effective for many applications and is traditionally lower in cost than
standard SQL for comparable volumes of data.

Azure Tables are commonly used for:

• Structured data capable of serving scaled web applications up to TB storage


sizes.

• Dataset storage not requiring complex joins, stored procedures, foreign keys and
can potentially be normalized for highly efficient access.

• Querying data utilizing a clustered index at extremely fast speeds.

• OData and LINQ queries with WCF Data Service library Access.
Deployment Options

Azure Table Storage employs multiple deployment options to meet organization and
application specific requirements.

The following are deployment configurations available and corresponding metrics for
Azure Table Storage:

• Storage Performance Tiers

o Standard v2 storage is a shared pool of storage that provides access to up


to date Azure Storage features in which table, blob, queue and file shares
can be deployed with pricing optimized for lowest GiB costs. Standard file
shares are available on LRS, ZRS and GRS.

o Azure Cosmos DB Table API is the premium offering for table storage
featuring throughput-optimized tables, global distribution and automatic
secondary indexes, the. The following table outlines differences between
Azure Table storage and Azure Cosmos DB Table API:

• Storage account types

o General purpose v2 storage is recommended for block and append blobs,


files, queues and tables that provide access to up to date Azure Storage
features and lowest per GiB capacity prices for Azure Storage.

o General purpose v1 storage is not generally recommended as it is the


legacy account type for block and append blobs, files, queues and tables
as it may not have the lowest per GiB pricing or latest features.

High Availability

Data in an Azure Table Storage is always replicated three times in the primary region
and offers the following options for data replication within primary and secondary
regions:

• Locally Redundant Storage (LRS)

o LRS data is synchronously copied three times within a single physical


location in the primary region, and while the least expensive replication
option, it is not recommended for applications requiring HA.
o An application is best suited for LRS if it stores easily reconstructed data,
in the event of data loss, or it is restricted by governance requirements not
to traverse country or region.

• Zone Redundant Storage (ZRS)

o ZRS data is synchronously copied across three Availability Zones in the


primary region and is highly recommended for high availability application
requirements by leveraging ZRS and replicating to a secondary region.

o Applications and scenarios best suited for ZRS require consistency,


durability, and HA by providing low latency and resiliency for data should
access become temporarily unavailable. When designing applications for
ZRS implementation, where possible, should include transient fault
handling, including retry policy implementation with exponential back-off.

• Geo Redundant Storage (GRS)

o GRS data is synchronously copied three times within a single region and
physical location utilizing LRS, then asynchronously copies data to the
secondary region’s single physical location.

o Applications and scenarios that require read access to the secondary


region should enable Read Access Geo Redundant Storage (RA-GRS).

• Geo Zone Redundant Storage (GZRS)

o GZRS data is synchronously copied across three Azure Availability Zones


in the primary region utilizing ZRS, then asynchronously copies to the
secondary region’s single physical location.

o Applications and scenarios requiring the highest levels of performance,


availability, consistency, and resiliency for DR should utilize GZRS as you
can continue to read/write data if an Availability Zone becomes
unavailable or is unrecoverable.

o Applications and scenarios that require read access to the secondary


region should enable Read Access Geo Zone Redundant Storage (RA-
GZRS).

Disaster Recovery

While unplanned service outages occur, Microsoft’s goal is to ensure the Azure Service
Fabric is always available. If an application or service requires resiliency, Microsoft
recommends utilizing, at a minimum, GRS to replicate data to a second region. Paired
with a DR plan to handle regional service outages, failover to a secondary endpoint, in
the event the primary becomes unavailable, is also an important facet of application and
service continuity.

Azure Storage account failover of ARM supported GRS and RA-GRS accounts is now
GA. You can initiate the failover process for an account if the primary endpoint becomes
unavailable with account failover. Once a failover is completed the secondary endpoint
becomes the active primary endpoint for the account and clients can resume writing.
One item of note is the risk of data loss due to asynchronous replication of data to the
secondary region due to the inherent delay between primary region write and secondary
region replication. This is valid for standard geo redundant storage replication and
during failover operations so it’s important to understand the implications of initiating
account failover.

An alternate to the storage account failover is manual data copy operations. Azure
Tables data can be exported via AzCopy and imported to another storage account in a
different region.

Backup

Due to the nature of the service offering and lack of traditional backup operations for the
Azure Table Storage Account service, the following method can be leveraged to store
existing configuration templates for redeployment:

1. Export templates of each Azure Storage Account.

2. Store exported templates in a Zone/Geo-redundant Blob/File account.

3. Test restore activities in dev environment to validate export fidelity.

However, a combination of the storage account type and replication options referenced
above should be leveraged to enable backup concepts for Table storage.

Availability Sets

Azure Table Storage does not offer the Availability Set feature as an available
function of the service offering.
Availability Zones

Due to the standard replication operations across all storage options the HA
configuration options referenced above should be leveraged to further expand
data replication and availability.

Regions

Please refer to the Appendix for regional availability.

Deployment Guidelines and Recommendations

The following are some guidelines and best practices to consider when implementing
Azure Table Storage:

• Table storage is low cost so store the same entity multiple times, with separate
keys, to enable HA and more efficient operations.

• Distribute data by locating table data in a storage account close to the end users.
This technique, along with leveraging a geo and zone redundant replication
option will enable both resiliency and efficiency of table operations.

• Select keys that enable the distribution of both requests and partitions at any
point in time to also assist with HA.

• Enable the newest version of the Table Service Client Layer to improve
performance and boost uptime. This will also assist with overall Table uptime and
responsiveness.

• If the Storage Account failover process is employed for DR, ensure the full scope
of time constraints, potential data loss and replication configuration continuity.

• If the Table Storage account is configured for RA-GRS, native read access is
enabled to the secondary endpoint. Leveraging AzCopy will allow you to copy
data to a storage account in an unaffected primary region so applications can
then be updated to leverage both read and write operations.
6.5.4. Azure Queue Storage

Overview

Azure Queue Storage stores many messages that can be accessed anywhere in the
world via authenticated HTTPS calls, however, KPMG policy enforces use of HTTPS..
Up to 64KB in size, a queue message can contain millions of messages up to storage
account capacity limit. A Queue is commonly utilized to develop a backlog of work to
asynchronously process.

The following components comprise the Queue service:

• Using the following URL format, Queues can be addressed:

• https://<<app>>.queue.core.windows.net/<<container>>

• https://contosoqueue.queue.core.windows.net/images-to-download

• Queue access, along with other storage types, is executed through the
sponsoring storage account.

• A Queue contains various messages that are required to use a queue name in all
lowercase.

• 64K is the maximum message size, regardless of format, and with the newer
version (2017-07-29 and later) can be configured with any positive number for
time-to-live and can also be configured to never expire by using -1. If omitted, the
default TTL is 7 days.
Deployment Options

Azure Queue Storage employs multiple deployment options to meet organization and
application specific requirements.

The following are deployment configurations available and corresponding metrics for
Azure Queue Storage:

• Storage Performance Tiers

o Standard v2 storage is a shared pool of storage that provides access to up


to date Azure Storage features in which table, blob, queue, and file shares
can be deployed with pricing optimized for lowest GiB costs. Standard file
shares are available on LRS, ZRS, and GRS. If a more robust service is
required, a Service Bus should be considered.

• Storage account types

o General purpose v2 storage is recommended for block and append blobs,


files, queues, and tables that provide access to up to date Azure Storage
features and lowest per GiB capacity prices for Azure Storage.

o General purpose v1 storage is not generally recommended as it is the


legacy account type for block and append blobs, files, queues, and tables
as it may not have the lowest per GiB pricing or latest features.

High Availability

Data in an Azure Queue Storage is always replicated three times in the primary region
and offers the following options for data replication within primary and secondary
regions:

• Locally Redundant Storage (LRS)

o LRS data is synchronously copied three times within a single physical


location in the primary region and while the least expensive replication
option, it is not recommended for applications requiring HA.

o An application is best suited for LRS if it stores easily reconstructed data,


in the event of data loss, or it is restricted by governance requirements to
not traverse country or region.

• Zone Redundant Storage (ZRS)


o ZRS data is synchronously copied across three Availability Zones in the
primary region and is highly recommended for high availability application
requirements by leveraging ZRS and replicating to a secondary region.

o Applications and scenarios best suited for ZRS require consistency,


durability and HA by providing low latency and resiliency for data should
access become temporarily unavailable. When designing applications for
ZRS implementation, where possible, should include transient fault
handling including retry policy implementation with exponential back-off.

• Geo Redundant Storage (GRS)

o GRS data is synchronously copied three times within a single region and
physical location utilizing LRS, then asynchronously copies data to the
secondary region’s single physical location.

o Applications and scenarios that require read access to the secondary


region should enable Read Access Geo Redundant Storage (RA-GRS).

Disaster Recovery

While unplanned service outages occur, Microsoft’s goal is to ensure the Azure Service
Fabric is always available. If an application or service requires resiliency, Microsoft
recommends utilizing, at a minimum, GRS to replicate data to a second region. Paired
with a DR plan to handle regional service outages, failover to a secondary endpoint, in
the event the primary becomes unavailable, is also an important facet of application and
service continuity.

Azure Storage account failover of ARM supported GRS and RA-GRS accounts is now
GA. You can initiate the failover process for an account if the primary endpoint becomes
unavailable with account failover. Once a failover is completed the secondary endpoint
becomes the active primary endpoint for the account and clients can resume writing.
One item of note is the risk of data loss due to asynchronous replication of data to the
secondary region due to the inherent delay between primary region write and secondary
region replication. This is valid for standard geo redundant storage replication and
during failover operations so it’s important to understand the implications of initiating
account failover.
Backup

Due to the nature of the service offering and lack of traditional backup operations for the
Azure Queue Storage Account service, the following method can be leveraged to store
existing configuration templates for redeployment:

1. Export templates of each Azure Storage Account.

2. Store exported templates in a Zone/Geo-redundant Blob/File account.

3. Test restore activities in dev environment to validate export fidelity.

Availability Sets

Azure Queue Storage does not offer the Availability Set feature as an available
function of the service offering.

Availability Zones

Due to the standard replication operations across all storage options the HA
configuration options referenced above should be leveraged to further expand
data replication and availability.

Regions

Please refer to the Appendix for regional availability.

Deployment Guidelines and Recommendations

The following are some guidelines and best practices to consider when implementing
Azure Queue Storage:

• Always test the queue service for performance requirements and where
possible ensure traffic is distributed well across partitions to avoid sudden
spikes in traffic rates, improving overall availability.

• Design applications to utilize exponential backoff policies for retries to


address workload limitations, avoiding Server Busy (503) or Operation
Timeout (500) errors and improving HA.
• Build HA into applications by utilizing RA-GRS Queue storage replication
options by designing the application to handle transient or extended length
issues via enabling reading from secondary regions while primary region
access/read operations are inhibited.

• Understand the account failover activities, and service continuity caveats, to


further impart HA and DR principles into applications and queue data access.
The Account failover service enables paired region, storage account access
to improve BCDR strategies.
6.5.5. Azure Managed Disks

Overview

Azure Managed Disks are utilized by Azure VMs to employ block level storage volumes.
Equivalent to physical disks mounted to on-premises physical or virtual machines,
Azure Managed Disks are virtual disks mounted to Azure VMs, similar to VMWare
VMDKs and Hyper-V VHD/VHDX files. Azure manages the disk once size and type are
specified during provisioning.

The following are Azure Managed Disk benefits:

• Availability and durability are significant design requirements that allow a 5 ‘9’
(99.999%) uptime and SLA.

• 50,000 VM disk types are supported per region and subscription, enabling the
creation of thousands of VM in a subscription. Virtual Machine Scale Sets
(VMSS), paired with Managed Disks, allows further scaling and HA.

• Managed disks are integrated with availability sets to ensure that the disks of
VMs in an availability set are sufficiently isolated from each other to avoid a
single point of failure. Disks are automatically placed in different storage scale
units (stamps). If a stamp fails due to hardware or software failure, only the
VM instances with disks on those stamps fail.

• Availability Zones are natively supported with Managed Disks to provide HA


during datacenter outages by pairing multiple datacenters within an Azure
Region, providing a 99.99% VM uptime SLA.

• Azure managed disks automatically encrypt your data by default when


persisting it to the cloud. Server-side encryption (SSE) protects your data and
helps you meet your organizational security and compliance commitments.
Data in Azure managed disks is encrypted transparently using 256-bit AES
encryption, one of the strongest block ciphers available, and is FIPS 140-2
compliant. Encryption does not impact the performance of managed disks
and there is no additional cost for the encryption. You can choose to manage
encryption at the level of each managed disk, with your own keys. Server-side
encryption for managed disks with customer-managed keys offers an
integrated experience with Azure Key Vault.

• Azure Backup can be utilized to protect against Azure Regional outages and
disasters by leveraging custom, or built in, policy configurations for time
based backup and retention of VM or Managed Disk restoration of disk sizes
up to 32 TiB.

Deployment Options

Azure Disk Management employs multiple deployment options to meet organization and
application specific requirements.

The following are deployment configurations available and corresponding metrics for
Azure Managed Disks:

• Storage Performance Tiers

o Standard disks are ideal for cost effective dev/test workloads,


backed by performant HDDs.

o Premium disks are ideal for production workload VMs backed by


high performance, low latency SSDs supporting DS, DSv2, GS and
FS series SKUs.

• Disk performance can be found in the following table:


Premium P1 P2 P3 P4 P6 P10
SSD Sizes
Disk size in 4 8 16 32 64 128
GB
Provisioned 120 120 120 120 240 500
IOPS/disk
Provisioned 25 MiB/sec 25 MiB/sec 25 MiB/sec 25 MiB/sec 50 MiB/sec 100 MiB/sec
Throughput
per disk
Max burst 3,500 3,500 3,500 3,500 3,500 3,500
IOPS per
disk
Max burst 170 MiB/sec 170 MiB/sec 170 MiB/sec 170 MiB/sec 170 MiB/sec 170 MiB/sec
throughput
per disk
Max burst 30 min 30 min 30 min 30 min 30 min 30 min
duration
Eligible for No No No No No No
reservation

• High disk throughput and IO are ideal for Big Data, SQL, NoSQL DBs, data
warehousing, and large transactional databases that are offered with Storage
Optimized VMs.

o LSv2 Series offers high throughput, low latency and direct mapped
NVMe storage running on the AMD EPYC platform with an all core
max boost of 3.0GHz and are offered in sizes from 8 to 80 vCPU,
multithreading configuration with 8GiB RAM per vCPU and one
1.92TB NVMe SSD M.2 device per 8 vCPUs to a max of 19.2TB.

Note: Managed Disks do not have an SLA itself; instead, it is aggregated from the SLA
of underlying storage and attached VMs.

High Availability

Azure Managed Disks are always replicated three times in the primary region based on
the following:

• Locally Redundant Storage (LRS)


o LRS data is synchronously copied three times within a single physical
location in the primary region, and while the least expensive replication
option, it is not recommended for applications requiring HA.

Managed Disks should be paired with VMs configured with Availability Sets to provide
better reliability by ensuring sufficient isolation between VMs to avoid single points of
failure. This is achieved by automatically placing disks in separate storage fault
domains, which are aligned with the VM fault domain.

Additionally, leveraging Recovery Services Vaults and Replicated items configurations


to create replicas of VMs and replicas of Managed Disks to a secondary Azure region to
mirror the source VMs Managed Disks. Please refer to the Azure Site Recovery service
section for additional details.

Disaster Recovery

While unplanned service outages occur, Microsoft’s goal is to ensure the Azure Service
Fabric is always available. If an application or service requires resiliency, Microsoft
recommends utilizing, at a minimum, GRS to replicate data to a second region. Paired
with a DR plan to handle regional service outages, failover to a secondary endpoint, in
the event the primary becomes unavailable, is also an important facet of application and
service continuity. Currently managed disks are not supported with the GZRS and RA-
GZRS replication options.

Azure Storage account failover of ARM supported GRS and RA-GRS accounts is now
GA. You can initiate the failover process for an account if the primary endpoint becomes
unavailable with account failover. Once a failover is completed, the secondary endpoint
becomes the active primary endpoint for the account, and clients can resume writing.
One item of note is the risk of data loss due to asynchronous replication of data to the
secondary region due to the inherent delay between the primary region write and
secondary region replication. This is valid for standard geo redundant storage
replication and during failover operations, so it’s important to understand the
implications of initiating account failover.

Due to the nature of the Managed Disk service offering, it is intimately tied to VM DR
features and strategies. In addition to IaaS resiliency, Azure Backup enables
redundancy and recovery from major disaster incidents. For disk DR, Azure Backup
should be configured to store in a different geographic region from the primary site. This
methodology ensures backed up data is not affected by the same events affecting the
disks and VMs to which they are attached.
DR considerations should include the following aspects:

• HA of an application enables the continuity of a healthy running state, minimizing


significant downtime. This means the application remains responsive and
accessible, even when there are failures in the Azure Service Fabric, so for
mission-critical applications, design redundancy into the application along with
the data itself.

• Data durability ensures preservation in the event of a disaster, so additional


considerations such as data backup to a different site should be considered.
Each application should be analyzed to help define the overall strategy.

Backup

Azure Managed Disks have multiple avenues of backup/restore and data preservation.
The following are methods for disk backup:

• A disk snapshot is a full read-only copy of a VMs VHD that can be used on OS or
Data disks for backup or troubleshooting of VM issues. If the restore of a VM with
a snapshot is a part of the backup/restore strategy, the VM should be cleanly
shut down to clear out running processes before executing.

• Incremental snapshots are point in time backups that consist only of changes
made since the previous snapshot, and the full VHD is used when either
downloaded or attempted use. Some differences between regular and
incremental snapshots is incremental always uses standard HDD storage,
regardless of managed disk storage type, while standard snapshots can use
premium SSDs. The following restrictions should be noted:

o Incremental snapshots cannot currently be moved between subscriptions.

o SAS URIs can only be generated for up to five snapshots at a time.

o Snapshots cannot be created outside of a disk’s subscription.

o A max of seven incremental snapshots per disk can be created every five
minutes.

o A max of 200 incremental snapshots can be created for a single disk.

• A disk export directly from the VM configuration generates a SAS URI of the disk
object to download and store in a storage account in a different region for restore
operations. In the event of catastrophic failure of a VM the VHD can be utilized to
deploy a new system from the exported disk.

• Azure Backup can be leveraged to protect a VM and, subsequently, the attached


disk to a Recovery Services Vault to provide application consistent backups by
utilizing the Volume Shadow Service (VSS) to ensure data is correctly written to
storage. Please refer to the Virtual Machine service section for further
information.

Availability Sets

Azure Managed Disks are intimately tied to Azure VM Availability Sets to provide VM
availability and redundancy by update domain and fault domain assignment, which is
handled by the Azure Service Fabric. Utilization of Managed Disks is highly
recommended when configuring Availability Sets to ensure the disks are sufficiently
isolated to prevent single points of failure. Please refer to the Azure Virtual Machine
service section for further information.

Availability Zones

Azure Managed Disks are intimately tied to Azure VM Availability Zones and expand the
level of control to maintain the availability of applications and data. Each zone
comprises one or more data centers, independent of critical resources to maintain the
Azure Service Fabric, within an Azure Region. To ensure resiliency, a minimum of three
separate zones are configured within enabled regions. The physical separation of the
zones within a region protects applications and data from datacenter failures by utilizing
zone redundant services to replicate across Availability Zones. Please refer to the Azure
Virtual Machine service section for further information.

Regions

Please refer to the Appendix for regional availability.

Deployment Guidelines and Recommendations

The following are some guidelines and best practices to consider when leveraging
Azure Managed Disks:
• Utilize a combination of all BCDR service offerings to tailor a strategy that meets
all business requirements across all IaaS tiers.

• If a service or application can be clustered across VMs, consider adding


additional HA and resiliency by leveraging Azure Shared Disks to enable support
for cluster shared volumes (CSV) and standard disk reservations required for
failover clustering disks and services.

• Utilize Managed snapshots as a simple option for backing up Managed Disks


attached to VMs. They exist independent of source disks that can be leveraged
to create new Managed Disks when rebuilding VMs across all development and
production tiers and only incur charges for actual used data vs. provisioned disk
size, i.e., a 64GB disk only using 10GB will create a 10GB snapshot.

• If Unmanaged Disks are currently deployed and attached to VMs, consider


migrating to Managed Disks to take advantage of features specific and
dependent on the service offering, such as removing separate storage
management requirements and enhanced reliability when utilizing Availability
Sets.
6.6. Azure ExpressRoute
6.6.1. Overview
Azure ExpressRoute utilizes a connectivity provider facilitated connection to extend on-
premises networks to the Azure Service Fabric. ExpressRoute connections do not
traverse the public internet and can establish connectivity via IPVPN (MPLS VPN), P2P
Ethernet or a virtual Cross-Connect Circuit (CCC) through an ISP at a co-lo facility. This
network architecture enables ExpressRoute services faster speeds, consistent latencies
and reliability along with higher security than common connections traversing the public
internet.

Key benefits and Features:

• Access between on-prem networks and Cloud Services (Azure Service Fabric
and M365) via Layer 3 access provided by a connectivity provider and flexible
configuration options.

• Geopolitical region connectivity to all Microsoft Cloud regional datacenters via


peering locations.

• Global connectivity to Microsoft services across all regions with ExpressRoute


premium add-on facilitating access to all cloud services located in any
datacenter.
• BGP-enabled dynamic routing between Microsoft and on-premises networks
establishes multiple BGP sessions to support differing traffic profile requirements.

• Two connections to two Microsoft Enterprise Edge Routers (MSEEs) at an


ExpressRoute Location, per circuit, are established from the connectivity provider
to the network edge. Microsoft requires dual BGP connection, one to each
MSEE, from connectivity provider/network edge to ensure connections are
handed off in a redundant manner to validate the SLA. Redundant
devices/circuits are not required by the consumer; however, connectivity
providers will utilize redundant devices to facilitate redundant Layer 3
connectivity to support our SLA.

• ExpressRoute Global Reach enables private datacenter connectivity over dual


ExpressRoute circuits by allowing private cross-data-center traversal through the
Microsoft network.

• Dynamic scaling and multiple bandwidth options allows the increase of circuit
bandwidth without connection re-provisioning from 50Mbps up to 10Gbps.

6.6.2. Deployment Options

Azure ExpressRoute offers numerous deployment options when provisioning


ExpressRoute circuits to meet organizational and business requirements. The following
items should be considered when planning for deployment:

• Local/Standard/Premium Circuit and SKU metrics:

Circuit Local Circuit Standard Premium Inbound Outbound


Bandwidth Price/month Circuit Circuit Data Data
Price/month Price/month Transfer Transfer
Included Included
50 Mbps N/A $300 $375 Unlimited Unlimited
100 Mbps N/A $575 $675 Unlimited Unlimited
200 Mbps N/A $1,150 $1,300 Unlimited Unlimited
500 Mbps N/A $2,750 $3,150 Unlimited Unlimited
1 Gbps $1,200 $5,700 $6,450 Unlimited Unlimited
2 Gbps $2,200 $11,400 $12,900 Unlimited Unlimited
5 Gbps $3,600 $26,650 $28,650 Unlimited Unlimited
10 Gbps $5,500 $51,300 $54,300 Unlimited Unlimited
6.6.3. High Availability
Designed for HA, ExpressRoute provides carrier grade private network connectivity to
Microsoft Cloud resources, eliminating a single point of failure in the ExpressRoute
path. The service provider and your segment of the ExpressRoute circuit should also be
appropriately architected for HA. Redundancy needs to be maintained within the on-
premises network and should not compromise redundancy within the service provider
network to ensure HA.

The following items are key considerations for ensuring HA principles on ExpressRoute
connections:

• ExpressRoute circuits, in Microsoft networks, are configured to operate the


primary and secondary connections in an active-active mode. Route
advertisements can be configured to force the redundant connections in the
ExpressRoute circuit to function in active-passive mode by advertising more
specific routes and BGP autonomous systems (AS) path prepending to make
one path preferred over the other. However, operating both ExpressRoute
circuits in active-active mode is highly recommended to improve HA. Running the
primary and secondary connections in the active-active configuration results in
only half of the flows failing and getting rerouted following a connection failure,
significantly improving the Mean Time To Recover (MTTR).

• Designed for communication between public endpoints, Microsoft peering is


commonly utilized when on-premises private endpoints are NATed with a public
IP on the customer or partner network before communicating over Microsoft
peering. To ensure HA in the event of a primary or secondary ExpressRoute
connection failure a common NAT pool is employed prior to splitting the traffic
between the connections. The common NAT pool configuration, however, does
not introduce a single point of failure due to the splitting of traffic from the pool
before traversing the primary or secondary connection, therefore providing high
availability by the network layer itself rerouting the packets and helping faster
recovery following a connection failure.

• The combination of a fault domain and an update domain, when opting for zone-
redundant Azure IaaS deployment and configuring zone-redundant VNet
Gateways that terminate ExpressRoute private peering, enables Availability Zone
aware ExpressRoute VNet Gateways.
6.6.4. Disaster Recovery

While ExpressRoute is designed for HA using the above principles, there are additional
methods to enable connectivity when outages cannot be addressed using a single
ExpressRoute circuit. The following items should be considered when designing the
overall BCDR strategy to build out robust backend network connectivity, addressing DR
by utilizing geo redundant ExpressRoute circuits.

• Multiple ExpressRoute circuits interconnected with the same set of networks


introduces parallel paths between the network and when improperly architected,
it could lead to asymmetrical routing. Utilizing stateful entities, such as NAT and
firewalls, in the path could block traffic flow from asymmetrical routing, however,
utilizing ExpressRoute private peering does not typically traverse stateful entities.
So architecting ExpressRoute private peering via asymmetrical routing does not
necessarily block the flow. However, if traffic is load balanced across geo
redundant parallel paths, regardless of stateful entity presence or not,
inconsistent network performance would be experienced.

• When designing DR for ExpressRoute connectivity the following should be


considered:

o Leveraging geo redundant ExpressRoute circuits.

o Employing multiple service provider networks to delineate ExpressRoute


circuit connectivity.

o Designing HA for each ExpressRoute circuit.

o ExpressRoute circuit termination at different locations on your network for


each circuit.

• The following techniques can be employed to influence Azure to prefer one


ExpressRoute circuit over another by:

o A more specific route advertisement over the preferred ExpressRoute


circuit compared to the other ExpressRoute circuit(s).

o Higher Connection Weight configurations on the connection linking the


virtual network to the preferred ExpressRoute circuit.

o Utilizing long AS Path prepend route advertisement over less preferred


ExpressRoute circuits.

• VPN can also be leveraged to provide site-to-site (S2S) connectivity as a backup


to ExpressRoute private peering.
Further discussions with Microsoft and Network Architects will assist in determining
current configurations (network devices, IP space, route tables, physical locations and
service providers, for example) and requirement to develop the network connectivity
BCDR strategy.

6.6.5. Backup

Due to the nature of the service offering and lack of traditional backup operations for the
Azure ExpressRoute service, the following method can be leveraged to store existing
configuration templates for redeployment:

1. Export templates of each Azure ExpressRoute object.

2. Store exported templates in a Zone/Geo-redundant Blob/File account.

3. Test restore activities in dev environment to validate export fidelity.

6.6.6. Availability Sets

Azure ExpressRoute does not offer the Availability Set feature as an available
function of the service offering.

6.6.7. Availability Zones

VPN and ExpressRoute gateways, deployed in Azure Availability Zones, brings


resiliency, scalability and higher availability to virtual network gateways by physically
and logically separating gateways within a region, protecting on-premises network
connectivity against Azure zone level outages.

6.6.8. Regions

Please refer to the Appendix for regional availability.


6.6.9. Deployment Guidelines and Recommendations

The following are some guidelines and best practices to consider when leveraging
Azure ExpressRoute:

• Utilize the unique BGP Community values to each Azure Region to optimize
routing over ExpressRoute circuits. You can use BGP’s Local Preferences to
influence routing by assigning a higher local preference value that corresponds
with the regional network prefix to ensure that when multiple paths to Microsoft
are available, users will prefer the route to the closer Azure region in regards to
their current location.

• Utilize the Autonomous System (AS) Path Prepending by advertising regional


prefixes on both ExpressRoute circuits relative to on-premises location, for
example, and in addition lengthening the AS Path for the network prefix in US
East so the preferred circuit will be the ExpressRoute circuit in US West due to
the network determining the path to the prefix is shorter in the west. Conversely,
you can lengthen the AS Path for the US West prefix so the network will prefer
the ExpressRoute circuit in US East, optimizing traffic for offices in both regions.
With this design Azure resources can reach, and be reached, if on ExpressRoute
circuit is experiencing an outage.

• If enabling VNet to VNet communications via ExpressRoute circuits you can


assign a higher weight to the local connection than the remote connection so
when a VNet receives the prefix of the other VNet on multiple connections it will
prefer the connection with the highest weight to send traffic destined for the
prefix.

One Platform implementation of ExpressRoute establishes connectivity through IPVPN


(MPLS VPN). One Platform hosting instances leverage two Express Route circuits
terminated at Microsoft peering locations nearby the Microsoft Azure paired regions that
are used for One Platform hosting.

The two Express Route circuits are configured in an Active – Active setup, wherby each
circuit runs the traffic for the particular hosting location (Azure region). BGP AS path
prepending and BGP weight are configured to allow for failover in the scenario where
one ExpressRoute circuit would completely fail.

The Express Route circuits are configured for Azure Privat Peering for Virtual Networks
(and optional Microsoft Peering and/or Public Peering is not used).
6.7. Azure Key Vault
6.7.1. Overview
Azure Key Vault allows the centralization of application secrets, stores the
corresponding keys and secrets, enables access and use monitoring and natively
integrates with Azure Service Fabric by providing the following solutions:

• Access tokens, passwords, certificates, API keys, and other secrets are closely
controlled and stored securely in Key Vault.

• Creating and controlling encryption keys utilized for data encryption can leverage
the Key Vault as a Key Management System.

• Provisioning, managing, and deployment of public/private TLS keypairs for use


with Azure.

• Software or FIPS 140-2 Level 2 Hardware Security Modules (HSMs) can protect
secrets and keys.

Applications and users can leverage Key Vault to store and use several types of
secret/key data:

• Algorithms and multiple key types are supported, and high value objects can be
protected by HSMs.

• Passwords, database connection strings and application secrets.

• PKI/X509 certificates.

• Storage account key management.

• Key Vault keys to encrypt the data at rest for PaaS services like Azure SQL,
Storage, Service Bus etc.

6.7.2. Deployment Options

Azure Key Vault deployments are deeply integrated into the Azure Service Fabric. Due
to this operational integration, it natively supports and fuses with Azure Storage
Accounts, Event Hubs, and Log Analytics.

The Key Vault provides the following guarantees:


• Key Vault transactions are executed within 5 seconds, at least 99.9% of the time.

• The following Service Levels are applicable:

Monthly Uptime Percentage Service Credit


< 99.9% 10%
< 99% 25%

6.7.3. High Availability


Featuring multiple layers of redundancy to ensure keys and secret availability, the Key
Vault contents are replicated within the region and to a secondary region a minimum of
150 miles but within the same geography to maintain high durability of these objects.

In the event service components within the Key Vault fail, alternate regional
components are initiated to service requests to ensure minimal to no degradation of
functionality. These activities are autonomous and do not require manual intervention.
In the rare event of an Azure region failure, the Key Vault requests are automatically
routed (failed over) to an available secondary region and when the primary region
access is restored the requests routed back (failed back), and again, this process is
autonomous so no manual intervention is required.

Key Vault does not require downtime for maintenance via the HA design, but the
following caveats should be noted:

• During a regional failover process, Key Vault requests could fail for a few minutes
during failover execution due to service endpoint updates.

• A Key Vault is available in read-only mode post failover and support the following
request types:

o List Key Vaults

o Get properties of Key Vaults

o List/Get secrets

o List keys

o Get properties of keys

o Encrypt/Decrypt
o Wrap/Unwrap

o Verify

o Sign

o Backup

• A Key Vault becomes available in read and write mode post failback.

6.7.4. Disaster Recovery

The Azure Key Vault offering has DR natively built in via the Azure Service Fabric.
Additionally, Key Vault includes a soft-delete feature, supporting the recovery of deleted
vaults and objects which encompass the following:

• Key Vault deletion recovery.

• Keys, Secrets and Certificates (Key Vault objects) deletion recovery.

After enabling soft-delete, Key Vault objects marked as deleted are retained for
customizable timeframe up to 90 days (default configuration), further providing a simple
object recovery mechanism. One item of note is an object name cannot be reused until
the soft-delete retention period has expired.

An optional Key Vault behavior, disabled by default, is Purge Protection and can only be
enabled once soft-delete is functional. A vault or vault object in the deleted state cannot
be purged until the retention period has expired when the Purge Protection feature is
enabled, ensuring deleted vaults and vault objects remain recoverable by enforcing the
retention period configuration.

The following items should be considered when leveraging soft-delete and purge
protection:

• Purging a Key Vault is possible via POST operations by a subscription owner,


which triggers the immediate and non-recoverable deletion of the vault/vault
object with the following exceptions:

o If an Azure subscription is marked as ‘undeletable’ only the Key Vault


service may perform the deletion as a scheduled process.
o If the vault is tagged with the ‘--enable-purge-protection’ flag, the Key
Vault will adhere to the retention policy.

• The Key Vault service creates a proxy resource under the subscription with
corresponding metadata for recovery when a vault is deleted and is a stored
object available in the same location as the deleted vault.

• The Key Vault service will place a deleted vault object in a deleted state
rendering it inaccessible to retrieval operations and can only be listed, recovered
or permanently/forcefully deleted while scheduling underlying data corresponding
to deleted vault or vault object, in parallel, in accordance with the retention policy.

6.7.5. Backup

Due to the nature of the service offering and lack of traditional backup operations for the
Azure Key Vault service, the following method can be leveraged to store existing
configuration templates for redeployment:

1. Export templates of each Azure Key Vault.

2. Store exported templates in a Zone/Geo-redundant Blob/File account.


Test restore activities/processes in Dev environment (Using non-PROD keys) to validate export fidelity.

However, Key Vault objects themselves (Keys/Secrets/Certificates) can be directly


downloaded to support restore activities in addition to the soft-delete functionality. To
ensure DR operational activities are included in the overall BCDR strategy, the
downloaded vault objects should be stored in a GRS or above replication level to
ensure the highest availability levels of these objects.

6.7.6. Availability Sets

Azure Key Vault does not offer the Availability Set feature as an available function
of the service offering.

6.7.7. Availability Zones

Due to the standard replication operations within the Azure Service Fabric, the
HA configuration options referenced above should be leveraged to further
expand data replication and availability.
6.7.8. Regions

Please refer to the Appendix for regional availability.

6.7.9. Deployment Guidelines and Recommendations

The following are some guidelines and best practices to consider when leveraging
Azure Key Vault:

• Due to the nature of a Key Vault, securing access to vaults and vault objects
should strictly control access while also closely monitoring/managing objects.
The following items are highly recommended to secure access to a vault:

o Deploy a RBAC model to lock down access to a subscription, resource


group and Key Vault hierarchy.

o Generate and utilize Access Policies for each vault and vault object.

o Grant access by employing the least privilege principle.

o Restrict network level access by configuring Firewall and VNet Service


Endpoints.

• Ensure soft-delete and Purge Protection are enabled to strictly control delete
operations and support quick restores.

• Delineate Key Vaults per application and per environment (Dev/Test, Production)
to avoid sharing secrets across enclaves and reduce breach vector footprints.

• Integrate the vault object (Keys, Secrets, Certificates) backup features into the
BCDR strategy for archival and restore operations.

• The staging and production key vault should be deployed with Premium tier
which gives the privilege to create RSA-HSM keys.
6.8. Azure Network Security Group
6.8.1. Overview
Azure Network Security Groups contain security rules that allow or deny inbound and
outbound network traffic to and from Azure resources in and Azure VNet. Source and
destination ports and protocols can be specified, per rule, for multiple Azure resource
types utilizing default and custom rules to create and customize secured access.

Network security group security rules are evaluated by priority using the 5-tuple
information (source, source port, destination, destination port, and protocol) to allow or
deny the traffic. A flow record is created for existing connections. Communication is
allowed or denied based on the connection state of the flow record.

A network security group contains zero, or as many rules as desired, within Azure
subscription limits. Each rule specifies the following properties:

A network security group contains zero, or as many rules as desired, within Azure
subscription limits. Each rule specifies the following properties:

Property Explanation
Name A unique name within the network security group.
Priority A number between 100 and 4096. Rules are processed in
priority order, with lower numbers processed before higher
numbers, because lower numbers have higher priority.
Once traffic matches a rule, processing stops. As a result,
any rules that exist with lower priorities (higher numbers)
that have the same attributes as rules with higher priorities
are not processed.
Source or destination Any, or an individual IP address, classless inter-domain
routing (CIDR) block (10.0.0.0/24, for example), service
tag, or application security group. If you specify an address
for an Azure resource, specify the private IP address
assigned to the resource. Network security groups are
processed after Azure translates a public IP address to a
private IP address for inbound traffic, and before Azure
translates a private IP address to a public IP address for
outbound traffic. Learn more about Azure IP addresses.
Specifying a range, a service tag, or application security
group, enables you to create fewer security rules. The
ability to specify multiple individual IP addresses and
ranges (you cannot specify multiple service tags or
application groups) in a rule is referred to as augmented
security rules. Augmented security rules can only be
created in network security groups created through the
Resource Manager deployment model. You cannot specify
multiple IP addresses and IP address ranges in network
security groups created through the classic deployment
model. Learn more about Azure deployment models.
Protocol TCP, UDP, ICMP or Any.
Direction Whether the rule applies to inbound, or outbound traffic.
Port range You can specify an individual or range of ports. For
example, you could specify 80 or 10000-10005. Specifying
ranges enables you to create fewer security rules.
Augmented security rules can only be created in network
security groups created through the Resource Manager
deployment model. You cannot specify multiple ports or
port ranges in the same security rule in network security
groups created through the classic deployment model.
Action Allow or deny.

6.8.2. One Platform NSG Standards


The new NSG standard for One Platform are summarized in the tables below; all newly
created NSG will follow these standards:
6.8.3. Deployment Options
Azure Network Security Groups are associated to Subnets or Network Interfaces and
unless specific requirements exist, should be associated with only one of the objects at
a time due to potential conflicting rules causing unexpected communications issues.

All rules will be evaluated based on their priority using these following five types of
information: source, source port, destination, destination port, and protocol.

When a new VM is provisioned the corresponding Network Interface allows the


provisioning of a default Network Security Group rule to allow inbound access features;
if no rule is selected the following default rules are provisioned:

6.8.4. High Availability

The Azure Network Security Group service is intimately tied to Azure VMs and Azure
Virtual Networks so the HA principles of those service offerings should be reviewed for
further information.
6.8.5. Disaster Recovery

The Azure Network Security Group service is intimately tied to Azure VMs and Azure
Virtual Networks so the DR principles of those service offerings must be reviewed for
further information.

Note: To ensure site recovery configurations are in place or creation of explicit NSGs
for secondary regions are configured.

6.8.6. Backup

Due to the nature of the service offering and lack of traditional backup operations for the
Azure Network Security Group, the following method can be leveraged to store existing
configuration templates for redeployment:

1. Export templates of each Azure Network Security Group object.

2. Store exported templates in a Zone/Geo-redundant Blob/File account.

3. Test restore activities in dev environment to validate export fidelity.

However, due to the nature of the service the Network Security Group is backed up in
Azure VM Backups and replicated when utilizing Azure Site Recovery Replication
policies so please refer to both sections for further information.

6.8.7. Availability Sets

Azure Network Security Group does not offer the Availability Set feature as an
available function of the service offering.

6.8.8. Availability Zones

The Azure Network Security group service is intimately tied to Azure VMs and Azure
Virtual Networks so Availability Zone configuration options for those service offerings
should be reviewed for further information.

6.8.9. Regions

Please refer to the Appendix for regional availability.


6.8.10. Deployment Guidelines and Recommendations

The following are some guidelines and best practices to consider when leveraging
Azure Network Security Groups:

• One Platform NSG standard should be followed as described above

• Understand and choose an appropriate scope of association to avoid unintended


access conflicts and to leverage simplicity and predictability in the Network
Security Group design.

• Understand application requirements and design access security models to best


meet while also considering service scaling requirements.

• Enable Diagnostic logs and Network Security Group flow logs to monitor
connection attempts and to also provide additional insight for both
troubleshooting and service performance.
6.9. Azure Network Watcher
6.9.1. Overview
Providing tools to monitor, diagnose, review metrics and log management for
Infrastructure objects in an Azure Virtual Network, Azure Network Watcher can repair
network health of Azure Infrastructure products to include Virtual Machines, Virtual
Networks, Application Gateways, Load balancers, etc.

Note: It is not intended for and will not work for PaaS monitoring or Web analytics.

Monitoring

• Monitor communication between a virtual machine and an endpoint

• View resources in a virtual network and their relationships

Diagnostics

• Diagnose network traffic filtering problems to or from a VM

• Diagnose network routing problems from a VM

• Diagnose outbound connections from a VM

• Capture packets to and from a VM

• Diagnose problems with an Azure Virtual network gateway and connections

• Determine relative latencies between Azure regions and internet service providers

• View security rules for a network interface

Metrics, Logs

• Analyze traffic to or from a network security group – currently in use in One Platform

• View diagnostic logs for network resources

6.9.2. Deployment Options

Enabling network diagnostic and visualization tools, Network Watcher assists with
diagnosis, gain insights and understand an Azure network. Network Watcher is enabled
with all subscriptions that have Virtual Networks. The following items are options for
how to deploy and integrate into an environment:
• Network Watcher can be utilized to assist with diagnosis of VM networking issues
by installation of the VM agent, which is a requirement for Network Watcher
functionality such as on-demand traffic capture and diagnosis of VM routing
issues. The Next Hop feature retrieves the packet type and IP to help determine
if traffic is directed to the intended destination or if the traffic is being dropped.

• Network Security Group flow logging allows the IP traffic information flows to be
logged in an Azure Storage account as well as SIEM or IDS platforms. This
feature captures and transmits source of truth for network activity that
ingress/egresses the VM’s NIC or VNet’s subnet, depending on association.

• VPN gateway connectivity can be reviewed to troubleshoot and diagnose VPN


gateways and connections by utilizing the feature in the Network Watcher. The
information is logged to an Azure Storage account for further analysis and
retention for IPsec communication and negotiation, a critical component of VPN
connectivity and security.

• Network Topology visualization can be generated per-VNet via the Network


Watcher Topology feature. This information can be utilized to build Network
Architecture documentation while also understanding the relationship between
resources in, and in between, VNets in a subscription and resource group. This
visualization includes Subnets, VMs, NICs and NSGs in the hierarchy of a VNet
and can be a valuable tool for both troubleshooting and future architectural re-
alignment and interoperability.

6.9.3. High Availability

Azure Network Watcher is a region-specific service, enabling monitoring and diagnosis


of network related conditions, at a scenario level, between all Azure pathing methods.
However, due to the nature of the region specificity, HA principles are not a feature of
Network Watcher. Should a region experience service outage, a secondary region’s
Network Watcher can only monitor the active region in which it resides. When the
primary region’s service is restored, the Network Watcher in that region will become
active again to resume operations. Microsoft guarantees that 99.9% of the time Network
Diagnostic Tools will successfully execute and return a response.

6.9.4. Disaster Recovery

The Azure Network Watcher offering has DR natively built in via the Azure Service
Fabric.
6.9.5. Backup

Due to the nature of the service offering and lack of traditional backup operations for the
Azure Network Watcher there are no methods to export configuration templates as the
feature set are all mostly ephemeral scope of execution.

6.9.6. Availability Sets

Azure Network Watcher does not offer the Availability Set feature as an available
function of the service offering.

6.9.7. Availability Zones

Due to the standard replication operations within the Azure Service Fabric,
Availability Zone configurations are not available as a function of the service
offering.

6.9.8. Regions

Please refer to the Appendix for regional availability.

6.9.9. Deployment Guidelines and Recommendations

The following are some guidelines and best practices to consider when leveraging
Azure Network Watcher:

• Understand the scope of a Network Watcher instance when leveraging for


diagnostic purposes. The region specificity of the Network Watcher service
prohibits the use of diagnostic tools for a service in a different region.

• Utilize the Network Watcher Topology builder feature to both analyze and
troubleshoot network related configurations and to also develop a visual
architecture for both documentation and potential rebuild of high level VNet
hierarchy.
6.10. Azure Virtual Machines
6.10.1. Overview
Azure Virtual Machines (VM) is one of several types of on-demand, scalable computing
resources that Azure offers. Typically, you choose a VM when you need more control
over the computing environment than the other choices offer. This article gives you
information about what you should consider before you create a VM, how you create it,
and how you manage it.

An Azure VM gives you the flexibility of virtualization without having to buy and maintain
the physical hardware that runs it. However, you still need to maintain the VM by
performing tasks, such as configuring, patching, and installing the software that runs on
it.

The number of VMs that your application uses can scale up and out to whatever is
required to meet your needs.

6.10.2. Deployment Options

Azure Virtual Machine deployment options cover numerous use cases to meet an
organization’s application and workload requirements for both Windows and Linux VMs.
Please refer to the Appendix for a link to the current SKU availability.

Gen2 VMs are now supported, leveraging UEFI boot architectures vs traditional BIOS
architecture utilized by Gen1 VMs. Gen2 VMs have improved boot and installation times
as well.

Additionally, Azure Virtual Machines can leverage the Virtual Machine Scale Set
(VMSS) feature to create and manage a group of identical, load balanced VMs. The
VMSS enables autoscaling to increase and decrease VM instances in response to
demand or schedule-based configurations, providing high availability to applications and
facilitate central management, configuration and update of these scaled instances. The
following table identifies scenario-based benefits of deploying VMSS vs individual VMs:
Scenario Manual group of VMs VMSS
Add additional VM instances Manual process to create, Automatically spawn from central
configure and ensure configuration
compliance.
Traffic balancing and Manual process to create and Automatically create and integrate
distribution configure Azure Load Balancer with Azure Load Balancer or
or Application Gateway Application Gateway.
High availability and Manually create Availability Set Automatic distribution of VM
redundancy or distribute and track VMs instances across Availability Zones
across Availability Zones. or Availability Sets.
Scaling of VMs Manual monitoring and Azure Autoscale based on host or in-
Automation. guest metrics, Application Insights
or schedule.

Note: No additional cost is incurred with scale sets; underlying compute resources such
as VM instances, load balancer or Managed Disk storage are standard billing metrics.

6.10.3. High Availability


Azure Virtual Machines employ multiple methods to design HA into the service offering
by spreading workloads across different VMs to gain high throughput, performance and
create redundancy in the event a VM instance is negatively impacted.

The following options outline Azure Virtual Machine HA constructs:

• Availability Zones are physically separate zones within an Azure region,


supporting three zones per region providing separate power, network
connectivity and cooling systems so if one zone is compromised, replicated apps
and data are instantly available in unaffected zones.

• A Fault domain shares common power source and switch, logically grouped.

• An Update domain can undergo maintenance or be rebooted in parallel that is


logically grouped to ensure at least one instance of an application always
remains running as the Azure Service Fabric undergoes periodic maintenance.

• As mentioned in the Deployment Options section, Virtual Machine Scale Sets


allows the creation and management of a load balanced VM group, automatically
increasing or decreasing the instance count based on response demand or
defined schedule. VMs in a scale set can be deployed across multiple Update
and Fault domains and single or multiple Availability Zones to maximize
availability and resiliency stemming from datacenter outages and planned or
unplanned maintenance events.

6.10.4. Disaster Recovery

Azure Virtual Machines can also leverage multiple avenues for DR operations and the
overall strategy to keep applications and workloads online when planned and unplanned
outages occur. We must prepare for entire region outages due to widespread service
interruption or a major natural disaster, even though they are rare occurrences, by
designing DR into applications and services, as well as the underlying/supporting
features.

The following features of the Azure Recovery Services vault are key pieces of the DR
strategy:

• Azure Site Recovery ensures business continuity by maintaining business


applications and workloads operating during outages by replicating workloads
from a primary site to a secondary location. During an outage at the primary
location, a failover can be invoked to move active service endpoints to the
secondary location for access and consumption and can be failed back when
services at your primary location are restored.

• Azure Backup retains copies of your data safe for future recovery.

Azure Site Recovery provides the following features to supplement the organization’s
BCDR strategy:

Feature Details
Simple BCDR solution Azure Site Recovery enables replication, failover and
failback from a single management point.
Azure VM replication Azure VM DR can be configured from a primary to
secondary region.
On-prem VM replication Both physical and VMs hosted on-prem can be
replicated to an Azure region.
Workload replication Supported Azure VMs can have workloads
replicated.
Data resilience Orchestrating replication without application data
interception, Azure Site Recovery stores data in
Azure Storage which provides resilience; when
failover occurs, Azure VMs are provisioned based on
the replicated data.
RTO and RPO targets Maintain organizational recovery time objectives
(RTO) and recovery point objectives (RPO) limits by
continuous replication via Site Recovery and further
reduce RTO by integrating with Azure Traffic
Manager.
Keep applications consistent during failover Recovery points with application consistent
snapshots can be leveraged for replication capturing
disk data, data in memory and in-process
transactions.
Testing without disruption Quick and nonintrusive DR drills can be executed
without affecting production replication.
Flexible failovers Execute planned failovers for expected outages with
zero-data loss or unplanned failovers with minimal
data-loss, depending on replication frequency, for
unexpected outages.
Customized recovery plans Customize and sequence failover and recovery plans
of multi-tier applications spanning multiple VM
instances by grouping machines in a plan and
optionally adding scripts and manual actions which
can be integrated with Azure Automation runbooks.
BCDR integration Integrating with other technologies, Site Recovery
can supplement SQL AlwaysOn services by
protecting SQL server backends supporting
corporate workloads.
Azure Automation integration The Azure Automation library provides application
specific and production ready scripts that can be
downloaded and integrated into Site Recovery.
Network integration Application network management is integrated
direction with Site Recovery to reserve IPs, configure
load balancers and leverage Traffic Manager for
efficient transitions.
An example of supported workloads that can be replicated with Site Recovery are:

• AD DS and DNS

• Web apps (IIS and SQL)

• SharePoint Farms

• Windows File Servers

• Citrix XenApp and XenDesktop

• Linux OS and Apps

The Azure Backups and Azure Site Recovery service sections should also be
referenced for additional information and constraints regarding those service offerings.

6.10.5. Backup

Azure Backup is another key feature for the overarching Virtual Machine BCDR strategy
by creating recovery points stored in geo redundant recovery vaults, which can restore
entire VMs or specific files. When a backup job is triggered by the Azure Backup service
the VM extension takes a point in time snapshot when the VM is running and if not
running at time of backup it will snapshot the underlying storage since no application
writes are occurring. The Azure Backup service coordinates activities with the Volume
Shadow Copy Service (VSS) for consistent snapshots of the VM’s disks, then transfers
the data to the vault and to maximize efficiency the service identifies and transfers only
changed blocks of data since previous backup. After transfer is complete the recovery
point is created, and the snapshot is removed.

The following are supported backup scenarios when utilizing the Azure Backup service:
Scenario Backup Agent Restore
Direct backup Back up the entire VM No additional Restore as follows:
of Azure VMs agent is needed
on the Azure VM; Create a basic VM – Useful if
Azure Backup the VM has no special
installs and configurations, such as
utilizes the multiple IPs.
Backup extension
via existing VM Restore VM disk – Restore
agent. then disk then attach to an
existing VM or create a new
VM from the disk by
PowerShell.

Replace VM disk – Restore an


unencrypted managed disk to
an existing VM to replace an
existing disk.

Restore specific folders – Files


and folder restoral from a VM
instead of entire VM.
Direct backup Backup specific Install Azure Restore specific files/folders.
of Azure VM files/folders/volume Recovery Services
(Windows agent.
only)
MARS agent can
run alongside VM
backup extension
to backup VM at
file/folder level.
Back up Azure Backup files/folders/volumes; Install Restore files/folders/volumes;
VM to backup system state/bare metal files; DPM/MABS system state/bare metal files;
server app data to DPM or Microsoft protection agent app data.
Azure Backup Server (MABS) on VM; MARS
agent is installed
on DPM/MABS.
In addition, Azure Managed Disk snapshots can protect data stored on VM disks and
VM configuration templates can be exported to impart additional layers of protection in
the event of outages and catastrophic failures.

6.10.6. Availability Sets

Logically grouping capability for isolating VM resources from each other when deployed,
Availability Sets ensure the VMs span multiple physical servers, compute racks, storage
units and network switches to reduce impact to both VMs and overall solution in the
event of hardware or software failure and are essential to building reliable cloud based
solutions.

For example, an Azure VM based solution containing web front ends and back end VMs
you should define two Availability Sets before deploying VMs; one for the web tier and
one for the backend tier. When creating a new VM the Availability Set should be
specified as a parameter to ensure the Azure Service Fabric isolates the VMs across
the spanning parameters to prevent interruption of services in the event of subsystem
error. Additionally, Azure Advisor can provide recommendations on improvement of VM
availability by analyzing configuration and usage telemetry.

When implementing Availability Sets it is highly recommended using Managed Disks,


and if currently using Unmanaged Disks you should convert to Managed to provide
better reliability by using Update and Fault domains for storage and compute clusters.

Lastly, when implementing multi-tier applications ensure each tier is deployed using
separate Availability Sets or Availability Zones to guarantee at least one machine per
tier is available during planned or unplanned outages. Leveraging Azure Load
Balancers along with Availability Sets or Zones to distribute traffic between the VMs to
provide additional application resiliency.

6.10.7. Availability Zones


You can maintain the level of control and availability of applications and data on VMs by
utilizing Availability Zone’s unique physical separation within an Azure region, making
up one or more datacenters with independent critical services. As previously mentioned,
Update and Fault Domains are key features that comprise the Availability Zone by
effectively distributing VMs across three fault and update domains, ensuring they are
not updated at the same time, or affected by a site outage. Architecting solutions
utilizing replicated VMs across zones protects applications and data from the loss of a
datacenter; if one zone is compromised, replicated apps and data are instantly available
in another zone providing a 99.99% VM uptime SLA.

6.10.8. Regions

Please refer to the Appendix for regional availability.

6.10.9. Deployment Guidelines and Recommendations

The following are some guidelines and best practices to consider when leveraging
Azure Virtual Machines:
• Deploy all VM sizes in a single template; to avoid landing on hardware that does
not support all the VM SKUs and sizes you require, include all of the application
tiers in a single template so that they will deploy at the same time.

• Separate and configure each application tier into separate Availability Zones
and/or Availability Sets along with load balancer services to provide the highest
resiliency and availability.

• When reusing an existing placement group from which VMs were deleted, wait
for the deletion to fully complete before adding VMs to it.

• If latency is a priority, put VMs in a proximity placement group and the entire
solution in an availability zone. However, if resiliency is a higher priority, spread
your instances across multiple availability zones (a single proximity placement
group cannot span zones).

• Use Availability Zones and Fault Domains together for even greater fault isolation
by specifying the zone and fault domain count if utilizing Dedicated Hosts.

• Utilize Scheduled Events to proactively provide notification and preparation time


before a maintenance window and to also notify when imminent hardware failure
could impact VMs allowing you to decide when healing activities are performed.
6.11. Azure Virtual Network – VNet
6.11.1. Overview
Azure Virtual Network (VNet) is the fundamental building block for your private network
in Azure. VNet enables many types of Azure resources, such as Azure Virtual Machines
(VM), to securely communicate with each other, the internet, and on-premises networks.
VNet is similar to a traditional network that you'd operate in your own data center, but
brings with it additional benefits of Azure's infrastructure such as scale, availability, and
isolation.

VNet Concepts:

• Address space: When creating a VNet, you must specify a custom private IP
address space using public and private (RFC 1918) addresses. Azure assigns
resources in a virtual network a private IP address from the address space that you
assign.

• Subnets: Subnets enable you to segment the virtual network into one or more sub-
networks and allocate a portion of the virtual network's address space to each
subnet. You can then deploy Azure resources in a specific subnet. Just like in a
traditional network, subnets allow you to segment your VNet address space into
segments that are appropriate for the organization's internal network. This also
improves address allocation efficiency. You can secure resources within subnets
using Network Security Groups. For more information, see Security groups.

• Regions: VNet is scoped to a single region/location; however, multiple virtual


networks from different regions can be connected together using Virtual Network
Peering.

• Subscription: VNet is scoped to a subscription. You can implement multiple virtual


networks within each Azure subscription and Azure region.

6.11.2. Deployment Options

With planning, deployment of Azure Virtual Networks and connectivity can be effectively
and efficiently managed to support production requirements. The following deployment
options and recommendations should be considered with Virtual Network deployment
and management:
• Virtual Network naming must be unique within a resource group but can be
duplicated within a subscription or Azure region so ensure a standardized
naming convention is developed and adhered to assist with overall management
and to also align into an overall governance framework.

• A resource connected to a Virtual Network must exist in the same region and
subscription as the Virtual Network; however, Virtual Networks can be
connected, or peered, so the following considerations should be reviewed to
effectively architect your Virtual Network infrastructure:

o Determine relative latencies between a specified location and Azure


region to design the lowest available latency, improving application and
service performance and response.

o If data sovereignty, residency, compliancy, or resiliency are requirements


ensure Azure region selection for Virtual Network deployment satisfies any
stipulations.

o Not all Azure regions support Availability Zones so ensure current, and
future, resources are deployed in regions and resiliency designs address
availability zone capabilities.

• Virtual Networks within a subscription can be provisioned up to a limit of 1000 so


implementing and adhering to a governance framework will assist with
maintaining structure and logic withing your organization and footprint.

• When creating and deploying Virtual Networks the following design questions
should be reviewed:

o Are there any organizational security requirements for traffic isolation into
separate Virtual Networks? When connecting Virtual Networks, Network
Virtual Appliances can control the flow of traffic between the networks.

o Are there any organizational requirements to isolate Virtual Networks into


separate subscriptions or regions?

o How many network interfaces and private IPs are required in a Virtual
Network? Review the limitations of each object per-Virtual Network to
understand and design effective segmentation.

o Do you need to connect Virtual Networks together or to an on-prem


network? Network addressing constraints and limitations should be
reviewed for effective design.
• Virtual Networks can be segmented up to the highest supported number of
subnet objects.

• If an Azure VM require higher throughput or optimization enable features such as


Receive Side Scaling (RSS) to reach higher maximal throughput than VMs not
enabled with the service.

6.11.3. High Availability

Azure Virtual Network is a region-specific service, enabling connectivity for resources


attached in the same region. However, due to the nature of the region specificity, HA
principles are not a feature of Azure Virtual Networks. Should a region experience
service outage, a secondary region can only utilize its own Virtual Network during
failover activities.

6.11.4. Disaster Recovery

Logical representation of your network in the cloud, you can define IP address space
and segment the Virtual Network into subnets while acting as a trust boundary to
compute resources such as Azure VMs and Azure App Services. The Virtual Network
allows direct, private IP communication between attached resources but the Virtual
Network is created within the scope of a region and while you can create Virtual
Networks with the same address space in two separate regions, you cannot connect
them due to requirements of non-overlapping address spaces.

In the event of Azure regional outages you should utilize Azure Site Recovery services
to both replicate primary region resource configurations (VMs, Disks, VNets, NSGs) to a
secondary region and also to execute failover and recovery plans to ensure continuity of
operations during the primary region outage. This failover is a supported by the
underlying replication of resources so the testing functionality should be employed to
validate replicated object fidelity as a part of your BCDR strategy.

6.11.5. Backups

Due to the nature of the service offering and lack of traditional backup operations for the
Azure Virtual Network, the following method can be leveraged to store existing
configuration templates for redeployment:

1. Export templates of each Virtual Network object.


2. Store exported templates in a Zone/Geo-redundant Blob/File account.

3. Test restore activities in dev environment to validate export fidelity.

6.11.6. Availability Sets

Azure Virtual Network does not offer the Availability Set feature as an available
function of the service offering.

6.11.7. Availability Zones

Due to the standard replication operations within the Azure Service Fabric,
Availability Zone configurations are not available as a function of the service
offering.

6.11.8. Regions

Please refer to the Appendix for regional availability.

6.11.9. Deployment Guidelines and Recommendations

The following are some guidelines and best practices to consider when leveraging
Azure Virtual Networks:

• Develop, implement, and test replication utilizing Azure Site Recovery to validate
your BCDR strategy.

• Logically segment subnets within a Virtual Network into smaller, more


manageable sizes for both management and service endpoint design.

• Restrict broad range assignment for access rules when attaching Network
Security Groups to subnet objects and Network Interfaces.

• Implement network access controls between subnets to protect against


unsolicited traffic.

• Avoid small Virtual Networks to ensure simplicity and flexibility for growth.

• Integrate Application Gateways, Load Balancers and Traffic Managers to


augment Virtual Networks and connected resources to design HA and resiliency
into your applications and service endpoints with Azure Site Recovery
Replication.

• Utilize Azure Virtual Network service endpoints to extend the identity of the
Virtual Network and extend private address space. You can secure critical Azure
service resources by utilizing endpoints by keeping traffic on the Azure backbone
network.
6.12. Non-Azure Services
6.12.1. Palo Alto – KNET-Facing

Overview
The KNET-facing Palo Altos have been deployed in all OPCS Regional and Satellite
instances. The Regional and Satellite instances are based on the concept of Hub and
Spoke topologies and built in specific Azure regions, whereby the Hub VNET is part of
the Core Services subscription and the Spoke VNETs are part of the Customer
subscriptions. Placement of the KNET-facing Palo Altos is in the Core Services
subscription, the Hub of the topology.

Regional and Satellite instances often exists of two hosting locations, referred to as
Primary and a Secondary (or DR) hosting location. These hosting locations leverage the
Azure paired regions concept. One Hub and Spoke topology exists on the primary
location/region, while the other Hub and Spoke topology exists on the secondary
location/region. Connectivity between both locations/regions is facilitated through the
Express Route layer.

The typical deployment pattern for KNET-facing Palo Altos is referred to as “Active-
Active One-armed model” and placed behind an Azure ILB (Internal Load Balancer) for
resiliency and horizontal scaling. This setup allows for running multiple active firewalls in
a single pool of VMs, without having the need to configure SNAT (Source Network
Address Translation).

The following traffic patterns are inspected by the firewalls:

• Traffic between KNET and Azure.

• Traffic between the Hub and the Spoke VNets (within a single hosting
location/region).

• Traffic between the Spoke VNets (within a single hosting location/region).

• Traffic between Primary and Secondary hosting locations (within a single


Instance).

Configuration of UDRs (User Defined Route) is the mechanism to have the Palo Alto
firewalls placed in the data path. This configuration is required in both directions of the
traffic flows.
The diagram below shows a logical representation of the deployment in a single hosting
location/region:

Deployment Guidelines and Recommendations

The following guidelines should be used for deployment:

• Premium and Managed disks.

• BYOL (Bring Your Own license) instead of PAYG (Pay As You Go).

• PANOS version 9.1 (or later) allowing for Accelerated Networking (= higher
throughput).

• One-Armed setup (single data Interface and Zone) for resiliency and horizontal
scaling.

• Standard ILB to load balance HA-ports (all TCP and UDP ports).

• Availability Set for resiliency (different fault and update domains).

• DR pre-provisioned (Secondary / DR location is actively running traffic).


• GSOC managed Panorama for firewall management and backup.

• Highly available Express Route connectivity in-between Primary, Secondary and


On-premises locations.
6.12.2. Palo Alto – Internet-Facing

Overview
The Internet-facing Palo Altos have been deployed in all OPCS Regional and Satellite
instances. The Regional and Satellite instances are based on the concept of Hub and
Spoke topologies and build in specific Azure regions, whereby the Hub VNET is part of
the Core Services subscription, and the Spoke VNETs are part of the Customer
subscriptions. Placement of the Internet-facing Palo Altos is in Core Services
subscription, the Hub of the topology.

Regional and Satellite instances often exists of two hosting locations, referred to as
Primary and a Secondary (or DR) hosting location. These hosting locations leverage the
Azure paired regions concept. One Hub and Spoke topology exists on the primary
location/region, while the other Hub and Spoke topology exists on the secondary
location/region. Connectivity between both locations/regions is facilitated through the
Express Route layer.

The typical deployment pattern for Internet-facing Palo Altos is referred to as “Two-
armed model”, and placed behind an Azure ELB (External Load Balancer) for resiliency
and horizontal scaling. This setup allows for running multiple active firewalls in a single
pool of VM’s. This setup requires the configuration of SNAT (Source Network Address
Translation) for Inbound traffic, in order to have the return traffic for individual sessions
running through the same firewall (symmetric traffic patterns).

The following traffic patterns are inspected by the firewalls:

• Inbound traffic from the Internet.

• Outbound initiated traffic to the Internet.

DNS (Domain Name Services) in combination with Public IP addresses on the ELB
(External Load Balancer) is the mechanism to have the Palo Alto firewalls placed in the
data path for Inbound traffic. Configuration of UDR’s (User Defined Route) is the
mechanism to have the Palo Alto firewalls placed in the data path for Outbound initiated
sessions.

The diagram below shows a logical representation of the deployment in a single hosting
location/region for Inbound traffic:
Deployment Guidelines and Recommendations

The following guidelines should be used for deployment:

• Premium and Managed disks

• BYOL (Bring Your Own license) instead of PAYG (Pay As You Go)

• PANOS version 9.1 (or later) allowing for Accelerated Networking (= higher
throughput)

• Two-Armed setup (two data Interfaces and Zones) for resiliency and horizontal
scaling

• Standard ELB (External Load Balancer) to load balance Inbound traffic

• Standard ILB (Internal Load Balancer) to load balance Outbound initiated traffic

• Availability Set for resiliency (different fault and update domains)


• DR pre-provisioned (Secondary / DR location is actively running traffic)

• GSOC managed Panorama for firewall management and backup


6.12.3. Palo Alto – Network (Panorama) Firewall Management

Overview

All OPCS Palo Altos are managed through GSOC Panorama. GSOC Panorama is
located in On-premises DXC data centers in Amsterdam and Atlanta.

GSOC is in the process of migrating their Panorama service from On-premises to


Azure. The Azure hosted Global Panorama consists of two Linux based VMs working in
High Availability mode (primary-secondary topology) with configuration synchronization
and heartbeat exchange mechanism over the network. By default, the VM holding the
primary role is the one hosted in the West Europe region and secondary is the one
hosted in East US region.

Geographical redundancy architecture includes automatic DNS based failover


mechanism for the Global Panorama Graphical User Interface (GUI) URL
(https://imsspanorama.kworld.kpmg.com/) which is currently based on global DNS
service (Global F5 Global Traffic Manager (GTM) platform from ITS Global). The remote
PAN firewalls are configured with the IP addresses from the Primary and Secondary
Global Panorama VMs for redundancy purposes.

Deployment Guidelines and Recommendations

The following guidelines should be used for deployment:

• Premium and Managed disks.

• Availability Set for resiliency (different fault and update domains).

• DR pre-provisioned (Secondary / DR location is actively running traffic).

• Panorama PANOS version must be higher or equal to the version[s] running on


the connected firewalls.
High-level design:

West Europe GO-M-GSOC EMA


GO-M-GSOC AMR TCP 28/ SSH - (config sync/ heartbeat)
East US ICMP - Keepalives

go2azrapp074 (primary)
go2azrapp072 (secondary)
imsspanorama01.kworld.kpmg.com
imsspanorama02.kworld.kpmg.com

ExpressRoute
KNet
ExpressRoute

.4
. 185

10.82.100.0/25

10.178.26.160/27

TCP port 3978/ TLS -> Device Management

KPMG Member Firms


https://imsspanorama.kworld.kworld.com/
(global DNS based automatic failover)

ITS Firewall Admins

PAN firewalls
6.12.4. Layer 7 - CA API Gateways

Overview
The CA API Gateway (Layer7) is an enterprise security management solution that
provides centralized management and access control over web applications and web
services, as well as related resources. The gateway is designed to protect web traffic
and mediate communications between Service Oriented Architecture (SOA) clients and
endpoints. KPMG is using CA API Gateways for web application proxy and single sign
on support to secure application access from the internet and intranet. In KPMG, Layer7
gateway provides services including but not limited to: Global Single Sign On, Web
Application Firewall, Multi-Factor Authentication, secure API and API management,
authentication and authorization, secure token service, secure session management
and other related services.

API Gateway engineers have implemented many applications on the gateway and most
of them are very similar except minor differences in authentication and authorization
controls, application security control, etc. This is consuming lot of time and effort and
creating huge backlog for application delivery. Team has designed a policy framework
for KPMG Enterprise Application Platform to support application registration and
enforces security on the gateways. This has some caveats to support some additional
features on the gateways.

By keeping in view of all the challenges and re-use of the existing framework Dynamic
Application Security Framework is designed to support many different use cases and
applications and can expand when needed by new applications. This allows
applications define pre-configured policy sets and assign to services and control
behavior on the fly. This policy framework is useful for both application proxy and
federation services on the API Gateways.

Layer7 is being deprecated at KPMG and replaced by Azure Application Gateway


Cloud Service
Providers
KPMG Managed Laptop

KPMG Managed User Certificate / Kerberos /


RSA / Network Credentials BoardEX
Mobile Devices
TIBBR

Internet RATW

External
Users

KPMG Global CRL Check


KPMG
Datacenter Certificate Authority

Frontend Authentication
Active AD

SSO Token Store Layer 7 Gateway LDAP Bind


Directory
KWORLD

Backend Authentication
LDAP Bind Active
Directory AD
XXE

PIN+Token
Kerberos SAML User Name Header SAML RSA RADIUS

SharePoint 2013 GMC eAudit SharePoint 2013


JIRA eMobile Kerberos
MySite ADFS Server
KPMG
Account Confluence Alex Mobile Central
Central

Deployment Options

• Web Application Firewall.

• Authentication and Authorization Policy Enforcement.

• Threat detection and mitigation.

• Authentication Protocol Transition for Inbound and Outbound Communications.

• Security Token Service Provider for Federated SSO.

• OAuth 2.0 Authorization Service.

• Open ID Connect for Mobile SSO.


Deployment Guidelines and Recommendations

Factors to consider when integrating Web Application Firewall with SSO:

• Web Applications should be confirming to OWASP TOP 10 Threats.

• URL / Base64 Encode any special characters that are consider unsafe.

• Consider hosting static content in CDN and share across multiple applications to
reduce initial load time of the pages.

• Do not block any web request for more than 10 seconds. Start responding the
request within 10 seconds.

o Note: If the application requires more than 10 seconds to respond to a


request, it is recommended to switch the function to async mechanism to
avoid long pending requests. Reducing long pending connections is a
good practice in general for all applications in the shared infrastructure.

• Consider using Asynchronous API for long running transactions rather than
blocking the threads.

• Consider using SAMAccountName / Email HTTP headers with Mutual TLS to


identify and authorize users on the web servers and still secure access to the
application.

• Use Kerberos Constrained Delegation when there is an absolute need to access


downstream resources with Kerberos Authentication.

• Leverage ready to integrate services as much as possible to increase speed of


delivery and save time and money.

Factors to consider when integrating with Federated SSO Integration:

• Web Applications should be confirming to OWASP TOP 10 Threats

• Document Service Provider metadata clearly

• Document SAML Claim requirements clearly

• Consider using ready to integrate services as much as possible to increase


speed of delivery and save time and money
• Consider using standard claims as much as possible to avoid custom
integrations

• Test application early in DEV environment as much possible to avoid any


integration issues and save time delivering application

Factors to consider when integrating when Accessing APIs Hosted Within KPMG:

• In addition to WAF recommendations, the following are additional


recommendations for Web APIs:

o Consider using OAuth 2.0 Framework for API Security.

o Document any CORS requirements to support API integration.

o Define proper error codes and descriptions for different failure conditions
for better user experience and automation.

Factors to consider when integrating when Accessing APIs Hosted Outside KPMG:

• Access APIs through API Gateways for better security and central API
management.

• Use Kerberos OR basic over TLS for access control to simplify API access from
internal network.

• Let API gateways handle authentication with 3rd party APIs server and mediate
authentication.
Appendix
References
 Overview of the resiliency pillar:
 https://docs.microsoft.com/en-
us/azure/architecture/framework/resiliency/overview
 Backup and disaster recovery for Azure applications:
 https://docs.microsoft.com/en-us/azure/architecture/framework/resiliency/backup-
and-recovery
 Using business metrics to design resilient Azure applications:
 https://docs.microsoft.com/en-
us/azure/architecture/framework/Resiliency/business-metrics#workload-
availability-targets
 Resiliency checklist for specific Azure services:
 https://docs.microsoft.com/en-us/azure/architecture/checklist/resiliency-per-
service
 SLA for Virtual Machines:
 https://azure.microsoft.com/en-us/support/legal/sla/virtual-machines/v1_9/
 Manage the availability of Windows virtual machines in Azure:
 https://docs.microsoft.com/en-us/azure/virtual-machines/windows/manage-
availability
 What are Availability Zones in Azure?
 https://docs.microsoft.com/en-us/azure/availability-zones/az-overview
 Set up disaster recovery to a secondary Azure region for an Azure VM:
 https://docs.microsoft.com/en-us/azure/site-recovery/azure-to-azure-quickstart
 Business continuity and disaster recovery (BCDR): Azure Paired Regions:
 https://docs.microsoft.com/en-us/azure/best-practices-availability-paired-regions
 Azure Global Infrastructure
 https://azure.microsoft.com/en-us/global-infrastructure/regions/
 Azure Application Gateway
 https://docs.microsoft.com/en-us/azure/application-gateway/overview
 Azure Monitor
 https://docs.microsoft.com/en-us/azure/azure-monitor/
 Azure Automation
 https://docs.microsoft.com/en-us/azure/automation/automation-intro
 Azure Backup
 https://docs.microsoft.com/en-us/azure/backup/backup-overview
 Azure Site Recovery
 https://docs.microsoft.com/en-us/azure/site-recovery/site-recovery-overview
 Core Azure Storage services
 https://docs.microsoft.com/en-us/azure/storage/common/storage-introduction
 Azure Blob Storage
 https://docs.microsoft.com/en-us/azure/storage/blobs/storage-blobs-introduction
 Azure Table Storage
 https://docs.microsoft.com/en-us/azure/storage/tables/table-storage-overview
 Azure Queue Storage
 https://docs.microsoft.com/en-us/azure/storage/queues/storage-queues-
introduction
 Azure Managed Disk Storage
 https://docs.microsoft.com/en-us/azure/virtual-machines/windows/managed-
disks-overview
 Azure File Storage
 https://docs.microsoft.com/en-us/azure/storage/files/storage-files-introduction
• Azure ExpressRoute
• https://docs.microsoft.com/en-us/azure/expressroute/expressroute-introduction
• Azure Key Vault
• https://docs.microsoft.com/en-us/azure/key-vault/general/overview
• Azure Load Balancer
• https://docs.microsoft.com/en-us/azure/load-balancer/load-balancer-overview
• Azure Network Security Group
• https://docs.microsoft.com/en-us/azure/virtual-network/security-overview
• Azure Network Watcher
• https://docs.microsoft.com/en-us/azure/network-watcher/network-watcher-
monitoring-overview
• Azure Traffic Manager
• https://docs.microsoft.com/en-us/azure/traffic-manager/traffic-manager-overview
• Azure Virtual Machines – VM and available SKUs
• https://docs.microsoft.com/en-us/azure/virtual-machines/windows/overview
• https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes
• Azure Virtual Network - VNet
• https://docs.microsoft.com/en-us/azure/virtual-network/virtual-networks-overview
• Azure Web Apps Services
• https://docs.microsoft.com/en-us/azure/app-service/
• Azure SLA
• https://azure.microsoft.com/en-us/support/legal/sla/summary/
• Azure Product by Region
• https://azure.microsoft.com/en-us/global-infrastructure/services/
© 2020 KPMG International Cooperative (“KPMG
International”), a Swiss entity. Member firms of the
KPMG network of independent firms are affiliated
with KPMG International. KPMG International
provides no client services. No member firm has any
authority to obligate or bind KPMG International or
any other member firm vis-à-vis third parties, nor
does KPMG International have any such authority to
obligate or bind any member firm. All rights
reserved.

You might also like