You are on page 1of 183

https://docs.microsoft.

com/en-
us/azure/architecture/

Cloud Computing
Estimated Reading Time: 3 minutes

You have heard lot’s of things about cloud. Let’s understand the each components of the cloud
in this page. Also let’s understand how the organizations are implementing various cloud models.
Before going further let’s understand about the what is cloud computing is all about. As per the
general dictionary definition we can see

“Cloud computing is a model for enabling convenient, on-demand network  access to a shared pool
of configurable computing resources (e.g., networks,  servers, storage, applications and services) that
can be rapidly  provisioned  and released with minimal management effort or service provider
interaction.”
The various characteristics of cloud computing is follows:

 On-demand self-service:
The consumer has to be able to provision the service themselves without any human
intervention. The service is provisioned almost instantly. So, an infrastructure using server
visualization that needs an administrator to manually provision a new virtual machine is not
cloud.Having to wait days to make a service available to the requester is not cloud.

 Resource pooling:
The resources of the cloud provider are pooled and can be consumed by multiple customers. The
subset of the pool that consists of storage, processing, and networking is assigned to the
consumer and can be
configured when needed/requested.
 Rapid elasticity:
The capacity delivered by the cloud service must easily and quickly be scaled up or scaled down
to meet the changes in demand.

 Measured Service (with pay-per-use characteristics):


The usage of the cloud services must be measured and reported on so that the customer and
the cloud provider have insight into the usage. It must provide reports that can be used for
billing. The pay-per-use characteristic is not a NIST characteristic but seen by Microsoft as
essential. In practice, not all cloud providers have a pay-per-use model.
 Broad network access:
The cloud service must be accessible over the network (Internet) and can be accessed using
different types of clients (like PC, smartphone, or tablet).

Cloud computing services can be categorized into three service delivery models:
• Software as a service (SaaS)
• Platform as a Service (PaaS)
• Infrastructure as a Service (IaaS)

The above image represents each components of the Azure public cloud and it’s very easy to
understand.

Cloud deployment models


The most commonly used cloud deployment model is public. A public cloud means the service is
run by an organization that is not a part of the organization to which the consumer belongs. The
business objective of a public cloud provider, in most cases, is to make money. Another
characteristic of a public cloud is that it is open to
multiple consumers. This so-called multitenant usage is offered in data centers that are only
accessible to employees working for the operator of the service.
A private cloud is the opposite of a public cloud. Services offered in a private cloud are typically
consumed by a single organization. The infrastructure can be locatedeither on premise or in a
data center owned and operated by a service provider. The provider of the private cloud service
is the IT department. It is also possible that the cloud management is outsourced to a vendor
while the IT department handles the governance. A private cloud, in most cases, exists in large
organizations that have frequent demands for new IT services. Organizations with a lot of
software developers are use cases for private cloud, as developers have frequent requests for
new virtual machines.

Confused with charges on your Azure VM


– Let’s see how it’s calculated
By Aavisek Choudhury Azure Compute  0 Comments

Estimated Reading Time: 1 minute

Recently I have been asked by few people about the Azure VM charges. Let’s see how Azure Bill
for the VM.

Please note that when a new VM spin-up Azure starts charging for the following meters.
 Compute hours
 IP Address hours (if the Public IP is static)
 Data Transfer Out (it’s about the data get out of the datacenter)
 Standard IO-Block Blob Read, Write, Delete
 Standard Managed Disk Operations
For any VM which is already build in the portal please note the following thumb rule.
 When the VM is up & Running azure charge for all the above.
 When the VM is Deallocated then charge only for storage (used capacity, e.g. if you got a
1TB disk and is in use 1GB you charge for 1GB, note this is not applicable for the managed
disks, if you have a managed disk of 1 TB and the VM is deallocated Azure will still charge
for the 1 TB managed disk.)
 When the VM is shut down and not Deallocated then you charged both for storage and
compute resources.
Where to check the billing information.
Azure Portal you can check resources cost from the Cost Management + Billing service.

Step by step monitoring of the


ExpressRoute Circuit with Network
Performance Monitor – Part I
By Aavisek Choudhury Azure Networking  0 Comments

Estimated Reading Time: 6 minutes

As many of you know that Azure ExpressRoute let us to extend our on-premises infrastructure to
Azure Cloud over a private dedicated connection. With ExpressRoute, you can establish
connections to Microsoft Azure Network.
If you are from the Azure pre sales background or an Azure Architect like myself, you must be
aware that in most of the solution/design we generally recommend our customers  to use the
Azure ExpressRoute circuit to connect their on-premises environment to Azure, however post-
deployment there are many challenges customers may face like they may not be able to utilize
the complete network bandwidth, latency related issue, slowness and many more.

A below network utilization is an example of such a scenario where although the customer has
purchased 500 Mbps pipe but it’s underutilized. So the monitoring of the express route circuit
will really become very important post-deployment.

Fig: Express route utilization.

And the best way to monitor the express route circuit is to deploy the NPM from Azure Market
Place.

A diagram of a typical ExpressRoute Circuit is shown below.

Fig: Express Route Circuit.


The main advantage of ExpressRoute Circuit is that they don’t go over the public Internet. This
lets ExpressRoute connections offer more reliability, faster speeds, lower latencies, and higher
security than typical connections over the Internet.

Network Performance Monitor (NPM) is a cloud-based hybrid network monitoring solution that
helps you monitor network performance between various points in your network infrastructure,
monitor network connectivity to applications and monitor the performance of your Azure
ExpressRoute.

As per Microsoft the NPM can do the following things.

NETWORK MONITORING FOR CLOUD AND ON-PREMISES ENVIRONMENTS


Monitor network connectivity across cloud deployments and on-premises locations, remote branch
& field offices, store locations and data centers. With NPM’s Performance Monitor, you can:
• Monitor loss and latency across various subnet and set alerts
• Monitor all paths (including redundant paths) on the network
• Troubleshoot transient & point-in-time network issues, which are difficult to replicate
• Isolate slowdowns by identifying problem spots along the network path, with latency data on each
hop
MONITOR NETWORK CONNECTIVITY TO APPLICATIONS
Monitor network connectivity from your users to the applications you care about, determine what
infrastructure is in the path, and where network bottlenecks are occurring. Key advantages of
NPM’s Service Endpoint Monitor are:
• Monitor the network connectivity to your applications and network services from multiple branch
offices/locations
• Determine whether poor application performance is because of the network or because of the
application
To monitor the expressroute circuit first you need to install and configure the Azure
Network Performance Monitor.
First go to the Azure Market Place and Search for Network Performance Monitor
Click on the network performance monitor and click on create button

In the next step you need to choose an OMS workspace.

* OMS is operations management suite.


Please note that at time while I am writing this blog post you can monitor ExpressRoute circuits
in any part of the world by using a workspace that is hosted in one of the following regions:

 West Europe
 West Central US
 East US
 South East Asia
 South East Australia
Also don’t think that you can’t monitor the express route of the other regions, these above
regions I am only talking about the workspace location and it has nothing to do with Express
Route Monitoring between other regions.

I have a workspace created in East US which is I am choosing now.

Since I have already OMS workspace created so I have linked the existing OMS workspace.
In the next step it will submit the deployment.

In the next step you can see that the deployment is successful.

In the last screen you can view the NPM solution which is just deployed in East US
After the Workspace has been deployed, navigate to the NetworkMonitoring(name) resource
that we created. Validate the settings, then click Solution requires additional configuration.
The next step is to download and configure the agent setup file.
Go to the Common Settings tab of the Network Performance Monitor Configuration page
for your resource. Click the agent that corresponds to your server’s processor from the Install
OMS Agents section and download the setup file.
Next, copy the Workspace ID and Primary Key to Notepad.
From the Configure OMS Agents for monitoring using TCP protocol section, download the
Powershell Script. This Firewall Script will add rules in the Windows Firewall.
The agent must be installed on a Windows Server (2008 SP1 or later)

I have choose to install it on a Windows 2008 R2 computer. This is a on premises server which is a
database server for Oracle.

Here are the steps to install the NPM agent on the server.
First step is to run the setup once you click on the setup you will see the following screen.
On the welcome page click on next

On the License Terms page, read the license, and then click I Agree.


On the Destination Folder page, change or keep the default installation folder, and then
click Next.

On the Agent Setup Options page, you can choose to connect the agent to Azure Log Analytics
or Operations Manager. Or, you can leave the choices blank if you want to configure the agent
later. After making your selection(s), click Next.
If you chose to connect to Azure Log Analytics, paste the Workspace ID and Workspace
Key(Primary Key) that you copied into Notepad in the previous section. Then, click Next.

On the Agent Action Account page, choose either the Local System account, or Domain or


Local Computer Account. Then, click Next.
On the Ready to Install page, review your choices, and then click Install.

On the Configuration completed successfully page, click Finish.


If you want to repair the program you can rerun the setup again.

Choose the repair button and click next.


How to verify the agent connectivity?
You can easily verify whether the agents are communicating.

Please go to the windows control panel and open the Microsoft Monitoring agent.

Click on the Azure Log Analytics Tab. In the Status column, you should see that the agent
connected successfully to Log Analytics
Now go to the Powershell in the agent computer and run the script which you have downloaded
earlier

The below command will create the firewall rules in the windows firewall.

As you can see few rules are created.


The next step is to configure the rule in Network Security Group (NSG) and to continue with a
bunch of other steps which I will discuss in my next post-Part II. For today that’s all that I have
posted so far, I hope you will like this post and will be eagerly wait for the part II.

Thanks for your time and you have a great day ahead.

Step by step installation of the On-


Premises Gateway for the Azure Analysis
Service
By Aavisek Choudhury SQL Server  0 Comments
Estimated Reading Time: 3 minutes
Dear friends in my last two posts (Part I and Part II ), I have shown you how to install Azure
Analysis Services and today I will show you the step by step installation of the On-Premises
Gateway for the Azure Analysis Services. To start with first you need to download the setup of the
Gateway installer from the Azure Portal.
Once downloaded copy the installation setup to the VM where you are planning to install the
Gateway. Please remember the gateway should be in same VNET where the SQL Server source
database resides. Also the gateway should be installed in a VM which will be always on.

Please click on the installer.


In the next screen you will see the following message.

We have installed the Gateway in the same SQL Server where the source data resides.
We have chosen to install the binaries in the C drive. The next step is to click on the install button.
The next screen will show the following message.
Once the installation is completed you need to register the gateway.

In next step authenticate with domain user ID and password


The next step is very important where you need to configure the gateway, you need to give a
name to the gateway and need to provide the recovery key.
Please note down the recovery key of the gateway which you may need to use at a later stage.
You need to configure the gateway in next step.
Once you click on done you will see the following screen.
You can go to the Add and Remove Programs and check whether Gateway is installed or not
In case you need to uninstall the gateway click on the uninstallation button

Once you click on next the Gateway will be uninstalled.


Once the uninstallation is completed you can see the following message.
I hope you will like this post. If you read my earlier posts and this one you can easily able to
install the Azure Analysis Services in your environment. In my next posts I will bring more stuff on
Azure VM low cost recovery and posts related to Express Route Monitoring. Stay tuned till then.

Step by step guide to create and


configure Analysis Services in Azure
(PaaS) – Part I
By Aavisek Choudhury Azure App Services, SQL Server  0 Comments

Estimated Reading Time: 4 minutes

This post I have targeted to the BI developers and system admins who are interested to configure
and work with the SQL Server Analysis Services in Azure (Called Analysis Services in Azure)

What is SQL Server Analysis Services?


SQL Analysis Service is the PaaS instance of SQL Server Analysis Services. It’s an analytical data
engine which supports business analytics and helps in business decision making. It provides
enterprise-grade semantic data models for business reports. You can view the reports by the
following client applications.

1. MS Excel.
2. MS Power BI
3. Tableu and other data visualization tools.
How the data models are developed?
The data model development is generally carried out in SSDT (SQL Server Data Tools for Visual
Studio) which is available as part of the Visual Studio Add on installations. Developers generally
build tabular or multidimensional data model project in Visual Studio, deploying the model as a
database to a server instance, setting up recurring data processing, and assigning permissions to
allow data access by end-users. When it’s ready to go, your semantic data model can be accessed
by client applications supporting Analysis Services as a data source.

The following tools are also used.

1. SSMS (SQL Server Management Studio)


2. PowerShell
What is Azure Analysis Service?
Azure Analysis Services provides enterprise-grade data modeling in the cloud. It is a fully
managed platform as a service (PaaS), integrated with Azure data platform services.

What is the advantage of choosing Azure Analysis Service instead of SQL Server Analysis
Services?
Azure Analysis Services has many advantages with Azure. As per Microsoft the Azure Analysis
Services integrates with many Azure services enabling you to build sophisticated analytics
solutions. Integration with Azure Active Directory provides secure, role-based access to your
critical data. Integrate with Azure Data Factory pipelines by including an activity that loads data
into the model. Azure Automation and Azure Functions can be used for lightweight orchestration
of models using custom code. It’s also complete PaaS solution offered by MS so it’s super easy to
deploy and can be scale out and scale in.
Will my data is secure with Azure Analysis services?
As per Microsoft the Azure Analysis Services utilizes Azure Blob storage to persist storage and
metadata for Analysis Services databases. Data files within Blob are encrypted using Azure Blob
Server Side Encryption (SSE). When using Direct Query mode, only metadata is stored. The actual
data is accessed from the data source at query time.

Let’s dirty our hand and see how we have configured the Azure Analysis Services.
If you go to all services and type anal it will show the analysis services as you can see below.
Once you click on the Analysis Services you can see the following thing

You can click on the + Add button above to configure the Analysis Services.
Next step is to click on Create
Please remember if you don’t want to use an existing storage account you can create a new
storage account and should add a container in it for the backup.

As you have seen the blob storage container which I have created I have kept the name as
backup.

Once the Analysis services is ready you can view the following screen
The next step is to view what is created.

The above screen will show the analysis service which has been created. The server name is the
one which is required to connect this Analysis Service from VS SSDT or Power BI.

For connecting from SSDT you need to download and install SSDT from internet. Here is the
download URL for SSDT. In my next post how we can connect this URL from SSDT and import the
data. In the next post we also need to bypass an error related to connecting a SQL Server data
source located in an IaaS VM in Azure, which will lead to the installation of unified gateway. The
step by step installation of the gateway also I will cover in my next post. Stay tuned till then.

Step by step guide to create and


configure Analysis Services in Azure
(PaaS) – Part II
By Aavisek Choudhury SQL Server  0 Comments
Estimated Reading Time: 6 minutes
In my last post I have shown you how to create and configure the Analysis services in Azure.
Today I will show you how to connect to the Analysis Services from Visual Studio SSDT (SQL
Server Data Tools).
In my lab environment the initial Architecture was as follows.

Fig: Azure Analysis Service Initial Architecture.

The steps to connect the Azure Analysis Services is shown below

Open the SSDT (SQL Server Data Tools) from your program files. And create a new project.
You need to select the 3rd option Analysis Services Tabular Project.

In the next step you need provide the URL of the Analysis Services which we have created in my
last post.
Once you click on the test connection it should show that the test connection is succeeded.

The visual studio will create the Tabular Project.

In the next step we need to connect a SQL server data source from where we will fetch the data
from a test table for the Analysis Services. In our case we have a data source in SQL Server which
resides in an IaaS VM in Azure.
In the next step you need to provide the connection name and the SQL Server instance name to
connect.
The next step is where you need to provide the impersonation information.
The next step will be as below where it will show the list of tables which you can choose to
import the data.
The next step in the table import wizard it will show the table name
And when you have reached the last step thinking it will be successful.
You will get this error about which I have mentioned in my last post.
The above error is confusing since it is indicating ‘On-Premise Gateway is required to access
the data source’. Since our SQL database is located in an Azure VM so we got confused why it’s
complaining. We have searched google and didn’t find an answer to this question. Later we
thought to deploy the Gateway based on the below statement which was in our mind.
“SQL Analysis services thinking any IaaS based SQL data source as the on premises data
source”
The next step is to install the on-premises Data Gateway. To know more about enterprise data
gateway I am going to write a separate post of how to create an on premises gateway for the
SQL Server Analysis Services in my next post in this blog.

Assuming the gateway is created and installed in an IaaS or On premises VM, you have to create
the same on premises data gateway in Azure as well.

Please follow the below steps to add the gateway in the Azure Portal

Go to Home-> New -> Marketplace -> Enterprise Integration


The next step is to create the gateway, click on the create button
Here is the screen you will find once you click on the create button
You need to provide the gateway name in the resource name field and you should choose the
same resource group where the SQL Server VM is located.
In next step you can see below

Once the Gateway is created in the Azure Portal you need to go to the Analysis Services and need
to connect this gateway as shown below. In the Analysis Services please choose the On-Premises
Data Gateway and from the drop down list you can choose the gateway name.
Once it’s configured we went to SSDT and have tried to import the table again. This time we have
used VS 2017 data source from the drop down list so the UI will be little different but it will work
with VS 2015 also.

Assuming you have already created the Analysis Service Tabular project by following the steps I
shown in the beginning of this article and you are in a stage where you need to import the data
from a table here is what you need to do.

Select the SQL Server database.


Select the SQL Server instance name
It will show as below

The next step is to select the table and it will show the table data. (For security reasons I can’t
show the table data)

In the next step you need to click on the Load Button


In next step it will normalize the query.

And will show this screen where we stuck last time


The next is no error, you will get the success message.
In the next step you can see the data model in Visual Studio

So looks like the below statement is true

“SQL Analysis services thinking any IaaS based SQL data source as the on premises data
source”
So there is a change in the Architecture when you need to connect SQL Server IaaS data
source or on premises data source. The Architecture will look like as below.
Fig: Analysis Services with Gateway to connect SQL Database in an IaaS VM

Conclusion: Azure Analysis Services is a very nice PaaS offering and very fast and easy to
configure. For connecting on premises data source as well SQL Server data stored in any IaaS VM
in Azure you need the on-premises data gateway. For connecting the PaaS instance of SQL
Server, Gateway is not a requirement.
I hope you have liked this post, stay tuned for my next post on the gateway installation.

Monitor and Analyze Azure Resources


with Azure Monitor and Log Analytics
(Real time examples with few Use Cases)
By Aavisek Choudhury Operations Management Suite  0 Comments

Estimated Reading Time: 4 minutes


Dear friends, ensuring your application workloads and data are secure is essential, but it is not
enough. Monitoring play an important role for any infrastructure. For IaaS workloads, Azure
Monitoring play a very important role and this require ‘s almost similar effort what you need to
monitor your on premises infrastructure.

We need continuous monitoring and analysis to ensure performance and stability aren’t
negatively impacted by poor network connections or server issues. Here are the three important
criteria you should keep in your mind while understanding the need of monitoring the Azure
infrastructure.

 Getting insight into the health of your VMs


 Correlating and mapping VM dependencies
 Monitoring and troubleshooting applications
Capabilities or components which can be used for monitoring of the Azure resources are as
follows.
 Azure Monitor
 Azure Log Analytics
 Azure Application Insights
 Service Map
In today’s post I will discuss about Azure Monitor and show few use cases. The last use case
is related to Azure Log Analytics.
What is Azure Monitor?
Azure Monitor collects host-level metrics like CPU, disk, and network usage for all virtual
machines without the need to install or configure any additional agents. With Azure Monitor, you
can visualize, query, route, archive, and take action on the metrics and logs coming from
resources in Azure.

For more insight into a virtual machine, you can collect guest-level metrics, logs, and other
diagnostic data using the Azure Diagnostics agent. You can also send diagnostic data to other
services like Application Insights.

Now let’s see few use cases where I have used Azure Monitor, these use cases are very simple
and used only for the demo purpose. In real production environment you may need to work with
many other metrics based on your requirement.

Use Case 1: We need to know whether any D-DOS attack recently happened in a public
gateway IP.
Please go to Monitor-Metrics tab
Select the resource group and select the IP address of the VM
Select under DDoS attack or not

For this particular VM above we haven’t see any DDoS attack.

Use Case 2: We will check what the incoming bandwidth in a VM network is.
Again we will go to monitor metrics tab.
We will select the VM and select the metric Network In
You can also export the data in excel to examine it at a later time. The excel file will look like this.

Use Case 3: Monitor Activity Log


This below screenshot will show us how to monitor each operations in all the resource group.
Use Case 4: Enable diagnostics and guest-level monitoring.
This use case will show how to enable guest-level monitoring of Azure VM’s

On the VM blade go to monitoring and click on the diagnostics settings.


Use case 5: Use Log Search to chart average free memory reported for each instances every
hour
This a sample taken from Azure Log Analytics, you can open Log Analytics from Azure Services as
shown below

Go to Log Search

In the log search run this query.

This is a sample query to chart average free memory reported for each instances every hour.

Perf

| where CounterName == “Free Megabytes”


| summarize avg(CounterValue) by bin(TimeGenerated, 1h), InstanceName

| render timechart

If you want to know more on how to write analytics query you can refer the query language cheat
sheet here.
Conclusion: For the existing infrastructure monitoring folks who were thinking that their jobs are
at risks due to massive workloads moving to Azure, I think this is a wrong thinking, Azure gives us
many new ways to monitor the infrastructure and the manual effort to monitor the infrastructure
will be always there. What we need to do is to learn all the aspect of monitoring in the Azure
cloud so that we can get the best out of monitoring. I will write more posts on Azure Monitoring
and Log Analytics in future. Stay tuned. Wish you a great day/night ahead.

Disaster Recovery of Azure VM – Step by


step configuration guide
By Aavisek Choudhury Azure Compute  3 Comments
Estimated Reading Time: 6 minutes
I think if you are an old school infrastructure management service techie, you must have been
part of many DR excersice during your various job roles. If you are a pre sales techie, in many of
the pre sales discussion you may have tried to convince your customers that Microsoft Azure
enviroment is highly reliable and available so you don’t need to setup any DR enviroment.
However it’s hard to digest by the customer because of the compliance need. The compliance
requirements such as ISO 27001 still require that you have a provable disaster recovery solution
in place as part of a business continuity plan (BCP). For many days the questions related to
setting up a DR site in another Azure region doesn’t have any concrete answer untill May 2017
when Microsoft has released the Diseaster Recovery (Preview) of the Azure VM’s. However I will
say this functionality is still not fully functional since there is no support for managed disks.

Edit: Managed disks are now fully supported in ASR. Please refer the below article.
Article for the support of managed disks.
Today we will see how we can configure the disaster recovery step by step.

Configure Azure VM disaster recovery step by step for the VM’s which have unmanaged
disks
I have selected a VM in my Lab, the VM is located in West US 2 and it’s having Windows 2016
Operating System.
It’s a Windows 2016 Datacenter Server VM, please find the OS version below.

The next step is to go the disaster recovery (Preview) tab as you can see below.
In the next step you need to configure the disaster recovery for this VM.
Select the resource group under which the replicated VM will be created when the VM is failed
over.

Select the virtual network in the target region to which failed over VM will be associated to.

Select the cache storage account, cache storage account is located in the source region. They are
used as a temporary data store before replicating the changes to the target region. By default
one cache storage account is created per vault and re-used. You can select different cache
storage account if intend to customize the cache storage account to be used for this VM.

Data being replicated from the source VM is stored in replica managed disks in the target region.
For each managed disks in source VM, one replicated managed disk is created and used in target
region.
Recovery services vault contains target VM configuration settings and orchestrates replication. In
the event of a disruption where your source VM is not available, you can failover from recovery
services vault.

Vault resource group is the resource group of the recovery services vault. Replication policy
defines the settings for the recovery point retention history and app consistent snapshot
frequency.

The world map below shows the Azure Data Center’s which we have chosen for the replication.
We have chosen to replicate the VM from West US 2 to East US 2.
The next step is to create the Azure resource.

When you can check the progress you can see the deployment is in progress.

In the next step it will show the replication going on for the VM.

Since this is part of the ASR (Azure Site Recovery), it will perform the same jobs which is generally
done during the VM migration. You can find below the jobs which are triggered.

Note: For more details on Azure Site Recovery you can click here.

After some time you can find out that enable replication has been completed.
The replication may take 15 minutes to few hours depending on the size of the VM

As you can see below in my case 98% percentage has been completed after 20 minutes

Since the VM was small it was completed after 25 minutes


What is RPO: RPO is the Recovery Point Object.

Recovery Point Objective (RPO) describes the interval of time that might pass during a disruption
before the quantity of data lost during that period exceeds the Business Continuity Plan’s
maximum allowable threshold or “tolerance.”

Example: If the last available good copy of data upon an outage is from 16 hours ago, and the
RPO for this business is 20 hours then we are still within the parameters of the Business
Continuity Plan’s RPO. In other words it the answers the question – “Up to what point in time
could the Business Process’s recovery proceed tolerably given the volume of data lost during that
interval?”

Now I have to shut down the primary VM just to check the RPO status after two days. After two
days the RPO was showing 2 days as you can see below.
And there is an error about replication was halted.
After I have started the VM the replication has been completed and the data from the primary
site and DR site has been synced and the RPO has came down.
Run a disaster recovery drill for Azure VMs to a secondary Azure region
To test I have decided to a test failover

A test failover configuration is shown below.


Test failover took some time but it was not very high.
After few minutes failover has been completed successfully.

Now I can see both the VM’s in the primary site and the DR site is running in two different Azure
regions.

The next step is to clean up the test failover

You can mention some note below.


Once you click on OK it will start the task to delete the VM
After some time the task will be completed as you can see below.

That’s all about today, I think you will like my post on Azure Disaster Recovery (Preview), I will
bring more on BCP and DR on Azure in my future posts. For more details on each replication
steps you can click here.
Enjoy rest of your day!!!!

What to do when you need more public


IP’s in your Azure Subscription
By Aavisek Choudhury Azure Networking  0 Comments
Estimated Reading Time: 1 minute
Sometimes you need more public IP for your subscription and the ARM resources limit is set at
‘Per region per subscription’ basis. Therefore, 20 is the default limit per region in the subscription.
You would have 20 as default limit for all the regions respectively.

Like any other quota increase request you need to open a ticket with Azure support and need to
request the additional IP’s.

Azure VM Backup failing: It’s time to


switch to private preview of disks greater
than 1023 GB
By Aavisek Choudhury Azure Storage  1 Comment
Estimated Reading Time: 2 minutes
Recently we have faced issues related to the backup of the Azure VM, the VM is a newly build
Azure IaaS server. While taking the snapshot backup of the Azure VM we have found that that we
are unable to take the backup and this VM and this is the error message that is generated in the
screen.
Photo Credit: Royalty free photos from www.twenty20.com
Error:
UserErrorUnsupportedDiskSize – Azure Backup does not support disk sizes greater than
1023GB.
Resolution
In this case you can join the private preview for large disks (>1TB). The same preview can be used
to support faster recovery point creation, using those recovery points during restore to improve
on rest.

Reference: https://gallery.technet.microsoft.com/Instant-recovery-point-and-25fe398a
How can you enable this feature?
You need to execute following cmdlets from an elevated PowerShell:

1) Login to Azure Account: Login-AzureRmAccount

2) Select the subscription which you want to register for preview: Get-AzureRmSubscription –
SubscriptionName “Subscription Name” | Select-AzureRmSubscription

3) Register this subscription for private preview: Register-AzureRmProviderFeature -FeatureName


“InstantBackupandRecovery” –ProviderNamespace Microsoft.RecoveryServices

It will take around two hours to complete the registration process. You can use the following
cmdlet to check registration status. You need not change anything in your schedule or policy for
this to take effect.

CMDlet:
Get-AzureRmProviderFeature -FeatureName “InstantBackupandRecovery” –
ProviderNamespace Microsoft.RecoveryServices
This will not impact you infrastructure. You just cannot perform any change on the backup policy
or schedule as mention above.

The cost will be the same for the ones you already have in place.

FAQ

What are the additional costs incurred when signed up for the private preview for support of
azure backup with large disk (Greater than 1 TB)?

Since MS store snapshots to boost recovery point creation and also to speed up restore, you will
see storage costs corresponding to snapshots for the period 7 days(currently snapshots will be
kept for 7 days – this is fixed and we have plans to make this a choice in coming releases).

Please refer the below doc shared by azure backup product team:

https://gallery.technet.microsoft.com/Instant-recovery-point-and-25fe398a
Is there any impact on the production server upon signing up to private preview for larger disk
backup on azure VM?

There is no impact on production server.

Any ETA on GA?

It isn’t announced publicly yet, but as per the PG it is GA at the end of March 2018 tentatively.

Conclusion:
In case you have larger disks it’s good idea to switch in to private preview. It’s a seamless
movement and doesn’t affect your production environment.

Top 12 ways to optimize the cost in Azure,


with detail explanation on Azure Reserved
VM Instances
By Aavisek Choudhury Azure Compute  0 Comments

Estimated Reading Time: 6 minutes


As an Azure Pre Sales or Delivery Architect, there must be lots of pressure on you to find various
ways to save/reduce the cost/spend in the cloud. Your KPI and KRA may be also integrated with
the cost savings which you can show at the end of the year. How to save the operational cost in
Azure, this question is going to be one of the main headache of the CIO organisation going
forward, since for the last couple of years we have seen many organizations has moved a
significant number of their on-premises workload to Azure/AWS without much planning on the
cost saving part. It may be because of the well-known joke in the Azure world where someone
had asked the CIO why you are moving to the public cloud and CIO answered, since everyone is
moving so we are.

Since the original estimates were failing and spend on the azure budget is overshooting month
by month, most of the public cloud architect jobs will need this important skill, as a top
demanding skill in their role. With the introduction of Azure Advisor and Cost Management, you
can get some insight definitely, but there are many things you should plan well in advance before
your next deal which can provide a significant lead against your competitors.

There are many cost savings measures which you can take and the top 12 initiative I have
listed below. And all of them can be planned well in advance during the planning stage of
the deal. Once you win the deal and project is in the delivery stage there will be ongoing
initiatives to bring the down the Azure spend as well.
1. RI’s (Reserved Instances), Pre-pay, on demand, Dedicated Hosts, BYOL etc.
2. Usage duration of the Azure Resources. (For Example, if you pause the azure analysis
services, you will not be billed)
3. Selecting the right storage and using the Storage Pool and Stripped Volumes (IOPS
calculation plays a big role here, also selection of storage policy like when data will move to
cold and archive storage is important)
4. Right Instance types (B Series VM’s etc.)
5. Estimation of data volumes for the client proposal (Network capacity planning,
ExpressRoute or Site to Site VPN etc.)
6. Turn on / Off (Deallocating the VM when not in use.)
7. Resize instances/change instance types.
8. Scale Up/ Scale Down
9. Conversion to PaaS services.
10. Public cloud Waste management. (By right tagging the resources with the department
which introduces responsibilities against over spending. )
11. Low priority VM’s (Already I have explained this in detail in my older blog post.)
12. HA planning with a single instance with limited allowed downtime.
There are other areas also where you can plan very well in advance and they are related to BCP
and DR and the areas related to backup and recovery. In some of my upcoming blog post, I will
discuss more on the Azure.

Based on my experience I have seen one of the major cost reduction can be achieved if we can
implement Azure Reserved instances for the Production and QA workloads. Today I am going to
discuss in details about that, in my later blogs I will write more on other cost-saving measures.

What is Azure Reservations (Reserved Instance)? ( One of the best option to reduce cost)
Azure reservation is a way to pre-purchase your virtual machines (Compute Usage) for a duration
1 to 3 years. If you are using a VM 24 x7 and up to 365 days you can save up to 60-70-80% of the
cost of Azure. It’s like a huge amount of savings and how you can achieve this or why MS is
giving this discount? There are really two things why you will get this discount.

1. You are committing for a longer period of time.


2. Capacity planning will become easier for MS because they know that this many numbers
of VM’s will be available for a longer commitment.
In this case, you need to pay upfront and Azure will stop charging an on-demand basis for the
instances.

How does it work?


You need to buy the reservations for a region and for a particular VM size. When you prepay for
that matching that VM size in a particular region it will apply to that. Please note that in this case
if you switch off the VM, still you need to pay for it. Unlike to the low priority VM’s about which I
have already written a detailed post here (They are for compute workload that are bursty), these
VM instances should be considered for your 24 x 7 production or QA workloads which you don’t
plan to shut down any time other than patching window or any other major upgrade.
How to find the right candidates for the reserved instances?
Once you migrate your QA and Production workload to Azure you start looking at your usage
and see what VM’s and which regions you are using it consistently and determine the right
candidate for the reserved instances.

What will happen if you need to change the instance during the tenure?
Now there may be a question on your mind that what will happen if you buy reserved instances
for 3 years and after 18 months you want to change the size of the VM. In this case, you can
exchange the reservations. There is unlimited exchange possible during a tenure. But there is a
catch, the new value of reservation should be greater than currently what you are paying.
So basically there is only upgrade exchange possible but not downgrade.
What will happen if you want to terminate the lease in advance? That means if you wanted
to terminate the contract before the end of the tenure.
In case you wanted to cancel it early, there is a 12% early termination fees which will be
deducted. And there is a limit of up to USD 50K in a year. That means you can cancel only up to
USD 50K in a year.

What will happen to the VM after the committed term?


Suppose you have opted for three years term and after that, you have not renewed your VM
reserved instances after 3 years Azure will again start charging you on pay as you go model. In
other words, It will go to regular billing.

Where can I find the reserved instances in the Azure Portal?


You can go to the All Services in the Azure portal and click on Reservations.

As you can see I have already created a reservation

To Create new VM instances you can click on Add and you can see the below screen. Please note
that if you select the scope as shared it will be applicable to all your subscription but if you select
a particular subscription it will apply to only that subscription. I have also noticed another thing
that based on my usage azure also recommend the VM size but it’s up to me about which VM I’ll
choose.
The cost will vary instance to instance if you choose a larger instance the cost savings will be 70%
as you can see below.
As you can see above the operating system will not be discounted by reservations it will only be
applied to the compute usage. Your OS cost will be charged separately. But if you have a
software assurance agreement and you are an enterprise customer and you have already paid for
the licenses. In this case, you can combine the reservations and Azure hub (The license will be
stored in Azure HUB for more details you can read here) benefits.
What is Azure HUB?
Hybrid Use Benefit (HUB) is available to customers with Enterprise Agreements and Software
Assurance that enables Windows Server licenses on-premises to be leveraged in Azure which
results in Windows VMs costing the same as Linux VMs (since there is no charge for the Windows
Server license).

With the combination with HUB you can save up to 80% of the total running cost. To know about
the Azure Reserved Instances cost kindly check the Azure Pricing calculator here.
That’s all for today. I hope you have a good time reading this blog and you have learned a good
thing. You have a very good rest of your day.

In the next few of my blog post, I will discuss how we can optimize the cost in Azure and other
various ways to do so. Stay tuned for more.
Top 20 most helpful information/checklist
that any Azure Pre Sales Architect should
keep handy in 2017.
By Aavisek Choudhury Azure Pre-Sales  2 Comments
Estimated Reading Time: 8 minutes
If you are planning to meet your customer for a large transformation and migration deal with
Azure offering, you are done with all your homework, presentation and ready to crack the deal,
hold on… before you make any promises, please spend some time to check this 20 most helpful
webpages or URL’s which may make yourself and your customer happy and your delivery team
life lot easier in future. I have complied this list based on my personal experience and I hope this
will make a big difference during any RFP/RFI or HLD/LLD or SoW preparation on Azure.

1. Azure Pricing Calculator.

Azure pricing calculator is something which you need at every step of your engagement with the
customer, here is the link for the Azure Pricing Calculator.
For the Azure CSP the pricing calculator is available in the CSP portal.
2.  Azure Subscription Limits and Quotas.
A must have URL to know about the available quota per subscription which will help for a smooth
design during the HLD phase, here is the link for that Azure subscription and service limits,
quotas, and constraints.
3. Cost control in Azure.
When you are in deep discussion with the customer one of the basic question customer may ask
you what to do if my budget overshoot in Azure, in that case you should be capable enough to
answer this tricky question, although there are few 3rd party products like Cloud Cruiser available
in Azure Market place for the cost control however they don’t have support for Azure CSP, this
below URL is one of the native feature in Azure and that will serve the purpose without much
effort Setup Billing Alerts in Azure.
4. Running non-supported Windows OS in Azure.
What will happen to my legacy applications running on Windows 2003, can I move them to
Azure? This is one of the frequently asked question you may face during your sessions with your
customer and you should be ready with the answer, first of all you should know that Windows
2003 VM is no longer officially supported in Azure however you may run them as long as you
want and more details can be found in this Windows 2003 VM’s in Azure. However a 2nd option
is to inform the customer about running them in a designated Hyper-V host in Azure which can
be easily build with the new nested virtualization introduced in Azure.
5. Azure Site Recovery Supported scenario,
Azure site recovery is very successful in all types of migration activities to Azure except few areas
where it may become a pain at a later stage for the delivery team, when they are in mid of a
migration process and they may discover that the VM or the physical machine can’t be moved to
Azure with the help of ASR due to one or the other unsupported scenarios. In one of my earlier
article I have mentioned the same thing, which you can fine here. (Azure ASR Limitations which is
difficult to bypass)
Under this type of situation customer may loss trust in your delivery team and there may be
conflict arises between delivery and the pre sales team regarding who has promised this
deliverable to the customer. So it’s always recommend and advisable to learn the different
scenarios which are supported by the ASR process. Please find the below URL’s which can help
here.

 Azure to Azure
 On premise to Azure
 ASR FAQ
6. Running Oracle Database in Azure.
Can I able to run my oracle databases in Azure, how can I move large Oracle databases to cloud?
This is also one of the common question if the enterprise is having lots of Oracle databases in
their environment. ASR may be used for Oracle databases but if the oracle VM’s or physical
machine are not supported by ASR, it’s better to use the Oracle data guard for the migration.
Here is an article which can help you to answer some basic questions on Oracle migration
to Azure Supported scenarios and Migration options for the Oracle database in Azure.
7. Site connectivity in Azure
Can I able to connect my existing on premise sites to Azure, do I need to invest in new VPN
routers and Gateway? This is one of the common question you should be ready to answer for
your customer and MS provide a list of the supported VPN routers however this list may not
cover all the routers available in the market. For example the TP-LINK router which I am using for
my home office is not covered in this list while I able to setup the VPN connectivity with Azure.
To know more please click here.
Please find the supported routers Supported VPN Routers in Azure.
8. Comparison with AWS.
Please expect set of questions when you meet your customer about similar offering from the
Amazon Web Services, so I will suggest that you should prepare yourself with a high level
product comparison between AWS and Azure. I have recently complied a head to head
comparison between Azure and AWS offering and I am sure this comparison is definitely going to
help you.
Please find my post below Azure VS. AWS Head to Head Comparison Q3 2017
9. Moving resources from one subscription to another.
Now this is an important question if customer already have some foot print in Azure and there is
a chance that you can on board them in your CSP subscription or maybe you are advising them
for a EA option. The question regarding the movement of resources from once subscription to
another is an important question you should be capable to answer at first place.
Here is a post for that Move resources from one Subscription to another.
10. Life Cycle Policy of Azure Resources.
Although this question may not be important for some customer however I have seen many
customer wanted to know if there is any impact on their applications if Microsoft changes the
underlying hardware.
A detail explanation about the Azure Life Cycle Policy can be found in this article Life Cycle Policy
for Azure Resources.
11. Total cost of ownership (TCO) in Azure and in AWS.
This is one of the most discussed topic during the estimation and proposal preparation phase,
generally Microsoft pre sales consultant must have already completes this process before the
release of the RFP or bid documents, however you should also know about this. And I believe this
two URL’s below should help you to answer any quick question on the TCO during your
discussion with the customer.
Total cost of Ownership for Azure.
Total cost of ownership for AWS.
12. Azure Stencils.
As an Azure pre-sales architect you will need the Azure Visio and PowerPoint stencils + icon sets
and they are available for the download at the Microsoft site, which will help you a lot. This is a
must have tool for your successful presentations, for the High Level and Low Level design and
you will need it throughout the bid process and every new deal which you will participate. Please
download the Azure stencils below here.
Microsoft Azure, Cloud and Enterprise Symbol / Icon Set – Visio stencil, PowerPoint, PNG, SVG
13. Azure data centre compliance.
Compliance of the Azure data center, when the security folks from the customer will ask you
about many compliance related questions in Azure and you can directly target them to this URL
and they will get the answers of all their questions, so this URL should be a handy one for you or
else there is a big chance that the security guys can put a cold water in your presentation and
they may switch to different vendor who can convince them better on the security part and no
doubt the security guys have an important role in all your deals.
Here is a list of the Compliance of the Azure Data Center.
14. Azure Product Availability by region.
Not all the azure products are available at all the Azure regions, so before you promise anything
about any particular Azure data center, please take a quick look into this URL mentioned below:
Product availability by Regions.
15. Azure Backup – Supported Scenarios.
This is an important area which has to be addressed correctly during the pre-sales bid otherwise
it may again become a pain for the delivery team. For example recently in one of the project I
have found that the pre sales team has promised for the ASR move of the Windows 2008 R2 SP1
VM’s in Azure because they are very well supported by ASR however after the first wave the
delivery team found that they can’t install the Azure backup agent in the Windows 2008 VM’s
which are 32 bit, and that results in a complete back out of the ASR move. This kind of situation
can give a bad name to you during the execution part so be very careful and you should must
add this URL in your check list.
Azure Backup-FAQ
Azure VM Backup-FAQ
16. Monitoring – Azure Log Analytics-Supported Data Sources.
And here comes the monitoring and this is going to be part of most of your deals and if you have
chosen to prescribe the Azure monitoring solution in your offering please don’t forget to take a
quick look on the supported data sources. You should keep in your mind that you can’t monitor
everything with the Azure Log Analytics. For example if customer want’s a monitoring solutions
for their web applications you may need to direct them to the 3rd party solutions available in the
Azure Market Place like AppDynamics etc. However for the present data sources which are
supported you can take a look into this below URL.
Azure Log Analytics Supported Data Sources
17. Azure Reference Architecture.
Whether you are a novice or an expert in the on premise architecture design, this is the time you
should spend few days understanding the Azure Application Architecture, you have to
understand that most of architecture in Azure cloud is based on the SRH guidelines, which is
nothing but the scalability, resiliency and high availability. This below two URL’s should be
enough to understand and master the probable going to be architecture in Azure for your
customers.
Azure Architecture Center.
Azure Reference Architecture.
18. Azure Express Route.
Azure express route is always a point of discussion in many customer’s engagement and many of
them would like to put it in the kitty of the network team but you should be ready with some of
the FAQ of the Azure ExpressRoute and here is the URL for that.
FAQ-Azure Express Route
19. Business Continuity and Disaster Recovery in Azure.
Azure BCP or DR is something like elephant in the room. This is you need to well plan before the
final commitment during the engagement with customer. If required please setup a small POC
with few set of application to validate your concept before finalizing the SoW.
You should also should be aware of the common terms which is used in any DR process as shown
below and this has to be agreed by your customer or the application owners. Some of them are
listed below. You should know what needs to recovered in case of DR.

RTO: The recovery time objective (RTO), which is the maximum acceptable length of time that
your application can be offline.

RPO: A recovery point objective (RPO), which is the maximum acceptable length of time during
which data might be lost due to a major incident. Note that this metric describes the length of
time only; it does not address the amount or quality of the data lost.

Here is a list of URL which are going to help you in this process.

Business Continuity and Disaster Recovery in Azure in the Azure Paired Regions.
Disaster Recovery for the Azure Applications.
High Availability of the Azure Applications.
Designing resilient applications for Azure.
20. What is there in Azure stack?
This is a question which many consultants are facing from the customers for the last few months
and as an Azure pre-sales architect you should be aware of what is there in Microsoft Azure Stack
and how can you compete with it with the other hyper converged vendors available in the
Market. Here is an article which will definitely increase your knowledge on Azure stack.
Key features and concepts in Azure stack.
That’s make the final list of 20 but this is of course not the end, being a player in tough
competition, you should constantly can stay informed of innovations, new releases and product
reviews of the Azure world to get ahead of others. Hope you will like this post.

Azure regions
Azure has more global regions than any other cloud provider—offering the scale
needed to bring applications closer to users around the world, preserving data
residency, and offering comprehensive compliance and resiliency options for
customers.

54
 regions 
worldwide
 

140
 available in 
140 countries

Azure Site Recovery (ASR) limitations


which are difficult to bypass
By Aavisek Choudhury Azure Site Recovery  1 Comment
Estimated Reading Time: 2 minutes
Recently I have seen issues with the limitation of Azure Site Recovery and tried multiple ways to
get it passed, later it became a road block to use ASR for migration of the large Oracle database
servers. Here are the two major limitations of ASR which I have faced.
1. ASR doesn’t support clustered disks.
2. ASR doesn’t support GPT/UEFI disk (In OS Disks)
Now most of the enterprise will have clustered disks, so the workaround for the above problem is
to disable cluster services, and create stand-alone VM’s. And consider standalone servers for the
migration, which may not work for many lift and shift type migration strategies of large
databases and other workloads.

Also the conversion of GPT to MBR disk will also not work here because these disks are mostly
OS disks and there is no supported way to convert them to MBR partition.

We thought for a workaround to create a VHDX using disk2VHD (GPT disk) and then create a
Gen2 VM on Hyper-V, thinking that ASR will work in this case, however we have found that 2008
R2 isn’t supported as a Generation2 VM, so we not able to proceed further on this unless we
upgrade the OS which is no go from the application owners.

So if you are planning to move physical boxes or VM’s which have GPT partitions in the C drive
OR have the clustered disks, it’s may be better to look out for some different tool for the
migration instead of using ASR at this point of time.
When we look out of this with the 3rd party I came across with a vendor called doubletake and
when we have checked their cloud migration user guide and we do not see comments specific to
UEFI partition based Windows machines.
Here is the user guide for doubletake your reference.

https://migrate.doubletake.com/docs/CMCUsersGuide.pdf

It says it does not support UEFI disks on Linux machines. Nothing on Windows 

For more details on Azure Site Recovery Support Matrix, please check this link

Bottom line is that ASR may be a very good tool but has few limitations and if you are planning
for large scale migration with different workloads, you need to plan for your large workload
sprint in advance and need to decide the strategy as a case to case basis.

https://docs.microsoft.com/en-us/azure/site-recovery/vmware-physical-azure-support-matrix

Azure Site Recovery: frequently asked


questions (FAQ)
 12/27/2018
 10 minutes to read
 Contributors
o

 
o

o all

This article includes frequently asked questions about Azure Site Recovery. If you
have questions after reading this article, post them on the Azure Recovery Services
Forum.
General
What does Site Recovery do?

Site Recovery contributes to your business continuity and disaster recovery (BCDR)
strategy, by orchestrating and automating replication of Azure VMs between regions,
on-premises virtual machines and physical servers to Azure, and on-premises
machines to a secondary datacenter. Learn more.
What can Site Recovery protect?

 Azure VMs: Site Recovery can replicate any workload running on a supported Azure
VM
 Hyper-V virtual machines: Site Recovery can protect any workload running on a
Hyper-V VM.
 Physical servers: Site Recovery can protect physical servers running Windows or
Linux.
 VMware virtual machines: Site Recovery can protect any workload running in a
VMware VM.

Can I replicate Azure VMs?

Yes, you can replicate supported Azure VMs between Azure regions. Learn more.
What do I need in Hyper-V to orchestrate replication with Site Recovery?

For the Hyper-V host server what you need depends on the deployment scenario.
Check out the Hyper-V prerequisites in:

 Replicating Hyper-V VMs (without VMM) to Azure


 Replicating Hyper-V VMs (with VMM) to Azure
 Replicating Hyper-V VMs to a secondary datacenter
 If you're replicating to a secondary datacenter read about Supported guest operating
systems for Hyper-V VMs.
 If you're replicating to Azure, Site Recovery supports all the guest operating systems
that are supported by Azure.

Can I protect VMs when Hyper-V is running on a client operating system?

No, VMs must be located on a Hyper-V host server that's running on a supported
Windows server machine. If you need to protect a client computer you could
replicate it as a physical machine to Azure or a secondary datacenter.
What workloads can I protect with Site Recovery?

You can use Site Recovery to protect most workloads running on a supported VM or
physical server. Site Recovery provides support for application-aware replication, so
that apps can be recovered to an intelligent state. It integrates with Microsoft
applications such as SharePoint, Exchange, Dynamics, SQL Server and Active
Directory, and works closely with leading vendors, including Oracle, SAP, IBM and
Red Hat. Learn more about workload protection.
Do Hyper-V hosts need to be in VMM clouds?

If you want to replicate to a secondary datacenter, then Hyper-V VMs must be on


Hyper-V hosts servers located in a VMM cloud. If you want to replicate to Azure, then
you can replicate VMs with or without VMM clouds. Read more about Hyper-V
replication to Azure.
Can I deploy Site Recovery with VMM if I only have one VMM server?

Yes. You can either replicate VMs in Hyper-V servers in the VMM cloud to Azure, or
you can replicate between VMM clouds on the same server. For on-premises to on-
premises replication, we recommend that you have a VMM server in both the
primary and secondary sites.
What physical servers can I protect?

You can replicate physical servers running Windows and Linux to Azure or to a
secondary site. Learn about requirements for replication to Azure, and replication to
a secondary site.

Note that physical servers will run as VMs in Azure if your on-premises server goes
down. Failback to an on-premises physical server isn't currently supported. For a
machine protected as physical, you can only failback to a VMware virtual machine.
What VMware VMs can I protect?

To protect VMware VMs you'll need a vSphere hypervisor, and virtual machines
running VMware tools. We also recommend that you have a VMware vCenter server
to manage the hypervisors. Learn more about requirements for replication to Azure,
or replication to a secondary site.
Can I manage disaster recovery for my branch offices with Site Recovery?

Yes. When you use Site Recovery to orchestrate replication and failover in your
branch offices, you'll get a unified orchestration and view of all your branch office
workloads in a central location. You can easily run failovers and administer disaster
recovery of all branches from your head office, without visiting the branches.
Pricing

For pricing related questions, please refer to the FAQ at Azure Site Recovery pricing.
Security
Is replication data sent to the Site Recovery service?

No, Site Recovery doesn't intercept replicated data, and doesn't have any
information about what's running on your virtual machines or physical servers.
Replication data is exchanged between on-premises Hyper-V hosts, VMware
hypervisors, or physical servers and Azure storage or your secondary site. Site
Recovery has no ability to intercept that data. Only the metadata needed to
orchestrate replication and failover is sent to the Site Recovery service.

Site Recovery is ISO 27001:2013, 27018, HIPAA, DPA certified, and is in the process of
SOC2 and FedRAMP JAB assessments.
For compliance reasons, even our on-premises metadata must remain within the
same geographic region. Can Site Recovery help us?

Yes. When you create a Site Recovery vault in a region, we ensure that all metadata
that we need to enable and orchestrate replication and failover remains within that
region's geographic boundary.
Does Site Recovery encrypt replication?

For virtual machines and physical servers, replicating between on-premises sites
encryption-in-transit is supported. For virtual machines and physical servers
replicating to Azure, both encryption-in-transit and encryption-at-rest (in Azure) are
supported.
Replication
Can I replicate over a site-to-site VPN to Azure?

Azure Site Recovery replicates data to an Azure storage account, over a public
endpoint. Replication isn't over a site-to-site VPN. You can create a site-to-site VPN,
with an Azure virtual network. This doesn't interfere with Site Recovery replication.
Can I use ExpressRoute to replicate virtual machines to Azure?

Yes, ExpressRoute can be used to replicate on-premises virtual machines to Azure.


Azure Site Recovery replicates data to an Azure Storage Account over a public
endpoint. You need to set up public peering or Microsoft peering to use
ExpressRoute for Site Recovery replication. Microsoft peering is the recommended
routing domain for replication. After the virtual machines have been failed over to an
Azure virtual network you can access them using the private peering setup with the
Azure virtual network. Replication is not supported over private peering. In case you
are protecting VMware machines or physical machines, ensure that the Networking
Requirements are also met for replication.
Are there any prerequisites for replicating virtual machines to Azure?

VMware VMs and Hyper-V VMs you want to replicate to Azure should comply with


Azure requirements.

Your Azure user account needs to have certain permissions to enable replication of a


new virtual machine to Azure.
Can I replicate Hyper-V generation 2 virtual machines to Azure?

Yes. Site Recovery converts from generation 2 to generation 1 during failover. At


failback the machine is converted back to generation 2. Read more.
If I replicate to Azure how do I pay for Azure VMs?

During regular replication, data is replicated to geo-redundant Azure storage and


you don’t need to pay any Azure IaaS virtual machine charges, providing a significant
advantage. When you run a failover to Azure, Site Recovery automatically creates
Azure IaaS virtual machines, and after that you'll be billed for the compute resources
that you consume in Azure.
Can I automate Site Recovery scenarios with an SDK?

Yes. You can automate Site Recovery workflows using the Rest API, PowerShell, or
the Azure SDK. Currently supported scenarios for deploying Site Recovery using
PowerShell:

 Replicate Hyper-V VMs in VMMs clouds to Azure PowerShell Resource Manager


 Replicate Hyper-V VMs without VMM to Azure PowerShell Resource Manager
 Replicate VMware to Azure with PowerShell Resource Manager

If I replicate to Azure what kind of storage account do I need?

You need an LRS or GRS storage account. We recommend GRS so that data is
resilient if a regional outage occurs, or if the primary region can't be recovered. The
account must be in the same region as the Recovery Services vault. Premium storage
is supported for VMware VM, Hyper-V VM, and physical server replication, when you
deploy Site Recovery in the Azure portal.
How often can I replicate data?

 Hyper-V: Hyper-V VMs can be replicated every 30 seconds (except for premium


storage), 5 minutes or 15 minutes. If you've set up SAN replication then replication is
synchronous.
 Azure VMs, VMware and physical servers: A replication frequency isn't relevant
here. Replication is continuous.

Can I extend replication from existing recovery site to another tertiary site?

Extended or chained replication isn't supported. Request this feature in feedback


forum.
Can I do an offline replication the first time I replicate to Azure?

This isn't supported. Request this feature in the feedback forum.


Can I exclude specific disks from replication?

This is supported when you're replicating VMware VMs and Hyper-V VMs to Azure,
using the Azure portal.
Can I replicate virtual machines with dynamic disks?

Dynamic disks are supported when replicating Hyper-V virtual machines. They are
also supported when replicating VMware VMs and physical machines to Azure. The
operating system disk must be a basic disk.
Can I add a new machine to an existing replication group?

Adding new machines to existing replication groups is supported. To do so, select


the replication group (from 'Replicated items' blade) and right click/select context
menu on the replication group and select the appropriate option.

Can I throttle bandwidth allotted for Hyper-V replication traffic?

Yes. You can read more about throttling bandwidth in the deployment articles:

 Capacity planning for replicating VMware VMs and physical servers


 Capacity planning for replicating Hyper-V VMs to Azure

Failover
If I'm failing over to Azure, how do I access the Azure virtual machines after failover?

You can access the Azure VMs over a secure Internet connection, over a site-to-site
VPN, or over Azure ExpressRoute. You'll need to prepare a number of things in order
to connect. Learn more
If I fail over to Azure how does Azure make sure my data is resilient?

Azure is designed for resilience. Site Recovery is already engineered for failover to a
secondary Azure datacenter, in accordance with the Azure SLA if the need arises. If
this happens, we make sure your metadata and vaults remain within the same
geographic region that you chose for your vault.
If I'm replicating between two datacenters what happens if my primary datacenter
experiences an unexpected outage?

You can trigger an unplanned failover from the secondary site. Site Recovery doesn't
need connectivity from the primary site to perform the failover.
Is failover automatic?

Failover isn't automatic. You initiate failovers with single click in the portal, or you can
use Site Recovery PowerShell to trigger a failover. Failing back is a simple action in
the Site Recovery portal.

To automate you could use on-premises Orchestrator or Operations Manager to


detect a virtual machine failure, and then trigger the failover using the SDK.

 Read more about recovery plans.


 Read more about failover.
 Read more about failing back VMware VMs and physical servers
If my on-premises host is not responding or crashed, can I failover back to a different
host?

Yes, you can use the alternate location recovery to failback to a different host from
Azure. Read more about the options in the below links for VMware and Hyper-V
virtual machines.

 For VMware virtual machines


 For Hyper-V virtual machines

Service providers
I'm a service provider. Does Site Recovery work for dedicated and shared
infrastructure models?

Yes, Site Recovery supports both dedicated and shared infrastructure models.
For a service provider, is the identity of my tenant shared with the Site Recovery
service?

No. Tenant identity remains anonymous. Your tenants don't need access to the Site
Recovery portal. Only the service provider administrator interacts with the portal.
Will tenant application data ever go to Azure?

When replicating between service provider-owned sites, application data never goes
to Azure. Data is encrypted in-transit, and replicated directly between the service
provider sites.

If you're replicating to Azure, application data is sent to Azure storage but not to the
Site Recovery service. Data is encrypted in-transit, and remains encrypted in Azure.
Will my tenants receive a bill for any Azure services?

No. Azure's billing relationship is directly with the service provider. Service providers
are responsible for generating specific bills for their tenants.
If I'm replicating to Azure, do we need to run virtual machines in Azure at all times?

No, Data is replicated to an Azure storage account in your subscription. When you
perform a test failover (DR drill) or an actual failover, Site Recovery automatically
creates virtual machines in your subscription.
Do you ensure tenant-level isolation when I replicate to Azure?

Yes.
What platforms do you currently support?

We support Azure Pack, Cloud Platform System, and System Center based (2012 and
higher) deployments. Learn more about Azure Pack and Site Recovery integration.
Do you support single Azure Pack and single VMM server deployments?

Yes, you can replicate Hyper-V virtual machines to Azure, or between service provider
sites. Note that if you replicate between service provider sites, Azure runbook
integration isn't available.

About Site Recovery


 12/27/2018
 3 minutes to read
 Contributors
o

 
o

o all

Welcome to the Azure Site Recovery service! This article provides a quick service
overview.

As an organization you need to adopt a business continuity and disaster recovery


(BCDR) strategy that keeps your data safe, and your apps and workloads up and
running, when planned and unplanned outages occur.

Azure Recovery Services contribute to your BCDR strategy:

 Site Recovery service: Site Recovery helps ensure business continuity by keeping
business apps and workloads running during outages. Site Recovery replicates
workloads running on physical and virtual machines (VMs) from a primary site to a
secondary location. When an outage occurs at your primary site, you fail over to
secondary location, and access apps from there. After the primary location is running
again, you can fail back to it.
 Backup service: The Azure Backup service keeps your data safe and recoverable by
backing it up to Azure.

Site Recovery can manage replication for:

 Azure VMs replicating between Azure regions.


 On-premises VMs, Azure Stack VMs and physical servers.

What does Site Recovery provide?


Feature Details

Simple BCDR Using Site Recovery, you can set up and manage replication, failover, and failback from a
solution single location in the Azure portal.

Azure VM You can set up disaster recovery of Azure VMs from a primary region to a secondary
replication region.

On-premises VM You can replicate on-premises VMs and physical servers to Azure, or to a secondary on-
replication premises datacenter. Replication to Azure eliminates the cost and complexity of
maintaining a secondary datacenter.

Workload Replicate any workload running on supported Azure VMs, on-premises Hyper-V and
replication VMware VMs, and Windows/Linux physical servers.

Data resilience Site recovery orchestrates replication without intercepting application data. When you
replicate to Azure, data is stored in Azure storage, with the resilience that provides.
When failover occurs, Azure VMs are created, based on the replicated data.

RTO and RPO Keep recovery time objectives (RTO) and recovery point objectives (RPO) within
targets organizational limits. Site Recovery provides continuous replication for Azure VMs and
VMware VMs, and replication frequency as low as 30 seconds for Hyper-V. You can
reduce RTO further by integrating with Azure Traffic Manager.

Keep apps You can replicate using recovery points with application-consistent snapshots. These
consistent over snapshots capture disk data, all data in memory, and all transactions in process.
failover

Testing without You can easily run disaster recovery drills, without affecting ongoing replication.
disruption
Feature Details

Flexible failovers You can run planned failovers for expected outages with zero-data loss, or unplanned
failovers with minimal data loss (depending on replication frequency) for unexpected
disasters. You can easily fail back to your primary site when it's available again.

Customized Using recovery plans, can customize and sequence the failover and recovery of multi-
recovery plans tier applications running on multiple VMs. You group machines together in a recovery
plan, and optionally add scripts and manual actions. Recovery plans can be integrated
with Azure automation runbooks.

BCDR integration Site Recovery integrates with other BCDR technologies. For example, you can use Site
Recovery to protect the SQL Server backend of corporate workloads, with native
support for SQL Server AlwaysOn, to manage the failover of availability groups.

Azure A rich Azure Automation library provides production-ready, application-specific scripts


automation that can be downloaded and integrated with Site Recovery.
integration

Network Site Recovery integrates with Azure for simple application network management,
integration including reserving IP addresses, configuring load-balancers, and integrating Azure
Traffic Manager for efficient network switchovers.

What can I replicate?


Supported Details

Replication Replicate Azure VMs from one Azure region to another.


scenarios
Replicate on-premises VMware VMs, Hyper-V VMs, physical servers (Windows and Linux),
Azure Stack VMs to Azure.

Replicate on-premises VMware VMs, Hyper-V VMs managed by System Center VMM, and
physical servers to a secondary site.

Regions Review supported regions for Site Recovery.

Replicated Review the replication requirements for Azure VM replication, on-premises VMware VMs


machines and physical servers, and on-premises Hyper-V VMs.
Supported Details

Workloads You can replicate any workload running on a machine that's supported for replication. In
addition, the Site Recovery team have performed app-specific testing for a number of
apps.

What workloads can you protect with


Azure Site Recovery?
 12/31/2018
 8 minutes to read
 Contributors
o

o all

This article describes workloads and applications you can protect for disaster
recovery with the Azure Site Recovery service.
Overview

Organizations need a business continuity and disaster recovery (BCDR) strategy to


keep workloads and data safe and available during planned and unplanned
downtime, and recover to regular working conditions as soon as possible.

Site Recovery is an Azure service that contributes to your BCDR strategy. Using Site
Recovery, you can deploy application-aware replication to the cloud, or to a
secondary site. Whether your apps are Windows or Linux-based, running on physical
servers, VMware or Hyper-V, you can use Site Recovery to orchestrate replication,
perform disaster recovery testing, and run failovers and failback.

Site Recovery integrates with Microsoft applications, including SharePoint, Exchange,


Dynamics, SQL Server, and Active Directory. Microsoft also works closely with leading
vendors including Oracle, SAP, and Red Hat. You can customize replication solutions
on an app-by-app basis.
Why use Site Recovery for application replication?

Site Recovery contributes to application-level protection and recovery as follows:

 App-agnostic, providing replication for any workloads running on a supported


machine.
 Near-synchronous replication, with RPOs as low as 30 seconds to meet the needs of
most critical business apps.
 App-consistent snapshots, for single or multi-tier applications.
 Integration with SQL Server AlwaysOn, and partnership with other application-level
replication technologies, including AD replication, SQL AlwaysOn, Exchange Database
Availability Groups (DAGs) and Oracle Data Guard.
 Flexible recovery plans, that enable you to recover an entire application stack with a
single click, and to include external scripts and manual actions in the plan.
 Advanced network management in Site Recovery and Azure to simplify app network
requirements, including the ability to reserve IP addresses, configure load-balancing,
and integration with Azure Traffic Manager, for low RTO network switchovers.
 A rich automation library that provides production-ready, application-specific scripts
that can be downloaded and integrated with recovery plans.

Workload summary

Site Recovery can replicate any app running on a supported machine. In addition,
we've partnered with product teams to carry out additional app-specific testing.
Replicate Replicate
Hyper-V VMs Replicate VMware VMs Replicate
Replicate to a Hyper-V to a VMware
Azure VMs secondary VMs to secondary VMs to
Workload to Azure site Azure site Azure

Active Y Y Y Y Y
Directory, DNS

Web apps (IIS, Y Y Y Y Y


SQL)

System Center Y Y Y Y Y
Operations
Manager

SharePoint Y Y Y Y Y

SAP Y (tested by Y (tested by Y (tested by Y (tested by Y (tested by


Microsoft) Microsoft) Microsoft) Microsoft) Microsoft)
Replicate SAP
site to Azure for
non-cluster

Exchange (non- Y Y Y Y Y
DAG)

Remote Y Y Y Y Y
Desktop/VDI

Linux (operating Y (tested by Y (tested by Y (tested by Y (tested by Y (tested by


system and Microsoft) Microsoft) Microsoft) Microsoft) Microsoft)
apps)

Dynamics AX Y Y Y Y Y

Windows File Y Y Y Y Y
Server
Replicate Replicate
Hyper-V VMs Replicate VMware VMs Replicate
Replicate to a Hyper-V to a VMware
Azure VMs secondary VMs to secondary VMs to
Workload to Azure site Azure site Azure

Citrix XenApp Y N/A Y N/A Y


and XenDesktop

Replicate Active Directory and DNS

An Active Directory and DNS infrastructure are essential to most enterprise apps.
During disaster recovery, you'll need to protect and recover these infrastructure
components, before recovering your workloads and apps.

You can use Site Recovery to create a complete automated disaster recovery plan for
Active Directory and DNS. For example, if you want to fail over SharePoint and SAP
from a primary to a secondary site, you can set up a recovery plan that fails over
Active Directory first, and then an additional app-specific recovery plan to fail over
the other apps that rely on Active Directory.

Learn more about protecting Active Directory and DNS.


Protect SQL Server

SQL Server provides a data services foundation for data services for many business
apps in an on-premises data center. Site Recovery can be used together with SQL
Server HA/DR technologies, to protect multi-tiered enterprise apps that use SQL
Server. Site Recovery provides:

 A simple and cost-effective disaster recovery solution for SQL Server. Replicate
multiple versions and editions of SQL Server standalone servers and clusters, to Azure
or to a secondary site.
 Integration with SQL AlwaysOn Availability Groups, to manage failover and failback
with Azure Site Recovery recovery plans.
 End-to-end recovery plans for the all tiers in an application, including the SQL Server
databases.
 Scaling of SQL Server for peak loads with Site Recovery, by “bursting” them into
larger IaaS virtual machine sizes in Azure.
 Easy testing of SQL Server disaster recovery. You can run test failovers to analyze data
and run compliance checks, without impacting your production environment.

Learn more about protecting SQL server.


Protect SharePoint

Azure Site Recovery helps protect SharePoint deployments, as follows:

 Eliminates the need and associated infrastructure costs for a stand-by farm for
disaster recovery. Use Site Recovery to replicate an entire farm (Web, app and
database tiers) to Azure or to a secondary site.
 Simplifies application deployment and management. Updates deployed to the
primary site are automatically replicated, and are thus available after failover and
recovery of a farm in a secondary site. Also lowers the management complexity and
costs associated with keeping a stand-by farm up-to-date.
 Simplifies SharePoint application development and testing by creating a production-
like copy on-demand replica environment for testing and debugging.
 Simplifies transition to the cloud by using Site Recovery to migrate SharePoint
deployments to Azure.

Learn more about protecting SharePoint.


Protect Dynamics AX

Azure Site Recovery helps protect your Dynamics AX ERP solution, by:

 Orchestrating replication of your entire Dynamics AX environment (Web and AOS


tiers, database tiers, SharePoint) to Azure, or to a secondary site.
 Simplifying migration of Dynamics AX deployments to the cloud (Azure).
 Simplifying Dynamics AX application development and testing by creating a
production-like copy on-demand, for testing and debugging.

Learn more about protecting Dynamic AX.


Protect RDS

Remote Desktop Services (RDS) enables virtual desktop infrastructure (VDI), session-
based desktops, and applications, allowing users to work anywhere. With Azure Site
Recovery you can:

 Replicate managed or unmanaged pooled virtual desktops to a secondary site,


and remote applications and sessions to a secondary site or Azure.

 Here's what you can replicate:


Replicate
Replicate Replicate physical
Replicate Hyper-V Replicate VMware Replicate servers to R
Azure VMs to a Hyper-V VMs to a VMware a p
VMs to secondary VMs to secondary VMs to secondary s
RDS Azure site Azure site Azure site t

Pooled No Yes No Yes No Yes N


Virtual
Desktop
(unmanaged)

Pooled No Yes No Yes No Yes N


Virtual
Desktop
(managed
and without
UPD)

Remote Yes Yes Yes Yes Yes Yes Y


applications
and Desktop
sessions
(without
UPD)

Set up disaster recovery for RDS using Azure Site Recovery.

Learn more about protecting RDS.


Protect Exchange

Site Recovery helps protect Exchange, as follows:

 For small Exchange deployments, such as a single or standalone server, Site Recovery
can replicate and fail over to Azure or to a secondary site.
 For larger deployments, Site Recovery integrates with Exchange DAGS.
 Exchange DAGs are the recommended solution for Exchange disaster recovery in an
enterprise. Site Recovery recovery plans can include DAGs, to orchestrate DAG failover
across sites.

Learn more about protecting Exchange.


Protect SAP

Use Site Recovery to protect your SAP deployment, as follows:

 Enable protection of SAP NetWeaver and non-NetWeaver Production applications


running on-premises, by replicating components to Azure.
 Enable protection of SAP NetWeaver and non-NetWeaver Production applications
running Azure, by replicating components to another Azure datacenter.
 Simplify cloud migration, by using Site Recovery to migrate your SAP deployment to
Azure.
 Simplify SAP project upgrades, testing, and prototyping, by creating a production
clone on-demand for testing SAP applications.

Learn more about protecting SAP.


Protect IIS

Use Site Recovery to protect your IIS deployment, as follows:

Azure Site Recovery provides disaster recovery by replicating the critical components
in your environment to a cold remote site or a public cloud like Microsoft Azure.
Since the virtual machines with the web server and the database are being replicated
to the recovery site, there is no requirement to backup configuration files or
certificates separately. The application mappings and bindings dependent on
environment variables that are changed post failover can be updated through scripts
integrated into the disaster recovery plans. Virtual machines are brought up on the
recovery site only in the event of a failover. Not only this, Azure Site Recovery also
helps you orchestrate the end to end failover by providing you the following
capabilities:

 Sequencing the shutdown and startup of virtual machines in the various tiers.
 Adding scripts to allow update of application dependencies and bindings on the
virtual machines after they have been started up. The scripts can also be used to
update the DNS server to point to the recovery site.
 Allocate IP addresses to virtual machines pre-failover by mapping the primary and
recovery networks and hence use scripts that do not need to be updated post failover.
 Ability for a one-click failover for multiple web applications on the web servers, thus
eliminating the scope for confusion in the event of a disaster.
 Ability to test the recovery plans in an isolated environment for DR drills.

Learn more about protecting IIS web farm.


Protect Citrix XenApp and XenDesktop

Use Site Recovery to protect your Citrix XenApp and XenDesktop deployments, as
follows:
 Enable protection of the Citrix XenApp and XenDesktop deployment, by replicating
different deployment layers including (AD DNS server, SQL database server, Citrix
Delivery Controller, StoreFront server, XenApp Master (VDA), Citrix XenApp License
Server) to Azure.
 Simplify cloud migration, by using Site Recovery to migrate your Citrix XenApp and
XenDesktop deployment to Azure.
 Simplify Citrix XenApp/XenDesktop testing, by creating a production-like copy on-
demand for testing and debugging.
 This solution is only applicable for Windows Server operating system virtual desktops
and not client virtual desktops as client virtual desktops are not yet supported for
licensing in Azure.Learn More about licensing for client/server desktops in Azure.

Learn more about protecting Citrix XenApp and XenDesktop deployments.


Alternatively, you can refer the whitepaper from Citrix detailing the same.

Set up disaster recovery to a secondary


Azure region for an Azure VM
 12/27/2018
 2 minutes to read
 Contributors
o

 
o

o all

The Azure Site Recovery service contributes to your business continuity and disaster


recovery (BCDR) strategy by keeping your business apps up and running, during
planned and unplanned outages. Site Recovery manages and orchestrates disaster
recovery of on-premises machines and Azure virtual machines (VMs), including
replication, failover, and recovery.

This quickstart describes how to replicate an Azure VM to a different Azure region.

If you don't have an Azure subscription, create a free account before you begin.
 Note

This article is intended to guide a new user through the Azure Site Recovery
experience with the default options and minimum customization. If you want to
know more about the various settings that can be customized, refer to the tutorial
for enabling replication for Azure VMs
Log in to Azure

Log in to the Azure portal at http://portal.azure.com.


Enable replication for the Azure VM

1. In the Azure portal, click Virtual machines, and select the VM you want to replicate.
2. In Operations, click Disaster recovery.
3. In Configure disaster recovery > Target region select the target region to which
you'll replicate.
4. For this Quickstart, accept the other default settings.

5. Click Enable replication. This starts a job to enable replication for the VM.

Verify settings

After the replication job has finished, you can check the replication status, modify
replication settings, and test the deployment.
1. In the VM menu, click Disaster recovery.

2. You can verify replication health, the recovery points that have been created,
and source, target regions on the map.

Clean up resources

The VM in the primary region stops replicating when you disable replication for it:

 The source replication settings are cleaned up automatically. Please note that the Site
Recovery extension that is installed as part of the replication isn't removed and needs
to be removed manually.
 Site Recovery billing for the VM also stops.
Stop replication as follows

1. Select the VM.

2. In Disaster recovery, click Disable Replication.

Next steps

In this quickstart, you replicated a single VM to a secondary region. You can now
explore more options and try replicating a set of Azure VMs using a recovery plan.
Azure to Azure disaster recovery
architecture
 12/31/2018
 8 minutes to read
 Contributors
o

This article describes the architecture, components, and processes used when you
deploy disaster recovery for Azure virtual machines (VMs) using the Azure Site
Recovery service. With disaster recovery set up, Azure VMs continuously replicate
from to a different target region. If an outage occurs, you can fail over VMs to the
secondary region, and access them from there. When everything's running normally
again, you can fail back and continue working in the primary location.
Architectural components

The components involved in disaster recovery for Azure VMs are summarized in the
following table.
Component Requirements

VMs in source One of more Azure VMs in a supported source region.


region
VMs can be running any supported operating system.

Source VM Azure VMs can be managed, or have non-managed disks spread across storage accounts.
storage
Learn about supported Azure storage.

Source VM VMs can be located in one or more subnets in a virtual network (VNet) in the source
networks region. Learn more about networking requirements.
Component Requirements

Cache storage You need a cache storage account in the source network. During replication, VM changes
account are stored in the cache before being sent to target storage.

Using a cache ensures minimal impact on production applications that are running on a
VM.

Learn more about cache storage requirements.

Target resources Target resources are used during replication, and when a failover occurs. Site Recovery
can set up target resource by default, or you can create/customize them.

In the target region, check that you're able to create VMs, and that your subscription has
enough resources to support VM sizes that will be needed in the target region.
Target resources

When you enable replication for a VM, Site Recovery gives you the option of creating
target resources automatically.
Target resource Default setting

Target Same as the source subscription.


subscription

Target resource The resource group to which VMs belong after failover.
group
It can be in any Azure region except the source region.

Site Recovery creates a new resource group in the target region, with an "asr" suffix.

Target VNet The virtual network (VNet) in which replicated VMs are located after failover. A network
mapping is created between source and target virtual networks, and vice versa.

Site Recovery creates a new VNet and subnet, with the "asr" suffix.

Target storage If the VM doesn't use a managed disk, this is the storage account to which data is
account replicated.

Site Recovery creates a new storage account in the target region, to mirror the source
storage account.

Replica If the VM uses a managed disk, this is the managed disks to which data is replicated.
managed disks
Site Recovery creates replica managed disks in the storage region to mirror the source.

Target Availability set in which replicating VMs are located after failover.
availability sets
Site Recovery creates an availability set in the target region with the suffix "asr", for VMs
that are located in an availability set in the source location. If an availability set exists, it's
used and a new one isn't created.

Target If the target region supports availability zones, Site Recovery assigns the same zone
availability number as that used in the source region.
zones
Managing target resources

You can manage target resources as follows:

 You can modify target settings as you enable replication.


 You can modify target settings after replication is already working. The exception is
the availability type (single instance, set or zone). To change this setting you need to
disable replication, modify the setting, and then reenable.

Replication policy

When you enable Azure VM replication, by default Site Recovery creates a new
replication policy with the default settings summarized in the table.
Policy setting Details Default

Recovery point retention Specifies how long Site Recovery keeps recovery 24 hours
points

App-consistent snapshot How often Site Recovery takes an app-consistent Every 60


frequency snapshot. minutes.

Managing replication policies

You can manage and modify the default replication policies settings as follows:

 You can modify the settings as you enable replication.


 You can create a replication policy at any time, and then apply it when you enable
replication.

Multi-VM consistency

If you want VMs to replicate together, and have shared crash-consistent and app-
consistent recovery points at failover, you can gather them together into a replication
group. Multi-VM consistency impacts workload performance, and should only be
used for VMs running workloads that need consistency across all machines.
Snapshots and recovery points

Recovery points are created from snapshots of VM disks taken at a specific point in
time. When you fail over a VM, you use a recovery point to restore the VM in the
target location.
When failing over, we generally want to ensure that the VM starts with no corruption
or data loss, and that the VM data is consistent for the operating system, and for
apps that run on the VM. This depends on the type of snapshots taken.

Site Recovery takes snapshots as follows:

1. Site Recovery takes crash-consistent snapshots of data by default, and app-consistent


snapshots if you specify a frequency for them.
2. Recovery points are created from the snapshots, and stored in accordance with
retention settings in the replication policy.

Consistency

The following table explains different types of consistency.


Crash-consistent

Description Details Recommendation

A crash consistent snapshot Site Recovery creates Today, most apps can recover well
captures data that was on the disk crash-consistent recovery from crash-consistent points.
when the snapshot was taken. It points every five minutes
doesn't include anything in memory. by default. This setting Crash-consistent recovery points are
can't be modified. usually sufficient for the replication
It contains the equivalent of the on- of operating systems, and apps such
disk data that would be present if as DHCP servers and print servers.
the VM crashed or the power cord
was pulled from the server at the
instant that the snapshot was taken.

A crash-consistent doesn't
guarantee data consistency for the
operating system, or for apps on the
VM.

App-consistent

Description Details Recommendation

App-consistent recovery App-consistent snapshots App-consistent snapshots are taken in


points are created from use the Volume Shadow accordance with the frequency you specify.
app-consistent snapshots. Copy Service (VSS): This frequency should always be less than you
set for retaining recovery points. For example,
An app-consistent 1) When a snapshot is if you retain recovery points using the default
snapshot contain all the initiated, VSS perform a setting of 24 hours, you should set the
information in a crash- copy-on-write (COW) frequency at less than 24 hours.
Description Details Recommendation

consistent snapshot, plus operation on the volume.


all the data in memory They're more complex and take longer to
and transactions in 2) Before it performs the complete than crash-consistent snapshots.
progress. COW, VSS informs every
app on the machine that it They affect the performance of apps running
needs to flush its memory- on a VM enabled for replication.
resident data to disk.

3) VSS then allows the


backup/disaster recovery
app (in this case Site
Recovery) to read the
snapshot data and
proceed.

Replication process

When you enable replication for an Azure VM, the following happens:

1. The Site Recovery Mobility service extension is automatically installed on the VM.
2. The extension registers the VM with Site Recovery.
3. Continuous replication begins for the VM. Disk writes are immediately transferred to
the cache storage account in the source location.
4. Site Recovery processes the data in the cache, and sends it to the target storage
account, or to the replica managed disks.
5. After the data is processed, crash-consistent recovery points are generated every five
minutes. App-consistent recovery points are generated according to the setting
specified in the replication policy.
Replication process
Connectivity requirements

The Azure VMs you replicate need outbound connectivity. Site Recovery never needs
inbound connectivity to the VM.
Outbound connectivity (URLs)

If outbound access for VMs is controlled with URLs, allow these URLs.
URL Details

*.blob.core.windows.net Allows data to be written from the VM to the cache storage


account in the source region.

login.microsoftonline.com Provides authorization and authentication to Site Recovery


service URLs.

*.hypervrecoverymanager.windowsazure.co Allows the VM to communicate with the Site Recovery


m service.

*.servicebus.windows.net Allows the VM to write Site Recovery monitoring and


diagnostics data.

Outbound connectivity for IP address ranges

To control outbound connectivity for VMs using IP addresses, allow these addresses.
Source region rules

Rule Details Service tag

Allow HTTPS Allow ranges that correspond to storage accounts in the Storage..
outbound: port 443 source region

Allow HTTPS Allow ranges that correspond to Azure Active Directory AzureActiveDirectory
outbound: port 443 (Azure AD).

If Azure AD addresses are added in future you need to


create new Network Security Group (NSG) rules.

Allow HTTPS Allow access to Site Recovery endpoints that correspond


outbound: port 443 to the target location.
Target region rules

Rule Details Service tag

Allow HTTPS outbound: Allow ranges that correspond to storage accounts in Storage..
port 443 the target region.

Allow HTTPS outbound: Allow ranges that correspond to Azure AD. AzureActiveDirectory
port 443
If Azure AD addresses are added in future you need to
create new NSG rules.

Allow HTTPS outbound: Allow access to Site Recovery endpoints that


port 443 correspond to the source location.

Control access with NSG rules

If you control VM connectivity by filtering network traffic to and from Azure


networks/subnets using NSG rules, note the following requirements:

 NSG rules for the source Azure region should allow outbound access for replication
traffic.
 We recommend you create rules in a test environment before you put them into
production.
 Use service tags instead of allowing individual IP addresses.
o Service tags represent a group of IP address prefixes gathered together to
minimize complexity when creating security rules.
o Microsoft automatically updates service tags over time.

Learn more about outbound connectivity for Site Recovery, and controlling


connectivity with NSGs.
Connectivity for multi-VM consistency

If you enable multi-VM consistency, machines in the replication group communicate


with each other over port 20004.

 Ensure that there is no firewall appliance blocking the internal communication


between the VMs over port 20004.
 If you want Linux VMs to be part of a replication group, ensure the outbound traffic
on port 20004 is manually opened as per the guidance of the specific Linux version.
Failover process

When you initiate a failover, the VMs are created in the target resource group, target
virtual network, target subnet, and in the target availability set. During a failover, you
can use any recovery point.

Azure TCO

Define your workloads

Enter the details of your on-premises workloads. This information will be used to
understand your current TCO and recommended services in Azure.
Servers
Enter the details of your on-premises server infrastructure. After adding a workload,
select the workload type and enter the remaining details.
Add server workload

Databases
Enter the details of your on-premises database infrastructure. After adding a
database, enter the details of your on-premises database infrastructure in the Source
section. In the Destination section, select the Azure service you would like to use.
Add database

Storage
Enter the details of your on-premises storage infrastructure. After adding storage,
select the storage type and enter the remaining details.
Add storage

Networking
Enter the amount of network bandwidth you currently consume in your on-premises
environment.
Outboud bandwidth

More info
(1 - 2000)

Adjust assumptions

The following assumptions are being made as part of the TCO model. These key
assumptions usually vary among customers. We recommend reviewing these values
for accuracy.
Currency

Software Assurance coverage (provides Azure Hybrid Benefit)


Enable this if you have purchased this benefit for your on-premises Windows or SQL
Servers. If enabled, Azure Hybrid Benefit (AHB) will be applied to Azure estimates.
AHB helps you get more value from your on-premises licenses — save up to 40
percent on virtual machines and up to 82 percent with Azure Reserved Virtual
Machines (VM) instances.
Windows Server Software Assurance coverage

SQL Server Software Assurance coverage

Learn more about Software AssuranceLearn more about Azure Hybrid Benefit

Geo-redundant storage (GRS)


GRS replicates your data to a secondary region that is hundreds of miles away form
the primary region.

Learn more about GRS


Virtual Machine costs
Enable this for the Calculator to not recommend B-series virtual machines

Learn more about B-series virtual machines


Electricity costs
Price per KW hour
(USD)

Storage costs
Storage procurement cost/GB for local disk/SAN-SSD
(USD)

Storage procurement cost/GB for local disk/SAN-HDD


(USD)

Storage procurement cost/GB for NAS/file storage


(USD)

Storage procurement cost/GB for Blob storage


(USD)

Annual enterprise storage software support cost


(%)

Cost per tape drive


(USD)

IT labor costs
Number of physical servers that can be managed by a full time administrator
Number of virtual machines that can be managed by a full time administrator
Hourly rate for IT administrator
(USD)

Other assumptions

The following assumptions also affect the TCO model, but typically require less
adjustment by customers. You can come back to this section at any time and adjust
the assumptions.
Hardware costs
Software costs
Electricity costs
Virtualization costs
Data center costs
Networking costs
Database costs

Geo-redundant storage (GRS): Cross-


regional replication for Azure Storage
Geo-redundant storage (GRS) is designed to provide at least 99.99999999999999%
(16 9's) durability of objects over a given year by replicating your data to a secondary
region that is hundreds of miles away from the primary region. If your storage
account has GRS enabled, then your data is durable even in the case of a complete
regional outage or a disaster in which the primary region isn't recoverable.

If you opt for GRS, you have two related options to choose from:

 GRS replicates your data to another data center in a secondary region, but that data
is available to be read only if Microsoft initiates a failover from the primary to
secondary region.
 Read-access geo-redundant storage (RA-GRS) is based on GRS. RA-GRS replicates
your data to another data center in a secondary region, and also provides you with the
option to read from the secondary region. With RA-GRS, you can read from the
secondary region regardless of whether Microsoft initiates a failover from the primary
to secondary region.

For a storage account with GRS or RA-GRS enabled, all data is first replicated with
locally redundant storage (LRS). An update is first committed to the primary location
and replicated using LRS. The update is then replicated asynchronously to the
secondary region using GRS. When data is written to the secondary location, it's also
replicated within that location using LRS.

Both the primary and secondary regions manage replicas across separate fault
domains and upgrade domains within a storage scale unit. The storage scale unit is
the basic replication unit within the datacenter. Replication at this level is provided
by LRS; for more information, see Locally redundant storage (LRS): Low-cost data
redundancy for Azure Storage.

Keep these points in mind when deciding which replication option to use:
 Zone-redundant storage (ZRS) provides highly availability with synchronous
replication and may be a better choice for some scenarios than GRS or RA-GRS. For
more information on ZRS, see ZRS.
 Asynchronous replication involves a delay from the time that data is written to the
primary region, to when it is replicated to the secondary region. In the event of a
regional disaster, changes that haven't yet been replicated to the secondary region
may be lost if that data can't be recovered from the primary region.
 With GRS, the replica isn't available for read or write access unless Microsoft initiates
a failover to the secondary region. In the case of a failover, you'll have read and write
access to that data after the failover has completed. For more information, please
see Disaster recovery guidance.
 If your application needs to read from the secondary region, enable RA-GRS.

Read-access geo-redundant storage

Read-access geo-redundant storage (RA-GRS) maximizes availability for your storage


account. RA-GRS provides read-only access to the data in the secondary location, in
addition to geo-replication across two regions.

When you enable read-only access to your data in the secondary region, your data is
available on a secondary endpoint as well as on the primary endpoint for your
storage account. The secondary endpoint is similar to the primary endpoint, but
appends the suffix –secondary to the account name. For example, if your primary
endpoint for the Blob service is myaccount.blob.core.windows.net, then your
secondary endpoint is myaccount-secondary.blob.core.windows.net. The access
keys for your storage account are the same for both the primary and secondary
endpoints.

Some considerations to keep in mind when using RA-GRS:

 Your application has to manage which endpoint it is interacting with when using RA-
GRS.
 Since asynchronous replication involves a delay, changes that haven't yet been
replicated to the secondary region may be lost if data can't be recovered from the
primary region.
 You can check the Last Sync Time of your storage account. Last Sync Time is a GMT
date/time value. All primary writes before the Last Sync Time have been successfully
written to the secondary location, meaning that they are available to be read from the
secondary location. Primary writes after the Last Sync Time may or may not be
available for reads yet. You can query this value using the Azure portal, Azure
PowerShell, or from one of the Azure Storage client libraries.
 If you initiate an account failover (preview) of a GRS or RA-GRS account to the
secondary region, write access to that account is restored after the failover has
completed. For more information, see Disaster recovery and storage account failover
(preview).
 RA-GRS is intended for high-availability purposes. For scalability guidance, review
the performance checklist.
 For suggestions on how to design for high availability with RA-GRS, see Designing
Highly Available Applications using RA-GRS storage.

What is the RPO and RTO with GRS?

Recovery Point Objective (RPO): In GRS and RA-GRS, the storage service
asynchronously geo-replicates the data from the primary to the secondary location.
In the event that the primary region becomes unavailable, you can perform an
account failover (preview) to the secondary region. When you initiate a failover,
recent changes that haven't yet been geo-replicated may be lost. The number of
minutes of potential data that's lost is known as the RPO. The RPO indicates the
point in time to which data can be recovered. Azure Storage typically has an RPO of
less than 15 minutes, although there's currently no SLA on how long geo-replication
takes.

Recovery Time Objective (RTO): The RTO is a measure of how long it takes to


perform the failover and get the storage account back online. The time to perform
the failover includes the following actions:

 The time until the customer initiates the failover of the storage account from the
primary to the secondary region.
 The time required by Azure to perform the failover by changing the primary DNS
entries to point to the secondary location.

Paired regions

When you create a storage account, you select the primary region for the account.
The paired secondary region is determined based on the primary region, and can't
be changed. For up-to-date information about regions supported by Azure,
see Business continuity and disaster recovery (BCDR): Azure paired regions.

Azure Storage redundancy


The data in your Microsoft Azure storage account is always replicated to ensure
durability and high availability. Azure Storage copies your data so that it is protected
from planned and unplanned events, including transient hardware failures, network
or power outages, and massive natural disasters. You can choose to replicate your
data within the same data center, across zonal data centers within the same region,
or across geographically separated regions.
Replication ensures that your storage account meets the Service-Level Agreement
(SLA) for Storage even in the face of failures. See the SLA for information about
Azure Storage guarantees for durability and availability.
Choosing a redundancy option

When you create a storage account, you can select one of the following redundancy
options:

 Locally redundant storage (LRS)


 Zone-redundant storage (ZRS)
 Geo-redundant storage (GRS)
 Read-access geo-redundant storage (RA-GRS)

The following table provides a quick overview of the scope of durability and
availability that each replication strategy will provide you for a given type of event (or
event of similar impact).
Scenario LRS ZRS GRS RA-GRS

Node Yes Yes Yes Yes


unavailability
within a data
center

An entire data No Yes Yes Yes


center (zonal
or non-zonal)
becomes
unavailable

A region-wide No No Yes Yes


outage

Read access No No No Yes


to your data
(in a remote,
geo-
replicated
region) in the
event of
region-wide
unavailability
Scenario LRS ZRS GRS RA-GRS

Designed to at least at least at least at least


provide __ 99.999999999% 99.9999999999% 99.99999999999999% 99.99999999999999%
durability of (11 9's) (12 9's) (16 9's) (16 9's)
objects over a
given year

Supported GPv2, GPv1, GPv2 GPv2, GPv1, Blob GPv2, GPv1, Blob
storage Blob
account types

Availability At least 99.9% At least 99.9% At least 99.9% (99% At least 99.99%
SLA for read (99% for cool (99% for cool for cool access tier) (99.9% for Cool Access
requests access tier) access tier) Tier)

Availability At least 99.9% At least 99.9% At least 99.9% (99% At least 99.9% (99%
SLA for write (99% for cool (99% for cool for cool access tier) for cool access tier)
requests access tier) access tier)

For pricing information for each redundancy option, see Azure Storage Pricing.

For information about Azure Storage guarantees for durability and availability, see
the Azure Storage SLA.
 Note

Premium Storage supports only locally redundant storage (LRS).


Changing replication strategy

We allow you to change your storage account's replication strategy by using


the Azure portal, Azure Powershell, Azure CLI, or one of the many Azure client
libraries. Changing the replication type of your storage account does not result in
down time.
 Note

Currently, you cannot use the Portal or API to convert your account to ZRS. If you
want to convert your account's replication to ZRS, see Zone-redundant storage
(ZRS) for details.
Are there any costs to changing my account's replication strategy?

It depends on your conversion path. Ordering from cheapest to the most expensive
redundancy offering we have LRS, ZRS, GRS, and RA-GRS. For example,
going from LRS to anything will incur additional charges because you are going to a
more sophisticated redundancy level. Going to GRS or RA-GRS will incur an egress
bandwidth charge because your data (in your primary region) is being replicated to
your remote secondary region. This is a one-time charge at initial setup. After the
data is copied, there are no further conversion charges. You will only be charged for
replicating any new or updates to existing data. For details on bandwidth charges,
see Azure Storage Pricing page.

If you change from GRS to LRS, there is no additional cost, but your replicated data is
deleted from the secondary location.

Locally redundant storage (LRS): Low-


cost data redundancy for Azure Storage
Locally redundant storage (LRS) provides at least 99.999999999% (11 nines)
durability of objects over a given year. LRS provides this object durability by
replicating your data to a storage scale unit. A datacenter, located in the region
where you created your storage account, hosts the storage scale unit. A write request
to an LRS storage account returns successfully only after the data is written to all
replicas. Each replica resides in separate fault domains and upgrade domains within a
storage scale unit.

A storage scale unit is a collection of racks of storage nodes. A fault domain (FD) is a
group of nodes that represent a physical unit of failure. Think of a fault domain as
nodes belonging to the same physical rack. An upgrade domain (UD) is a group of
nodes that are upgraded together during the process of a service upgrade (rollout).
The replicas are spread across UDs and FDs within one storage scale unit. This
architecture ensures your data is available if a hardware failure affects a single rack or
when nodes are upgraded during a service upgrade.

LRS is the lowest-cost replication option and offers the least durability compared to
other options. If a datacenter-level disaster (for example, fire or flooding) occurs, all
replicas may be lost or unrecoverable. To mitigate this risk, Microsoft recommends
using either zone-redundant storage (ZRS) or geo-redundant storage (GRS).

 If your application stores data that can be easily reconstructed if data loss occurs, you
may opt for LRS.
 Some applications are restricted to replicating data only within a country due to data
governance requirements. In some cases, the paired regions across which the data is
replicated for GRS accounts may be in another country. For more information on
paired regions, see Azure regions.

Zone-redundant storage (ZRS): Highly


available Azure Storage applications
Zone-redundant storage (ZRS) replicates your data synchronously across three
storage clusters in a single region. Each storage cluster is physically separated from
the others and is located in its own availability zone (AZ). Each availability zone—and
the ZRS cluster within it—is autonomous and includes separate utilities and
networking features.

When you store your data in a storage account using ZRS replication, you can
continue to access and manage your data if an availability zone becomes unavailable.
ZRS provides excellent performance and low latency. ZRS offers the same scalability
targets as locally redundant storage (LRS).

Consider ZRS for scenarios that require consistency, durability, and high availability.
Even if an outage or natural disaster renders an availability zone unavailable, ZRS
offers durability for storage objects of at least 99.9999999999% (12 9's) over a given
year.

For more information about availability zones, see Availability Zones overview.


Support coverage and regional availability

ZRS currently supports standard general-purpose v2 account types. For more


information about storage account types, see Azure storage account overview.

ZRS is available for block blobs, non-disk page blobs, files, tables, and queues.

ZRS is generally available in the following regions:

 Asia Southeast
 Europe West
 Europe North
 France Central
 Japan East
 UK South
 US East
 US East 2
 US West 2
 US Central

Microsoft continues to enable ZRS in additional Azure regions. Check the Azure


Service Updates page regularly for information about new regions.
What happens when a zone becomes unavailable?

Your data is still accessible for both read and write operations even if a zone
becomes unavailable. Microsoft recommends that you continue to follow practices
for transient fault handling. These practices include implementing retry policies with
exponential back-off.

When a zone is unavailable, Azure undertakes networking updates, such as DNS


repointing. These updates may affect your application if you are accessing your data
before the updates have completed.

ZRS may not protect your data against a regional disaster where multiple zones are
permanently affected. Instead, ZRS offers resiliency for your data if it becomes
temporarily unavailable. For protection against regional disasters, Microsoft
recommends using geo-redundant storage (GRS). For more information about GRS,
see Geo-redundant storage (GRS): Cross-regional replication for Azure Storage.
Converting to ZRS replication

Migrating to or from LRS, GRS, and RA-GRS is straightforward. Use the Azure portal
or the Storage Resource Provider API to change your account's redundancy type.
Azure will then replicate your data accordingly.

Migrating data to or from ZRS requires a different strategy. ZRS migration involves
the physical movement of data from a single storage stamp to multiple stamps
within a region.

There are two primary options for migration to or from ZRS:

 Manually copy or move data to a new ZRS account from an existing account.
 Request a live migration.

Microsoft strongly recommends that you perform a manual migration. A manual


migration provides more flexibility than a live migration. With a manual migration,
you're in control of the timing.

To perform a manual migration, you have options:

 Use existing tooling like AzCopy, one of the Azure Storage client libraries, or reliable
third-party tools.
 If you're familiar with Hadoop or HDInsight, attach both source and destination (ZRS)
account to your cluster. Then, parallelize the data copy process with a tool like DistCp.
 Build your own tooling using one of the Azure Storage client libraries.

A manual migration can result in application downtime. If your application requires


high availability, Microsoft also provides a live migration option. A live migration is
an in-place migration.

During a live migration, you can use your storage account while your data is
migrated between source and destination storage stamps. During the migration
process, you have the same level of durability and availability SLA as you do
normally.

Keep in mind the following restrictions on live migration:

 While Microsoft handles your request for live migration promptly, there's no
guarantee as to when a live migration will complete. If you need your data migrated to
ZRS by a certain date, then Microsoft recommends that you perform a manual
migration instead. Generally, the more data you have in your account, the longer it
takes to migrate that data.
 Live migration is supported only for storage accounts that use LRS or GRS replication.
If your account uses RA-GRS, then you need to first change your account's replication
type to either LRS or GRS before proceeding. This intermediary step removes the
secondary read-only endpoint provided by RA-GRS before migration.
 Your account must contain data.
 You can only migrate data within the same region. If you want to migrate your data
into a ZRS account located in a region different than the source account, then you
must perform a manual migration.
 Only standard storage account types support live migration. Premium storage
accounts must be migrated manually.

You can request live migration through the Azure Support portal. From the portal,
select the storage account you want to convert to ZRS.

1. Select New Support Request


2. Complete the Basics based on your account information. In the Service section,
select Storage Account Management and the resource you want to convert to ZRS.
3. Select Next.
4. Specify the following values the Problem section:

o Severity: Leave the default value as-is.


o Problem Type: Select Data Migration.
o Category: Select Migrate to ZRS within a region.
o Title: Type a descriptive title, for example, ZRS account migration.
o Details: Type additional details in the Details box, for example, I would like to
migrate to ZRS from [LRS, GRS] in the __ region.
 Select Next.
 Verify that the contact information is correct on the Contact
information blade.
 Select Create.

A support person will contact you and provide any assistance you need.

ZRS Classic: A legacy option for block blobs redundancy


 Note

Microsoft will deprecate and migrate ZRS Classic accounts on March 31, 2021. More
details will be provided to ZRS Classic customers before deprecation.

Once ZRS becomes generally available in a region, customers won't be able to


create ZRS Classic accounts from the portal in that region. Using Microsoft
PowerShell and Azure CLI to create ZRS Classic accounts is an option until ZRS
Classic is deprecated.

ZRS Classic asynchronously replicates data across data centers within one to two
regions. Replicated data may not be available unless Microsoft initiates failover to the
secondary. A ZRS Classic account can't be converted to or from LRS, GRS, or RA-GRS.
ZRS Classic accounts also don't support metrics or logging.

ZRS Classic is available only for block blobs in general-purpose V1 (GPv1) storage


accounts. For more information about storage accounts, see Azure storage account
overview.

To manually migrate ZRS account data to or from an LRS, ZRS Classic, GRS, or RA-
GRS account, use one of the following tools: AzCopy, Azure Storage Explorer, Azure
PowerShell, or Azure CLI. You can also build your own migration solution with one of
the Azure Storage client libraries.

 AWS is 5 times more expensive than Azure for Windows Server and SQL Server*

 Migrate your Windows Server and SQL Server workloads to Azure with your on-premises
licenses

 Go at your own pace—move a few workloads or entire datacenters

 Message us

 Call sales 1-800-419-8555
AWS is 5 times more expensive than Azure for Windows Server and SQL Server
Achieve the lowest cost of ownership when you combine the Azure Hybrid
Benefit, reservation pricing, and extended security updates.

 Receive free extended security updates when you migrate your Windows Server and SQL
Server 2008 and 2008 R2 workloads to Azure virtual machines.

 Save when you prepay your compute capacity on a one- or three-year term with reservation
pricing, which not only improves budget forecasting but also provides flexibility to exchange or
cancel should business needs change.
Learn more

Azure Hybrid Benefit Savings Calculator


1. Windows Server VMs
2. SQL Server VMs
3. SQL Database
Enter the number of core licenses you own that are covered with active Software
Assurance or Windows Server Subscriptions
More information

Enter planned Azure deployment of Windows Virtual Machines


Learn more about Windows Server instances

Region                                                                                                                                                                              

Instance size                                                                                                                                                                   
D4 v2: 8 cores, 28 GB RAM, 400 GB SSD, $1.008

Hours / month

Eligible Virtual Machines based on instance size selection

Learn more about eligibility

Monthly Estimates
Without Azure Hybrid Benefit per month$3,679.20

With Azure Hybrid Benefit per month$2,135.25

Savings across eligible Virtual Machines per month $1,543.95


(42.0% savings)

Annual Estimates
Your estimated annual savings on Azure across all virtual machines$18,527.40
Calculator is to help estimate savings range when using the Azure Hybrid Benefit for Windows Server licenses that include
Software Assurance. Your actual savings may vary.
Activate your Azure Hybrid Benefit

Windows Server VMs:

1. Deploy a virtual machine within minutes from the Azure Portal  or through the Azure
Marketplace.

2. Upload a custom virtual machine

3. Migrate existing virtual machines with Azure Site Recovery

SQL Database:

1. Try SQL Database Managed Instance from the Azure Portal and migrate your SQL
Server databases without changing your apps.

2. Try SQL Database Single Database or Elastic Pool from the Azure Portal and build
data-driven applications and websites in the programming language of your choice.

B-series burstable virtual machine sizes


The B-series VM family allows you to choose which VM size provides you the
necessary base level performance for your workload, with the ability to burst CPU
performance up to 100% of an Intel® Broadwell E5-2673 v4 2.3 GHz, or an Intel®
Haswell 2.4 GHz E5-2673 v3 processor vCPU.

The B-series VMs are ideal for workloads that do not need the full performance of
the CPU continuously, like web servers, proof of concepts, small databases and
development build environments. These workloads typically have burstable
performance requirements. The B-series provides you with the ability to purchase a
VM size with baseline performance and the VM instance builds up credits when it is
using less than its baseline. When the VM has accumulated credit, the VM can burst
above the baseline using up to 100% of the vCPU when your application requires
higher CPU performance.

The B-series comes in the following six VM sizes:


Temp Base Max
storage CPU CPU Credits Max
Memory: (SSD) Perf of Perf of Banked Banked
Size vCPU's GiB GiB VM VM / Hour Credits

Standard_B1s 1 1 4 10% 100% 6 144

Standard_B1m 1 2 4 20% 100% 12 288


s

Standard_B2s 2 4 8 40% 200% 24 576

Standard_B2m 2 8 16 60% 200% 36 864


s

Standard_B4m 4 16 32 90% 400% 54 1296


s

Standard_B8m 8 32 64 135% 800% 81 1944


s

Q&A
Q: How do you get 135% baseline performance from a VM?

A: The 135% is shared amongst the 8 vCPU’s that make up the VM size. For example,
if your application uses 4 of the 8 cores working on batch processing and each of
those 4 vCPU’s are running at 30% utilization the total amount of VM CPU
performance would equal 120%. Meaning that your VM would be building credit
time based on the 15% delta from your baseline performance. But it also means that
when you have credits available that same VM can use 100% of all 8 vCPU’s giving
that VM a Max CPU performance of 800%.
Q: How can I monitor my credit balance and consumption

A: We will be introducing 2 new metrics in the coming weeks, the Credit metric will


allow you to view how many credits your VM has banked and
the ConsumedCredit metric will show how many CPU credits your VM has
consumed from the bank. You will be able to view these metrics from the metrics
pane in the portal or programmatically through the Azure Monitor APIs.
For more information on how to access the metrics data for Azure, see Overview of
metrics in Microsoft Azure.
Q: How are credits accumulated?

A: The VM accumulation and consumption rates are set such that a VM running at
exactly its base performance level will have neither a net accumulation or
consumption of bursting credits. A VM will have a net increase in credits whenever it
is running below its base performance level and will have a net decrease in credits
whenever the VM is utilizing the CPU more than its base performance level.

Example: I deploy a VM using the B1ms size for my small time and attendance
database application. This size allows my application to use up to 20% of a vCPU as
my baseline, which is 0.2 credits per minute I can use or bank.

My application is busy at the beginning and end of my employees work day,


between 7:00-9:00 AM and 4:00 - 6:00PM. During the other 20 hours of the day, my
application is typically at idle, only using 10% of the vCPU. For the non-peak hours, I
earn 0.2 credits per minute but only consume 0.l credits per minute, so my VM will
bank 0.1 x 60 = 6 credits per hour. For the 20 hours that I am off-peak, I will bank 120
credits.

During peak hours my application averages 60% vCPU utilization, I still earn 0.2
credits per minute but I consume 0.6 credits per minute, for a net cost of 0.4 credits a
minute or 0.4 x 60 = 24 credits per hour. I have 4 hours per day of peak usage, so it
costs 4 x 24 = 96 credits for my peak usage.

If I take the 120 credits I earned off-peak and subtract the 96 credits I used for my
peak times, I bank an additional 24 credits per day that I can use for other bursts of
activity.
Q: Does the B-Series support Premium Storage data disks?

A: Yes, all B-Series sizes support Premium Storage data disks.


Q: Why is my remaining credit set to 0 after a redeploy or a stop/start?

A : When a VM is “REDPLOYED” and the VM moves to another node, the


accumulated credit is lost. If the VM is stopped/started, but remains on the same
node, the VM retains the accumulated credit. Whenever the VM starts fresh on a
node, it gets an initial credit, for Standard_B8ms it is 240 mins.
Other sizes

 General purpose
 Compute optimized
 Memory optimized
 Storage optimized
 GPU optimized
 High performance compute

Agent data sources in Azure Monitor


The data that Azure Monitor collects from agents is defined by the data sources that
you configure. The data from agents is stored as log data with a set of records. Each
data source creates records of a particular type with each type having its own set of
properties.

Summary of data sources

The following table lists the agent data sources that are currently available in Azure
Monitor. Each has a link to a separate article providing detail for that data source. It
also provides information on their method and frequency of collection.
Operations
Manager
agent data
Microsoft Operations Operations sent via
monitoring Manager Azure Manager management
Data source Platform agent agent storage required? group

Custom logs Windows •

Custom logs Linux •

IIS logs Windows • • •

Performance Windows • •
counters
Operations
Manager
agent data
Microsoft Operations Operations sent via
monitoring Manager Azure Manager management
Data source Platform agent agent storage required? group

Performance Linux •
counters

Syslog Linux •

Windows Windows • • • •
Event logs

Configuring data sources

You configure data sources from the Data menu in Advanced Settings for the


workspace. Any configuration is delivered to all connected sources in your
workspace. You cannot currently exclude any agents from this configuration.

1. In the Azure portal, select Log Analytics workspaces > your workspace > Advanced


Settings.
2. Select Data.
3. Click on the data source you want to configure.
4. Follow the link to the documentation for each data source in the above table for
details on their configuration.

Data collection

Data source configurations are delivered to agents that are directly connected to
Azure Monitor within a few minutes. The specified data is collected from the agent
and delivered directly to Azure Monitor at intervals specific to each data source. See
the documentation for each data source for these specifics.
For System Center Operations Manager agents in a connected management group,
data source configurations are translated into management packs and delivered to
the management group every 5 minutes by default. The agent downloads the
management pack like any other and collects the specified data. Depending on the
data source, the data will be either sent to a management server which forwards the
data to the Azure Monitor, or the agent will send the data to Azure Monitor without
going through the management server. See Data collection details for monitoring
solutions in Azure for details. You can read about details of connecting Operations
Manager and Azure Monitor and modifying the frequency that configuration is
delivered at Configure Integration with System Center Operations Manager.

If the agent is unable to connect to Azure Monitor or Operations Manager, it will


continue to collect data that it will deliver when it establishes a connection. Data can
be lost if the amount of data reaches the maximum cache size for the client, or if the
agent is not able to establish a connection within 24 hours.
Log records

All log data collected by Azure Monitor is stored in the workspace as records.
Records collected by different data sources will have their own set of properties and
be identified by their Type property. See the documentation for each data source
and solution for details on each record type.

ExpressRoute FAQ
What is ExpressRoute?

ExpressRoute is an Azure service that lets you create private connections between
Microsoft datacenters and infrastructure that’s on your premises or in a colocation
facility. ExpressRoute connections do not go over the public Internet, and offer
higher security, reliability, and speeds with lower latencies than typical connections
over the Internet.
What are the benefits of using ExpressRoute and private network connections?

ExpressRoute connections do not go over the public Internet. They offer higher
security, reliability, and speeds, with lower and consistent latencies than typical
connections over the Internet. In some cases, using ExpressRoute connections to
transfer data between on-premises devices and Azure can yield significant cost
benefits.
Where is the service available?

See this page for service location and availability: ExpressRoute partners and
locations.
How can I use ExpressRoute to connect to Microsoft if I don’t have partnerships with
one of the ExpressRoute-carrier partners?

You can select a regional carrier and land Ethernet connections to one of the
supported exchange provider locations. You can then peer with Microsoft at the
provider location. Check the last section of ExpressRoute partners and locations to
see if your service provider is present in any of the exchange locations. You can then
order an ExpressRoute circuit through the service provider to connect to Azure.
How much does ExpressRoute cost?

Check pricing details for pricing information.


If I pay for an ExpressRoute circuit of a given bandwidth, does the VPN connection I
purchase from my network service provider have to be the same speed?

No. You can purchase a VPN connection of any speed from your service provider.
However, your connection to Azure is limited to the ExpressRoute circuit bandwidth
that you purchase.
If I pay for an ExpressRoute circuit of a given bandwidth, do I have the ability to burst
up to higher speeds if necessary?

Yes. ExpressRoute circuits are configured to allow you to burst up to two times the
bandwidth limit you procured for no additional cost. Check with your service
provider to see if they support this capability.
Can I use the same private network connection with virtual network and other Azure
services simultaneously?

Yes. An ExpressRoute circuit, once set up, allows you to access services within a
virtual network and other Azure services simultaneously. You connect to virtual
networks over the private peering path, and to other services over the Microsoft
peering path.
Does ExpressRoute offer a Service Level Agreement (SLA)?

For information, see the ExpressRoute SLA page.


Supported services

ExpressRoute supports three routing domains for various types of services.


Private peering

 Virtual networks, including all virtual machines and cloud services

Public peering
 Note
Public peering has been disabled on new ExpressRoute circuits. Azure services are
available on Microsoft peering.

 Power BI
 Dynamics 365 for Finance and Operations (formerly known as Dynamics AX Online)
 Most of the Azure services are supported. Please check directly with the service that
you want to use to verify support.

The following services are NOT supported:


o CDN
o Multi-factor Authentication
o Traffic Manager

Microsoft peering

 Office 365
 Dynamics 365
 Power BI
 Azure Active Directory
 Azure DevOps (Azure Global Services community)
 Most of the Azure services are supported. Please check directly with the service that
you want to use to verify support.

The following services are NOT supported:


o CDN
o Multi-factor Authentication
o Traffic Manager

Data and connections


Are there limits on the amount of data that I can transfer using ExpressRoute?

We do not set a limit on the amount of data transfer. Refer to pricing details for
information on bandwidth rates.
What connection speeds are supported by ExpressRoute?

Supported bandwidth offers:

50 Mbps, 100 Mbps, 200 Mbps, 500 Mbps, 1 Gbps, 2 Gbps, 5 Gbps, 10 Gbps
Which service providers are available?

See ExpressRoute partners and locations for the list of service providers and


locations.
Technical details
What are the technical requirements for connecting my on-premises location to
Azure?

See ExpressRoute prerequisites page for requirements.


Are connections to ExpressRoute redundant?

Yes. Each ExpressRoute circuit has a redundant pair of cross connections configured
to provide high availability.
Will I lose connectivity if one of my ExpressRoute links fail?

You will not lose connectivity if one of the cross connections fails. A redundant
connection is available to support the load of your network and provide high
availability of your ExpressRoute circuit. You can additionally create a circuit in a
different peering location to achieve circuit-level resilience.
How do I ensure high availability on a virtual network connected to ExpressRoute?

You can achieve high availability by connecting ExpressRoute circuits in different


peering locations (for example, Singapore, Singapore2) to your virtual network. If one
ExpressRoute circuit goes down, connectivity will fail over to another ExpressRoute
circuit. By default, traffic leaving your virtual network is routed based on Equal Cost
Multi-path Routing (ECMP). You can use Connection Weight to prefer one circuit to
another. For more information, see Optimizing ExpressRoute Routing.
If I'm not co-located at a cloud exchange and my service provider offers point-to-
point connection, do I need to order two physical connections between my on-
premises network and Microsoft?

If your service provider can establish two Ethernet virtual circuits over the physical
connection, you only need one physical connection. The physical connection (for
example, an optical fiber) is terminated on a layer 1 (L1) device (see the image). The
two Ethernet virtual circuits are tagged with different VLAN IDs, one for the primary
circuit, and one for the secondary. Those VLAN IDs are in the outer 802.1Q Ethernet
header. The inner 802.1Q Ethernet header (not shown) is mapped to a
specific ExpressRoute routing domain.

Can I extend one of my VLANs to Azure using ExpressRoute?

No. We do not support layer 2 connectivity extensions into Azure.


Can I have more than one ExpressRoute circuit in my subscription?

Yes. You can have more than one ExpressRoute circuit in your subscription. The
default limit is set to 10. You can contact Microsoft Support to increase the limit, if
needed.
Can I have ExpressRoute circuits from different service providers?

Yes. You can have ExpressRoute circuits with many service providers. Each
ExpressRoute circuit is associated with one service provider only.
I see two ExpressRoute peering locations in the same metro, for example, Singapore
and Singapore2. Which peering location should I choose to create my ExpressRoute
circuit?

If your service provider offers ExpressRoute at both sites, you can work with your
provider and pick either site to set up ExpressRoute.
Can I have multiple ExpressRoute circuits in the same metro? Can I link them to the
same virtual network?

Yes. You can have multiple ExpressRoute circuits with the same or different service
providers. If the metro has multiple ExpressRoute peering locations and the circuits
are created at different peering locations, you can link them to the same virtual
network. If the circuits are created at the same peering location, you can’t link them
to the same virtual network. Each location name in Azure portal or in PowerShell/CLI
API represents one peering location. For example, you can select the peering
locations "Singapore" and "Singapore2" and connect circuits from each to the same
virtual network.
How do I connect my virtual networks to an ExpressRoute circuit

The basic steps are:

 Establish an ExpressRoute circuit and have the service provider enable it.
 You, or the provider, must configure the BGP peering(s).
 Link the virtual network to the ExpressRoute circuit.

For more information, see ExpressRoute workflows for circuit provisioning and circuit
states.
Are there connectivity boundaries for my ExpressRoute circuit?

Yes. The ExpressRoute partners and locations article provides an overview of the


connectivity boundaries for an ExpressRoute circuit. Connectivity for an ExpressRoute
circuit is limited to a single geopolitical region. Connectivity can be expanded to
cross geopolitical regions by enabling the ExpressRoute premium feature.
Can I link to more than one virtual network to an ExpressRoute circuit?

Yes. You can have up to 10 virtual networks connections on a standard ExpressRoute


circuit, and up to 100 on a premium ExpressRoute circuit.
I have multiple Azure subscriptions that contain virtual networks. Can I connect
virtual networks that are in separate subscriptions to a single ExpressRoute circuit?

Yes. You can link up to 10 virtual networks in the same subscription as the circuit or
different subscriptions using a single ExpressRoute circuit. This limit can be increased
by enabling the ExpressRoute premium feature.

For more information, see Sharing an ExpressRoute circuit across multiple


subscriptions.
I have multiple Azure subscriptions associated to different Azure Active Directory
tenants or Enterprise Agreement enrollments. Can I connect virtual networks that are
in separate tenants and enrollments to a single ExpressRoute circuit not in the same
tenant or enrollment?

Yes. ExpressRoute authorizations can span subscription, tenant, and enrollment


boundaries with no additional configuration required.

For more information, see Sharing an ExpressRoute circuit across multiple


subscriptions.
Are virtual networks connected to the same circuit isolated from each other?

No. From a routing perspective, all virtual networks linked to the same ExpressRoute
circuit are part of the same routing domain and are not isolated from each other. If
you need route isolation, you need to create a separate ExpressRoute circuit.
Can I have one virtual network connected to more than one ExpressRoute circuit?

Yes. You can link a single virtual network with up to four ExpressRoute circuits. They
must be ordered through four different ExpressRoute locations.
Can I access the Internet from my virtual networks connected to ExpressRoute
circuits?

Yes. If you have not advertised default routes (0.0.0.0/0) or Internet route prefixes
through the BGP session, you can connect to the Internet from a virtual network
linked to an ExpressRoute circuit.
Can I block Internet connectivity to virtual networks connected to ExpressRoute
circuits?

Yes. You can advertise default routes (0.0.0.0/0) to block all Internet connectivity to
virtual machines deployed within a virtual network and route all traffic out through
the ExpressRoute circuit.

If you advertise default routes, we force traffic to services offered over Microsoft
peering (such as Azure storage and SQL DB) back to your premises. You will have to
configure your routers to return traffic to Azure through the Microsoft peering path
or over the Internet. If you've enabled a service endpoint for the service, the traffic to
the service is not forced to your premises. The traffic remains within the Azure
backbone network. To learn more about service endpoints, see Virtual network
service endpoints
Can virtual networks linked to the same ExpressRoute circuit talk to each other?

Yes. Virtual machines deployed in virtual networks connected to the same


ExpressRoute circuit can communicate with each other.
Can I use site-to-site connectivity for virtual networks in conjunction with
ExpressRoute?

Yes. ExpressRoute can coexist with site-to-site VPNs. See Configure ExpressRoute


and site-to-site coexisting connections.
Why is there a public IP address associated with the ExpressRoute gateway on a
virtual network?

The public IP address is used for internal management only, and does not constitute
a security exposure of your virtual network.
Are there limits on the number of routes I can advertise?

Yes. We accept up to 4000 route prefixes for private peering and 200 for Microsoft
peering. You can increase this to 10,000 routes for private peering if you enable the
ExpressRoute premium feature.
Are there restrictions on IP ranges I can advertise over the BGP session?

We do not accept private prefixes (RFC1918) for the Microsoft peering BGP session.
What happens if I exceed the BGP limits?

BGP sessions will be dropped. They will be reset once the prefix count goes below
the limit.
What is the ExpressRoute BGP hold time? Can it be adjusted?

The hold time is 180. The keep-alive messages are sent every 60 seconds. These are
fixed settings on the Microsoft side that cannot be changed. It is possible for you to
configure different timers, and the BGP session parameters will be negotiated
accordingly.
Can I change the bandwidth of an ExpressRoute circuit?

Yes, you can attempt to increase the bandwidth of your ExpressRoute circuit in the
Azure portal, or by using PowerShell. If there is capacity available on the physical port
on which your circuit was created, your change succeeds.
If your change fails, it means either there isn’t enough capacity left on the current
port and you need to create a new ExpressRoute circuit with the higher bandwidth,
or that there is no additional capacity at that location, in which case you won't be
able to increase the bandwidth.

You will also have to follow up with your connectivity provider to ensure that they
update the throttles within their networks to support the bandwidth increase. You
cannot, however, reduce the bandwidth of your ExpressRoute circuit. You have to
create a new ExpressRoute circuit with lower bandwidth and delete the old circuit.
How do I change the bandwidth of an ExpressRoute circuit?

You can update the bandwidth of the ExpressRoute circuit using the REST API or
PowerShell cmdlet.
ExpressRoute premium
What is ExpressRoute premium?

ExpressRoute premium is a collection of the following features:

 Increased routing table limit from 4000 routes to 10,000 routes for private peering.
 Increased number of VNets and ExpressRoute Global Reach connections that can be
enabled on an ExpressRoute circuit (default is 10). For more information, see
the ExpressRoute Limits table.
 Connectivity to Office 365 and Dynamics 365.

 Global connectivity over the Microsoft core network. You can now link a VNet
in one geopolitical region with an ExpressRoute circuit in another region.
Examples:
o You can link a VNet created in Europe West to an ExpressRoute circuit created
in Silicon Valley.
o On the Microsoft peering, prefixes from other geopolitical regions are
advertised such that you can connect to, for example, SQL Azure in Europe West
from a circuit in Silicon Valley.

How many VNets and ExpressRoute Global Reach connections can I enable on an
ExpressRoute circuit if I enabled ExpressRoute premium?

The following tables show the ExpressRoute limits and the number of VNets and
ExpressRoute Global Reach connections per ExpressRoute circuit:
ExpressRoute Limits

The following limits apply to ExpressRoute resources per subscription.


Default/Max
Resource Limit

ExpressRoute circuits per subscription 10

ExpressRoute circuits per region per subscription (Azure Resource Manager) 10

Maximum number of routes for Azure private peering with ExpressRoute standard 4,000

Maximum number of routes for Azure private peering with ExpressRoute premium add- 10,000
on

Maximum number of routes for Azure Microsoft peering with ExpressRoute standard 200

Maximum number of routes for Azure Microsoft peering with ExpressRoute premium 200
add-on

Maximum number of ExpressRoute circuits linked to the same virtual network in 4


different peering locations

Number of virtual network links allowed per ExpressRoute circuit see table below

Number of Virtual Networks per ExpressRoute circuit

Circuit Size Number of VNet links for standard Number of VNet Links with Premium add-on

50 Mbps 10 20

100 Mbps 10 25

200 Mbps 10 25

500 Mbps 10 40

1 Gbps 10 50
Circuit Size Number of VNet links for standard Number of VNet Links with Premium add-on

2 Gbps 10 60

5 Gbps 10 75

10 Gbps 10 100

How do I enable ExpressRoute premium?

ExpressRoute premium features can be enabled when the feature is enabled, and can
be shut down by updating the circuit state. You can enable ExpressRoute premium at
circuit creation time, or can call the REST API / PowerShell cmdlet.
How do I disable ExpressRoute premium?

You can disable ExpressRoute premium by calling the REST API or PowerShell cmdlet.
You must make sure that you have scaled your connectivity needs to meet the
default limits before you disable ExpressRoute premium. If your utilization scales
beyond the default limits, the request to disable ExpressRoute premium fails.
Can I pick and choose the features I want from the premium feature set?

No. You can't pick the features. We enable all features when you turn on
ExpressRoute premium.
How much does ExpressRoute premium cost?

Refer to pricing details for cost.


Do I pay for ExpressRoute premium in addition to standard ExpressRoute charges?

Yes. ExpressRoute premium charges apply on top of ExpressRoute circuit charges


and charges required by the connectivity provider.
ExpressRoute for Office 365

Office 365 was created to be accessed securely and reliably via the Internet. Because
of this, we recommend ExpressRoute for specific scenarios. For information about
using ExpressRoute to access Office 365, visit Azure ExpressRoute for Office 365.
How do I create an ExpressRoute circuit to connect to Office 365 services?

1. Review the ExpressRoute prerequisites page to make sure you meet the


requirements.
2. To ensure that your connectivity needs are met, review the list of service providers
and locations in the ExpressRoute partners and locations article.
3. Plan your capacity requirements by reviewing Network planning and performance
tuning for Office 365.
4. Follow the steps listed in the workflows to set up connectivity ExpressRoute
workflows for circuit provisioning and circuit states.

 Important

Make sure that you have enabled ExpressRoute premium add-on when configuring
connectivity to Office 365 services.
Can my existing ExpressRoute circuits support connectivity to Office 365 services and
Dynamics 365?

Yes. Your existing ExpressRoute circuit can be configured to support connectivity to


Office 365 services. Make sure that you have sufficient capacity to connect to Office
365 services and that you have enabled premium add-on. Network planning and
performance tuning for Office 365 helps you plan your connectivity needs. Also,
see Create and modify an ExpressRoute circuit.
What Office 365 services can be accessed over an ExpressRoute connection?

Refer to Office 365 URLs and IP address ranges page for an up-to-date list of services
supported over ExpressRoute.
How much does ExpressRoute for Office 365 services cost?

Office 365 services require premium add-on to be enabled. See the pricing details
page for costs.
What regions is ExpressRoute for Office 365 supported in?

See ExpressRoute partners and locations for information.


Can I access Office 365 over the Internet, even if ExpressRoute was configured for my
organization?

Yes. Office 365 service endpoints are reachable through the Internet, even though
ExpressRoute has been configured for your network. Please check with your
organization's networking team if the network at your location is configured to
connect to Office 365 services through ExpressRoute.
How can I plan for high availability for Office 365 network traffic on Azure
ExpressRoute?

See the recommendation for High availability and failover with Azure ExpressRoute
Can I access Office 365 US Government Community (GCC) services over an Azure US
Government ExpressRoute circuit?

Yes. Office 365 GCC service endpoints are reachable through the Azure US
Government ExpressRoute. However, you first need to open a support ticket on the
Azure portal to provide the prefixes you intend to advertise to Microsoft. Your
connectivity to Office 365 GCC services will be established after the support ticket is
resolved.
Route filters for Microsoft peering
I am turning on Microsoft peering for the first time, what routes will I see?

You will not see any routes. You have to attach a route filter to your circuit to start
prefix advertisements. For instructions, see Configure route filters for Microsoft
peering.
I turned on Microsoft peering and now I am trying to select Exchange Online, but it is
giving me an error that I am not authorized to do it.

When using route filters, any customer can turn on Microsoft peering. However, for
consuming Office 365 services, you still need to get authorized by Office 365.
Do I need to get authorization for turning on Dynamics 365 over Microsoft peering?

No, you do not need authorization for Dynamics 365. You can create a rule and
select Dynamics 365 community without authorization.
I enabled Microsoft peering prior to August 1, 2017, how can I take advantage of
route filters?

Your existing circuit will continue advertising the prefixes for Office 365 and
Dynamics 365. If you want to add Azure public prefixes advertisements over the
same Microsoft peering, you can create a route filter, select the services you need
advertised (including the Office 365 service(s) you need and Dynamics 365), and
attach the filter to your Microsoft peering. For instructions, see Configure route filters
for Microsoft peering.
I have Microsoft peering at one location, now I am trying to enable it at another
location and I am not seeing any prefixes.

 Microsoft peering of ExpressRoute circuits that were configured prior to


August 1, 2017 will have all service prefixes advertised through Microsoft
peering, even if route filters are not defined.

 Microsoft peering of ExpressRoute circuits that are configured on or after


August 1, 2017 will not have any prefixes advertised until a route filter is
attached to the circuit. You will see no prefixes by default.
ExpressRoute Direct (Preview)
What is ExpressRoute Direct?

ExpressRoute Direct provides customers with the ability to connect directly into
Microsoft’s global network at peering locations strategically distributed across the
world. ExpressRoute Direct provides dual 100 Gbps connectivity, which supports
Active/Active connectivity at scale.
How do customers connect to ExpressRoute Direct?

Customers will need to work with their local carriers and co-location providers to get
connectivity to ExpressRoute routers to take advantage of ExpressRoute Direct.
What locations currently support ExpressRoute Direct?

The available ports will be dynamic and will be available by PowerShell to view the
capacity. Locations include and are subject to change based on availability:

 Amsterdam
 Canberra
 Chicago
 Washington DC
 Dallas
 Hong Kong
 Los Angeles
 New York City
 Paris
 San Antonio
 Silicon Valley
 Singapore

What is the SLA for ExpressRoute Direct?

ExpressRoute Direct will utilize the same enterprise-grade of ExpressRoute.


What scenarios should customers consider with ExpressRoute Direct?

ExpressRoute Direct provides customers with direct 100 Gbps port pairs into the
Microsoft global backbone. The scenarios that will provide customers with the
greatest benefits include: Massive data ingestion, physical isolation for regulated
markets, and dedicated capacity for burst scenario, like rendering.
What is the billing model for ExpressRoute Direct?

ExpressRoute Direct will be billed for the port pair at a fixed amount. Standard
circuits will be included at no additional hours and premium will have a slight add-on
charge. Egress will be billed on a per circuit basis based on the zone of the peering
location.
When does billing state for the ExpressRoute Direct port pairs?

ExpressRoute Direct's port pairs are billed 45 days into the creation of the
ExpressRoute Direct resource or when 1 or both of the links are enabled, whichever
comes first. The 45-day grace period is granted to allow customers to complete the
cross-connection process with the colocation provider.
Global Reach (Preview)
What is ExpressRoute Global Reach?

ExpressRoute Global Reach is an Azure service that connects your on-premises


networks via the ExpressRoute service through Microsoft's global network. For
example, if you have a private data center in California connected to ExpressRoute in
Silicon Valley and another private data center in Texas connected to ExpressRoute in
Dallas, with ExpressRoute Global Reach, you can connect your private data centers
together through the two ExpressRoute connections and your cross data center
traffic will traverse through Microsoft's network backbone.
How do I enable or disable ExpressRoute Global Reach?

You enable ExpressRoute Global Reach by connecting your ExpressRoute circuits


together. You disable the feature by disconnecting the circuits. See the configuration.
Do I need ExpressRoute Premium for ExpressRoute Global Reach?

If your ExpressRoute circuits are in the same geopolitical region, you don't need
ExpressRoute Premium to connect them together. If two ExpressRoute circuits are in
different geopolitical regions, you need ExpressRoute Premium for both circuits in
order to enable connectivity between them.
How will I be charged for ExpressRoute Global Reach?

ExpressRoute enables connectivity from your on-premises network to Microsoft


cloud services. ExpressRoute Global Reach enables connectivity between your own
on-premises networks via your existing ExpressRoute circuits, leveraging Microsoft's
global network. ExpressRoute Global Reach is billed separately from the existing
ExpressRoute service. There is an Add-on fee for enabling this feature on each
ExpressRoute circuit. Traffic between your on-premises networks enabled by
ExpressRoute Global Reach will be billed for an egress rate at the source and for an
ingress rate at the destination. The rates are based on the zone at which the circuits
are located. See
Where is ExpressRoute Global Reach supported?

ExpressRoute Global Reach is supported in select countries or places. The


ExpressRoute circuits must be created at the peering locations in those countries or
places.
I have more than two on-premises networks, each connected to an ExpressRoute
circuit. Can I enable ExpressRoute Global Reach to connect all of my on-premises
networks together?

Yes, you can, as long as the circuits are in the supported countries. You need to
connect two ExpressRoute circuits at a time. To create a fully meshed network, you
need to enumerate all circuit pairs and repeat the configuration.
Can I enable ExpressRoute Global Reach between two ExpressRoute circuits at the
same peering location?

No. The two circuits must be from different peering locations. If a metro in a
supported country has more than one ExpressRoute peering location, you can
connect together the ExpressRoute circuits created at different peering locations in
that metro.
If ExpressRoute Global Reach is enabled between circuit X and circuit Y, and between
circuit Y and circuit Z, will my on-premises networks connected to circuit X and circuit
Z talk to each other via Microsoft's network?

No. To enable connectivity between any two of your on-premises networks, you must
connect the corresponding ExpressRoute circuits explicitly. In the above example, you
must connect circuit X and circuit Z.
What is the network throughput I can expect between my on-premises networks
after I enable ExpressRoute Global Reach?

The network throughput between your on-premises networks, enabled by


ExpressRoute Global Reach, is capped by the smaller of the two ExpressRoute
circuits.
With ExpressRoute Global Reach, what are the limits on the number of routes I can
advertise and the number of routes I will receive?

The number of routes you can advertise to Microsoft on Azure private peering
remains at 4000 on a Standard circuit or 10000 on a Premium circuit. The number of
routes you will receive from Microsoft on Azure private peering will be the sum of
the routes of your Azure virtual networks and the routes from your other on-
premises networks connected via ExpressRoute Global Reach. Please make sure you
set an appropriate maximum prefix limit on your on-premises router.
What is the SLA for ExpressRoute Global Reach?

ExpressRoute Global Reach will provide the same availability SLA as the regular
ExpressRoute service.

Business continuity and disaster


recovery (BCDR): Azure Paired Regions
What are paired regions?

Azure operates in multiple geographies around the world. An Azure geography is a


defined area of the world that contains at least one Azure Region. An Azure region is
an area within a geography, containing one or more datacenters.

Each Azure region is paired with another region within the same geography, together
making a regional pair. The exception is Brazil South, which is paired with a region
outside its geography. Across the region pairs Azure serializes platform updates
(planned maintenance), so that only one paired region is updated at a time. In the
event of an outage affecting multiple regions, at least one region in each pair will be
prioritized for recovery.

Figure 1 – Azure regional pairs


Geography Paired regions

Asia East Asia Southeast Asia

Australia Australia East Australia Southeast

Australia Australia Central Australia Central 2

Brazil Brazil South South Central US

Canada Canada Central Canada East

China China North China East

China China North 2 China East 2

Europe North Europe West Europe

France France Central France South

Germany Germany Central Germany Northeast

India Central India South India


Geography Paired regions

India West India South India

Japan Japan East Japan West

Korea Korea Central Korea South

North America East US West US

North America East US 2 Central US

North America North Central US South Central US

North America West US 2 West Central US

UK UK West UK South

US Department of Defense US DoD East US DoD Central

US Government US Gov Arizona US Gov Texas

US Government US Gov Iowa US Gov Virginia

US Government US Gov Virginia US Gov Texas

Table 1 - Mapping of Azure regional pairs

 West India is different because it is paired with another region in one direction only.
West India's secondary region is South India, but South India's secondary region is
Central India.
 Brazil South is unique because it is paired with a region outside of its own geography.
Brazil South’s secondary region is South Central US, but South Central US’s secondary
region is not Brazil South.
 US Gov Iowa's secondary region is US Gov Virginia, but US Gov Virginia's secondary
region is not US Gov Iowa.
 US Gov Virginia's secondary region is US Gov Texas, but US Gov Texas' secondary
region is not US Gov Virginia.

We recommend that you configure business continuity disaster recovery (BCDR)


across regional pairs to benefit from Azure’s isolation and availability policies. For
applications which support multiple active regions, we recommend using both
regions in a region pair where possible. This will ensure optimal availability for
applications and minimized recovery time in the event of a disaster.
An example of paired regions

Figure 2 below shows a hypothetical application which uses the regional pair for
disaster recovery. The green numbers highlight the cross-region activities of three
Azure services (Azure compute, storage, and database) and how they are configured
to replicate across regions. The unique benefits of deploying across paired regions
are highlighted by the orange numbers.

Figure 2 – Hypothetical Azure regional pair


Cross-region activities

As referred to in figure 2.

 Azure Compute (IaaS) – You must provision additional compute resources in


advance to ensure resources are available in another region during a disaster. For
more information, see Azure resiliency technical guidance.

 Azure Storage - Geo-Redundant storage (GRS) is configured by default when


an Azure Storage account is created. With GRS, your data is automatically replicated
three times within the primary region, and three times in the paired region. For more
information, see Azure Storage Redundancy Options.

 Azure SQL Database – With Azure SQL Database Geo-Replication, you can
configure asynchronous replication of transactions to any region in the world;
however, we recommend you deploy these resources in a paired region for most
disaster recovery scenarios. For more information, see Geo-Replication in Azure SQL
Database.

 Azure Resource Manager - Resource Manager inherently provides logical


isolation of components across regions. This means logical failures in one region are
less likely to impact another.
Benefits of paired regions

As referred to in figure 2.

 Physical isolation – When possible, Azure prefers at least 300 miles of


separation between datacenters in a regional pair, although this isn't practical or
possible in all geographies. Physical datacenter separation reduces the likelihood of
natural disasters, civil unrest, power outages, or physical network outages affecting
both regions at once. Isolation is subject to the constraints within the geography
(geography size, power/network infrastructure availability, regulations, etc.).

 Platform-provided replication - Some services such as Geo-Redundant


Storage provide automatic replication to the paired region.

 Region recovery order – In the event of a broad outage, recovery of one


region is prioritized out of every pair. Applications that are deployed across paired
regions are guaranteed to have one of the regions recovered with priority. If an
application is deployed across regions that are not paired, recovery might be delayed
– in the worst case the chosen regions may be the last two to be recovered.

 Sequential updates – Planned Azure system updates are rolled out to paired
regions sequentially (not at the same time) to minimize downtime, the effect of bugs,
and logical failures in the rare event of a bad update.

 Data residency – A region resides within the same geography as its pair (with
the exception of Brazil South) in order to meet data residency requirements for tax
and law enforcement jurisdiction purposes.

Disaster recovery for Azure applications


Disaster recovery (DR) is focused on recovering from a catastrophic loss of application
functionality. For example, if an Azure region hosting your application becomes unavailable,
you need a plan for running your application or accessing your data in another region.

Business and technology owners must determine how much functionality is required during a
disaster. This level of functionality can take a few forms: completely unavailable, partially
available via reduced functionality or delayed processing, or fully available.

Resiliency and high availability strategies are intended for handling temporary failure
conditions. Executing this plan involves people, processes, and supporting applications that
allow the system to continue functioning. Your plan should include rehearsing failures and
testing the recovery of databases to ensure the plan is sound.
Azure disaster recovery features

As with availability considerations, Azure provides resiliency technical guidance designed to


support disaster recovery. There is also a relationship between availability features of Azure
and disaster recovery. For example, the management of roles across fault domains increases
the availability of an application. Without that management, an unhandled hardware failure
would become a “disaster” scenario. Leveraging these availability features and strategies is
an important part of disaster-proofing your application. However, this article goes beyond
general availability issues to more serious (and rarer) disaster events.
Multiple datacenter regions

Azure maintains datacenters in many regions around the world. This infrastructure supports
several disaster recovery scenarios, such as system-provided geo-replication of Azure Storage
to secondary regions. You can also easily and inexpensively deploy a cloud service to
multiple locations around the world. Compare this with the cost and difficulty of building and
maintaining your own datacenters in multiple regions. Deploying data and services to
multiple regions helps protect your application from a major outage in a single region. As you
design your disaster recovery plan, it’s important to understand the concept of paired regions.
For more information, see Business continuity and disaster recovery (BCDR): Azure Paired
Regions.
Azure Site Recovery

Azure Site Recovery provides a simple way to replicate Azure VMs between regions. It has
minimal management overhead, because you don't need to provision any additional resources
in the secondary region. When you enable replication, Site Recovery automatically creates
the required resources in the target region, based on the source VM settings. It provides
automated continuous replication, and enables you to perform application failover with a
single click. You can also run disaster recovery drills by testing failover, without affecting
your production workloads or ongoing replication.
Azure Traffic Manager

When a region-specific failure occurs, you must redirect traffic to services or deployments in
another region. It is most effective to handle this via services such as Azure Traffic Manager,
which automates the failover of user traffic to another region if the primary region fails.
Understanding the fundamentals of Traffic Manager is important when designing an effective
DR strategy.

Traffic Manager uses the Domain Name System (DNS) to direct client requests to the most
appropriate endpoint based on a traffic-routing method and the health of the endpoints. In the
following diagram, users connect to a Traffic Manager URL
(http://myATMURL.trafficmanager.net) which abstracts the actual site URLs
(http://app1URL.cloudapp.net and http://app2URL.cloudapp.net). User requests are
routed to the proper underlying URL based on your configured Traffic Manager routing
method. For the sake of this article, we will be concerned with only the failover option.

When configuring Traffic Manager, you provide a new Traffic Manager DNS prefix, which
users will use to access your service. Traffic Manager now abstracts load balancing one level
higher that the regional level. The Traffic Manager DNS maps to a CNAME for all the
deployments that it manages.

Within Traffic Manager, you specify a prioritized list of deployments that users will be
routed to when failure occurs. Traffic Manager monitors the deployment endpoints. If the
primary deployment becomes unavailable, Traffic Manager routes users to the next
deployment on the priority list.

Although Traffic Manager decides where to go during a failover, you can decide whether
your failover domain is dormant or active while you're not in failover mode (which is
unrelated to Traffic Manager). Traffic Manager detects a failure in the primary site and rolls
over to the failover site, regardless of whether that site is currently serving users.

For more information on how Azure Traffic Manager works, refer to:

 Traffic Manager overview


 Traffic Manager routing methods
 Configure failover routing method

Azure disaster scenarios

The following sections cover several different types of disaster scenarios. Region-wide
service disruptions are not the only cause of application-wide failures. Poor design and
administrative errors can also lead to outages. It's important to consider the possible causes of
a failure during both the design and testing phases of your recovery plan. A good plan takes
advantage of Azure features and augments them with application-specific strategies. The
chosen response is determined by the importance of the application, the recovery point
objective (RPO), and the recovery time objective (RTO).
Application failure

Azure Traffic Manager automatically handles failures that result from the underlying
hardware or operating system software in the host virtual machine. Azure creates a new role
instance and adds it to the available pool. If more than one role instance was already running,
Azure shifts processing to the other running role instances while replacing the failed node.

Serious application errors can occur without any underlying failure of the hardware or
operating system. The application might fail due to catastrophic exceptions caused by bad
logic or data integrity issues. You must include sufficient telemetry in the application code so
that a monitoring system can detect failure conditions and notify an application administrator.
An administrator who has full knowledge of the disaster recovery processes can decide
whether to trigger a failover process or accept an availability outage while resolving the
critical errors.
Data corruption

Azure automatically stores Azure SQL Database and Azure Storage data three times
redundantly within different fault domains in the same region. If you use geo-replication, the
data is stored three additional times in a different region. However, if your users or your
application corrupts that data in the primary copy, the data quickly replicates to the other
copies. Unfortunately, this results in multiple copies of corrupt data.
To manage potential corruption of your data, you have two options. First, you can manage a
custom backup strategy. You can store your backups in Azure or on-premises, depending on
your business requirements or governance regulations. Another option is to use the point-in-
time restore option to recover a SQL database. For more information, see the data strategies
for disaster recovery section below.
Network outage

When parts of the Azure network are inaccessible, you may be unable to access your
application or data. If one or more role instances are unavailable due to network issues, Azure
uses the remaining available instances of your application. If your application cannot access
its data because of an Azure network outage, you can potentially run with reduced application
functionality locally by using cached data. You need to design the disaster recovery strategy
to run with reduced functionality in your application. For some applications, this might not be
practical.

Another option is to store data in an alternate location until connectivity is restored. If


reducing functionality is not an option, the remaining options are application downtime or
failover to an alternate region. The design of an application running with reduced
functionality is as much a business decision as a technical one. This is discussed further in the
section on reduced application functionality.
Failure of a dependent service

Azure provides many services that can experience periodic downtime. For example, Azure
Redis Cache is a multi-tenant service which provides caching capabilities to your application.
It's important to consider what happens in your application if the dependent service is
unavailable. In many ways, this scenario is similar to the network outage scenario. However,
considering each service independently results in potential improvements to your overall
plan.

Azure Redis Cache provides caching to your application from within your cloud service
deployment, which provides disaster recovery benefits. First, the service now runs on roles
that are local to your deployment. Therefore, you're better able to monitor and manage the
status of the cache as part of your overall management processes for the cloud service. This
type of caching also exposes new features such as high availability for cached data, which
preserves cached data if a single node fails by maintaining duplicate copies on other nodes.

Note that high availability decreases throughput and increases latency because write
operations must also upedate any secondary copies. The amount of memory required to store
the cached data is effectively doubled, which must be taken into account during capacity
planning. This example demonstrates that each dependent service might have capabilities that
improve your overall availability and resistance to catastrophic failures.

With each dependent service, you should understand the implications of a service disruption.
In the caching example, it might be possible to access the data directly from a database until
you restore your cache. This would result in reduced performance while providing full access
to application data.
Region-wide service disruption

The previous failures have primarily been failures that can be managed within the same
Azure region. However, you must also prepare for the possibility that there is a service
disruption of the entire region. If a region-wide service disruption occurs, the locally
redundant copies of your data are not available. If you have enabled geo-replication, there are
three additional copies of your blobs and tables in a different region. If Microsoft declares the
region lost, Azure remaps all of the DNS entries to the geo-replicated region.
 Note

Be aware that you don't have any control over this process, and it will occur only for region-
wide service disruption. Consider using Azure Site Recovery to achieve better RPO and
RTO. Site Recovery allows application to decide what is an acceptable outage, and when to
fail over to the replicated VMs.
Azure-wide service disruption

In disaster planning, you must consider the entire range of possible disasters. One of the most
severe service disruptions would involve all Azure regions simultaneously. As with other
service disruptions, you might decide to accept the risk of temporary downtime in that event.
Widespread service disruptions that span regions are much rarer than isolated service
disruptions involving dependent services or single regions.

However, you may decide that certain mission-critical applications require a backup plan for
a multi-region service disruption. This plan might include failing over to services in
an alternative cloud or a hybrid on-premises and cloud solution.
Reduced application functionality

A well-designed application typically uses services that communicate with each other though
the implementation of loosely coupled information-interchange patterns. A DR-friendly
application requires separation of responsibilities at the service level. This prevents the
disruption of a dependent service from bringing down the entire application. For example,
consider a web commerce application for Company Y. The following modules might
constitute the application:

 Product Catalog allows users to browse products.


 Shopping Cart allows users to add/remove products in their shopping cart.
 Order Status shows the shipping status of user orders.
 Order Submission finalizes the shopping session by submitting the order with payment.
 Order Processing validates the order for data integrity and performs a quantity availability
check.

When a service dependency in this application becomes unavailable, how does the service
function until the dependency recovers? A well-designed system implements isolation
boundaries through separation of responsibilities, both at design time and at runtime. You can
categorize every failure as recoverable and non-recoverable. Non-recoverable errors will
bring down the service, but you can mitigate a recoverable error through alternatives. Certain
problems addressed by automatically handling faults and taking alternate actions are
transparent to the user. During a more serious service disruption, the application might be
completely unavailable. A third option is to continue handling user requests with reduced
functionality.

For instance, if the database for hosting orders goes down, the Order Processing service loses
its ability to process sales transactions. Depending on the architecture, it might be difficult or
impossible for the Order Submission and Order Processing services of the application to
continue. If the application is not designed to handle this scenario, the entire application
might go offline. However, if the product data is stored in a different location, then the
Product Catalog module can still be used for viewing products. However, other parts of the
application are unavailable, such as ordering or inventory queries.

Deciding what reduced application functionality is available is both a business decision and a
technical decision. You must decide how the application will inform the users of any
temporary problems. In the example above, the application might allow viewing products and
adding them to a shopping cart. However, when the user attempts to make a purchase, the
application notifies the user that the ordering functionality is temporarily unavailable. This
isn't ideal for the customer, but it does prevent an application-wide service disruption.
Data strategies for disaster recovery

Proper data handling is a challenging aspect of a disaster recovery plan. During the recovery
process, data restoration typically takes the most time. Different choices for reducing
functionality result in difficult challenges for data recovery from failure and consistency after
failure.

One consideration is the need to restore or maintain a copy of the application’s data. You will
use this data for reference and transactional purposes at a secondary site. An on-premises
deployment requires an expensive and lengthy planning process to implement a multiple-
region disaster recovery strategy. Conveniently, most cloud providers, including Azure,
readily allow the deployment of applications to multiple regions. These regions are
geographically distributed in such a way that multiple-region service disruption should be
extremely rare. The strategy for handling data across regions is one of the contributing factors
for the success of any disaster recovery plan.

The following sections discuss disaster recovery techniques related to data backups, reference
data, and transactional data.
Backup and restore

Regular backups of application data can support some disaster recovery scenarios. Different
storage resources require different techniques.
SQL Database

For the Basic, Standard, and Premium SQL Database tiers, you can take advantage of point-
in-time restore to recover your database. For more information, see Overview: Cloud
business continuity and database disaster recovery with SQL Database. Another option is to
use Active Geo-Replication for SQL Database. This automatically replicates database
changes to secondary databases in the same Azure region or even in a different Azure region.
This provides a potential alternative to some of the more manual data synchronization
techniques presented in this article. For more information, see Overview: SQL Database
Active Geo-Replication.

You can also use a more manual approach for backup and restore. Use the DATABASE
COPY command to create a backup copy of the database with transactional consistency. You
can also use the import/export service of Azure SQL Database, which supports exporting
databases to BACPAC files (compressed files containing your database schema and
associated data) that are stored in Azure Blob storage.
The built-in redundancy of Azure Storage creates two replicas of the backup file in the same
region. However, the frequency of running the backup process determines your RPO, which
is the amount of data you might lose in disaster scenarios. For example, imagine that you
perform a backup at the top of each hour, and a disaster occurs two minutes before the top of
the hour. You lose 58 minutes of data recorded after the last backup was performed. Also, to
protect against a region-wide service disruption, you should copy the BACPAC files to an
alternate region. You then have the option of restoring those backups in the alternate region.
For more details, see Overview: Cloud business continuity and database disaster recovery
with SQL Database.
SQL Data Warehouse

For SQL Data Warehouse, use geo-backups to restore to a paired region for disaster recovery.
These backups are taken every 24 hours and can be restore within 20 minutes in the paired
region. This feature is on by default for all SQL data warehouses. For more information on
how to restore your data warehouse, see Restore from an Azure geographical region using
PowerShell.
Azure Storage

For Azure Storage, you can develop a custom backup process or use one of many third-party
backup tools. Note that most application designs have additional complexities where storage
resources reference each other. For example, consider a SQL database that has a column that
links to a blob in Azure Storage. If the backups do not happen simultaneously, the database
might have a pointer to a blob that was not backed up before the failure. The application or
disaster recovery plan must implement processes to handle this inconsistency after a
recovery.
Other data platforms

Other infrastructure-as-a-service (IaaS) hosted data platforms, such as Elasticsearch or


MongoDB, have their own capabilities and considerations when creating an integrated
backup and restore process. For these data platforms, the general recommendation is to use
any native or available integration-based replication or snapshotting capabilities. If those
capabilities do not exist or are not suitable, then consider using Azure Backup Service or
managed/unmanaged disk snapshots to create a point-in-time copy of application data. In all
cases, it’s important to determine how to achieve consistent backups, especially when
application data spans multiple files systems, or when multiple drives are combined into a
single file system using volume managers or software-based RAID.
Reference data pattern for disaster recovery

Reference data is read-only data that supports application functionality. It typically does not
change frequently. Although backup and restore is one method to handle region-wide service
disruptions, the RTO is relatively long. When you deploy the application to a secondary
region, some strategies can improve the RTO for reference data.

Because reference data changes infrequently, you can improve the RTO by maintaining a
permanent copy of the reference data in the secondary region. This eliminates the time
required to restore backups in the event of a disaster. To meet the multiple-region disaster
recovery requirements, you must deploy the application and the reference data together in
multiple regions. You can deploy reference data to the role itself, to external storage, or to a
combination of both.
The reference data deployment model within compute nodes implicitly satisfies the disaster
recovery requirements. Reference data deployment to SQL Database requires that you deploy
a copy of the reference data to each region. The same strategy applies to Azure Storage. You
must deploy a copy of any reference data that's stored in Azure Storage to the primary and
secondary regions.

You must implement your own application-specific backup routines for all data, including
reference data. Geo-replicated copies across regions are used only in a region-wide service
disruption. To prevent extended downtime, deploy the mission-critical parts of the
application’s data to the secondary region. For an example of this topology, see the active-
passive model.
Transactional data pattern for disaster recovery

Implementation of a fully functional disaster mode strategy requires asynchronous replication


of the transactional data to the secondary region. The practical time windows within which
the replication can occur will determine the RPO characteristics of the application. You might
still recover the data that was lost from the primary region during the replication window.
You might also be able to merge with the secondary region later.

The following architecture examples provide some ideas on different ways of handling
transactional data in a failover scenario. It's important to note that these examples are not
exhaustive. For example, intermediate storage locations such as queues might be replaced
with Azure SQL Database. The queues themselves might be either Azure Storage or Azure
Service Bus queues (see Azure queues and Service Bus queues - compared and contrasted).
Server storage destinations might also vary, such as Azure tables instead of SQL Database. In
addition, worker roles might be inserted as intermediaries in various steps. The intent is not to
emulate these architectures exactly, but to consider various alternatives in the recovery of
transactional data and related modules.
Replication of transactional data in preparation for disaster recovery

Consider an application that uses Azure Storage queues to hold transactional data. This
allows worker roles to process the transactional data to the server database in a decoupled
architecture. This requires the transactions to use some form of temporary caching if the
front-end roles require the immediate query of that data. Depending on the level of data-loss
tolerance, you might choose to replicate the queues, the database, or all of the storage
resources. With only database replication, if the primary region goes down, you can still
recover the data in the queues when the primary region comes back.

The following diagram shows an architecture where the server database is synchronized
across regions.

The biggest challenge to implementing this architecture is the replication strategy between
regions. The Azure SQL Data Sync service enables this type of replication. As of this writing,
the service is in preview and is not yet recommended for production environments. For more
information, see Overview: Cloud business continuity and database disaster recovery with
SQL Database. For production applications, you must invest in a third-party solution or
create your own replication logic in code. Depending on the architecture, the replication
might be bidirectional, which is more complex.

One potential implementation might use the intermediate queue in the previous example. The
worker role that processes the data to the final storage destination might make the change in
both the primary region and the secondary region. These are not trivial tasks, and complete
guidance for replication code is beyond the scope of this article. Invest significant time and
testing into the approach for replicating data to the secondary region. Additional processing
and testing can help ensure that the failover and recovery processes correctly handle any
possible data inconsistencies or duplicate transactions.
 Note

Most of this paper focuses on platform as a service (PaaS). However, additional replication
and availability options for hybrid applications use Azure Virtual Machines. These hybrid
applications use infrastructure as a service (IaaS) to host SQL Server on virtual machines in
Azure. This allows traditional availability approaches in SQL Server, such as AlwaysOn
Availability Groups or Log Shipping. Some techniques, such as AlwaysOn, work only
between on-premises SQL Server instances and Azure virtual machines. For more
information, see High availability and disaster recovery for SQL Server in Azure Virtual
Machines.
Reduced application functionality for transaction capture

Consider a second architecture that operates with reduced functionality. The application in
the secondary region deactivates all the functionality, such as reporting, business intelligence
(BI), or draining queues. It accepts only the most important types of transactional workflows,
as defined by business requirements. The system captures the transactions and writes them to
queues. The system might postpone processing the data during the initial stage of the service
disruption. If the system on the primary region is reactivated within the expected time
window, the worker roles in the primary region can drain the queues. This process eliminates
the need for database merging. If the primary region service disruption goes beyond the
tolerable window, the application can start processing the queues.

In this scenario, the database in the secondary region contains incremental transactional data
that must be merged after the primary is reactivated. The following diagram shows this
strategy for temporarily storing transactional data until the primary region is restored.

For more discussion of data management techniques for resilient Azure applications,
see Failsafe: Guidance for Resilient Cloud Architectures.
Deployment topologies for disaster recovery

You must prepare mission-critical applications to handle region-wide service disruptions.


Incorporate a multiple-region deployment strategy into the operational planning.

Multiple-region deployments might involve IT processes to publish the application and


reference data to the secondary region after a disaster. If the application requires instant
failover, the deployment process might involve an active/passive setup or an active/active
setup. This type of deployment has existing instances of the application running in the
alternate region. A routing service such as Azure Traffic Manager provides load-balancing
services at the DNS level. It can detect service disruptions and route the users to different
regions when needed.

A successful Azure disaster recovery includes building that recovery into the solution from
the start. The cloud provides additional options for recovering from failures during a disaster
that are not available in a traditional hosting provider. Specifically, you can dynamically and
quickly allocate resources in a different region, avoiding the cost of idle resources prior to a
failure.

The following sections cover different deployment topologies for disaster recovery.
Typically, there's a tradeoff in increased cost or complexity for additional availability.
Single-region deployment

A single-region deployment is not really a disaster recovery topology, but is meant to contrast
with the other architectures. Single-region deployments are common for applications in
Azure; however, they do not meet the requirements of a disaster recovery topology.

The following diagram depicts an application running in a single Azure region. Azure Traffic
Manager and the use of fault and upgrade domains increase availability of the application
within the region.

In this scenario, the database is a single point of failure. Though Azure replicates the data
across different fault domains to internal replicas, this replication occurs only within the same
region. The application cannot withstand a catastrophic failure. If the region becomes
unavailable, then so do the fault domains, including all service instances and storage
resources.

For all but the least critical applications, you must devise a plan to deploy your applications
across multiple regions. You should also consider RTO and cost constraints in considering
which deployment topology to use.

Let's take a look now at specific approaches to supporting failover across different regions.
These examples all use two regions to describe the process.
Failover using Azure Site Recovery

When you enable Azure VM replication using Azure Site Recovery, it creates several
resources in the secondary region:

 Resource group.
 Virtual network (VNet).
 Storage account.
 Availability sets to hold VMs after failover.

Data writes on the VM disks in the primary region are continuously transferred to the storage
account in the secondary region. Recovery points are generated in the target storage account
every few minutes. When you initiate a failover, the recovered VMs are created in the target
resource group, VNet, and availability set. During a failover, you can choose any available
recovery point.
Redeployment to a secondary Azure region

For the approach of redeployment to a secondary region, only the primary region has
applications and databases running. The secondary region is not set up for an automatic
failover. So when a disaster occurs, you must spin up all the parts of the service in the new
region. This includes uploading a cloud service to Azure, deploying the cloud service,
restoring the data, and changing DNS to reroute the traffic.

Although this is the most affordable of the multiple-region options, it has the worst RTO
characteristics. In this model, the service package and database backups are stored either on-
premises or in the Azure Blob storage instance of the secondary region. However, you must
deploy a new service and restore the data before it resumes operation. Even with full
automation of the data transfer from backup storage, provisioning a new database
environment consumes a lot of time. Moving data from the backup disk storage to the empty
database on the secondary region is the most expensive part of the restore process. You must
do this, however, to bring the new database to an operational state because it isn't replicated.

The best approach is to store the service packages in Blob storage in the secondary region.
This eliminates the need to upload the package to Azure, which is what happens when you
deploy from an on-premises development machine. You can quickly deploy the service
packages to a new cloud service from Blob storage by using PowerShell scripts.

This option is practical only for non-critical applications that can tolerate a high RTO. For
instance, this might work for an application that can be down for several hours but is required
to be available within 24 hours.

Active-passive

An active-passive topology is the choice that many companies favor. This topology provides
improvements to the RTO with a relatively small increase in cost over the redeployment
approach. In this scenario, there is again a primary and a secondary Azure region. All of the
traffic goes to the active deployment on the primary region. The secondary region is better
prepared for disaster recovery because the database is running on both regions. Additionally,
a synchronization mechanism is in place between them. This standby approach can involve
two variations: a database-only approach or a complete deployment in the secondary region.
Database only

In the first variation of the active-passive topology, only the primary region has a deployed
cloud service application. However, unlike the redeployment approach, both regions are
synchronized with the contents of the database. (For more information, see the section
on transactional data pattern for disaster recovery.) When a disaster occurs, there are fewer
activation requirements. You start the application in the secondary region, change connection
strings to the new database, and change the DNS entries to reroute traffic.

Like the redeployment approach, you should have already stored the service packages in
Azure Blob storage in the secondary region for faster deployment. However, you don’t incur
the majority of the overhead that database restore operation requires, because the database is
ready and running. This saves a significant amount of time, making this an affordable DR
pattern (and the one most frequently used).
Full replica

In the second variation of the active-passive topology, both the primary region and the
secondary region have a full deployment. This deployment includes the cloud services and a
synchronized database. However, only the primary region is actively handling network
requests from the users. The secondary region becomes active only when the primary region
experiences a service disruption. In that case, all new network requests route to the secondary
region. Azure Traffic Manager can manage this failover automatically.

Failover occurs faster than the database-only variation because the services are already
deployed. This topology provides a very low RTO. The secondary failover region must be
ready to go immediately after failure of the primary region.

Along with a quicker response time, this topology pre-allocates and deploys backup services,
avoiding the possibility of a lack of space to allocate new instances during a disaster. This is
important if your secondary Azure region is nearing capacity. No service-level agreement
(SLA) guarantees that you can instantly deploy one or more new cloud services in any region.

For the fastest response time with this model, you must have similar scale (number of role
instances) in the primary and secondary regions. Despite the advantages, paying for unused
compute instances is costly, and this might not be the most prudent financial choice. Because
of this, it's more common to use a slightly scaled-down version of cloud services on the
secondary region. Then you can quickly fail over and scale out the secondary deployment if
necessary. You should automate the failover process so that after the primary region is
inaccessible, you activate additional instances, depending on the load. This might involve the
use of an autoscaling mechanism like virtual machine scale sets.

The following diagram shows the model where the primary and secondary regions contain a
fully deployed cloud service in an active-passive topology.

Active-active

In an active-active topology, the cloud services and database are fully deployed in both
regions. Unlike the active-passive model, both regions receive user traffic. This option yields
the quickest recovery time. The services are already scaled to handle a portion of the load at
each region. DNS is already enabled to use the secondary region. There's additional
complexity in determining how to route users to the appropriate region. Round-robin
scheduling might be possible. It's more likely that certain users would use a specific region
where the primary copy of their data resides.

In case of failover, simply disable DNS to the primary region. This routes all traffic to the
secondary region.

Even in this model, there are some variations. For example, the following diagram depicts a
primary region which owns the master copy of the database. The cloud services in both
regions write to that primary database. The secondary deployment can read from the primary
or replicated database. Replication in this example is one-way.
There is a downside to the active-active architecture in the preceding diagram. The second
region must access the database in the first region because the master copy resides there.
Performance significantly drops off when you access data from outside a region. In cross-
region database calls, you should consider some type of batching strategy to improve the
performance of these calls. For more information, see How to use batching to improve SQL
Database application performance.

An alternative architecture might involve each region accessing its own database directly. In
that model, some type of bidirectional replication is required to synchronize the databases in
each region.

With the previous topologies, decreasing RTO generally increases costs and complexity. The
active-active topology deviates from this cost pattern. In the active-active topology, you
might not need as many instances on the primary region as you would in the active-passive
topology. If you have 10 instances on the primary region in an active-passive architecture,
you might need only 5 in each region in an active-active architecture. Both regions now share
the load. This might be a cost savings over the active-passive topology if you keep a warm
standby on the passive region with 10 instances waiting for failover.

Realize that until you restore the primary region, the secondary region might receive a sudden
surge of new users. If there are 10,000 users on each server when the primary region
experiences a service disruption, the secondary region suddenly has to handle 20,000 users.
Monitoring rules on the secondary region must detect this increase and double the instances
in the secondary region. For more information on this, see the section on failure detection.
Hybrid on-premises and cloud solution

One additional strategy for disaster recovery is to architect a hybrid application that runs on-
premises and in the cloud. Depending on the application, the primary region might be either
location. Consider the previous architectures and imagine the primary or secondary region as
an on-premises location.

There are some challenges in these hybrid architectures. First, most of this article has
addressed PaaS architecture patterns. Typical PaaS applications in Azure rely on Azure-
specific constructs such as roles, cloud services, and Traffic Manager. Creating an on-
premises solution for this type of PaaS application would require a significantly different
architecture. This might not be feasible from a management or cost perspective.

However, a hybrid solution for disaster recovery has fewer challenges for traditional
architectures that have been migrated to the cloud, such as IaaS-based architectures. IaaS
applications use virtual machines in the cloud that can have direct on-premises equivalents.
You can also use virtual networks to connect machines in the cloud with on-premises
network resources. This allows several possibilities that are not possible with PaaS-only
applications. For example, SQL Server can take advantage of disaster recovery solutions such
as AlwaysOn Availability Groups and database mirroring. For details, see High availability
and disaster recovery for SQL Server in Azure virtual machines.
IaaS solutions also provide an easier path for on-premises applications to use Azure as the
failover option. You might have a fully functioning application in an existing on-premises
region. However, what if you lack the resources to maintain a geographically separate region
for failover? You might decide to use virtual machines and virtual networks to get your
application running in Azure. In that case, define processes that synchronize data to the
cloud. The Azure deployment then becomes the secondary region to use for failover. The
primary region remains the on-premises application. For more information about IaaS
architectures and capabilities, see the Virtual Machines documentation.
Alternative cloud

There are situations where the broad capabilities of Microsoft Azure still may not meet
internal compliance rules or policies required by your organization. Even the best preparation
and design to implement backup systems during a disaster are inadequate during a global
service disruption of a cloud service provider.

You should compare availability requirements with the cost and complexity of increased
availability. Perform a risk analysis, and define the RTO and RPO for your solution. If your
application cannot tolerate any downtime, you might consider using an additional cloud
solution. Unless the entire Internet goes down, another cloud solution might still be available
if Azure becomes globally inaccessible.

As with the hybrid scenario, the failover deployments in the previous disaster recovery
architectures can also exist within another cloud solution. Alternative cloud DR sites should
be used only for solutions whose RTO allows very little, if any, downtime. Note that a
solution that uses a DR site outside Azure will require more work to configure, develop,
deploy, and maintain. It's also more difficult to implement proven practices in a cross-cloud
architecture. Although cloud platforms have similar high-level concepts, the APIs and
architectures are different.

If your DR strategy relies upon multiple cloud platforms, it's valuable to include abstraction
layers in the design of the solution. This eliminates the need to develop and maintain two
different versions of the same application for different cloud platforms in case of disaster. As
with the hybrid scenario, the use of Azure Virtual Machines or Azure Container Service
might be easier in these cases than the use of cloud-specific PaaS designs.
Automation

Some of the patterns that we just discussed require quick activation of offline deployments as
well as restoration of specific parts of a system. Automation scripts can activate resources on
demand and deploy solutions rapidly. The DR-related automation examples below use Azure
PowerShell, but using the Azure CLI or the Service Management REST API are also good
options.

Automation scripts manage aspects of DR not transparently handled by Azure. This produces
consistent and repeatable results, minimizing human error. Predefined DR scripts also reduce
the time to rebuild a system and its constituent parts during a disaster. You don’t want to try
to manually figure out how to restore your site while it's down and losing money every
minute.
Test your scripts repeatedly from start to finish. After verifying their basic functionality,
make sure to test them in disaster simulation. This helps uncover defects in the scripts or
processes.

A best practice with automation is to create a repository of PowerShell scripts or command-


line interface (CLI) scripts for Azure disaster recovery. Clearly mark and categorize them for
quick access. Designate a primary person to manage the repository and versioning of the
scripts. Document them well with explanations of parameters and examples of script use.
Also ensure that you keep this documentation in sync with your Azure deployments. This
underscores the purpose of having a primary person in charge of all parts of the repository.
Failure detection

To correctly handle problems with availability and disaster recovery, you must be able to
detect and diagnose failures. Perform advanced server and deployment monitoring to quickly
recognize when a system or its components suddenly become unavailable. Monitoring tools
that assess the overall health of the cloud service and its dependencies can perform part of
this work. One suitable Microsoft tool is System Center 2016. Third-party tools can also
provide monitoring capabilities. Most monitoring solutions track key performance counters
and service availability.

Although these tools are vital, you must plan for fault detection and reporting within a cloud
service. You must also plan to properly use Azure Diagnostics. Custom performance counters
or event-log entries can also be part of the overall strategy. This provides more data during
failures to quickly diagnose the problem and restore full capabilities. It also provides
additional metrics that the monitoring tools can use to determine application health. For more
information, see Enabling Azure Diagnostics in Azure Cloud Services. For a discussion of
how to plan for an overall “health model,” see Failsafe: Guidance for Resilient Cloud
Architectures.
Disaster simulation

Simulation testing involves creating small real-life situations on the work floor to observe
how the team members react. Simulations also show how effective the solutions are in the
recovery plan. Execute simulations so that the created scenarios don't disrupt actual business,
while still feeling like real situations.

Consider architecting a type of “switchboard” in the application to manually simulate


availability issues. For instance, through a soft switch, trigger database access exceptions for
an ordering module by causing it to malfunction. You can take similar lightweight
approaches for other modules at the network interface level.

The simulation highlights any issues that were inadequately addressed. The simulated
scenarios must be completely controllable. This means that, even if the recovery plan seems
to be failing, you can restore the situation back to normal without causing any significant
damage. It’s also important that you inform higher-level management about when and how
the simulation exercises will be executed. This plan should detail the time or resources
affected during the simulation. Also define the measures of success when testing your
disaster recovery plan.

If you are using Azure Site Recovery, you can execute a test failover to Azure, to validate
your replication strategy or perform a disaster recovery drill without any data loss or
downtime. A test failover does not affect on the ongoing VM replication or your production
environment.

Several other techniques can test disaster recovery plans. However, most of them are simply
variations of these basic techniques. The intent of this testing is to evaluate the feasibility of
the recovery plan. Disaster recovery testing focuses on the details to discover gaps in the
basic recovery plan.
Service-specific guidance

The following topics describe disaster recovery specific Azure services:


Service Topic

Azure Database for MySQL Overview of business continuity with Azure Database for MySQL

Azure Database for Overview of business continuity with Azure Database for PostgreSQL
PostgreSQL

Cloud Services What to do in the event of an Azure service disruption that impacts Azure
Cloud Services

Cosmos DB Automatic regional failover for business continuity in Azure Cosmos DB

Key Vault Azure Key Vault availability and redundancy

Storage What to do if an Azure Storage outage occurs

SQL Database Restore an Azure SQL Database or failover to a secondary

Virtual machines What to do in the event that an Azure service disruption impacts Azure virtual
machines

Virtual networks Virtual Network – Business Continuity

Availability checklist
 11/26/2018
 11 minutes to read
 Contributors
o

o all

Availability is the proportion of time that a system is functional and working, and is
one of the pillars of software quality. Use this checklist to review your application
architecture from an availability standpoint.
Application design

Avoid any single point of failure. All components, services, resources, and compute
instances should be deployed as multiple instances to prevent a single point of
failure from affecting availability. This includes authentication mechanisms. Design
the application to be configurable to use multiple instances, and to automatically
detect failures and redirect requests to non-failed instances where the platform does
not do this automatically.

Decompose workloads by service-level objective. If a service is composed of


critical and less-critical workloads, manage them differently and specify the service
features and number of instances to meet their availability requirements.

Minimize and understand service dependencies. Minimize the number of different


services used where possible, and ensure you understand all of the feature and
service dependencies that exist in the system. This includes the nature of these
dependencies, and the impact of failure or reduced performance in each one on the
overall application.

Design tasks and messages to be idempotent where possible. An operation is


idempotent if it can be repeated multiple times and produce the same result.
Idempotency can ensure that duplicated requests don't cause problems. Message
consumers and the operations they carry out should be idempotent so that
repeating a previously executed operation does not render the results invalid. This
may mean detecting duplicated messages, or ensuring consistency by using an
optimistic approach to handling conflicts.

Use a message broker that implements high availability for critical


transactions. Many cloud applications use messaging to initiate tasks that are
performed asynchronously. To guarantee delivery of messages, the messaging
system should provide high availability. Azure Service Bus Messaging implements at
least once semantics. This means that a message posted to a queue will not be lost,
although duplicate copies may be delivered under certain circumstances. If message
processing is idempotent (see the previous item), repeated delivery should not be a
problem.

Design applications to gracefully degrade. The load on an application may exceed


the capacity of one or more parts, causing reduced availability and failed
connections. Scaling can help to alleviate this, but it may reach a limit imposed by
other factors, such as resource availability or cost. When an application reaches a
resource limit, it should take appropriate action to minimize the impact for the user.
For example, in an ecommerce system, if the order-processing subsystem is under
strain or fails, it can be temporarily disabled while allowing other functionality, such
as browsing the product catalog. It might be appropriate to postpone requests to a
failing subsystem, for example still enabling customers to submit orders but saving
them for later processing, when the orders subsystem is available again.

Gracefully handle rapid burst events. Most applications need to handle varying


workloads over time. Auto-scaling can help to handle the load, but it may take some
time for additional instances to come online and handle requests. Prevent sudden
and unexpected bursts of activity from overwhelming the application: design it to
queue requests to the services it uses and degrade gracefully when queues are near
to full capacity. Ensure that there is sufficient performance and capacity available
under non-burst conditions to drain the queues and handle outstanding requests.
For more information, see the Queue-Based Load Leveling pattern.
Deployment and maintenance

Deploy multiple instances of services. If your application depends on a single


instance of a service, it creates a single point of failure. Provisioning multiple
instances improves both resiliency and scalability. For Azure App Service, select
an App Service Plan that offers multiple instances. For Azure Cloud Services,
configure each of your roles to use multiple instances. For Azure Virtual Machines
(VMs), ensure that your VM architecture includes more than one VM and that each
VM is included in an availability set.

Consider deploying your application across multiple regions. If your application


is deployed to a single region, in the rare event the entire region becomes
unavailable, your application will also be unavailable. This may be unacceptable
under the terms of your application's SLA. If so, consider deploying your application
and its services across multiple regions.

Automate and test deployment and maintenance tasks. Distributed applications


consist of multiple parts that must work together. Deployment should be automated,
using tested and proven mechanisms such as scripts. These can update and validate
configuration, and automate the deployment process. Use Azure Resource Manager
templates to provision Azure resource. Also use automated techniques to perform
application updates. It is vital to test all of these processes fully to ensure that errors
do not cause additional downtime. All deployment tools must have suitable security
restrictions to protect the deployed application; define and enforce deployment
policies carefully and minimize the need for human intervention.

Use staging and production features of the platform.. For example, Azure App
Service supports deployment slots, which you can use to stage a deployment before
swapping it to production. Azure Service Fabric supports rolling upgrades to
application services.

Place virtual machines (VMs) in an availability set. To maximize availability, create


multiple instances of each VM role and place these instances in the same availability
set. If you have multiple VMs that serve different roles, such as different application
tiers, create an availability set for each VM role. For example, create an availability set
for the web tier and another for the data tier.

Replicate VMs using Azure Site Recovery. To maximize availability, replicate all
your virtual machines into another Azure region using Site Recovery. Ensure that all
the VMs across all the tiers of your application are replicated. If there is a disruption
in the source region, you can fail over the VMs into the other region within minutes.
Data management

Geo-replicate data in Azure Storage. Data in Azure Storage is automatically


replicated within in a datacenter. For even higher availability, use Read-access geo-
redundant storage (-RAGRS), which replicates your data to a secondary region and
provides read-only access to the data in the secondary location. The data is durable
even in the case of a complete regional outage or a disaster. For more information,
see Azure Storage replication.

Geo-replicate databases. Azure SQL Database and Cosmos DB both support geo-
replication, which enables you to configure secondary database replicas in other
regions. Secondary databases are available for querying and for failover in the case
of a data center outage or the inability to connect to the primary database. For more
information, see Failover groups and active geo-replication (SQL Database) and How
to distribute data globally with Azure Cosmos DB.

Use optimistic concurrency and eventual consistency. Transactions that block


access to resources through locking (pessimistic concurrency) can cause poor
performance and considerably reduce availability. These problems can become
especially acute in distributed systems. In many cases, careful design and techniques
such as partitioning can minimize the chances of conflicting updates occurring.
Where data is replicated, or is read from a separately updated store, the data will
only be eventually consistent. But the advantages usually far outweigh the impact on
availability of using transactions to ensure immediate consistency.

Use periodic backup and point-in-time restore. Regularly and automatically back
up data that is not preserved elsewhere, and verify you can reliably restore both the
data and the application itself should a failure occur. Ensure that backups meet your
Recovery Point Objective (RPO). Data replication is not a backup feature, because
human error or malicious operations can corrupt data across all the replicas. The
backup process must be secure to protect the data in transit and in storage.
Databases or parts of a data store can usually be recovered to a previous point in
time by using transaction logs. For more information, see Recover from data
corruption or accidental deletion

Replicate VM disks using Azure Site Recovery. When you replicate Azure VMs
using Site Recovery, all the VM disks are continuously replicated to the target region
asynchronously. The recovery points are created every few minutes. This gives you an
RPO in the order of minutes.
Errors and failures

Configure request timeouts. Services and resources may become unavailable,


causing requests to fail. Ensure that the timeouts you apply are appropriate for each
service or resource as well as the client that is accessing them. In some cases, you
might allow a longer timeout for a particular instance of a client, depending on the
context and other actions that the client is performing. Very short timeouts may
cause excessive retry operations for services and resources that have considerable
latency. Very long timeouts can cause blocking if a large number of requests are
queued, waiting for a service or resource to respond.
Retry failed operations caused by transient faults. Design a retry strategy for
access to all services and resources where they do not inherently support automatic
connection retry. Use a strategy that includes an increasing delay between retries as
the number of failures increases, to prevent overloading of the resource and to allow
it to gracefully recover and handle queued requests. Continual retries with very short
delays are likely to exacerbate the problem. For more information, see Retry
guidance for specific services.

Implement circuit breaking to avoid cascading failures. There may be situations


in which transient or other faults, ranging in severity from a partial loss of
connectivity to the complete failure of a service, take much longer than expected to
return to normal. , if a service is very busy, failure in one part of the system may lead
to cascading failures, and result in many operations becoming blocked while holding
onto critical system resources such as memory, threads, and database connections.
Instead of continually retrying an operation that is unlikely to succeed, the
application should quickly accept that the operation has failed, and gracefully handle
this failure. Use the Circuit Breaker pattern to reject requests for specific operations
for defined periods. For more information, see the Circuit Breaker pattern.

Compose or fall back to multiple components. Design applications to use multiple


instances without affecting operation and existing connections where possible. Use
multiple instances and distribute requests between them, and detect and avoid
sending requests to failed instances, in order to maximize availability.

Fall back to a different service or workflow. For example, if writing to SQL


Database fails, temporarily store data in blob storage or Redis Cache. Provide a way
to replay the writes to SQL Database when the service becomes available. In some
cases, a failed operation may have an alternative action that allows the application to
continue to work even when a component or service fails. If possible, detect failures
and redirect requests to other services that can offer a suitable alternative
functionality, or to back up or reduced functionality instances that can maintain core
operations while the primary service is offline.
Monitoring and disaster recovery

Provide rich instrumentation for likely failures and failure events to report the
situation to operations staff. For failures that are likely but have not yet occurred,
provide sufficient data to enable operations staff to determine the cause, mitigate
the situation, and ensure that the system remains available. For failures that have
already occurred, the application should return an appropriate error message to the
user but attempt to continue running, albeit with reduced functionality. In all cases,
the monitoring system should capture comprehensive details to enable operations
staff to effect a quick recovery, and if necessary, for designers and developers to
modify the system to prevent the situation from arising again.
Monitor system health by implementing checking functions. The health and
performance of an application can degrade over time, without being noticeable until
it fails. Implement probes or check functions that are executed regularly from outside
the application. These checks can be as simple as measuring response time for the
application as a whole, for individual parts of the application, for individual services
that the application uses, or for individual components. Check functions can execute
processes to ensure they produce valid results, measure latency and check
availability, and extract information from the system.

Regularly test all failover and fallback systems. Changes to systems and


operations may affect failover and fallback functions, but the impact may not be
detected until the main system fails or becomes overloaded. Test it before it is
required to compensate for a live problem at runtime. If you are using Azure Site
Recovery to replicate VMs, run disaster recovery drills periodically by doing a test
failover. For more information, see Run a disaster recovery drill to Azure.

Test the monitoring systems. Automated failover and fallback systems, and manual
visualization of system health and performance by using dashboards, all depend on
monitoring and instrumentation functioning correctly. If these elements fail, miss
critical information, or report inaccurate data, an operator might not realize that the
system is unhealthy or failing.

Track the progress of long-running workflows and retry on failure. Long-


running workflows are often composed of multiple steps. Ensure that each step is
independent and can be retried to minimize the chance that the entire workflow will
need to be rolled back, or that multiple compensating transactions need to be
executed. Monitor and manage the progress of long-running workflows by
implementing a pattern such as Scheduler Agent Supervisor pattern.

Plan for disaster recovery. Create an accepted, fully-tested plan for recovery from
any type of failure that may affect system availability. Choose a multi-site disaster
recovery architecture for any mission-critical applications. Identify a specific owner of
the disaster recovery plan, including automation and testing. Ensure the plan is well-
documented, and automate the process as much as possible. Establish a backup
strategy for all reference and transactional data, and test the restoration of these
backups regularly. Train operations staff to execute the plan, and perform regular
disaster simulations to validate and improve the plan. If you are using Azure Site
Recovery to replicate VMs, create a fully automated recovery plan to failover the
entire application within minutes.

You might also like