Professional Documents
Culture Documents
com/en-
us/azure/architecture/
Cloud Computing
Estimated Reading Time: 3 minutes
You have heard lot’s of things about cloud. Let’s understand the each components of the cloud
in this page. Also let’s understand how the organizations are implementing various cloud models.
Before going further let’s understand about the what is cloud computing is all about. As per the
general dictionary definition we can see
“Cloud computing is a model for enabling convenient, on-demand network access to a shared pool
of configurable computing resources (e.g., networks, servers, storage, applications and services) that
can be rapidly provisioned and released with minimal management effort or service provider
interaction.”
The various characteristics of cloud computing is follows:
On-demand self-service:
The consumer has to be able to provision the service themselves without any human
intervention. The service is provisioned almost instantly. So, an infrastructure using server
visualization that needs an administrator to manually provision a new virtual machine is not
cloud.Having to wait days to make a service available to the requester is not cloud.
Resource pooling:
The resources of the cloud provider are pooled and can be consumed by multiple customers. The
subset of the pool that consists of storage, processing, and networking is assigned to the
consumer and can be
configured when needed/requested.
Rapid elasticity:
The capacity delivered by the cloud service must easily and quickly be scaled up or scaled down
to meet the changes in demand.
Cloud computing services can be categorized into three service delivery models:
• Software as a service (SaaS)
• Platform as a Service (PaaS)
• Infrastructure as a Service (IaaS)
The above image represents each components of the Azure public cloud and it’s very easy to
understand.
Recently I have been asked by few people about the Azure VM charges. Let’s see how Azure Bill
for the VM.
Please note that when a new VM spin-up Azure starts charging for the following meters.
Compute hours
IP Address hours (if the Public IP is static)
Data Transfer Out (it’s about the data get out of the datacenter)
Standard IO-Block Blob Read, Write, Delete
Standard Managed Disk Operations
For any VM which is already build in the portal please note the following thumb rule.
When the VM is up & Running azure charge for all the above.
When the VM is Deallocated then charge only for storage (used capacity, e.g. if you got a
1TB disk and is in use 1GB you charge for 1GB, note this is not applicable for the managed
disks, if you have a managed disk of 1 TB and the VM is deallocated Azure will still charge
for the 1 TB managed disk.)
When the VM is shut down and not Deallocated then you charged both for storage and
compute resources.
Where to check the billing information.
Azure Portal you can check resources cost from the Cost Management + Billing service.
As many of you know that Azure ExpressRoute let us to extend our on-premises infrastructure to
Azure Cloud over a private dedicated connection. With ExpressRoute, you can establish
connections to Microsoft Azure Network.
If you are from the Azure pre sales background or an Azure Architect like myself, you must be
aware that in most of the solution/design we generally recommend our customers to use the
Azure ExpressRoute circuit to connect their on-premises environment to Azure, however post-
deployment there are many challenges customers may face like they may not be able to utilize
the complete network bandwidth, latency related issue, slowness and many more.
A below network utilization is an example of such a scenario where although the customer has
purchased 500 Mbps pipe but it’s underutilized. So the monitoring of the express route circuit
will really become very important post-deployment.
And the best way to monitor the express route circuit is to deploy the NPM from Azure Market
Place.
Network Performance Monitor (NPM) is a cloud-based hybrid network monitoring solution that
helps you monitor network performance between various points in your network infrastructure,
monitor network connectivity to applications and monitor the performance of your Azure
ExpressRoute.
West Europe
West Central US
East US
South East Asia
South East Australia
Also don’t think that you can’t monitor the express route of the other regions, these above
regions I am only talking about the workspace location and it has nothing to do with Express
Route Monitoring between other regions.
Since I have already OMS workspace created so I have linked the existing OMS workspace.
In the next step it will submit the deployment.
In the next step you can see that the deployment is successful.
In the last screen you can view the NPM solution which is just deployed in East US
After the Workspace has been deployed, navigate to the NetworkMonitoring(name) resource
that we created. Validate the settings, then click Solution requires additional configuration.
The next step is to download and configure the agent setup file.
Go to the Common Settings tab of the Network Performance Monitor Configuration page
for your resource. Click the agent that corresponds to your server’s processor from the Install
OMS Agents section and download the setup file.
Next, copy the Workspace ID and Primary Key to Notepad.
From the Configure OMS Agents for monitoring using TCP protocol section, download the
Powershell Script. This Firewall Script will add rules in the Windows Firewall.
The agent must be installed on a Windows Server (2008 SP1 or later)
I have choose to install it on a Windows 2008 R2 computer. This is a on premises server which is a
database server for Oracle.
Here are the steps to install the NPM agent on the server.
First step is to run the setup once you click on the setup you will see the following screen.
On the welcome page click on next
On the Agent Setup Options page, you can choose to connect the agent to Azure Log Analytics
or Operations Manager. Or, you can leave the choices blank if you want to configure the agent
later. After making your selection(s), click Next.
If you chose to connect to Azure Log Analytics, paste the Workspace ID and Workspace
Key(Primary Key) that you copied into Notepad in the previous section. Then, click Next.
Please go to the windows control panel and open the Microsoft Monitoring agent.
Click on the Azure Log Analytics Tab. In the Status column, you should see that the agent
connected successfully to Log Analytics
Now go to the Powershell in the agent computer and run the script which you have downloaded
earlier
The below command will create the firewall rules in the windows firewall.
Thanks for your time and you have a great day ahead.
We have installed the Gateway in the same SQL Server where the source data resides.
We have chosen to install the binaries in the C drive. The next step is to click on the install button.
The next screen will show the following message.
Once the installation is completed you need to register the gateway.
This post I have targeted to the BI developers and system admins who are interested to configure
and work with the SQL Server Analysis Services in Azure (Called Analysis Services in Azure)
1. MS Excel.
2. MS Power BI
3. Tableu and other data visualization tools.
How the data models are developed?
The data model development is generally carried out in SSDT (SQL Server Data Tools for Visual
Studio) which is available as part of the Visual Studio Add on installations. Developers generally
build tabular or multidimensional data model project in Visual Studio, deploying the model as a
database to a server instance, setting up recurring data processing, and assigning permissions to
allow data access by end-users. When it’s ready to go, your semantic data model can be accessed
by client applications supporting Analysis Services as a data source.
What is the advantage of choosing Azure Analysis Service instead of SQL Server Analysis
Services?
Azure Analysis Services has many advantages with Azure. As per Microsoft the Azure Analysis
Services integrates with many Azure services enabling you to build sophisticated analytics
solutions. Integration with Azure Active Directory provides secure, role-based access to your
critical data. Integrate with Azure Data Factory pipelines by including an activity that loads data
into the model. Azure Automation and Azure Functions can be used for lightweight orchestration
of models using custom code. It’s also complete PaaS solution offered by MS so it’s super easy to
deploy and can be scale out and scale in.
Will my data is secure with Azure Analysis services?
As per Microsoft the Azure Analysis Services utilizes Azure Blob storage to persist storage and
metadata for Analysis Services databases. Data files within Blob are encrypted using Azure Blob
Server Side Encryption (SSE). When using Direct Query mode, only metadata is stored. The actual
data is accessed from the data source at query time.
Let’s dirty our hand and see how we have configured the Azure Analysis Services.
If you go to all services and type anal it will show the analysis services as you can see below.
Once you click on the Analysis Services you can see the following thing
You can click on the + Add button above to configure the Analysis Services.
Next step is to click on Create
Please remember if you don’t want to use an existing storage account you can create a new
storage account and should add a container in it for the backup.
As you have seen the blob storage container which I have created I have kept the name as
backup.
Once the Analysis services is ready you can view the following screen
The next step is to view what is created.
The above screen will show the analysis service which has been created. The server name is the
one which is required to connect this Analysis Service from VS SSDT or Power BI.
For connecting from SSDT you need to download and install SSDT from internet. Here is the
download URL for SSDT. In my next post how we can connect this URL from SSDT and import the
data. In the next post we also need to bypass an error related to connecting a SQL Server data
source located in an IaaS VM in Azure, which will lead to the installation of unified gateway. The
step by step installation of the gateway also I will cover in my next post. Stay tuned till then.
Open the SSDT (SQL Server Data Tools) from your program files. And create a new project.
You need to select the 3rd option Analysis Services Tabular Project.
In the next step you need provide the URL of the Analysis Services which we have created in my
last post.
Once you click on the test connection it should show that the test connection is succeeded.
In the next step we need to connect a SQL server data source from where we will fetch the data
from a test table for the Analysis Services. In our case we have a data source in SQL Server which
resides in an IaaS VM in Azure.
In the next step you need to provide the connection name and the SQL Server instance name to
connect.
The next step is where you need to provide the impersonation information.
The next step will be as below where it will show the list of tables which you can choose to
import the data.
The next step in the table import wizard it will show the table name
And when you have reached the last step thinking it will be successful.
You will get this error about which I have mentioned in my last post.
The above error is confusing since it is indicating ‘On-Premise Gateway is required to access
the data source’. Since our SQL database is located in an Azure VM so we got confused why it’s
complaining. We have searched google and didn’t find an answer to this question. Later we
thought to deploy the Gateway based on the below statement which was in our mind.
“SQL Analysis services thinking any IaaS based SQL data source as the on premises data
source”
The next step is to install the on-premises Data Gateway. To know more about enterprise data
gateway I am going to write a separate post of how to create an on premises gateway for the
SQL Server Analysis Services in my next post in this blog.
Assuming the gateway is created and installed in an IaaS or On premises VM, you have to create
the same on premises data gateway in Azure as well.
Please follow the below steps to add the gateway in the Azure Portal
Once the Gateway is created in the Azure Portal you need to go to the Analysis Services and need
to connect this gateway as shown below. In the Analysis Services please choose the On-Premises
Data Gateway and from the drop down list you can choose the gateway name.
Once it’s configured we went to SSDT and have tried to import the table again. This time we have
used VS 2017 data source from the drop down list so the UI will be little different but it will work
with VS 2015 also.
Assuming you have already created the Analysis Service Tabular project by following the steps I
shown in the beginning of this article and you are in a stage where you need to import the data
from a table here is what you need to do.
The next step is to select the table and it will show the table data. (For security reasons I can’t
show the table data)
“SQL Analysis services thinking any IaaS based SQL data source as the on premises data
source”
So there is a change in the Architecture when you need to connect SQL Server IaaS data
source or on premises data source. The Architecture will look like as below.
Fig: Analysis Services with Gateway to connect SQL Database in an IaaS VM
Conclusion: Azure Analysis Services is a very nice PaaS offering and very fast and easy to
configure. For connecting on premises data source as well SQL Server data stored in any IaaS VM
in Azure you need the on-premises data gateway. For connecting the PaaS instance of SQL
Server, Gateway is not a requirement.
I hope you have liked this post, stay tuned for my next post on the gateway installation.
We need continuous monitoring and analysis to ensure performance and stability aren’t
negatively impacted by poor network connections or server issues. Here are the three important
criteria you should keep in your mind while understanding the need of monitoring the Azure
infrastructure.
For more insight into a virtual machine, you can collect guest-level metrics, logs, and other
diagnostic data using the Azure Diagnostics agent. You can also send diagnostic data to other
services like Application Insights.
Now let’s see few use cases where I have used Azure Monitor, these use cases are very simple
and used only for the demo purpose. In real production environment you may need to work with
many other metrics based on your requirement.
Use Case 1: We need to know whether any D-DOS attack recently happened in a public
gateway IP.
Please go to Monitor-Metrics tab
Select the resource group and select the IP address of the VM
Select under DDoS attack or not
Use Case 2: We will check what the incoming bandwidth in a VM network is.
Again we will go to monitor metrics tab.
We will select the VM and select the metric Network In
You can also export the data in excel to examine it at a later time. The excel file will look like this.
Go to Log Search
This is a sample query to chart average free memory reported for each instances every hour.
Perf
| render timechart
If you want to know more on how to write analytics query you can refer the query language cheat
sheet here.
Conclusion: For the existing infrastructure monitoring folks who were thinking that their jobs are
at risks due to massive workloads moving to Azure, I think this is a wrong thinking, Azure gives us
many new ways to monitor the infrastructure and the manual effort to monitor the infrastructure
will be always there. What we need to do is to learn all the aspect of monitoring in the Azure
cloud so that we can get the best out of monitoring. I will write more posts on Azure Monitoring
and Log Analytics in future. Stay tuned. Wish you a great day/night ahead.
Edit: Managed disks are now fully supported in ASR. Please refer the below article.
Article for the support of managed disks.
Today we will see how we can configure the disaster recovery step by step.
Configure Azure VM disaster recovery step by step for the VM’s which have unmanaged
disks
I have selected a VM in my Lab, the VM is located in West US 2 and it’s having Windows 2016
Operating System.
It’s a Windows 2016 Datacenter Server VM, please find the OS version below.
The next step is to go the disaster recovery (Preview) tab as you can see below.
In the next step you need to configure the disaster recovery for this VM.
Select the resource group under which the replicated VM will be created when the VM is failed
over.
Select the virtual network in the target region to which failed over VM will be associated to.
Select the cache storage account, cache storage account is located in the source region. They are
used as a temporary data store before replicating the changes to the target region. By default
one cache storage account is created per vault and re-used. You can select different cache
storage account if intend to customize the cache storage account to be used for this VM.
Data being replicated from the source VM is stored in replica managed disks in the target region.
For each managed disks in source VM, one replicated managed disk is created and used in target
region.
Recovery services vault contains target VM configuration settings and orchestrates replication. In
the event of a disruption where your source VM is not available, you can failover from recovery
services vault.
Vault resource group is the resource group of the recovery services vault. Replication policy
defines the settings for the recovery point retention history and app consistent snapshot
frequency.
The world map below shows the Azure Data Center’s which we have chosen for the replication.
We have chosen to replicate the VM from West US 2 to East US 2.
The next step is to create the Azure resource.
When you can check the progress you can see the deployment is in progress.
In the next step it will show the replication going on for the VM.
Since this is part of the ASR (Azure Site Recovery), it will perform the same jobs which is generally
done during the VM migration. You can find below the jobs which are triggered.
Note: For more details on Azure Site Recovery you can click here.
After some time you can find out that enable replication has been completed.
The replication may take 15 minutes to few hours depending on the size of the VM
As you can see below in my case 98% percentage has been completed after 20 minutes
Recovery Point Objective (RPO) describes the interval of time that might pass during a disruption
before the quantity of data lost during that period exceeds the Business Continuity Plan’s
maximum allowable threshold or “tolerance.”
Example: If the last available good copy of data upon an outage is from 16 hours ago, and the
RPO for this business is 20 hours then we are still within the parameters of the Business
Continuity Plan’s RPO. In other words it the answers the question – “Up to what point in time
could the Business Process’s recovery proceed tolerably given the volume of data lost during that
interval?”
Now I have to shut down the primary VM just to check the RPO status after two days. After two
days the RPO was showing 2 days as you can see below.
And there is an error about replication was halted.
After I have started the VM the replication has been completed and the data from the primary
site and DR site has been synced and the RPO has came down.
Run a disaster recovery drill for Azure VMs to a secondary Azure region
To test I have decided to a test failover
Now I can see both the VM’s in the primary site and the DR site is running in two different Azure
regions.
That’s all about today, I think you will like my post on Azure Disaster Recovery (Preview), I will
bring more on BCP and DR on Azure in my future posts. For more details on each replication
steps you can click here.
Enjoy rest of your day!!!!
Like any other quota increase request you need to open a ticket with Azure support and need to
request the additional IP’s.
Reference: https://gallery.technet.microsoft.com/Instant-recovery-point-and-25fe398a
How can you enable this feature?
You need to execute following cmdlets from an elevated PowerShell:
2) Select the subscription which you want to register for preview: Get-AzureRmSubscription –
SubscriptionName “Subscription Name” | Select-AzureRmSubscription
It will take around two hours to complete the registration process. You can use the following
cmdlet to check registration status. You need not change anything in your schedule or policy for
this to take effect.
CMDlet:
Get-AzureRmProviderFeature -FeatureName “InstantBackupandRecovery” –
ProviderNamespace Microsoft.RecoveryServices
This will not impact you infrastructure. You just cannot perform any change on the backup policy
or schedule as mention above.
The cost will be the same for the ones you already have in place.
FAQ
What are the additional costs incurred when signed up for the private preview for support of
azure backup with large disk (Greater than 1 TB)?
Since MS store snapshots to boost recovery point creation and also to speed up restore, you will
see storage costs corresponding to snapshots for the period 7 days(currently snapshots will be
kept for 7 days – this is fixed and we have plans to make this a choice in coming releases).
Please refer the below doc shared by azure backup product team:
https://gallery.technet.microsoft.com/Instant-recovery-point-and-25fe398a
Is there any impact on the production server upon signing up to private preview for larger disk
backup on azure VM?
It isn’t announced publicly yet, but as per the PG it is GA at the end of March 2018 tentatively.
Conclusion:
In case you have larger disks it’s good idea to switch in to private preview. It’s a seamless
movement and doesn’t affect your production environment.
Since the original estimates were failing and spend on the azure budget is overshooting month
by month, most of the public cloud architect jobs will need this important skill, as a top
demanding skill in their role. With the introduction of Azure Advisor and Cost Management, you
can get some insight definitely, but there are many things you should plan well in advance before
your next deal which can provide a significant lead against your competitors.
There are many cost savings measures which you can take and the top 12 initiative I have
listed below. And all of them can be planned well in advance during the planning stage of
the deal. Once you win the deal and project is in the delivery stage there will be ongoing
initiatives to bring the down the Azure spend as well.
1. RI’s (Reserved Instances), Pre-pay, on demand, Dedicated Hosts, BYOL etc.
2. Usage duration of the Azure Resources. (For Example, if you pause the azure analysis
services, you will not be billed)
3. Selecting the right storage and using the Storage Pool and Stripped Volumes (IOPS
calculation plays a big role here, also selection of storage policy like when data will move to
cold and archive storage is important)
4. Right Instance types (B Series VM’s etc.)
5. Estimation of data volumes for the client proposal (Network capacity planning,
ExpressRoute or Site to Site VPN etc.)
6. Turn on / Off (Deallocating the VM when not in use.)
7. Resize instances/change instance types.
8. Scale Up/ Scale Down
9. Conversion to PaaS services.
10. Public cloud Waste management. (By right tagging the resources with the department
which introduces responsibilities against over spending. )
11. Low priority VM’s (Already I have explained this in detail in my older blog post.)
12. HA planning with a single instance with limited allowed downtime.
There are other areas also where you can plan very well in advance and they are related to BCP
and DR and the areas related to backup and recovery. In some of my upcoming blog post, I will
discuss more on the Azure.
Based on my experience I have seen one of the major cost reduction can be achieved if we can
implement Azure Reserved instances for the Production and QA workloads. Today I am going to
discuss in details about that, in my later blogs I will write more on other cost-saving measures.
What is Azure Reservations (Reserved Instance)? ( One of the best option to reduce cost)
Azure reservation is a way to pre-purchase your virtual machines (Compute Usage) for a duration
1 to 3 years. If you are using a VM 24 x7 and up to 365 days you can save up to 60-70-80% of the
cost of Azure. It’s like a huge amount of savings and how you can achieve this or why MS is
giving this discount? There are really two things why you will get this discount.
What will happen if you need to change the instance during the tenure?
Now there may be a question on your mind that what will happen if you buy reserved instances
for 3 years and after 18 months you want to change the size of the VM. In this case, you can
exchange the reservations. There is unlimited exchange possible during a tenure. But there is a
catch, the new value of reservation should be greater than currently what you are paying.
So basically there is only upgrade exchange possible but not downgrade.
What will happen if you want to terminate the lease in advance? That means if you wanted
to terminate the contract before the end of the tenure.
In case you wanted to cancel it early, there is a 12% early termination fees which will be
deducted. And there is a limit of up to USD 50K in a year. That means you can cancel only up to
USD 50K in a year.
To Create new VM instances you can click on Add and you can see the below screen. Please note
that if you select the scope as shared it will be applicable to all your subscription but if you select
a particular subscription it will apply to only that subscription. I have also noticed another thing
that based on my usage azure also recommend the VM size but it’s up to me about which VM I’ll
choose.
The cost will vary instance to instance if you choose a larger instance the cost savings will be 70%
as you can see below.
As you can see above the operating system will not be discounted by reservations it will only be
applied to the compute usage. Your OS cost will be charged separately. But if you have a
software assurance agreement and you are an enterprise customer and you have already paid for
the licenses. In this case, you can combine the reservations and Azure hub (The license will be
stored in Azure HUB for more details you can read here) benefits.
What is Azure HUB?
Hybrid Use Benefit (HUB) is available to customers with Enterprise Agreements and Software
Assurance that enables Windows Server licenses on-premises to be leveraged in Azure which
results in Windows VMs costing the same as Linux VMs (since there is no charge for the Windows
Server license).
With the combination with HUB you can save up to 80% of the total running cost. To know about
the Azure Reserved Instances cost kindly check the Azure Pricing calculator here.
That’s all for today. I hope you have a good time reading this blog and you have learned a good
thing. You have a very good rest of your day.
In the next few of my blog post, I will discuss how we can optimize the cost in Azure and other
various ways to do so. Stay tuned for more.
Top 20 most helpful information/checklist
that any Azure Pre Sales Architect should
keep handy in 2017.
By Aavisek Choudhury Azure Pre-Sales 2 Comments
Estimated Reading Time: 8 minutes
If you are planning to meet your customer for a large transformation and migration deal with
Azure offering, you are done with all your homework, presentation and ready to crack the deal,
hold on… before you make any promises, please spend some time to check this 20 most helpful
webpages or URL’s which may make yourself and your customer happy and your delivery team
life lot easier in future. I have complied this list based on my personal experience and I hope this
will make a big difference during any RFP/RFI or HLD/LLD or SoW preparation on Azure.
Azure pricing calculator is something which you need at every step of your engagement with the
customer, here is the link for the Azure Pricing Calculator.
For the Azure CSP the pricing calculator is available in the CSP portal.
2. Azure Subscription Limits and Quotas.
A must have URL to know about the available quota per subscription which will help for a smooth
design during the HLD phase, here is the link for that Azure subscription and service limits,
quotas, and constraints.
3. Cost control in Azure.
When you are in deep discussion with the customer one of the basic question customer may ask
you what to do if my budget overshoot in Azure, in that case you should be capable enough to
answer this tricky question, although there are few 3rd party products like Cloud Cruiser available
in Azure Market place for the cost control however they don’t have support for Azure CSP, this
below URL is one of the native feature in Azure and that will serve the purpose without much
effort Setup Billing Alerts in Azure.
4. Running non-supported Windows OS in Azure.
What will happen to my legacy applications running on Windows 2003, can I move them to
Azure? This is one of the frequently asked question you may face during your sessions with your
customer and you should be ready with the answer, first of all you should know that Windows
2003 VM is no longer officially supported in Azure however you may run them as long as you
want and more details can be found in this Windows 2003 VM’s in Azure. However a 2nd option
is to inform the customer about running them in a designated Hyper-V host in Azure which can
be easily build with the new nested virtualization introduced in Azure.
5. Azure Site Recovery Supported scenario,
Azure site recovery is very successful in all types of migration activities to Azure except few areas
where it may become a pain at a later stage for the delivery team, when they are in mid of a
migration process and they may discover that the VM or the physical machine can’t be moved to
Azure with the help of ASR due to one or the other unsupported scenarios. In one of my earlier
article I have mentioned the same thing, which you can fine here. (Azure ASR Limitations which is
difficult to bypass)
Under this type of situation customer may loss trust in your delivery team and there may be
conflict arises between delivery and the pre sales team regarding who has promised this
deliverable to the customer. So it’s always recommend and advisable to learn the different
scenarios which are supported by the ASR process. Please find the below URL’s which can help
here.
Azure to Azure
On premise to Azure
ASR FAQ
6. Running Oracle Database in Azure.
Can I able to run my oracle databases in Azure, how can I move large Oracle databases to cloud?
This is also one of the common question if the enterprise is having lots of Oracle databases in
their environment. ASR may be used for Oracle databases but if the oracle VM’s or physical
machine are not supported by ASR, it’s better to use the Oracle data guard for the migration.
Here is an article which can help you to answer some basic questions on Oracle migration
to Azure Supported scenarios and Migration options for the Oracle database in Azure.
7. Site connectivity in Azure
Can I able to connect my existing on premise sites to Azure, do I need to invest in new VPN
routers and Gateway? This is one of the common question you should be ready to answer for
your customer and MS provide a list of the supported VPN routers however this list may not
cover all the routers available in the market. For example the TP-LINK router which I am using for
my home office is not covered in this list while I able to setup the VPN connectivity with Azure.
To know more please click here.
Please find the supported routers Supported VPN Routers in Azure.
8. Comparison with AWS.
Please expect set of questions when you meet your customer about similar offering from the
Amazon Web Services, so I will suggest that you should prepare yourself with a high level
product comparison between AWS and Azure. I have recently complied a head to head
comparison between Azure and AWS offering and I am sure this comparison is definitely going to
help you.
Please find my post below Azure VS. AWS Head to Head Comparison Q3 2017
9. Moving resources from one subscription to another.
Now this is an important question if customer already have some foot print in Azure and there is
a chance that you can on board them in your CSP subscription or maybe you are advising them
for a EA option. The question regarding the movement of resources from once subscription to
another is an important question you should be capable to answer at first place.
Here is a post for that Move resources from one Subscription to another.
10. Life Cycle Policy of Azure Resources.
Although this question may not be important for some customer however I have seen many
customer wanted to know if there is any impact on their applications if Microsoft changes the
underlying hardware.
A detail explanation about the Azure Life Cycle Policy can be found in this article Life Cycle Policy
for Azure Resources.
11. Total cost of ownership (TCO) in Azure and in AWS.
This is one of the most discussed topic during the estimation and proposal preparation phase,
generally Microsoft pre sales consultant must have already completes this process before the
release of the RFP or bid documents, however you should also know about this. And I believe this
two URL’s below should help you to answer any quick question on the TCO during your
discussion with the customer.
Total cost of Ownership for Azure.
Total cost of ownership for AWS.
12. Azure Stencils.
As an Azure pre-sales architect you will need the Azure Visio and PowerPoint stencils + icon sets
and they are available for the download at the Microsoft site, which will help you a lot. This is a
must have tool for your successful presentations, for the High Level and Low Level design and
you will need it throughout the bid process and every new deal which you will participate. Please
download the Azure stencils below here.
Microsoft Azure, Cloud and Enterprise Symbol / Icon Set – Visio stencil, PowerPoint, PNG, SVG
13. Azure data centre compliance.
Compliance of the Azure data center, when the security folks from the customer will ask you
about many compliance related questions in Azure and you can directly target them to this URL
and they will get the answers of all their questions, so this URL should be a handy one for you or
else there is a big chance that the security guys can put a cold water in your presentation and
they may switch to different vendor who can convince them better on the security part and no
doubt the security guys have an important role in all your deals.
Here is a list of the Compliance of the Azure Data Center.
14. Azure Product Availability by region.
Not all the azure products are available at all the Azure regions, so before you promise anything
about any particular Azure data center, please take a quick look into this URL mentioned below:
Product availability by Regions.
15. Azure Backup – Supported Scenarios.
This is an important area which has to be addressed correctly during the pre-sales bid otherwise
it may again become a pain for the delivery team. For example recently in one of the project I
have found that the pre sales team has promised for the ASR move of the Windows 2008 R2 SP1
VM’s in Azure because they are very well supported by ASR however after the first wave the
delivery team found that they can’t install the Azure backup agent in the Windows 2008 VM’s
which are 32 bit, and that results in a complete back out of the ASR move. This kind of situation
can give a bad name to you during the execution part so be very careful and you should must
add this URL in your check list.
Azure Backup-FAQ
Azure VM Backup-FAQ
16. Monitoring – Azure Log Analytics-Supported Data Sources.
And here comes the monitoring and this is going to be part of most of your deals and if you have
chosen to prescribe the Azure monitoring solution in your offering please don’t forget to take a
quick look on the supported data sources. You should keep in your mind that you can’t monitor
everything with the Azure Log Analytics. For example if customer want’s a monitoring solutions
for their web applications you may need to direct them to the 3rd party solutions available in the
Azure Market Place like AppDynamics etc. However for the present data sources which are
supported you can take a look into this below URL.
Azure Log Analytics Supported Data Sources
17. Azure Reference Architecture.
Whether you are a novice or an expert in the on premise architecture design, this is the time you
should spend few days understanding the Azure Application Architecture, you have to
understand that most of architecture in Azure cloud is based on the SRH guidelines, which is
nothing but the scalability, resiliency and high availability. This below two URL’s should be
enough to understand and master the probable going to be architecture in Azure for your
customers.
Azure Architecture Center.
Azure Reference Architecture.
18. Azure Express Route.
Azure express route is always a point of discussion in many customer’s engagement and many of
them would like to put it in the kitty of the network team but you should be ready with some of
the FAQ of the Azure ExpressRoute and here is the URL for that.
FAQ-Azure Express Route
19. Business Continuity and Disaster Recovery in Azure.
Azure BCP or DR is something like elephant in the room. This is you need to well plan before the
final commitment during the engagement with customer. If required please setup a small POC
with few set of application to validate your concept before finalizing the SoW.
You should also should be aware of the common terms which is used in any DR process as shown
below and this has to be agreed by your customer or the application owners. Some of them are
listed below. You should know what needs to recovered in case of DR.
RTO: The recovery time objective (RTO), which is the maximum acceptable length of time that
your application can be offline.
RPO: A recovery point objective (RPO), which is the maximum acceptable length of time during
which data might be lost due to a major incident. Note that this metric describes the length of
time only; it does not address the amount or quality of the data lost.
Here is a list of URL which are going to help you in this process.
Business Continuity and Disaster Recovery in Azure in the Azure Paired Regions.
Disaster Recovery for the Azure Applications.
High Availability of the Azure Applications.
Designing resilient applications for Azure.
20. What is there in Azure stack?
This is a question which many consultants are facing from the customers for the last few months
and as an Azure pre-sales architect you should be aware of what is there in Microsoft Azure Stack
and how can you compete with it with the other hyper converged vendors available in the
Market. Here is an article which will definitely increase your knowledge on Azure stack.
Key features and concepts in Azure stack.
That’s make the final list of 20 but this is of course not the end, being a player in tough
competition, you should constantly can stay informed of innovations, new releases and product
reviews of the Azure world to get ahead of others. Hope you will like this post.
Azure regions
Azure has more global regions than any other cloud provider—offering the scale
needed to bring applications closer to users around the world, preserving data
residency, and offering comprehensive compliance and resiliency options for
customers.
54
regions
worldwide
140
available in
140 countries
Also the conversion of GPT to MBR disk will also not work here because these disks are mostly
OS disks and there is no supported way to convert them to MBR partition.
We thought for a workaround to create a VHDX using disk2VHD (GPT disk) and then create a
Gen2 VM on Hyper-V, thinking that ASR will work in this case, however we have found that 2008
R2 isn’t supported as a Generation2 VM, so we not able to proceed further on this unless we
upgrade the OS which is no go from the application owners.
So if you are planning to move physical boxes or VM’s which have GPT partitions in the C drive
OR have the clustered disks, it’s may be better to look out for some different tool for the
migration instead of using ASR at this point of time.
When we look out of this with the 3rd party I came across with a vendor called doubletake and
when we have checked their cloud migration user guide and we do not see comments specific to
UEFI partition based Windows machines.
Here is the user guide for doubletake your reference.
https://migrate.doubletake.com/docs/CMCUsersGuide.pdf
It says it does not support UEFI disks on Linux machines. Nothing on Windows
For more details on Azure Site Recovery Support Matrix, please check this link
Bottom line is that ASR may be a very good tool but has few limitations and if you are planning
for large scale migration with different workloads, you need to plan for your large workload
sprint in advance and need to decide the strategy as a case to case basis.
https://docs.microsoft.com/en-us/azure/site-recovery/vmware-physical-azure-support-matrix
o
o all
This article includes frequently asked questions about Azure Site Recovery. If you
have questions after reading this article, post them on the Azure Recovery Services
Forum.
General
What does Site Recovery do?
Site Recovery contributes to your business continuity and disaster recovery (BCDR)
strategy, by orchestrating and automating replication of Azure VMs between regions,
on-premises virtual machines and physical servers to Azure, and on-premises
machines to a secondary datacenter. Learn more.
What can Site Recovery protect?
Azure VMs: Site Recovery can replicate any workload running on a supported Azure
VM
Hyper-V virtual machines: Site Recovery can protect any workload running on a
Hyper-V VM.
Physical servers: Site Recovery can protect physical servers running Windows or
Linux.
VMware virtual machines: Site Recovery can protect any workload running in a
VMware VM.
Yes, you can replicate supported Azure VMs between Azure regions. Learn more.
What do I need in Hyper-V to orchestrate replication with Site Recovery?
For the Hyper-V host server what you need depends on the deployment scenario.
Check out the Hyper-V prerequisites in:
No, VMs must be located on a Hyper-V host server that's running on a supported
Windows server machine. If you need to protect a client computer you could
replicate it as a physical machine to Azure or a secondary datacenter.
What workloads can I protect with Site Recovery?
You can use Site Recovery to protect most workloads running on a supported VM or
physical server. Site Recovery provides support for application-aware replication, so
that apps can be recovered to an intelligent state. It integrates with Microsoft
applications such as SharePoint, Exchange, Dynamics, SQL Server and Active
Directory, and works closely with leading vendors, including Oracle, SAP, IBM and
Red Hat. Learn more about workload protection.
Do Hyper-V hosts need to be in VMM clouds?
Yes. You can either replicate VMs in Hyper-V servers in the VMM cloud to Azure, or
you can replicate between VMM clouds on the same server. For on-premises to on-
premises replication, we recommend that you have a VMM server in both the
primary and secondary sites.
What physical servers can I protect?
You can replicate physical servers running Windows and Linux to Azure or to a
secondary site. Learn about requirements for replication to Azure, and replication to
a secondary site.
Note that physical servers will run as VMs in Azure if your on-premises server goes
down. Failback to an on-premises physical server isn't currently supported. For a
machine protected as physical, you can only failback to a VMware virtual machine.
What VMware VMs can I protect?
To protect VMware VMs you'll need a vSphere hypervisor, and virtual machines
running VMware tools. We also recommend that you have a VMware vCenter server
to manage the hypervisors. Learn more about requirements for replication to Azure,
or replication to a secondary site.
Can I manage disaster recovery for my branch offices with Site Recovery?
Yes. When you use Site Recovery to orchestrate replication and failover in your
branch offices, you'll get a unified orchestration and view of all your branch office
workloads in a central location. You can easily run failovers and administer disaster
recovery of all branches from your head office, without visiting the branches.
Pricing
For pricing related questions, please refer to the FAQ at Azure Site Recovery pricing.
Security
Is replication data sent to the Site Recovery service?
No, Site Recovery doesn't intercept replicated data, and doesn't have any
information about what's running on your virtual machines or physical servers.
Replication data is exchanged between on-premises Hyper-V hosts, VMware
hypervisors, or physical servers and Azure storage or your secondary site. Site
Recovery has no ability to intercept that data. Only the metadata needed to
orchestrate replication and failover is sent to the Site Recovery service.
Site Recovery is ISO 27001:2013, 27018, HIPAA, DPA certified, and is in the process of
SOC2 and FedRAMP JAB assessments.
For compliance reasons, even our on-premises metadata must remain within the
same geographic region. Can Site Recovery help us?
Yes. When you create a Site Recovery vault in a region, we ensure that all metadata
that we need to enable and orchestrate replication and failover remains within that
region's geographic boundary.
Does Site Recovery encrypt replication?
For virtual machines and physical servers, replicating between on-premises sites
encryption-in-transit is supported. For virtual machines and physical servers
replicating to Azure, both encryption-in-transit and encryption-at-rest (in Azure) are
supported.
Replication
Can I replicate over a site-to-site VPN to Azure?
Azure Site Recovery replicates data to an Azure storage account, over a public
endpoint. Replication isn't over a site-to-site VPN. You can create a site-to-site VPN,
with an Azure virtual network. This doesn't interfere with Site Recovery replication.
Can I use ExpressRoute to replicate virtual machines to Azure?
Yes. You can automate Site Recovery workflows using the Rest API, PowerShell, or
the Azure SDK. Currently supported scenarios for deploying Site Recovery using
PowerShell:
You need an LRS or GRS storage account. We recommend GRS so that data is
resilient if a regional outage occurs, or if the primary region can't be recovered. The
account must be in the same region as the Recovery Services vault. Premium storage
is supported for VMware VM, Hyper-V VM, and physical server replication, when you
deploy Site Recovery in the Azure portal.
How often can I replicate data?
Can I extend replication from existing recovery site to another tertiary site?
This is supported when you're replicating VMware VMs and Hyper-V VMs to Azure,
using the Azure portal.
Can I replicate virtual machines with dynamic disks?
Dynamic disks are supported when replicating Hyper-V virtual machines. They are
also supported when replicating VMware VMs and physical machines to Azure. The
operating system disk must be a basic disk.
Can I add a new machine to an existing replication group?
Yes. You can read more about throttling bandwidth in the deployment articles:
Failover
If I'm failing over to Azure, how do I access the Azure virtual machines after failover?
You can access the Azure VMs over a secure Internet connection, over a site-to-site
VPN, or over Azure ExpressRoute. You'll need to prepare a number of things in order
to connect. Learn more
If I fail over to Azure how does Azure make sure my data is resilient?
Azure is designed for resilience. Site Recovery is already engineered for failover to a
secondary Azure datacenter, in accordance with the Azure SLA if the need arises. If
this happens, we make sure your metadata and vaults remain within the same
geographic region that you chose for your vault.
If I'm replicating between two datacenters what happens if my primary datacenter
experiences an unexpected outage?
You can trigger an unplanned failover from the secondary site. Site Recovery doesn't
need connectivity from the primary site to perform the failover.
Is failover automatic?
Failover isn't automatic. You initiate failovers with single click in the portal, or you can
use Site Recovery PowerShell to trigger a failover. Failing back is a simple action in
the Site Recovery portal.
Yes, you can use the alternate location recovery to failback to a different host from
Azure. Read more about the options in the below links for VMware and Hyper-V
virtual machines.
Service providers
I'm a service provider. Does Site Recovery work for dedicated and shared
infrastructure models?
Yes, Site Recovery supports both dedicated and shared infrastructure models.
For a service provider, is the identity of my tenant shared with the Site Recovery
service?
No. Tenant identity remains anonymous. Your tenants don't need access to the Site
Recovery portal. Only the service provider administrator interacts with the portal.
Will tenant application data ever go to Azure?
When replicating between service provider-owned sites, application data never goes
to Azure. Data is encrypted in-transit, and replicated directly between the service
provider sites.
If you're replicating to Azure, application data is sent to Azure storage but not to the
Site Recovery service. Data is encrypted in-transit, and remains encrypted in Azure.
Will my tenants receive a bill for any Azure services?
No. Azure's billing relationship is directly with the service provider. Service providers
are responsible for generating specific bills for their tenants.
If I'm replicating to Azure, do we need to run virtual machines in Azure at all times?
No, Data is replicated to an Azure storage account in your subscription. When you
perform a test failover (DR drill) or an actual failover, Site Recovery automatically
creates virtual machines in your subscription.
Do you ensure tenant-level isolation when I replicate to Azure?
Yes.
What platforms do you currently support?
We support Azure Pack, Cloud Platform System, and System Center based (2012 and
higher) deployments. Learn more about Azure Pack and Site Recovery integration.
Do you support single Azure Pack and single VMM server deployments?
Yes, you can replicate Hyper-V virtual machines to Azure, or between service provider
sites. Note that if you replicate between service provider sites, Azure runbook
integration isn't available.
o
o all
Welcome to the Azure Site Recovery service! This article provides a quick service
overview.
Site Recovery service: Site Recovery helps ensure business continuity by keeping
business apps and workloads running during outages. Site Recovery replicates
workloads running on physical and virtual machines (VMs) from a primary site to a
secondary location. When an outage occurs at your primary site, you fail over to
secondary location, and access apps from there. After the primary location is running
again, you can fail back to it.
Backup service: The Azure Backup service keeps your data safe and recoverable by
backing it up to Azure.
Simple BCDR Using Site Recovery, you can set up and manage replication, failover, and failback from a
solution single location in the Azure portal.
Azure VM You can set up disaster recovery of Azure VMs from a primary region to a secondary
replication region.
On-premises VM You can replicate on-premises VMs and physical servers to Azure, or to a secondary on-
replication premises datacenter. Replication to Azure eliminates the cost and complexity of
maintaining a secondary datacenter.
Workload Replicate any workload running on supported Azure VMs, on-premises Hyper-V and
replication VMware VMs, and Windows/Linux physical servers.
Data resilience Site recovery orchestrates replication without intercepting application data. When you
replicate to Azure, data is stored in Azure storage, with the resilience that provides.
When failover occurs, Azure VMs are created, based on the replicated data.
RTO and RPO Keep recovery time objectives (RTO) and recovery point objectives (RPO) within
targets organizational limits. Site Recovery provides continuous replication for Azure VMs and
VMware VMs, and replication frequency as low as 30 seconds for Hyper-V. You can
reduce RTO further by integrating with Azure Traffic Manager.
Keep apps You can replicate using recovery points with application-consistent snapshots. These
consistent over snapshots capture disk data, all data in memory, and all transactions in process.
failover
Testing without You can easily run disaster recovery drills, without affecting ongoing replication.
disruption
Feature Details
Flexible failovers You can run planned failovers for expected outages with zero-data loss, or unplanned
failovers with minimal data loss (depending on replication frequency) for unexpected
disasters. You can easily fail back to your primary site when it's available again.
Customized Using recovery plans, can customize and sequence the failover and recovery of multi-
recovery plans tier applications running on multiple VMs. You group machines together in a recovery
plan, and optionally add scripts and manual actions. Recovery plans can be integrated
with Azure automation runbooks.
BCDR integration Site Recovery integrates with other BCDR technologies. For example, you can use Site
Recovery to protect the SQL Server backend of corporate workloads, with native
support for SQL Server AlwaysOn, to manage the failover of availability groups.
Network Site Recovery integrates with Azure for simple application network management,
integration including reserving IP addresses, configuring load-balancers, and integrating Azure
Traffic Manager for efficient network switchovers.
Replicate on-premises VMware VMs, Hyper-V VMs managed by System Center VMM, and
physical servers to a secondary site.
Workloads You can replicate any workload running on a machine that's supported for replication. In
addition, the Site Recovery team have performed app-specific testing for a number of
apps.
o all
This article describes workloads and applications you can protect for disaster
recovery with the Azure Site Recovery service.
Overview
Site Recovery is an Azure service that contributes to your BCDR strategy. Using Site
Recovery, you can deploy application-aware replication to the cloud, or to a
secondary site. Whether your apps are Windows or Linux-based, running on physical
servers, VMware or Hyper-V, you can use Site Recovery to orchestrate replication,
perform disaster recovery testing, and run failovers and failback.
Workload summary
Site Recovery can replicate any app running on a supported machine. In addition,
we've partnered with product teams to carry out additional app-specific testing.
Replicate Replicate
Hyper-V VMs Replicate VMware VMs Replicate
Replicate to a Hyper-V to a VMware
Azure VMs secondary VMs to secondary VMs to
Workload to Azure site Azure site Azure
Active Y Y Y Y Y
Directory, DNS
System Center Y Y Y Y Y
Operations
Manager
SharePoint Y Y Y Y Y
Exchange (non- Y Y Y Y Y
DAG)
Remote Y Y Y Y Y
Desktop/VDI
Dynamics AX Y Y Y Y Y
Windows File Y Y Y Y Y
Server
Replicate Replicate
Hyper-V VMs Replicate VMware VMs Replicate
Replicate to a Hyper-V to a VMware
Azure VMs secondary VMs to secondary VMs to
Workload to Azure site Azure site Azure
An Active Directory and DNS infrastructure are essential to most enterprise apps.
During disaster recovery, you'll need to protect and recover these infrastructure
components, before recovering your workloads and apps.
You can use Site Recovery to create a complete automated disaster recovery plan for
Active Directory and DNS. For example, if you want to fail over SharePoint and SAP
from a primary to a secondary site, you can set up a recovery plan that fails over
Active Directory first, and then an additional app-specific recovery plan to fail over
the other apps that rely on Active Directory.
SQL Server provides a data services foundation for data services for many business
apps in an on-premises data center. Site Recovery can be used together with SQL
Server HA/DR technologies, to protect multi-tiered enterprise apps that use SQL
Server. Site Recovery provides:
A simple and cost-effective disaster recovery solution for SQL Server. Replicate
multiple versions and editions of SQL Server standalone servers and clusters, to Azure
or to a secondary site.
Integration with SQL AlwaysOn Availability Groups, to manage failover and failback
with Azure Site Recovery recovery plans.
End-to-end recovery plans for the all tiers in an application, including the SQL Server
databases.
Scaling of SQL Server for peak loads with Site Recovery, by “bursting” them into
larger IaaS virtual machine sizes in Azure.
Easy testing of SQL Server disaster recovery. You can run test failovers to analyze data
and run compliance checks, without impacting your production environment.
Eliminates the need and associated infrastructure costs for a stand-by farm for
disaster recovery. Use Site Recovery to replicate an entire farm (Web, app and
database tiers) to Azure or to a secondary site.
Simplifies application deployment and management. Updates deployed to the
primary site are automatically replicated, and are thus available after failover and
recovery of a farm in a secondary site. Also lowers the management complexity and
costs associated with keeping a stand-by farm up-to-date.
Simplifies SharePoint application development and testing by creating a production-
like copy on-demand replica environment for testing and debugging.
Simplifies transition to the cloud by using Site Recovery to migrate SharePoint
deployments to Azure.
Azure Site Recovery helps protect your Dynamics AX ERP solution, by:
Remote Desktop Services (RDS) enables virtual desktop infrastructure (VDI), session-
based desktops, and applications, allowing users to work anywhere. With Azure Site
Recovery you can:
For small Exchange deployments, such as a single or standalone server, Site Recovery
can replicate and fail over to Azure or to a secondary site.
For larger deployments, Site Recovery integrates with Exchange DAGS.
Exchange DAGs are the recommended solution for Exchange disaster recovery in an
enterprise. Site Recovery recovery plans can include DAGs, to orchestrate DAG failover
across sites.
Azure Site Recovery provides disaster recovery by replicating the critical components
in your environment to a cold remote site or a public cloud like Microsoft Azure.
Since the virtual machines with the web server and the database are being replicated
to the recovery site, there is no requirement to backup configuration files or
certificates separately. The application mappings and bindings dependent on
environment variables that are changed post failover can be updated through scripts
integrated into the disaster recovery plans. Virtual machines are brought up on the
recovery site only in the event of a failover. Not only this, Azure Site Recovery also
helps you orchestrate the end to end failover by providing you the following
capabilities:
Sequencing the shutdown and startup of virtual machines in the various tiers.
Adding scripts to allow update of application dependencies and bindings on the
virtual machines after they have been started up. The scripts can also be used to
update the DNS server to point to the recovery site.
Allocate IP addresses to virtual machines pre-failover by mapping the primary and
recovery networks and hence use scripts that do not need to be updated post failover.
Ability for a one-click failover for multiple web applications on the web servers, thus
eliminating the scope for confusion in the event of a disaster.
Ability to test the recovery plans in an isolated environment for DR drills.
Use Site Recovery to protect your Citrix XenApp and XenDesktop deployments, as
follows:
Enable protection of the Citrix XenApp and XenDesktop deployment, by replicating
different deployment layers including (AD DNS server, SQL database server, Citrix
Delivery Controller, StoreFront server, XenApp Master (VDA), Citrix XenApp License
Server) to Azure.
Simplify cloud migration, by using Site Recovery to migrate your Citrix XenApp and
XenDesktop deployment to Azure.
Simplify Citrix XenApp/XenDesktop testing, by creating a production-like copy on-
demand for testing and debugging.
This solution is only applicable for Windows Server operating system virtual desktops
and not client virtual desktops as client virtual desktops are not yet supported for
licensing in Azure.Learn More about licensing for client/server desktops in Azure.
o
o all
If you don't have an Azure subscription, create a free account before you begin.
Note
This article is intended to guide a new user through the Azure Site Recovery
experience with the default options and minimum customization. If you want to
know more about the various settings that can be customized, refer to the tutorial
for enabling replication for Azure VMs
Log in to Azure
1. In the Azure portal, click Virtual machines, and select the VM you want to replicate.
2. In Operations, click Disaster recovery.
3. In Configure disaster recovery > Target region select the target region to which
you'll replicate.
4. For this Quickstart, accept the other default settings.
5. Click Enable replication. This starts a job to enable replication for the VM.
Verify settings
After the replication job has finished, you can check the replication status, modify
replication settings, and test the deployment.
1. In the VM menu, click Disaster recovery.
2. You can verify replication health, the recovery points that have been created,
and source, target regions on the map.
Clean up resources
The VM in the primary region stops replicating when you disable replication for it:
The source replication settings are cleaned up automatically. Please note that the Site
Recovery extension that is installed as part of the replication isn't removed and needs
to be removed manually.
Site Recovery billing for the VM also stops.
Stop replication as follows
Next steps
In this quickstart, you replicated a single VM to a secondary region. You can now
explore more options and try replicating a set of Azure VMs using a recovery plan.
Azure to Azure disaster recovery
architecture
12/31/2018
8 minutes to read
Contributors
o
This article describes the architecture, components, and processes used when you
deploy disaster recovery for Azure virtual machines (VMs) using the Azure Site
Recovery service. With disaster recovery set up, Azure VMs continuously replicate
from to a different target region. If an outage occurs, you can fail over VMs to the
secondary region, and access them from there. When everything's running normally
again, you can fail back and continue working in the primary location.
Architectural components
The components involved in disaster recovery for Azure VMs are summarized in the
following table.
Component Requirements
Source VM Azure VMs can be managed, or have non-managed disks spread across storage accounts.
storage
Learn about supported Azure storage.
Source VM VMs can be located in one or more subnets in a virtual network (VNet) in the source
networks region. Learn more about networking requirements.
Component Requirements
Cache storage You need a cache storage account in the source network. During replication, VM changes
account are stored in the cache before being sent to target storage.
Using a cache ensures minimal impact on production applications that are running on a
VM.
Target resources Target resources are used during replication, and when a failover occurs. Site Recovery
can set up target resource by default, or you can create/customize them.
In the target region, check that you're able to create VMs, and that your subscription has
enough resources to support VM sizes that will be needed in the target region.
Target resources
When you enable replication for a VM, Site Recovery gives you the option of creating
target resources automatically.
Target resource Default setting
Target resource The resource group to which VMs belong after failover.
group
It can be in any Azure region except the source region.
Site Recovery creates a new resource group in the target region, with an "asr" suffix.
Target VNet The virtual network (VNet) in which replicated VMs are located after failover. A network
mapping is created between source and target virtual networks, and vice versa.
Site Recovery creates a new VNet and subnet, with the "asr" suffix.
Target storage If the VM doesn't use a managed disk, this is the storage account to which data is
account replicated.
Site Recovery creates a new storage account in the target region, to mirror the source
storage account.
Replica If the VM uses a managed disk, this is the managed disks to which data is replicated.
managed disks
Site Recovery creates replica managed disks in the storage region to mirror the source.
Target Availability set in which replicating VMs are located after failover.
availability sets
Site Recovery creates an availability set in the target region with the suffix "asr", for VMs
that are located in an availability set in the source location. If an availability set exists, it's
used and a new one isn't created.
Target If the target region supports availability zones, Site Recovery assigns the same zone
availability number as that used in the source region.
zones
Managing target resources
Replication policy
When you enable Azure VM replication, by default Site Recovery creates a new
replication policy with the default settings summarized in the table.
Policy setting Details Default
Recovery point retention Specifies how long Site Recovery keeps recovery 24 hours
points
You can manage and modify the default replication policies settings as follows:
Multi-VM consistency
If you want VMs to replicate together, and have shared crash-consistent and app-
consistent recovery points at failover, you can gather them together into a replication
group. Multi-VM consistency impacts workload performance, and should only be
used for VMs running workloads that need consistency across all machines.
Snapshots and recovery points
Recovery points are created from snapshots of VM disks taken at a specific point in
time. When you fail over a VM, you use a recovery point to restore the VM in the
target location.
When failing over, we generally want to ensure that the VM starts with no corruption
or data loss, and that the VM data is consistent for the operating system, and for
apps that run on the VM. This depends on the type of snapshots taken.
Consistency
A crash consistent snapshot Site Recovery creates Today, most apps can recover well
captures data that was on the disk crash-consistent recovery from crash-consistent points.
when the snapshot was taken. It points every five minutes
doesn't include anything in memory. by default. This setting Crash-consistent recovery points are
can't be modified. usually sufficient for the replication
It contains the equivalent of the on- of operating systems, and apps such
disk data that would be present if as DHCP servers and print servers.
the VM crashed or the power cord
was pulled from the server at the
instant that the snapshot was taken.
A crash-consistent doesn't
guarantee data consistency for the
operating system, or for apps on the
VM.
App-consistent
Replication process
When you enable replication for an Azure VM, the following happens:
1. The Site Recovery Mobility service extension is automatically installed on the VM.
2. The extension registers the VM with Site Recovery.
3. Continuous replication begins for the VM. Disk writes are immediately transferred to
the cache storage account in the source location.
4. Site Recovery processes the data in the cache, and sends it to the target storage
account, or to the replica managed disks.
5. After the data is processed, crash-consistent recovery points are generated every five
minutes. App-consistent recovery points are generated according to the setting
specified in the replication policy.
Replication process
Connectivity requirements
The Azure VMs you replicate need outbound connectivity. Site Recovery never needs
inbound connectivity to the VM.
Outbound connectivity (URLs)
If outbound access for VMs is controlled with URLs, allow these URLs.
URL Details
To control outbound connectivity for VMs using IP addresses, allow these addresses.
Source region rules
Allow HTTPS Allow ranges that correspond to storage accounts in the Storage..
outbound: port 443 source region
Allow HTTPS Allow ranges that correspond to Azure Active Directory AzureActiveDirectory
outbound: port 443 (Azure AD).
Allow HTTPS outbound: Allow ranges that correspond to storage accounts in Storage..
port 443 the target region.
Allow HTTPS outbound: Allow ranges that correspond to Azure AD. AzureActiveDirectory
port 443
If Azure AD addresses are added in future you need to
create new NSG rules.
NSG rules for the source Azure region should allow outbound access for replication
traffic.
We recommend you create rules in a test environment before you put them into
production.
Use service tags instead of allowing individual IP addresses.
o Service tags represent a group of IP address prefixes gathered together to
minimize complexity when creating security rules.
o Microsoft automatically updates service tags over time.
When you initiate a failover, the VMs are created in the target resource group, target
virtual network, target subnet, and in the target availability set. During a failover, you
can use any recovery point.
Azure TCO
Enter the details of your on-premises workloads. This information will be used to
understand your current TCO and recommended services in Azure.
Servers
Enter the details of your on-premises server infrastructure. After adding a workload,
select the workload type and enter the remaining details.
Add server workload
Databases
Enter the details of your on-premises database infrastructure. After adding a
database, enter the details of your on-premises database infrastructure in the Source
section. In the Destination section, select the Azure service you would like to use.
Add database
Storage
Enter the details of your on-premises storage infrastructure. After adding storage,
select the storage type and enter the remaining details.
Add storage
Networking
Enter the amount of network bandwidth you currently consume in your on-premises
environment.
Outboud bandwidth
More info
(1 - 2000)
Adjust assumptions
The following assumptions are being made as part of the TCO model. These key
assumptions usually vary among customers. We recommend reviewing these values
for accuracy.
Currency
Learn more about Software AssuranceLearn more about Azure Hybrid Benefit
Storage costs
Storage procurement cost/GB for local disk/SAN-SSD
(USD)
IT labor costs
Number of physical servers that can be managed by a full time administrator
Number of virtual machines that can be managed by a full time administrator
Hourly rate for IT administrator
(USD)
Other assumptions
The following assumptions also affect the TCO model, but typically require less
adjustment by customers. You can come back to this section at any time and adjust
the assumptions.
Hardware costs
Software costs
Electricity costs
Virtualization costs
Data center costs
Networking costs
Database costs
If you opt for GRS, you have two related options to choose from:
GRS replicates your data to another data center in a secondary region, but that data
is available to be read only if Microsoft initiates a failover from the primary to
secondary region.
Read-access geo-redundant storage (RA-GRS) is based on GRS. RA-GRS replicates
your data to another data center in a secondary region, and also provides you with the
option to read from the secondary region. With RA-GRS, you can read from the
secondary region regardless of whether Microsoft initiates a failover from the primary
to secondary region.
For a storage account with GRS or RA-GRS enabled, all data is first replicated with
locally redundant storage (LRS). An update is first committed to the primary location
and replicated using LRS. The update is then replicated asynchronously to the
secondary region using GRS. When data is written to the secondary location, it's also
replicated within that location using LRS.
Both the primary and secondary regions manage replicas across separate fault
domains and upgrade domains within a storage scale unit. The storage scale unit is
the basic replication unit within the datacenter. Replication at this level is provided
by LRS; for more information, see Locally redundant storage (LRS): Low-cost data
redundancy for Azure Storage.
Keep these points in mind when deciding which replication option to use:
Zone-redundant storage (ZRS) provides highly availability with synchronous
replication and may be a better choice for some scenarios than GRS or RA-GRS. For
more information on ZRS, see ZRS.
Asynchronous replication involves a delay from the time that data is written to the
primary region, to when it is replicated to the secondary region. In the event of a
regional disaster, changes that haven't yet been replicated to the secondary region
may be lost if that data can't be recovered from the primary region.
With GRS, the replica isn't available for read or write access unless Microsoft initiates
a failover to the secondary region. In the case of a failover, you'll have read and write
access to that data after the failover has completed. For more information, please
see Disaster recovery guidance.
If your application needs to read from the secondary region, enable RA-GRS.
When you enable read-only access to your data in the secondary region, your data is
available on a secondary endpoint as well as on the primary endpoint for your
storage account. The secondary endpoint is similar to the primary endpoint, but
appends the suffix –secondary to the account name. For example, if your primary
endpoint for the Blob service is myaccount.blob.core.windows.net, then your
secondary endpoint is myaccount-secondary.blob.core.windows.net. The access
keys for your storage account are the same for both the primary and secondary
endpoints.
Your application has to manage which endpoint it is interacting with when using RA-
GRS.
Since asynchronous replication involves a delay, changes that haven't yet been
replicated to the secondary region may be lost if data can't be recovered from the
primary region.
You can check the Last Sync Time of your storage account. Last Sync Time is a GMT
date/time value. All primary writes before the Last Sync Time have been successfully
written to the secondary location, meaning that they are available to be read from the
secondary location. Primary writes after the Last Sync Time may or may not be
available for reads yet. You can query this value using the Azure portal, Azure
PowerShell, or from one of the Azure Storage client libraries.
If you initiate an account failover (preview) of a GRS or RA-GRS account to the
secondary region, write access to that account is restored after the failover has
completed. For more information, see Disaster recovery and storage account failover
(preview).
RA-GRS is intended for high-availability purposes. For scalability guidance, review
the performance checklist.
For suggestions on how to design for high availability with RA-GRS, see Designing
Highly Available Applications using RA-GRS storage.
Recovery Point Objective (RPO): In GRS and RA-GRS, the storage service
asynchronously geo-replicates the data from the primary to the secondary location.
In the event that the primary region becomes unavailable, you can perform an
account failover (preview) to the secondary region. When you initiate a failover,
recent changes that haven't yet been geo-replicated may be lost. The number of
minutes of potential data that's lost is known as the RPO. The RPO indicates the
point in time to which data can be recovered. Azure Storage typically has an RPO of
less than 15 minutes, although there's currently no SLA on how long geo-replication
takes.
The time until the customer initiates the failover of the storage account from the
primary to the secondary region.
The time required by Azure to perform the failover by changing the primary DNS
entries to point to the secondary location.
Paired regions
When you create a storage account, you select the primary region for the account.
The paired secondary region is determined based on the primary region, and can't
be changed. For up-to-date information about regions supported by Azure,
see Business continuity and disaster recovery (BCDR): Azure paired regions.
When you create a storage account, you can select one of the following redundancy
options:
The following table provides a quick overview of the scope of durability and
availability that each replication strategy will provide you for a given type of event (or
event of similar impact).
Scenario LRS ZRS GRS RA-GRS
Supported GPv2, GPv1, GPv2 GPv2, GPv1, Blob GPv2, GPv1, Blob
storage Blob
account types
Availability At least 99.9% At least 99.9% At least 99.9% (99% At least 99.99%
SLA for read (99% for cool (99% for cool for cool access tier) (99.9% for Cool Access
requests access tier) access tier) Tier)
Availability At least 99.9% At least 99.9% At least 99.9% (99% At least 99.9% (99%
SLA for write (99% for cool (99% for cool for cool access tier) for cool access tier)
requests access tier) access tier)
For pricing information for each redundancy option, see Azure Storage Pricing.
For information about Azure Storage guarantees for durability and availability, see
the Azure Storage SLA.
Note
Currently, you cannot use the Portal or API to convert your account to ZRS. If you
want to convert your account's replication to ZRS, see Zone-redundant storage
(ZRS) for details.
Are there any costs to changing my account's replication strategy?
It depends on your conversion path. Ordering from cheapest to the most expensive
redundancy offering we have LRS, ZRS, GRS, and RA-GRS. For example,
going from LRS to anything will incur additional charges because you are going to a
more sophisticated redundancy level. Going to GRS or RA-GRS will incur an egress
bandwidth charge because your data (in your primary region) is being replicated to
your remote secondary region. This is a one-time charge at initial setup. After the
data is copied, there are no further conversion charges. You will only be charged for
replicating any new or updates to existing data. For details on bandwidth charges,
see Azure Storage Pricing page.
If you change from GRS to LRS, there is no additional cost, but your replicated data is
deleted from the secondary location.
A storage scale unit is a collection of racks of storage nodes. A fault domain (FD) is a
group of nodes that represent a physical unit of failure. Think of a fault domain as
nodes belonging to the same physical rack. An upgrade domain (UD) is a group of
nodes that are upgraded together during the process of a service upgrade (rollout).
The replicas are spread across UDs and FDs within one storage scale unit. This
architecture ensures your data is available if a hardware failure affects a single rack or
when nodes are upgraded during a service upgrade.
LRS is the lowest-cost replication option and offers the least durability compared to
other options. If a datacenter-level disaster (for example, fire or flooding) occurs, all
replicas may be lost or unrecoverable. To mitigate this risk, Microsoft recommends
using either zone-redundant storage (ZRS) or geo-redundant storage (GRS).
If your application stores data that can be easily reconstructed if data loss occurs, you
may opt for LRS.
Some applications are restricted to replicating data only within a country due to data
governance requirements. In some cases, the paired regions across which the data is
replicated for GRS accounts may be in another country. For more information on
paired regions, see Azure regions.
When you store your data in a storage account using ZRS replication, you can
continue to access and manage your data if an availability zone becomes unavailable.
ZRS provides excellent performance and low latency. ZRS offers the same scalability
targets as locally redundant storage (LRS).
Consider ZRS for scenarios that require consistency, durability, and high availability.
Even if an outage or natural disaster renders an availability zone unavailable, ZRS
offers durability for storage objects of at least 99.9999999999% (12 9's) over a given
year.
ZRS is available for block blobs, non-disk page blobs, files, tables, and queues.
Asia Southeast
Europe West
Europe North
France Central
Japan East
UK South
US East
US East 2
US West 2
US Central
Your data is still accessible for both read and write operations even if a zone
becomes unavailable. Microsoft recommends that you continue to follow practices
for transient fault handling. These practices include implementing retry policies with
exponential back-off.
ZRS may not protect your data against a regional disaster where multiple zones are
permanently affected. Instead, ZRS offers resiliency for your data if it becomes
temporarily unavailable. For protection against regional disasters, Microsoft
recommends using geo-redundant storage (GRS). For more information about GRS,
see Geo-redundant storage (GRS): Cross-regional replication for Azure Storage.
Converting to ZRS replication
Migrating to or from LRS, GRS, and RA-GRS is straightforward. Use the Azure portal
or the Storage Resource Provider API to change your account's redundancy type.
Azure will then replicate your data accordingly.
Migrating data to or from ZRS requires a different strategy. ZRS migration involves
the physical movement of data from a single storage stamp to multiple stamps
within a region.
Manually copy or move data to a new ZRS account from an existing account.
Request a live migration.
Use existing tooling like AzCopy, one of the Azure Storage client libraries, or reliable
third-party tools.
If you're familiar with Hadoop or HDInsight, attach both source and destination (ZRS)
account to your cluster. Then, parallelize the data copy process with a tool like DistCp.
Build your own tooling using one of the Azure Storage client libraries.
During a live migration, you can use your storage account while your data is
migrated between source and destination storage stamps. During the migration
process, you have the same level of durability and availability SLA as you do
normally.
While Microsoft handles your request for live migration promptly, there's no
guarantee as to when a live migration will complete. If you need your data migrated to
ZRS by a certain date, then Microsoft recommends that you perform a manual
migration instead. Generally, the more data you have in your account, the longer it
takes to migrate that data.
Live migration is supported only for storage accounts that use LRS or GRS replication.
If your account uses RA-GRS, then you need to first change your account's replication
type to either LRS or GRS before proceeding. This intermediary step removes the
secondary read-only endpoint provided by RA-GRS before migration.
Your account must contain data.
You can only migrate data within the same region. If you want to migrate your data
into a ZRS account located in a region different than the source account, then you
must perform a manual migration.
Only standard storage account types support live migration. Premium storage
accounts must be migrated manually.
You can request live migration through the Azure Support portal. From the portal,
select the storage account you want to convert to ZRS.
A support person will contact you and provide any assistance you need.
Microsoft will deprecate and migrate ZRS Classic accounts on March 31, 2021. More
details will be provided to ZRS Classic customers before deprecation.
ZRS Classic asynchronously replicates data across data centers within one to two
regions. Replicated data may not be available unless Microsoft initiates failover to the
secondary. A ZRS Classic account can't be converted to or from LRS, GRS, or RA-GRS.
ZRS Classic accounts also don't support metrics or logging.
To manually migrate ZRS account data to or from an LRS, ZRS Classic, GRS, or RA-
GRS account, use one of the following tools: AzCopy, Azure Storage Explorer, Azure
PowerShell, or Azure CLI. You can also build your own migration solution with one of
the Azure Storage client libraries.
AWS is 5 times more expensive than Azure for Windows Server and SQL Server*
Migrate your Windows Server and SQL Server workloads to Azure with your on-premises
licenses
Message us
Call sales 1-800-419-8555
AWS is 5 times more expensive than Azure for Windows Server and SQL Server
Achieve the lowest cost of ownership when you combine the Azure Hybrid
Benefit, reservation pricing, and extended security updates.
Receive free extended security updates when you migrate your Windows Server and SQL
Server 2008 and 2008 R2 workloads to Azure virtual machines.
Save when you prepay your compute capacity on a one- or three-year term with reservation
pricing, which not only improves budget forecasting but also provides flexibility to exchange or
cancel should business needs change.
Learn more
Instance size
D4 v2: 8 cores, 28 GB RAM, 400 GB SSD, $1.008
Hours / month
Monthly Estimates
Without Azure Hybrid Benefit per month$3,679.20
Annual Estimates
Your estimated annual savings on Azure across all virtual machines$18,527.40
Calculator is to help estimate savings range when using the Azure Hybrid Benefit for Windows Server licenses that include
Software Assurance. Your actual savings may vary.
Activate your Azure Hybrid Benefit
1. Deploy a virtual machine within minutes from the Azure Portal or through the Azure
Marketplace.
SQL Database:
1. Try SQL Database Managed Instance from the Azure Portal and migrate your SQL
Server databases without changing your apps.
2. Try SQL Database Single Database or Elastic Pool from the Azure Portal and build
data-driven applications and websites in the programming language of your choice.
The B-series VMs are ideal for workloads that do not need the full performance of
the CPU continuously, like web servers, proof of concepts, small databases and
development build environments. These workloads typically have burstable
performance requirements. The B-series provides you with the ability to purchase a
VM size with baseline performance and the VM instance builds up credits when it is
using less than its baseline. When the VM has accumulated credit, the VM can burst
above the baseline using up to 100% of the vCPU when your application requires
higher CPU performance.
Q&A
Q: How do you get 135% baseline performance from a VM?
A: The 135% is shared amongst the 8 vCPU’s that make up the VM size. For example,
if your application uses 4 of the 8 cores working on batch processing and each of
those 4 vCPU’s are running at 30% utilization the total amount of VM CPU
performance would equal 120%. Meaning that your VM would be building credit
time based on the 15% delta from your baseline performance. But it also means that
when you have credits available that same VM can use 100% of all 8 vCPU’s giving
that VM a Max CPU performance of 800%.
Q: How can I monitor my credit balance and consumption
A: The VM accumulation and consumption rates are set such that a VM running at
exactly its base performance level will have neither a net accumulation or
consumption of bursting credits. A VM will have a net increase in credits whenever it
is running below its base performance level and will have a net decrease in credits
whenever the VM is utilizing the CPU more than its base performance level.
Example: I deploy a VM using the B1ms size for my small time and attendance
database application. This size allows my application to use up to 20% of a vCPU as
my baseline, which is 0.2 credits per minute I can use or bank.
During peak hours my application averages 60% vCPU utilization, I still earn 0.2
credits per minute but I consume 0.6 credits per minute, for a net cost of 0.4 credits a
minute or 0.4 x 60 = 24 credits per hour. I have 4 hours per day of peak usage, so it
costs 4 x 24 = 96 credits for my peak usage.
If I take the 120 credits I earned off-peak and subtract the 96 credits I used for my
peak times, I bank an additional 24 credits per day that I can use for other bursts of
activity.
Q: Does the B-Series support Premium Storage data disks?
General purpose
Compute optimized
Memory optimized
Storage optimized
GPU optimized
High performance compute
The following table lists the agent data sources that are currently available in Azure
Monitor. Each has a link to a separate article providing detail for that data source. It
also provides information on their method and frequency of collection.
Operations
Manager
agent data
Microsoft Operations Operations sent via
monitoring Manager Azure Manager management
Data source Platform agent agent storage required? group
Performance Windows • •
counters
Operations
Manager
agent data
Microsoft Operations Operations sent via
monitoring Manager Azure Manager management
Data source Platform agent agent storage required? group
Performance Linux •
counters
Syslog Linux •
Windows Windows • • • •
Event logs
Data collection
Data source configurations are delivered to agents that are directly connected to
Azure Monitor within a few minutes. The specified data is collected from the agent
and delivered directly to Azure Monitor at intervals specific to each data source. See
the documentation for each data source for these specifics.
For System Center Operations Manager agents in a connected management group,
data source configurations are translated into management packs and delivered to
the management group every 5 minutes by default. The agent downloads the
management pack like any other and collects the specified data. Depending on the
data source, the data will be either sent to a management server which forwards the
data to the Azure Monitor, or the agent will send the data to Azure Monitor without
going through the management server. See Data collection details for monitoring
solutions in Azure for details. You can read about details of connecting Operations
Manager and Azure Monitor and modifying the frequency that configuration is
delivered at Configure Integration with System Center Operations Manager.
All log data collected by Azure Monitor is stored in the workspace as records.
Records collected by different data sources will have their own set of properties and
be identified by their Type property. See the documentation for each data source
and solution for details on each record type.
ExpressRoute FAQ
What is ExpressRoute?
ExpressRoute is an Azure service that lets you create private connections between
Microsoft datacenters and infrastructure that’s on your premises or in a colocation
facility. ExpressRoute connections do not go over the public Internet, and offer
higher security, reliability, and speeds with lower latencies than typical connections
over the Internet.
What are the benefits of using ExpressRoute and private network connections?
ExpressRoute connections do not go over the public Internet. They offer higher
security, reliability, and speeds, with lower and consistent latencies than typical
connections over the Internet. In some cases, using ExpressRoute connections to
transfer data between on-premises devices and Azure can yield significant cost
benefits.
Where is the service available?
See this page for service location and availability: ExpressRoute partners and
locations.
How can I use ExpressRoute to connect to Microsoft if I don’t have partnerships with
one of the ExpressRoute-carrier partners?
You can select a regional carrier and land Ethernet connections to one of the
supported exchange provider locations. You can then peer with Microsoft at the
provider location. Check the last section of ExpressRoute partners and locations to
see if your service provider is present in any of the exchange locations. You can then
order an ExpressRoute circuit through the service provider to connect to Azure.
How much does ExpressRoute cost?
No. You can purchase a VPN connection of any speed from your service provider.
However, your connection to Azure is limited to the ExpressRoute circuit bandwidth
that you purchase.
If I pay for an ExpressRoute circuit of a given bandwidth, do I have the ability to burst
up to higher speeds if necessary?
Yes. ExpressRoute circuits are configured to allow you to burst up to two times the
bandwidth limit you procured for no additional cost. Check with your service
provider to see if they support this capability.
Can I use the same private network connection with virtual network and other Azure
services simultaneously?
Yes. An ExpressRoute circuit, once set up, allows you to access services within a
virtual network and other Azure services simultaneously. You connect to virtual
networks over the private peering path, and to other services over the Microsoft
peering path.
Does ExpressRoute offer a Service Level Agreement (SLA)?
Public peering
Note
Public peering has been disabled on new ExpressRoute circuits. Azure services are
available on Microsoft peering.
Power BI
Dynamics 365 for Finance and Operations (formerly known as Dynamics AX Online)
Most of the Azure services are supported. Please check directly with the service that
you want to use to verify support.
Microsoft peering
Office 365
Dynamics 365
Power BI
Azure Active Directory
Azure DevOps (Azure Global Services community)
Most of the Azure services are supported. Please check directly with the service that
you want to use to verify support.
We do not set a limit on the amount of data transfer. Refer to pricing details for
information on bandwidth rates.
What connection speeds are supported by ExpressRoute?
50 Mbps, 100 Mbps, 200 Mbps, 500 Mbps, 1 Gbps, 2 Gbps, 5 Gbps, 10 Gbps
Which service providers are available?
Yes. Each ExpressRoute circuit has a redundant pair of cross connections configured
to provide high availability.
Will I lose connectivity if one of my ExpressRoute links fail?
You will not lose connectivity if one of the cross connections fails. A redundant
connection is available to support the load of your network and provide high
availability of your ExpressRoute circuit. You can additionally create a circuit in a
different peering location to achieve circuit-level resilience.
How do I ensure high availability on a virtual network connected to ExpressRoute?
If your service provider can establish two Ethernet virtual circuits over the physical
connection, you only need one physical connection. The physical connection (for
example, an optical fiber) is terminated on a layer 1 (L1) device (see the image). The
two Ethernet virtual circuits are tagged with different VLAN IDs, one for the primary
circuit, and one for the secondary. Those VLAN IDs are in the outer 802.1Q Ethernet
header. The inner 802.1Q Ethernet header (not shown) is mapped to a
specific ExpressRoute routing domain.
Yes. You can have more than one ExpressRoute circuit in your subscription. The
default limit is set to 10. You can contact Microsoft Support to increase the limit, if
needed.
Can I have ExpressRoute circuits from different service providers?
Yes. You can have ExpressRoute circuits with many service providers. Each
ExpressRoute circuit is associated with one service provider only.
I see two ExpressRoute peering locations in the same metro, for example, Singapore
and Singapore2. Which peering location should I choose to create my ExpressRoute
circuit?
If your service provider offers ExpressRoute at both sites, you can work with your
provider and pick either site to set up ExpressRoute.
Can I have multiple ExpressRoute circuits in the same metro? Can I link them to the
same virtual network?
Yes. You can have multiple ExpressRoute circuits with the same or different service
providers. If the metro has multiple ExpressRoute peering locations and the circuits
are created at different peering locations, you can link them to the same virtual
network. If the circuits are created at the same peering location, you can’t link them
to the same virtual network. Each location name in Azure portal or in PowerShell/CLI
API represents one peering location. For example, you can select the peering
locations "Singapore" and "Singapore2" and connect circuits from each to the same
virtual network.
How do I connect my virtual networks to an ExpressRoute circuit
Establish an ExpressRoute circuit and have the service provider enable it.
You, or the provider, must configure the BGP peering(s).
Link the virtual network to the ExpressRoute circuit.
For more information, see ExpressRoute workflows for circuit provisioning and circuit
states.
Are there connectivity boundaries for my ExpressRoute circuit?
Yes. You can link up to 10 virtual networks in the same subscription as the circuit or
different subscriptions using a single ExpressRoute circuit. This limit can be increased
by enabling the ExpressRoute premium feature.
No. From a routing perspective, all virtual networks linked to the same ExpressRoute
circuit are part of the same routing domain and are not isolated from each other. If
you need route isolation, you need to create a separate ExpressRoute circuit.
Can I have one virtual network connected to more than one ExpressRoute circuit?
Yes. You can link a single virtual network with up to four ExpressRoute circuits. They
must be ordered through four different ExpressRoute locations.
Can I access the Internet from my virtual networks connected to ExpressRoute
circuits?
Yes. If you have not advertised default routes (0.0.0.0/0) or Internet route prefixes
through the BGP session, you can connect to the Internet from a virtual network
linked to an ExpressRoute circuit.
Can I block Internet connectivity to virtual networks connected to ExpressRoute
circuits?
Yes. You can advertise default routes (0.0.0.0/0) to block all Internet connectivity to
virtual machines deployed within a virtual network and route all traffic out through
the ExpressRoute circuit.
If you advertise default routes, we force traffic to services offered over Microsoft
peering (such as Azure storage and SQL DB) back to your premises. You will have to
configure your routers to return traffic to Azure through the Microsoft peering path
or over the Internet. If you've enabled a service endpoint for the service, the traffic to
the service is not forced to your premises. The traffic remains within the Azure
backbone network. To learn more about service endpoints, see Virtual network
service endpoints
Can virtual networks linked to the same ExpressRoute circuit talk to each other?
The public IP address is used for internal management only, and does not constitute
a security exposure of your virtual network.
Are there limits on the number of routes I can advertise?
Yes. We accept up to 4000 route prefixes for private peering and 200 for Microsoft
peering. You can increase this to 10,000 routes for private peering if you enable the
ExpressRoute premium feature.
Are there restrictions on IP ranges I can advertise over the BGP session?
We do not accept private prefixes (RFC1918) for the Microsoft peering BGP session.
What happens if I exceed the BGP limits?
BGP sessions will be dropped. They will be reset once the prefix count goes below
the limit.
What is the ExpressRoute BGP hold time? Can it be adjusted?
The hold time is 180. The keep-alive messages are sent every 60 seconds. These are
fixed settings on the Microsoft side that cannot be changed. It is possible for you to
configure different timers, and the BGP session parameters will be negotiated
accordingly.
Can I change the bandwidth of an ExpressRoute circuit?
Yes, you can attempt to increase the bandwidth of your ExpressRoute circuit in the
Azure portal, or by using PowerShell. If there is capacity available on the physical port
on which your circuit was created, your change succeeds.
If your change fails, it means either there isn’t enough capacity left on the current
port and you need to create a new ExpressRoute circuit with the higher bandwidth,
or that there is no additional capacity at that location, in which case you won't be
able to increase the bandwidth.
You will also have to follow up with your connectivity provider to ensure that they
update the throttles within their networks to support the bandwidth increase. You
cannot, however, reduce the bandwidth of your ExpressRoute circuit. You have to
create a new ExpressRoute circuit with lower bandwidth and delete the old circuit.
How do I change the bandwidth of an ExpressRoute circuit?
You can update the bandwidth of the ExpressRoute circuit using the REST API or
PowerShell cmdlet.
ExpressRoute premium
What is ExpressRoute premium?
Increased routing table limit from 4000 routes to 10,000 routes for private peering.
Increased number of VNets and ExpressRoute Global Reach connections that can be
enabled on an ExpressRoute circuit (default is 10). For more information, see
the ExpressRoute Limits table.
Connectivity to Office 365 and Dynamics 365.
Global connectivity over the Microsoft core network. You can now link a VNet
in one geopolitical region with an ExpressRoute circuit in another region.
Examples:
o You can link a VNet created in Europe West to an ExpressRoute circuit created
in Silicon Valley.
o On the Microsoft peering, prefixes from other geopolitical regions are
advertised such that you can connect to, for example, SQL Azure in Europe West
from a circuit in Silicon Valley.
How many VNets and ExpressRoute Global Reach connections can I enable on an
ExpressRoute circuit if I enabled ExpressRoute premium?
The following tables show the ExpressRoute limits and the number of VNets and
ExpressRoute Global Reach connections per ExpressRoute circuit:
ExpressRoute Limits
Maximum number of routes for Azure private peering with ExpressRoute standard 4,000
Maximum number of routes for Azure private peering with ExpressRoute premium add- 10,000
on
Maximum number of routes for Azure Microsoft peering with ExpressRoute standard 200
Maximum number of routes for Azure Microsoft peering with ExpressRoute premium 200
add-on
Number of virtual network links allowed per ExpressRoute circuit see table below
Circuit Size Number of VNet links for standard Number of VNet Links with Premium add-on
50 Mbps 10 20
100 Mbps 10 25
200 Mbps 10 25
500 Mbps 10 40
1 Gbps 10 50
Circuit Size Number of VNet links for standard Number of VNet Links with Premium add-on
2 Gbps 10 60
5 Gbps 10 75
10 Gbps 10 100
ExpressRoute premium features can be enabled when the feature is enabled, and can
be shut down by updating the circuit state. You can enable ExpressRoute premium at
circuit creation time, or can call the REST API / PowerShell cmdlet.
How do I disable ExpressRoute premium?
You can disable ExpressRoute premium by calling the REST API or PowerShell cmdlet.
You must make sure that you have scaled your connectivity needs to meet the
default limits before you disable ExpressRoute premium. If your utilization scales
beyond the default limits, the request to disable ExpressRoute premium fails.
Can I pick and choose the features I want from the premium feature set?
No. You can't pick the features. We enable all features when you turn on
ExpressRoute premium.
How much does ExpressRoute premium cost?
Office 365 was created to be accessed securely and reliably via the Internet. Because
of this, we recommend ExpressRoute for specific scenarios. For information about
using ExpressRoute to access Office 365, visit Azure ExpressRoute for Office 365.
How do I create an ExpressRoute circuit to connect to Office 365 services?
Important
Make sure that you have enabled ExpressRoute premium add-on when configuring
connectivity to Office 365 services.
Can my existing ExpressRoute circuits support connectivity to Office 365 services and
Dynamics 365?
Refer to Office 365 URLs and IP address ranges page for an up-to-date list of services
supported over ExpressRoute.
How much does ExpressRoute for Office 365 services cost?
Office 365 services require premium add-on to be enabled. See the pricing details
page for costs.
What regions is ExpressRoute for Office 365 supported in?
Yes. Office 365 service endpoints are reachable through the Internet, even though
ExpressRoute has been configured for your network. Please check with your
organization's networking team if the network at your location is configured to
connect to Office 365 services through ExpressRoute.
How can I plan for high availability for Office 365 network traffic on Azure
ExpressRoute?
See the recommendation for High availability and failover with Azure ExpressRoute
Can I access Office 365 US Government Community (GCC) services over an Azure US
Government ExpressRoute circuit?
Yes. Office 365 GCC service endpoints are reachable through the Azure US
Government ExpressRoute. However, you first need to open a support ticket on the
Azure portal to provide the prefixes you intend to advertise to Microsoft. Your
connectivity to Office 365 GCC services will be established after the support ticket is
resolved.
Route filters for Microsoft peering
I am turning on Microsoft peering for the first time, what routes will I see?
You will not see any routes. You have to attach a route filter to your circuit to start
prefix advertisements. For instructions, see Configure route filters for Microsoft
peering.
I turned on Microsoft peering and now I am trying to select Exchange Online, but it is
giving me an error that I am not authorized to do it.
When using route filters, any customer can turn on Microsoft peering. However, for
consuming Office 365 services, you still need to get authorized by Office 365.
Do I need to get authorization for turning on Dynamics 365 over Microsoft peering?
No, you do not need authorization for Dynamics 365. You can create a rule and
select Dynamics 365 community without authorization.
I enabled Microsoft peering prior to August 1, 2017, how can I take advantage of
route filters?
Your existing circuit will continue advertising the prefixes for Office 365 and
Dynamics 365. If you want to add Azure public prefixes advertisements over the
same Microsoft peering, you can create a route filter, select the services you need
advertised (including the Office 365 service(s) you need and Dynamics 365), and
attach the filter to your Microsoft peering. For instructions, see Configure route filters
for Microsoft peering.
I have Microsoft peering at one location, now I am trying to enable it at another
location and I am not seeing any prefixes.
ExpressRoute Direct provides customers with the ability to connect directly into
Microsoft’s global network at peering locations strategically distributed across the
world. ExpressRoute Direct provides dual 100 Gbps connectivity, which supports
Active/Active connectivity at scale.
How do customers connect to ExpressRoute Direct?
Customers will need to work with their local carriers and co-location providers to get
connectivity to ExpressRoute routers to take advantage of ExpressRoute Direct.
What locations currently support ExpressRoute Direct?
The available ports will be dynamic and will be available by PowerShell to view the
capacity. Locations include and are subject to change based on availability:
Amsterdam
Canberra
Chicago
Washington DC
Dallas
Hong Kong
Los Angeles
New York City
Paris
San Antonio
Silicon Valley
Singapore
ExpressRoute Direct provides customers with direct 100 Gbps port pairs into the
Microsoft global backbone. The scenarios that will provide customers with the
greatest benefits include: Massive data ingestion, physical isolation for regulated
markets, and dedicated capacity for burst scenario, like rendering.
What is the billing model for ExpressRoute Direct?
ExpressRoute Direct will be billed for the port pair at a fixed amount. Standard
circuits will be included at no additional hours and premium will have a slight add-on
charge. Egress will be billed on a per circuit basis based on the zone of the peering
location.
When does billing state for the ExpressRoute Direct port pairs?
ExpressRoute Direct's port pairs are billed 45 days into the creation of the
ExpressRoute Direct resource or when 1 or both of the links are enabled, whichever
comes first. The 45-day grace period is granted to allow customers to complete the
cross-connection process with the colocation provider.
Global Reach (Preview)
What is ExpressRoute Global Reach?
If your ExpressRoute circuits are in the same geopolitical region, you don't need
ExpressRoute Premium to connect them together. If two ExpressRoute circuits are in
different geopolitical regions, you need ExpressRoute Premium for both circuits in
order to enable connectivity between them.
How will I be charged for ExpressRoute Global Reach?
Yes, you can, as long as the circuits are in the supported countries. You need to
connect two ExpressRoute circuits at a time. To create a fully meshed network, you
need to enumerate all circuit pairs and repeat the configuration.
Can I enable ExpressRoute Global Reach between two ExpressRoute circuits at the
same peering location?
No. The two circuits must be from different peering locations. If a metro in a
supported country has more than one ExpressRoute peering location, you can
connect together the ExpressRoute circuits created at different peering locations in
that metro.
If ExpressRoute Global Reach is enabled between circuit X and circuit Y, and between
circuit Y and circuit Z, will my on-premises networks connected to circuit X and circuit
Z talk to each other via Microsoft's network?
No. To enable connectivity between any two of your on-premises networks, you must
connect the corresponding ExpressRoute circuits explicitly. In the above example, you
must connect circuit X and circuit Z.
What is the network throughput I can expect between my on-premises networks
after I enable ExpressRoute Global Reach?
The number of routes you can advertise to Microsoft on Azure private peering
remains at 4000 on a Standard circuit or 10000 on a Premium circuit. The number of
routes you will receive from Microsoft on Azure private peering will be the sum of
the routes of your Azure virtual networks and the routes from your other on-
premises networks connected via ExpressRoute Global Reach. Please make sure you
set an appropriate maximum prefix limit on your on-premises router.
What is the SLA for ExpressRoute Global Reach?
ExpressRoute Global Reach will provide the same availability SLA as the regular
ExpressRoute service.
Each Azure region is paired with another region within the same geography, together
making a regional pair. The exception is Brazil South, which is paired with a region
outside its geography. Across the region pairs Azure serializes platform updates
(planned maintenance), so that only one paired region is updated at a time. In the
event of an outage affecting multiple regions, at least one region in each pair will be
prioritized for recovery.
UK UK West UK South
West India is different because it is paired with another region in one direction only.
West India's secondary region is South India, but South India's secondary region is
Central India.
Brazil South is unique because it is paired with a region outside of its own geography.
Brazil South’s secondary region is South Central US, but South Central US’s secondary
region is not Brazil South.
US Gov Iowa's secondary region is US Gov Virginia, but US Gov Virginia's secondary
region is not US Gov Iowa.
US Gov Virginia's secondary region is US Gov Texas, but US Gov Texas' secondary
region is not US Gov Virginia.
Figure 2 below shows a hypothetical application which uses the regional pair for
disaster recovery. The green numbers highlight the cross-region activities of three
Azure services (Azure compute, storage, and database) and how they are configured
to replicate across regions. The unique benefits of deploying across paired regions
are highlighted by the orange numbers.
As referred to in figure 2.
Azure SQL Database – With Azure SQL Database Geo-Replication, you can
configure asynchronous replication of transactions to any region in the world;
however, we recommend you deploy these resources in a paired region for most
disaster recovery scenarios. For more information, see Geo-Replication in Azure SQL
Database.
As referred to in figure 2.
Sequential updates – Planned Azure system updates are rolled out to paired
regions sequentially (not at the same time) to minimize downtime, the effect of bugs,
and logical failures in the rare event of a bad update.
Data residency – A region resides within the same geography as its pair (with
the exception of Brazil South) in order to meet data residency requirements for tax
and law enforcement jurisdiction purposes.
Business and technology owners must determine how much functionality is required during a
disaster. This level of functionality can take a few forms: completely unavailable, partially
available via reduced functionality or delayed processing, or fully available.
Resiliency and high availability strategies are intended for handling temporary failure
conditions. Executing this plan involves people, processes, and supporting applications that
allow the system to continue functioning. Your plan should include rehearsing failures and
testing the recovery of databases to ensure the plan is sound.
Azure disaster recovery features
Azure maintains datacenters in many regions around the world. This infrastructure supports
several disaster recovery scenarios, such as system-provided geo-replication of Azure Storage
to secondary regions. You can also easily and inexpensively deploy a cloud service to
multiple locations around the world. Compare this with the cost and difficulty of building and
maintaining your own datacenters in multiple regions. Deploying data and services to
multiple regions helps protect your application from a major outage in a single region. As you
design your disaster recovery plan, it’s important to understand the concept of paired regions.
For more information, see Business continuity and disaster recovery (BCDR): Azure Paired
Regions.
Azure Site Recovery
Azure Site Recovery provides a simple way to replicate Azure VMs between regions. It has
minimal management overhead, because you don't need to provision any additional resources
in the secondary region. When you enable replication, Site Recovery automatically creates
the required resources in the target region, based on the source VM settings. It provides
automated continuous replication, and enables you to perform application failover with a
single click. You can also run disaster recovery drills by testing failover, without affecting
your production workloads or ongoing replication.
Azure Traffic Manager
When a region-specific failure occurs, you must redirect traffic to services or deployments in
another region. It is most effective to handle this via services such as Azure Traffic Manager,
which automates the failover of user traffic to another region if the primary region fails.
Understanding the fundamentals of Traffic Manager is important when designing an effective
DR strategy.
Traffic Manager uses the Domain Name System (DNS) to direct client requests to the most
appropriate endpoint based on a traffic-routing method and the health of the endpoints. In the
following diagram, users connect to a Traffic Manager URL
(http://myATMURL.trafficmanager.net) which abstracts the actual site URLs
(http://app1URL.cloudapp.net and http://app2URL.cloudapp.net). User requests are
routed to the proper underlying URL based on your configured Traffic Manager routing
method. For the sake of this article, we will be concerned with only the failover option.
When configuring Traffic Manager, you provide a new Traffic Manager DNS prefix, which
users will use to access your service. Traffic Manager now abstracts load balancing one level
higher that the regional level. The Traffic Manager DNS maps to a CNAME for all the
deployments that it manages.
Within Traffic Manager, you specify a prioritized list of deployments that users will be
routed to when failure occurs. Traffic Manager monitors the deployment endpoints. If the
primary deployment becomes unavailable, Traffic Manager routes users to the next
deployment on the priority list.
Although Traffic Manager decides where to go during a failover, you can decide whether
your failover domain is dormant or active while you're not in failover mode (which is
unrelated to Traffic Manager). Traffic Manager detects a failure in the primary site and rolls
over to the failover site, regardless of whether that site is currently serving users.
For more information on how Azure Traffic Manager works, refer to:
The following sections cover several different types of disaster scenarios. Region-wide
service disruptions are not the only cause of application-wide failures. Poor design and
administrative errors can also lead to outages. It's important to consider the possible causes of
a failure during both the design and testing phases of your recovery plan. A good plan takes
advantage of Azure features and augments them with application-specific strategies. The
chosen response is determined by the importance of the application, the recovery point
objective (RPO), and the recovery time objective (RTO).
Application failure
Azure Traffic Manager automatically handles failures that result from the underlying
hardware or operating system software in the host virtual machine. Azure creates a new role
instance and adds it to the available pool. If more than one role instance was already running,
Azure shifts processing to the other running role instances while replacing the failed node.
Serious application errors can occur without any underlying failure of the hardware or
operating system. The application might fail due to catastrophic exceptions caused by bad
logic or data integrity issues. You must include sufficient telemetry in the application code so
that a monitoring system can detect failure conditions and notify an application administrator.
An administrator who has full knowledge of the disaster recovery processes can decide
whether to trigger a failover process or accept an availability outage while resolving the
critical errors.
Data corruption
Azure automatically stores Azure SQL Database and Azure Storage data three times
redundantly within different fault domains in the same region. If you use geo-replication, the
data is stored three additional times in a different region. However, if your users or your
application corrupts that data in the primary copy, the data quickly replicates to the other
copies. Unfortunately, this results in multiple copies of corrupt data.
To manage potential corruption of your data, you have two options. First, you can manage a
custom backup strategy. You can store your backups in Azure or on-premises, depending on
your business requirements or governance regulations. Another option is to use the point-in-
time restore option to recover a SQL database. For more information, see the data strategies
for disaster recovery section below.
Network outage
When parts of the Azure network are inaccessible, you may be unable to access your
application or data. If one or more role instances are unavailable due to network issues, Azure
uses the remaining available instances of your application. If your application cannot access
its data because of an Azure network outage, you can potentially run with reduced application
functionality locally by using cached data. You need to design the disaster recovery strategy
to run with reduced functionality in your application. For some applications, this might not be
practical.
Azure provides many services that can experience periodic downtime. For example, Azure
Redis Cache is a multi-tenant service which provides caching capabilities to your application.
It's important to consider what happens in your application if the dependent service is
unavailable. In many ways, this scenario is similar to the network outage scenario. However,
considering each service independently results in potential improvements to your overall
plan.
Azure Redis Cache provides caching to your application from within your cloud service
deployment, which provides disaster recovery benefits. First, the service now runs on roles
that are local to your deployment. Therefore, you're better able to monitor and manage the
status of the cache as part of your overall management processes for the cloud service. This
type of caching also exposes new features such as high availability for cached data, which
preserves cached data if a single node fails by maintaining duplicate copies on other nodes.
Note that high availability decreases throughput and increases latency because write
operations must also upedate any secondary copies. The amount of memory required to store
the cached data is effectively doubled, which must be taken into account during capacity
planning. This example demonstrates that each dependent service might have capabilities that
improve your overall availability and resistance to catastrophic failures.
With each dependent service, you should understand the implications of a service disruption.
In the caching example, it might be possible to access the data directly from a database until
you restore your cache. This would result in reduced performance while providing full access
to application data.
Region-wide service disruption
The previous failures have primarily been failures that can be managed within the same
Azure region. However, you must also prepare for the possibility that there is a service
disruption of the entire region. If a region-wide service disruption occurs, the locally
redundant copies of your data are not available. If you have enabled geo-replication, there are
three additional copies of your blobs and tables in a different region. If Microsoft declares the
region lost, Azure remaps all of the DNS entries to the geo-replicated region.
Note
Be aware that you don't have any control over this process, and it will occur only for region-
wide service disruption. Consider using Azure Site Recovery to achieve better RPO and
RTO. Site Recovery allows application to decide what is an acceptable outage, and when to
fail over to the replicated VMs.
Azure-wide service disruption
In disaster planning, you must consider the entire range of possible disasters. One of the most
severe service disruptions would involve all Azure regions simultaneously. As with other
service disruptions, you might decide to accept the risk of temporary downtime in that event.
Widespread service disruptions that span regions are much rarer than isolated service
disruptions involving dependent services or single regions.
However, you may decide that certain mission-critical applications require a backup plan for
a multi-region service disruption. This plan might include failing over to services in
an alternative cloud or a hybrid on-premises and cloud solution.
Reduced application functionality
A well-designed application typically uses services that communicate with each other though
the implementation of loosely coupled information-interchange patterns. A DR-friendly
application requires separation of responsibilities at the service level. This prevents the
disruption of a dependent service from bringing down the entire application. For example,
consider a web commerce application for Company Y. The following modules might
constitute the application:
When a service dependency in this application becomes unavailable, how does the service
function until the dependency recovers? A well-designed system implements isolation
boundaries through separation of responsibilities, both at design time and at runtime. You can
categorize every failure as recoverable and non-recoverable. Non-recoverable errors will
bring down the service, but you can mitigate a recoverable error through alternatives. Certain
problems addressed by automatically handling faults and taking alternate actions are
transparent to the user. During a more serious service disruption, the application might be
completely unavailable. A third option is to continue handling user requests with reduced
functionality.
For instance, if the database for hosting orders goes down, the Order Processing service loses
its ability to process sales transactions. Depending on the architecture, it might be difficult or
impossible for the Order Submission and Order Processing services of the application to
continue. If the application is not designed to handle this scenario, the entire application
might go offline. However, if the product data is stored in a different location, then the
Product Catalog module can still be used for viewing products. However, other parts of the
application are unavailable, such as ordering or inventory queries.
Deciding what reduced application functionality is available is both a business decision and a
technical decision. You must decide how the application will inform the users of any
temporary problems. In the example above, the application might allow viewing products and
adding them to a shopping cart. However, when the user attempts to make a purchase, the
application notifies the user that the ordering functionality is temporarily unavailable. This
isn't ideal for the customer, but it does prevent an application-wide service disruption.
Data strategies for disaster recovery
Proper data handling is a challenging aspect of a disaster recovery plan. During the recovery
process, data restoration typically takes the most time. Different choices for reducing
functionality result in difficult challenges for data recovery from failure and consistency after
failure.
One consideration is the need to restore or maintain a copy of the application’s data. You will
use this data for reference and transactional purposes at a secondary site. An on-premises
deployment requires an expensive and lengthy planning process to implement a multiple-
region disaster recovery strategy. Conveniently, most cloud providers, including Azure,
readily allow the deployment of applications to multiple regions. These regions are
geographically distributed in such a way that multiple-region service disruption should be
extremely rare. The strategy for handling data across regions is one of the contributing factors
for the success of any disaster recovery plan.
The following sections discuss disaster recovery techniques related to data backups, reference
data, and transactional data.
Backup and restore
Regular backups of application data can support some disaster recovery scenarios. Different
storage resources require different techniques.
SQL Database
For the Basic, Standard, and Premium SQL Database tiers, you can take advantage of point-
in-time restore to recover your database. For more information, see Overview: Cloud
business continuity and database disaster recovery with SQL Database. Another option is to
use Active Geo-Replication for SQL Database. This automatically replicates database
changes to secondary databases in the same Azure region or even in a different Azure region.
This provides a potential alternative to some of the more manual data synchronization
techniques presented in this article. For more information, see Overview: SQL Database
Active Geo-Replication.
You can also use a more manual approach for backup and restore. Use the DATABASE
COPY command to create a backup copy of the database with transactional consistency. You
can also use the import/export service of Azure SQL Database, which supports exporting
databases to BACPAC files (compressed files containing your database schema and
associated data) that are stored in Azure Blob storage.
The built-in redundancy of Azure Storage creates two replicas of the backup file in the same
region. However, the frequency of running the backup process determines your RPO, which
is the amount of data you might lose in disaster scenarios. For example, imagine that you
perform a backup at the top of each hour, and a disaster occurs two minutes before the top of
the hour. You lose 58 minutes of data recorded after the last backup was performed. Also, to
protect against a region-wide service disruption, you should copy the BACPAC files to an
alternate region. You then have the option of restoring those backups in the alternate region.
For more details, see Overview: Cloud business continuity and database disaster recovery
with SQL Database.
SQL Data Warehouse
For SQL Data Warehouse, use geo-backups to restore to a paired region for disaster recovery.
These backups are taken every 24 hours and can be restore within 20 minutes in the paired
region. This feature is on by default for all SQL data warehouses. For more information on
how to restore your data warehouse, see Restore from an Azure geographical region using
PowerShell.
Azure Storage
For Azure Storage, you can develop a custom backup process or use one of many third-party
backup tools. Note that most application designs have additional complexities where storage
resources reference each other. For example, consider a SQL database that has a column that
links to a blob in Azure Storage. If the backups do not happen simultaneously, the database
might have a pointer to a blob that was not backed up before the failure. The application or
disaster recovery plan must implement processes to handle this inconsistency after a
recovery.
Other data platforms
Reference data is read-only data that supports application functionality. It typically does not
change frequently. Although backup and restore is one method to handle region-wide service
disruptions, the RTO is relatively long. When you deploy the application to a secondary
region, some strategies can improve the RTO for reference data.
Because reference data changes infrequently, you can improve the RTO by maintaining a
permanent copy of the reference data in the secondary region. This eliminates the time
required to restore backups in the event of a disaster. To meet the multiple-region disaster
recovery requirements, you must deploy the application and the reference data together in
multiple regions. You can deploy reference data to the role itself, to external storage, or to a
combination of both.
The reference data deployment model within compute nodes implicitly satisfies the disaster
recovery requirements. Reference data deployment to SQL Database requires that you deploy
a copy of the reference data to each region. The same strategy applies to Azure Storage. You
must deploy a copy of any reference data that's stored in Azure Storage to the primary and
secondary regions.
You must implement your own application-specific backup routines for all data, including
reference data. Geo-replicated copies across regions are used only in a region-wide service
disruption. To prevent extended downtime, deploy the mission-critical parts of the
application’s data to the secondary region. For an example of this topology, see the active-
passive model.
Transactional data pattern for disaster recovery
The following architecture examples provide some ideas on different ways of handling
transactional data in a failover scenario. It's important to note that these examples are not
exhaustive. For example, intermediate storage locations such as queues might be replaced
with Azure SQL Database. The queues themselves might be either Azure Storage or Azure
Service Bus queues (see Azure queues and Service Bus queues - compared and contrasted).
Server storage destinations might also vary, such as Azure tables instead of SQL Database. In
addition, worker roles might be inserted as intermediaries in various steps. The intent is not to
emulate these architectures exactly, but to consider various alternatives in the recovery of
transactional data and related modules.
Replication of transactional data in preparation for disaster recovery
Consider an application that uses Azure Storage queues to hold transactional data. This
allows worker roles to process the transactional data to the server database in a decoupled
architecture. This requires the transactions to use some form of temporary caching if the
front-end roles require the immediate query of that data. Depending on the level of data-loss
tolerance, you might choose to replicate the queues, the database, or all of the storage
resources. With only database replication, if the primary region goes down, you can still
recover the data in the queues when the primary region comes back.
The following diagram shows an architecture where the server database is synchronized
across regions.
The biggest challenge to implementing this architecture is the replication strategy between
regions. The Azure SQL Data Sync service enables this type of replication. As of this writing,
the service is in preview and is not yet recommended for production environments. For more
information, see Overview: Cloud business continuity and database disaster recovery with
SQL Database. For production applications, you must invest in a third-party solution or
create your own replication logic in code. Depending on the architecture, the replication
might be bidirectional, which is more complex.
One potential implementation might use the intermediate queue in the previous example. The
worker role that processes the data to the final storage destination might make the change in
both the primary region and the secondary region. These are not trivial tasks, and complete
guidance for replication code is beyond the scope of this article. Invest significant time and
testing into the approach for replicating data to the secondary region. Additional processing
and testing can help ensure that the failover and recovery processes correctly handle any
possible data inconsistencies or duplicate transactions.
Note
Most of this paper focuses on platform as a service (PaaS). However, additional replication
and availability options for hybrid applications use Azure Virtual Machines. These hybrid
applications use infrastructure as a service (IaaS) to host SQL Server on virtual machines in
Azure. This allows traditional availability approaches in SQL Server, such as AlwaysOn
Availability Groups or Log Shipping. Some techniques, such as AlwaysOn, work only
between on-premises SQL Server instances and Azure virtual machines. For more
information, see High availability and disaster recovery for SQL Server in Azure Virtual
Machines.
Reduced application functionality for transaction capture
Consider a second architecture that operates with reduced functionality. The application in
the secondary region deactivates all the functionality, such as reporting, business intelligence
(BI), or draining queues. It accepts only the most important types of transactional workflows,
as defined by business requirements. The system captures the transactions and writes them to
queues. The system might postpone processing the data during the initial stage of the service
disruption. If the system on the primary region is reactivated within the expected time
window, the worker roles in the primary region can drain the queues. This process eliminates
the need for database merging. If the primary region service disruption goes beyond the
tolerable window, the application can start processing the queues.
In this scenario, the database in the secondary region contains incremental transactional data
that must be merged after the primary is reactivated. The following diagram shows this
strategy for temporarily storing transactional data until the primary region is restored.
For more discussion of data management techniques for resilient Azure applications,
see Failsafe: Guidance for Resilient Cloud Architectures.
Deployment topologies for disaster recovery
A successful Azure disaster recovery includes building that recovery into the solution from
the start. The cloud provides additional options for recovering from failures during a disaster
that are not available in a traditional hosting provider. Specifically, you can dynamically and
quickly allocate resources in a different region, avoiding the cost of idle resources prior to a
failure.
The following sections cover different deployment topologies for disaster recovery.
Typically, there's a tradeoff in increased cost or complexity for additional availability.
Single-region deployment
A single-region deployment is not really a disaster recovery topology, but is meant to contrast
with the other architectures. Single-region deployments are common for applications in
Azure; however, they do not meet the requirements of a disaster recovery topology.
The following diagram depicts an application running in a single Azure region. Azure Traffic
Manager and the use of fault and upgrade domains increase availability of the application
within the region.
In this scenario, the database is a single point of failure. Though Azure replicates the data
across different fault domains to internal replicas, this replication occurs only within the same
region. The application cannot withstand a catastrophic failure. If the region becomes
unavailable, then so do the fault domains, including all service instances and storage
resources.
For all but the least critical applications, you must devise a plan to deploy your applications
across multiple regions. You should also consider RTO and cost constraints in considering
which deployment topology to use.
Let's take a look now at specific approaches to supporting failover across different regions.
These examples all use two regions to describe the process.
Failover using Azure Site Recovery
When you enable Azure VM replication using Azure Site Recovery, it creates several
resources in the secondary region:
Resource group.
Virtual network (VNet).
Storage account.
Availability sets to hold VMs after failover.
Data writes on the VM disks in the primary region are continuously transferred to the storage
account in the secondary region. Recovery points are generated in the target storage account
every few minutes. When you initiate a failover, the recovered VMs are created in the target
resource group, VNet, and availability set. During a failover, you can choose any available
recovery point.
Redeployment to a secondary Azure region
For the approach of redeployment to a secondary region, only the primary region has
applications and databases running. The secondary region is not set up for an automatic
failover. So when a disaster occurs, you must spin up all the parts of the service in the new
region. This includes uploading a cloud service to Azure, deploying the cloud service,
restoring the data, and changing DNS to reroute the traffic.
Although this is the most affordable of the multiple-region options, it has the worst RTO
characteristics. In this model, the service package and database backups are stored either on-
premises or in the Azure Blob storage instance of the secondary region. However, you must
deploy a new service and restore the data before it resumes operation. Even with full
automation of the data transfer from backup storage, provisioning a new database
environment consumes a lot of time. Moving data from the backup disk storage to the empty
database on the secondary region is the most expensive part of the restore process. You must
do this, however, to bring the new database to an operational state because it isn't replicated.
The best approach is to store the service packages in Blob storage in the secondary region.
This eliminates the need to upload the package to Azure, which is what happens when you
deploy from an on-premises development machine. You can quickly deploy the service
packages to a new cloud service from Blob storage by using PowerShell scripts.
This option is practical only for non-critical applications that can tolerate a high RTO. For
instance, this might work for an application that can be down for several hours but is required
to be available within 24 hours.
Active-passive
An active-passive topology is the choice that many companies favor. This topology provides
improvements to the RTO with a relatively small increase in cost over the redeployment
approach. In this scenario, there is again a primary and a secondary Azure region. All of the
traffic goes to the active deployment on the primary region. The secondary region is better
prepared for disaster recovery because the database is running on both regions. Additionally,
a synchronization mechanism is in place between them. This standby approach can involve
two variations: a database-only approach or a complete deployment in the secondary region.
Database only
In the first variation of the active-passive topology, only the primary region has a deployed
cloud service application. However, unlike the redeployment approach, both regions are
synchronized with the contents of the database. (For more information, see the section
on transactional data pattern for disaster recovery.) When a disaster occurs, there are fewer
activation requirements. You start the application in the secondary region, change connection
strings to the new database, and change the DNS entries to reroute traffic.
Like the redeployment approach, you should have already stored the service packages in
Azure Blob storage in the secondary region for faster deployment. However, you don’t incur
the majority of the overhead that database restore operation requires, because the database is
ready and running. This saves a significant amount of time, making this an affordable DR
pattern (and the one most frequently used).
Full replica
In the second variation of the active-passive topology, both the primary region and the
secondary region have a full deployment. This deployment includes the cloud services and a
synchronized database. However, only the primary region is actively handling network
requests from the users. The secondary region becomes active only when the primary region
experiences a service disruption. In that case, all new network requests route to the secondary
region. Azure Traffic Manager can manage this failover automatically.
Failover occurs faster than the database-only variation because the services are already
deployed. This topology provides a very low RTO. The secondary failover region must be
ready to go immediately after failure of the primary region.
Along with a quicker response time, this topology pre-allocates and deploys backup services,
avoiding the possibility of a lack of space to allocate new instances during a disaster. This is
important if your secondary Azure region is nearing capacity. No service-level agreement
(SLA) guarantees that you can instantly deploy one or more new cloud services in any region.
For the fastest response time with this model, you must have similar scale (number of role
instances) in the primary and secondary regions. Despite the advantages, paying for unused
compute instances is costly, and this might not be the most prudent financial choice. Because
of this, it's more common to use a slightly scaled-down version of cloud services on the
secondary region. Then you can quickly fail over and scale out the secondary deployment if
necessary. You should automate the failover process so that after the primary region is
inaccessible, you activate additional instances, depending on the load. This might involve the
use of an autoscaling mechanism like virtual machine scale sets.
The following diagram shows the model where the primary and secondary regions contain a
fully deployed cloud service in an active-passive topology.
Active-active
In an active-active topology, the cloud services and database are fully deployed in both
regions. Unlike the active-passive model, both regions receive user traffic. This option yields
the quickest recovery time. The services are already scaled to handle a portion of the load at
each region. DNS is already enabled to use the secondary region. There's additional
complexity in determining how to route users to the appropriate region. Round-robin
scheduling might be possible. It's more likely that certain users would use a specific region
where the primary copy of their data resides.
In case of failover, simply disable DNS to the primary region. This routes all traffic to the
secondary region.
Even in this model, there are some variations. For example, the following diagram depicts a
primary region which owns the master copy of the database. The cloud services in both
regions write to that primary database. The secondary deployment can read from the primary
or replicated database. Replication in this example is one-way.
There is a downside to the active-active architecture in the preceding diagram. The second
region must access the database in the first region because the master copy resides there.
Performance significantly drops off when you access data from outside a region. In cross-
region database calls, you should consider some type of batching strategy to improve the
performance of these calls. For more information, see How to use batching to improve SQL
Database application performance.
An alternative architecture might involve each region accessing its own database directly. In
that model, some type of bidirectional replication is required to synchronize the databases in
each region.
With the previous topologies, decreasing RTO generally increases costs and complexity. The
active-active topology deviates from this cost pattern. In the active-active topology, you
might not need as many instances on the primary region as you would in the active-passive
topology. If you have 10 instances on the primary region in an active-passive architecture,
you might need only 5 in each region in an active-active architecture. Both regions now share
the load. This might be a cost savings over the active-passive topology if you keep a warm
standby on the passive region with 10 instances waiting for failover.
Realize that until you restore the primary region, the secondary region might receive a sudden
surge of new users. If there are 10,000 users on each server when the primary region
experiences a service disruption, the secondary region suddenly has to handle 20,000 users.
Monitoring rules on the secondary region must detect this increase and double the instances
in the secondary region. For more information on this, see the section on failure detection.
Hybrid on-premises and cloud solution
One additional strategy for disaster recovery is to architect a hybrid application that runs on-
premises and in the cloud. Depending on the application, the primary region might be either
location. Consider the previous architectures and imagine the primary or secondary region as
an on-premises location.
There are some challenges in these hybrid architectures. First, most of this article has
addressed PaaS architecture patterns. Typical PaaS applications in Azure rely on Azure-
specific constructs such as roles, cloud services, and Traffic Manager. Creating an on-
premises solution for this type of PaaS application would require a significantly different
architecture. This might not be feasible from a management or cost perspective.
However, a hybrid solution for disaster recovery has fewer challenges for traditional
architectures that have been migrated to the cloud, such as IaaS-based architectures. IaaS
applications use virtual machines in the cloud that can have direct on-premises equivalents.
You can also use virtual networks to connect machines in the cloud with on-premises
network resources. This allows several possibilities that are not possible with PaaS-only
applications. For example, SQL Server can take advantage of disaster recovery solutions such
as AlwaysOn Availability Groups and database mirroring. For details, see High availability
and disaster recovery for SQL Server in Azure virtual machines.
IaaS solutions also provide an easier path for on-premises applications to use Azure as the
failover option. You might have a fully functioning application in an existing on-premises
region. However, what if you lack the resources to maintain a geographically separate region
for failover? You might decide to use virtual machines and virtual networks to get your
application running in Azure. In that case, define processes that synchronize data to the
cloud. The Azure deployment then becomes the secondary region to use for failover. The
primary region remains the on-premises application. For more information about IaaS
architectures and capabilities, see the Virtual Machines documentation.
Alternative cloud
There are situations where the broad capabilities of Microsoft Azure still may not meet
internal compliance rules or policies required by your organization. Even the best preparation
and design to implement backup systems during a disaster are inadequate during a global
service disruption of a cloud service provider.
You should compare availability requirements with the cost and complexity of increased
availability. Perform a risk analysis, and define the RTO and RPO for your solution. If your
application cannot tolerate any downtime, you might consider using an additional cloud
solution. Unless the entire Internet goes down, another cloud solution might still be available
if Azure becomes globally inaccessible.
As with the hybrid scenario, the failover deployments in the previous disaster recovery
architectures can also exist within another cloud solution. Alternative cloud DR sites should
be used only for solutions whose RTO allows very little, if any, downtime. Note that a
solution that uses a DR site outside Azure will require more work to configure, develop,
deploy, and maintain. It's also more difficult to implement proven practices in a cross-cloud
architecture. Although cloud platforms have similar high-level concepts, the APIs and
architectures are different.
If your DR strategy relies upon multiple cloud platforms, it's valuable to include abstraction
layers in the design of the solution. This eliminates the need to develop and maintain two
different versions of the same application for different cloud platforms in case of disaster. As
with the hybrid scenario, the use of Azure Virtual Machines or Azure Container Service
might be easier in these cases than the use of cloud-specific PaaS designs.
Automation
Some of the patterns that we just discussed require quick activation of offline deployments as
well as restoration of specific parts of a system. Automation scripts can activate resources on
demand and deploy solutions rapidly. The DR-related automation examples below use Azure
PowerShell, but using the Azure CLI or the Service Management REST API are also good
options.
Automation scripts manage aspects of DR not transparently handled by Azure. This produces
consistent and repeatable results, minimizing human error. Predefined DR scripts also reduce
the time to rebuild a system and its constituent parts during a disaster. You don’t want to try
to manually figure out how to restore your site while it's down and losing money every
minute.
Test your scripts repeatedly from start to finish. After verifying their basic functionality,
make sure to test them in disaster simulation. This helps uncover defects in the scripts or
processes.
To correctly handle problems with availability and disaster recovery, you must be able to
detect and diagnose failures. Perform advanced server and deployment monitoring to quickly
recognize when a system or its components suddenly become unavailable. Monitoring tools
that assess the overall health of the cloud service and its dependencies can perform part of
this work. One suitable Microsoft tool is System Center 2016. Third-party tools can also
provide monitoring capabilities. Most monitoring solutions track key performance counters
and service availability.
Although these tools are vital, you must plan for fault detection and reporting within a cloud
service. You must also plan to properly use Azure Diagnostics. Custom performance counters
or event-log entries can also be part of the overall strategy. This provides more data during
failures to quickly diagnose the problem and restore full capabilities. It also provides
additional metrics that the monitoring tools can use to determine application health. For more
information, see Enabling Azure Diagnostics in Azure Cloud Services. For a discussion of
how to plan for an overall “health model,” see Failsafe: Guidance for Resilient Cloud
Architectures.
Disaster simulation
Simulation testing involves creating small real-life situations on the work floor to observe
how the team members react. Simulations also show how effective the solutions are in the
recovery plan. Execute simulations so that the created scenarios don't disrupt actual business,
while still feeling like real situations.
The simulation highlights any issues that were inadequately addressed. The simulated
scenarios must be completely controllable. This means that, even if the recovery plan seems
to be failing, you can restore the situation back to normal without causing any significant
damage. It’s also important that you inform higher-level management about when and how
the simulation exercises will be executed. This plan should detail the time or resources
affected during the simulation. Also define the measures of success when testing your
disaster recovery plan.
If you are using Azure Site Recovery, you can execute a test failover to Azure, to validate
your replication strategy or perform a disaster recovery drill without any data loss or
downtime. A test failover does not affect on the ongoing VM replication or your production
environment.
Several other techniques can test disaster recovery plans. However, most of them are simply
variations of these basic techniques. The intent of this testing is to evaluate the feasibility of
the recovery plan. Disaster recovery testing focuses on the details to discover gaps in the
basic recovery plan.
Service-specific guidance
Azure Database for MySQL Overview of business continuity with Azure Database for MySQL
Azure Database for Overview of business continuity with Azure Database for PostgreSQL
PostgreSQL
Cloud Services What to do in the event of an Azure service disruption that impacts Azure
Cloud Services
Virtual machines What to do in the event that an Azure service disruption impacts Azure virtual
machines
Availability checklist
11/26/2018
11 minutes to read
Contributors
o
o all
Availability is the proportion of time that a system is functional and working, and is
one of the pillars of software quality. Use this checklist to review your application
architecture from an availability standpoint.
Application design
Avoid any single point of failure. All components, services, resources, and compute
instances should be deployed as multiple instances to prevent a single point of
failure from affecting availability. This includes authentication mechanisms. Design
the application to be configurable to use multiple instances, and to automatically
detect failures and redirect requests to non-failed instances where the platform does
not do this automatically.
Use staging and production features of the platform.. For example, Azure App
Service supports deployment slots, which you can use to stage a deployment before
swapping it to production. Azure Service Fabric supports rolling upgrades to
application services.
Replicate VMs using Azure Site Recovery. To maximize availability, replicate all
your virtual machines into another Azure region using Site Recovery. Ensure that all
the VMs across all the tiers of your application are replicated. If there is a disruption
in the source region, you can fail over the VMs into the other region within minutes.
Data management
Geo-replicate databases. Azure SQL Database and Cosmos DB both support geo-
replication, which enables you to configure secondary database replicas in other
regions. Secondary databases are available for querying and for failover in the case
of a data center outage or the inability to connect to the primary database. For more
information, see Failover groups and active geo-replication (SQL Database) and How
to distribute data globally with Azure Cosmos DB.
Use periodic backup and point-in-time restore. Regularly and automatically back
up data that is not preserved elsewhere, and verify you can reliably restore both the
data and the application itself should a failure occur. Ensure that backups meet your
Recovery Point Objective (RPO). Data replication is not a backup feature, because
human error or malicious operations can corrupt data across all the replicas. The
backup process must be secure to protect the data in transit and in storage.
Databases or parts of a data store can usually be recovered to a previous point in
time by using transaction logs. For more information, see Recover from data
corruption or accidental deletion
Replicate VM disks using Azure Site Recovery. When you replicate Azure VMs
using Site Recovery, all the VM disks are continuously replicated to the target region
asynchronously. The recovery points are created every few minutes. This gives you an
RPO in the order of minutes.
Errors and failures
Provide rich instrumentation for likely failures and failure events to report the
situation to operations staff. For failures that are likely but have not yet occurred,
provide sufficient data to enable operations staff to determine the cause, mitigate
the situation, and ensure that the system remains available. For failures that have
already occurred, the application should return an appropriate error message to the
user but attempt to continue running, albeit with reduced functionality. In all cases,
the monitoring system should capture comprehensive details to enable operations
staff to effect a quick recovery, and if necessary, for designers and developers to
modify the system to prevent the situation from arising again.
Monitor system health by implementing checking functions. The health and
performance of an application can degrade over time, without being noticeable until
it fails. Implement probes or check functions that are executed regularly from outside
the application. These checks can be as simple as measuring response time for the
application as a whole, for individual parts of the application, for individual services
that the application uses, or for individual components. Check functions can execute
processes to ensure they produce valid results, measure latency and check
availability, and extract information from the system.
Test the monitoring systems. Automated failover and fallback systems, and manual
visualization of system health and performance by using dashboards, all depend on
monitoring and instrumentation functioning correctly. If these elements fail, miss
critical information, or report inaccurate data, an operator might not realize that the
system is unhealthy or failing.
Plan for disaster recovery. Create an accepted, fully-tested plan for recovery from
any type of failure that may affect system availability. Choose a multi-site disaster
recovery architecture for any mission-critical applications. Identify a specific owner of
the disaster recovery plan, including automation and testing. Ensure the plan is well-
documented, and automate the process as much as possible. Establish a backup
strategy for all reference and transactional data, and test the restoration of these
backups regularly. Train operations staff to execute the plan, and perform regular
disaster simulations to validate and improve the plan. If you are using Azure Site
Recovery to replicate VMs, create a fully automated recovery plan to failover the
entire application within minutes.