Professional Documents
Culture Documents
net/publication/301670228
CITATIONS READS
5 1,886
3 authors:
Claus Pahl
Free University of Bozen-Bolzano
425 PUBLICATIONS 8,052 CITATIONS
SEE PROFILE
All content following this page was uploaded by Claus Pahl on 12 October 2017.
1 Introduction
The effectiveness of BC can be controlled via the key features recovery time
objective (RTO) and recovery point objective (RPO) [1]. RTO is the duration
of time needed until the service is usable again after a major outage, and RPO
refers the point in time from which data will be restored to be usable. To reduce
the downtime (RTO) and data loss (RPO) of a system with an outage, it is
common to replicate data (e.g., database, application, or system meta-data) to
another standby site or another provider. Recently, virtualized cloud platforms
have been used to provide scalable HADR solutions. Moreover, the pay-as-you-
go pricing mode can reduce the operating cost of infrequently used IT resources
over the Internet [4].
A challenge for HADR in the cloud is to implement the system recovery and
business continuity across different cloud platforms or providers. A typical use
case would be a business has its own private cloud infrastructure for providing
services, but does not want to invest in additional infrastructure that is rarely
used (e.g., disaster recovery). Taking a public cloud for disaster recovery can
help to reduce the capital expenditures. In addition, HADR can happen in dif-
ferent levels in terms of the different IT resources involved, like VMs, storage, or
databases. While currently most across-cloud DR solutions focus on VM and im-
age management, the HADR solution for relational databases across multi-cloud
is still a significant problem we address here.
We present an architectural pattern describing the integration of high avail-
ability and disaster recovery into a database cluster architecture between private
cloud and public cloud environments – addressing a need to look at the cloud
from a software architecture perspective [20]. As a case study, we deployed two
MySQL database clusters on an OpenStack private cloud as the primary site and
Windows Azure public cloud as the backup site separately. We used load bal-
ancing to monitor and manage the traffic between the two clouds. The pattern
generalizes our specific experience, implementing synchronous replication inside
of each database cluster to achieve high availability, as well as asynchronous repli-
An Architectural Pattern for Multi-Cloud HADR 3
Master-slave Multi-master
Synchro- No updates are made The same data has been updated in different
nous at the primary (master) nodes at the same time: using locking techniques
node until all secondary to avoid concurrent transactions on the same
(slave) nodes have also data or atomic broadcast to determine the total
received the update. order when conflicting transactions happen.
Asynchro- When the primary node The same data could have been updated at
nous receives an update trans- different nodes at the same time, but a prob-
action, it is executed lem during the synchronization might arise, e.g.,
and committed. Then it some decision might need to be made on which
propagates a new copy to transaction to keep and which transaction to
the secondary nodes. undo.
bottleneck of the system. Thus, virtual IP technology [33] can be used between
load balancers for high availability and failure tolerance. Remus [5] is a project
that provides a virtualization-based HA solution by allowing a group of VMs
to be replicated across different data centers over WAN. SecondSite [23] extends
high availability to disaster tolerance, allowing the backup VM to take over
in a completely transparent manner to implement both fail-over recovery and
seamless fail-back. For HADR in cloud environments, Wood et al. [31] analyze
the economic benefits and deployment challenges of disaster recovery as a cloud
service, exploring the traditional functionality with current cloud platforms in
order to minimize cost, data loss and recovery time in cloud based DR services.
3.2 Solution
Structure. In our setup, cf. Fig. 2, two virtual server clusters are used to
provide HADR for the databases. The primary site could host an OpenStack
private cloud, which handles all client requests during the normal operations.
The standby site, hosted in the Azure public cloud provides the database service
when a failure happens in the primary site. In addition, we use one load balancer
to manage the workload between these two sites. Note that OpenStack and Azure
are used for illustration, but are not part of the actual pattern.
Technically, the HADR pattern architecture includes site-to-site VPN and
MySQL cluster replication (synchronous and asynchronous). Here, site-to-site
VPN is a private virtual cloud network that overlays the Internet to connect dif-
ferent cloud infrastructures, which is the networking precondition of the database
cluster replication. Here, the cluster provides shared-nothing clustering and auto-
sharding for the MySQL database system, which consists of multi-master syn-
chronous replication and master-slave asynchronous replication techniques.
3.3 Strategies
– Hot standby. This is a duplicate of the primary site, with full computer
systems as well as near-complete backups of user data. Real-time synchro-
nization between the two sites may be used to completely mirror the data
environment of the primary site via virtual private network (VPN).
– Warm standby. This is a compromise between hot and cold, where the basic
infrastructures have already been established, though on a smaller scale than
the primary production site, e.g., keep one VM running in the backup site,
and real-time data synchronization between the two sites via VPN.
8 H. Xiong, F. Fowley, C. Pahl
– Cold standby. This is the least expensive type of backup site for an orga-
nization to operate, which is designed to provide the basic infrastructure
needed to run a data center for long-term outages of the primary one. In
cold standby, there is no VM running in the backup site, only VM snapshots
are transferred from primary site to backup site on a regular basis.
5 Evaluation
For the evaluation of the HADR pattern, we will measure the availability for
HA, RPO and RTO for DR, cost (financial constraint) and response time (for
system performance) in terms of most company concerns.
The evaluation results vary across to the different DR standby strategies.
Selecting an appropriate DR strategy is one of the most important decisions
made during the disaster recovery planning (DRP). It is important to carefully
consider the specific business needs along with the costs and capabilities (such
as RPO and RTO) when identifying the best approach for business continuity.
5.2 Experiment
– Hot standby. Two VMs run in Azure as the backup, VPN runs and connects
with OpenStack, data is asynchronously replicated between two sites and
synchronously replicated in the backup site.
– Warm standby. One VM runs in Azure as backup, VPN runs and connects
with OpenStack, data is asynchronously replicated between two sites.
– Cold standby. No VM runs in Azure; VPN is created, but not running.
5.3 Results
Table 2: Performance and business impact analysis results with three DR strategies
E[uptime]
A= (1)
E[uptime] + E[downtime]
Here, we assure the total time of a functional unit (uptime + downtime) is
a year (typical SLA value), which is equal to 365×24×60 minutes, while the
downtime varies over the different DR strategies. The availability results of hot,
warm and cold DR strategies are presented in Table 2.
RTO and RPO. RTO is the period from the disaster happening (the service
becoming unavailable) to the service being available again, which is the same
as the downtime in availability. RPO is the point in time from that data will
be restored, which indicates the period from the latest backup version until the
disaster happens. The results of RTO and RPO with hot, warm and cold DR
strategies are presented in Table 2.
Cost. To give an indicative costing (according to Windows Azure pricing as
of April 2015), we assume the following VM and storage costs for DR resources
(cost summaries for the three DR strategies are also shown in Table 2):
– Virtual machine for computing (small size VM): 6.70 Cents per hour or 49
Euros per month
– Networking: virtual network (gateway hours): 2.68 cent per hour or 15.75
euro per month
– Storage: 5 cent per GB per month, or 50 euro per TB per month
Response time. For web applications, the response time is the time it takes
servers to respond when a user sends a query through a client to the servers.
We used jmeter to simulate the workload for the primary site with different DR
strategies as well as the response time when the backup site takes over. The
response time results of cold, warm and hot DR strategies are shown in Fig. 5.
12 H. Xiong, F. Fowley, C. Pahl
Here, the top three graphs present the response time (in milliseconds) from
the primary site with three common operations of our Wordpress application:
browse home page, login and comment, while the last one shows the response
time (in milliseconds) from the backup site with the same operations. As we have
seen in Figure 4, the average response times of the different DR strategies from
the primary site (see Graphs a,b,c) are similar, which indicates that database
replication rarely impacts on response time and user experience. However, the
average and maximum response time from the backup site (see Graph d) are
slightly higher than the values from the primary site, which is actually expected
due to the network latency and cross-cloud load balancing.
Guidelines. For the experimental evaluation, we only run 2 VMs as full-size
applications, which makes the differences of recovery time and cost between the
three DR strategies not very substantial. However, if we were to run hundreds
of VMs for the service, and each VM costs 49 Euros per month, the hot standby
would cost significantly more than the warm standby solution. Considering the
RPO and RTO, warm standby has the same RPO as the hot standby solution,
but the RTO of warm standby depends on the capacity (e.g., cloud bursting
and auto scaling) of cloud platforms, ranging from a few minutes to dozens of
minutes. In fact, we would not recommend to use cold standby for emergency
disaster recovery in real business scenarios, which is quite inexpensive comparing
to other two strategies, but extremely inefficient regarding recovery time and
data loss. Therefore, warm standby for DR would be an ideal strategy for most
organizations in terms of the combination of RPO, RTP and financial concerns.
6 Conclusions
References
1. Benton, D.: Disaster recovery: A pragmatist’s viewpoint. Disaster Recovery Journal
20(1), 79–81 (2007)
2. Bernstein, P.A., Hadzilacos, V., Goodman, N.: Concurrency control and recovery
in database systems. Addison-Wesley, New York, US (1987)
3. Buschmann, F., Meunier, R., Rohnert, H., Sommerlad, P., Stal, M., Sommerlad, P.,
Stal, M.: Pattern-oriented software architecture (Volume 1): A system of patterns.
John Wiley and Sons (1996)
4. Creeger, M.: Cloud computing: An overview. ACM Queue 7(5), 2 (2009)
5. Cully, B., Lefebvre, G., Meyer, D., Feeley, M., Hutchinson, N., Warfield, A.: Remus:
High availability via asynchronous virtual machine replication. In: Proc USENIX
Symp on Networked Systems Design and Implementation. pp. 161–174 (2008)
6. Fu, S.: Failure-aware resource management for high-availability computing clusters
with distributed virtual machines. Journal of Parallel and Distributed Computing
70(4), 384–393 (2010)
7. Glen Robinson, Attila Narin, a.C.E.: Using Amazon Web Services for dis-
aster recovery. http://media.amazonwebservices.com/AWS_Disaster_Recovery.
pdf, [updated October 2014]
8. Gray, J., Siewiorek, D.P.: High-availability computer systems. Computer 24(9),
39–48 (1991)
9. Jamshidi, P., Ahmad, A., Pahl, C.: Cloud migration research: a systematic review.
IEEE Transactions on Cloud Computing 1(2), 142–157 (2013)
10. Kandukuri, B.R., Paturi, V.R., Rakshit, A.: Cloud security issues. In: Services
Computing, 2009. SCC’09. IEEE International Conference on. pp. 517–520 (2009)
11. King, R.P., Halim, N., Garcia-Molina, H., Polyzois, C.A.: Management of a re-
mote backup copy for disaster recovery. ACM Transactions on Database Systems
(TODS) 16(2), 338–368 (1991)
An Architectural Pattern for Multi-Cloud HADR 15
12. Ladin, R., Liskov, B., Shrira, L., Ghemawat, S.: Providing high availability using
lazy replication. ACM Trans on Computer Systems (TOCS) 10(4), 360–391 (1992)
13. Lewis, P.: A high-availability cluster for Linux. Linux Journal 64 (1999)
14. Lumpp, T., Schneider, J., Holtz, J., Mueller, M., Lenz, N., Biazetti, A., Petersen,
D.: From high availability and disaster recovery to business continuity solutions.
IBM Systems Journal 47(4), 605–619 (2008)
15. Microsoft: High availability and disaster recovery for SQL Server in Azure vir-
tual machines. http://msdn.microsoft.com/en-us/library/azure/jj870962.
aspx (2014), [updated June 12, 2014]
16. MySQL: Replication with global transaction identifiers. http://dev.mysql.com/
doc/refman/5.6/en/replication-gtids.html
17. Openstack Wiki: Openstack disaster recovery solution. https://wiki.openstack.
org/wiki/DisasterRecovery
18. Openswan: Openswan official website. https://www.openswan.org/
19. Pacitti, E., Özsu, M.T., Coulon, C.: Preventive multi-master replication in a cluster
of autonomous databases. In: Euro-Par 2003 parallel processing, pp. 318–327 (2003)
20. Pahl, C., Jamshidi, P.: Software architecture for the cloud - a roadmap towards
control-theoretic, model-based cloud architecture. In: European Conference on
Software Architecture ECSA’2015 (2015)
21. Pahl, C., Xiong, H.: Migration to PaaS clouds - migration process and architectural
concerns. In: IEEE International Symposium on the Maintenance and Evolution
of Service-Oriented and Cloud-Based Systems (MESOCA’2013). pp. 86–91 (2013)
22. Pahl, C., Xiong, H., Walshe, R.: A comparison of on-premise to cloud migration ap-
proaches. In: Proceedings European Conference Service-Oriented and Cloud Com-
puting ESOCC’13, pp. 212–226 (2013)
23. Rajagopalan, S., Cully, B., O’Connor, R., Warfield, A.: Secondsite: disaster toler-
ance as a service. In: ACM SIGPLAN Notices. vol. 47, pp. 97–108. ACM (2012)
24. Sapate, S., Ramteke, M.: Survey on comparative analysis of database replication
techniques. International Journal of IT, Engineering and Applied Sciences Research
(IJIEASR) 2(3), 72–80 (2013)
25. Schmidt, K.: High availability and disaster recovery. Springer (2006)
26. Severalnines: Clustercontrol for mysql galera tutorial. http://www.severalnines.
com/clustercontrol-mysql-galera-tutorial
27. Vmware: vCenter site recovery manager 5.5. http://www.vmware.com/files/pdf/
products/SRM/VMware_vCenter_Site_Recovery_Manager_5.5.pdf (2014)
28. Wiesmann, M., Pedone, F., Schiper, A., Kemme, B., Alonso, G.: Database repli-
cation techniques: A three parameter classification. In: Proceedings 19th IEEE
Symposium on Reliable Distributed Systems SRDS’2000. pp. 206–215 (2000)
29. Wikipedia: Business continuity planning. http://en.wikipedia.org/wiki/
Business_continuity_planning#Business_impact_analysis_.28BIA.29
30. Wikipedia: Virtual private network. http://en.wikipedia.org/wiki/Virtual_
private_network
31. Wood, T., Cecchet, E., Ramakrishnan, K., Shenoy, P., Van Der Merwe, J.,
Venkataramani, A.: Disaster recovery as a cloud service: Economic benefits & de-
ployment challenges. In: Proceedings 2nd USENIX Conference on Hot Topics in
Cloud Computing. pp. 8–8 (2010)
32. zerto: Zerto virtual replication. http://www.zerto.com/
33. Zhang, W.: Linux virtual server for scalable network services. In: Ottawa Linux
Symposium (2000)