You are on page 1of 11

Hybrid Web Cluster Whitepaper Version 2.

Hybrid Web Cluster


Whitepaper: High Availability, Scalable 'Cloud Sites' Deployments with No Single Point of Failure
Version 2.0

Luke Marsden, CTO Hybrid Logic Ltd. +1-415-449-1165 (US) / +44-203-384-6649 (UK) sales@hybrid-cluster.com http://www.hybrid-cluster.com/

Wednesday 2 November 2011

1/11

Hybrid Web Cluster Whitepaper Version 2.0

Table of Contents
Hybrid Web Cluster ....................................................................................................................1 Abstract................................................................................................................................2 Keeping Your Websites Online...................................................................................................3 Understanding Business Continuity Planning...........................................................................3 Scaling Websites When There Are Spikes In Traffic..................................................................4 Protect Against User Error with Continuous Data Protection......................................................5 Choosing Between Shared and Direct Attached Storage...........................................................5 Technical Details.....................................................................................................................6 Hybrid Web Cluster: A Paradigm Shift in Web Hosting..............................................................6 More Than Just LAMP Web Hosting........................................................................................6 Infrastructure-as-a-Service Integration..................................................................................6 Analysis of a Typical Web Request.........................................................................................7 Keeping you online in a disaster: Practical choices...................................................................8 Tunable Parameters............................................................................................................8 Integration Options.................................................................................................................9 Control Panel.....................................................................................................................9 API................................................................................................................................. 10 WHMCS & Parallels APS.....................................................................................................10 Conclusion...........................................................................................................................11 Derive Competitive Advantage with Hybrid Web Cluster..........................................................11

Abstract
Providers of hosting solutions today are faced with a myriad of challenges in transitioning to the cloud. On this journey, many issues remain constant. In this paper we offer solutions to four key problems: Business continuity planning and disaster recovery. Scaling websites when there are spikes in traffic. Protecting against user error with continuous data protection. Choosing between shared and direct-attached storage.

The revolutionary technology in Hybrid Web Cluster solves these problems for you, enabling you to deliver the next-generation cloud web hosting that your customers are demanding.

Wednesday 2 November 2011

2/11

Hybrid Web Cluster Whitepaper Version 2.0

Keeping Your Websites Online


Hybrid Web Cluster allows previously unseen resilience in the face of hardware, network and system failures, completely eliminating single points of failure.

Understanding Business Continuity Planning


There are two key metrics used by industry to evaluate available Disaster Recovery (DR) solutions. These are called Recovery Point Objective (RPO) and Recovery Time Objective (RTO). Typically one has a primary site and a DR site, where data is replicated from the primary to the DR site at a certain interval:

As you can see from the diagram above, RPO is the amount of data lost in a disaster (such as the failure of a server or data center). This depends on the backup or replication frequency, since the worst-case is that the disaster occurs just before the next scheduled replication occurs. RPO = data loss measured in time RTO defines the amount of time it takes an organization to react to a disaster (whether automatically or manually; typically there will be at least some manual element such as changing IP addresses), performing the reconfiguration necessary to recreate the primary site at the DR site. For example if there is a fire at the primary site you would need to order new hardware and re-provision your servers from backups. For most web hosting companies retaining an exact replica of every server at the primary site at the DR site is not economically viable. For example, a hosting company recently interviewed described how the purchase of an additional NetApp storage appliance at the DR site was financially infeasible and therefore if the primary were to fail permanently then the RTO could be measured in weeks. RTO = downtime Hybrid Web Cluster makes enterprise standards of RPO and RTO available to all web hosts without the additional costs of any expensive shared storage hardware. We guarantee RPO = 5 minutes and RTO = 2 minutes. This can be adjusted according to your requirements, as a trade-off between disk and network I/O and acceptable amounts of data loss in a disaster scenario (see Tunable Options section). RPO = data loss Conventional backup cycle 24 hours Hybrid Web Cluster 5 minutes RTO = downtime 48 hours 2 minutes

Compare our RPO and RTO to your current web hosting solution. If you have nightly backups to an offsite storage server, your RPO is 24 hours and your RTO is however long as it would take your technicians to provision and reconfigure all the new hardware at your DR site. By automating continuous data protection and failover with Hybrid Web Cluster, you can significantly improve the guarantees you offer to your customers even in the worst case scenario. In the context of cloud infrastructure, our solution can cope with the failure of an entire region1.

1 http://aws.amazon.com/message/65648/
Wednesday 2 November 2011 3/11

Hybrid Web Cluster Whitepaper Version 2.0

Scaling Websites When There Are Spikes In Traffic


In this section we compare the Hybrid Web Cluster scalability model to both the common approach of installing many websites on a single server and the CloudLinux model, the current industry leader.

In the common shared hosting model, a web hosting company will simply install a lot of websites on a single server without any High Availability (HA) or redundancy, and set up a nightly backup via rsync. In this model when a website gets very popular, the server which is hosting it is also busy serving requests for a lot of other websites and becomes over-loaded. Typical consequences of this are that the server will start to respond very slowly as the required number of I/O operations per second exceeds the capacity of the server. The server will soon run out of memory as the web requests stack up, start swapping to disk, and thrash itself to death. This results in everyone's websites going offline.

The technique advocated by CloudLinux is to contain the spike of traffic by imposing OSlevel restrictions on the site which is experiencing heavy traffic. This is clearly an improvement because the other sites on the server stay online. However the disadvantage to this approach is that the website which is gaining the traffic is necessarily slowed down or stopped completely. If the server were to try to fully service all incoming requests for that site, it would crash, as above.

This is where the Hybrid Web Cluster model really wins. The moment that a big spike in traffic happens is not when your users want to be worrying about migrating to a dedicated server! Rather than strangling the site which is experiencing the spike in traffic, Hybrid Web Cluster dynamically live-migrates the other websites on that server to other servers in the cluster, with no downtime for any sites, so your users get the full benefit of automatic scalability. Site Juggler Live Migration delivers three orders of magnitude2 greater scalability than shared hosting solutions, allowing websites to scale by intelligently and transparently migrating them between hosts. 2 Assuming just 500 websites per server, you can burst to 2 dedicated servers or 1,000x scalability
Wednesday 2 November 2011 4/11

Hybrid Web Cluster Whitepaper Version 2.0

Protect Against User Error with Continuous Data Protection


When considering data protection systems, it's important to distinguish between systems such as RAID or synchronous replication, which protect you against the failure of hardware, but if a user accidentally deletes some data, such systems will replicate the deletion to the other device and the data will be permanently lost. A better solution is Continuous Data Protection, or as we call it, our Point-InTime Restore feature, which takes continual point-in-time snapshots of all the data stored on the system, and exposes it to the end user via a friendly web user interface so that they can undo their mistakes without administrator intervention.

Each circle represents a snapshot the user can roll back to.

Choosing Between Shared and Direct Attached Storage


Our HCFS Data Replication allows you to take advantage of the performance and cost savings of directattached storage. At his presentation at HostingCon, Siena Fath-Azam of Storm on Demand described the dichotomy between shared storage (Storage Area Network, or SAN) and Direct Attached Storage (DAS). A SAN is a storage device, typically from a vendor such as NetApp or EMC, which provides a central location for your servers to keep their data. DAS just means connecting disks directly to your servers. The following table summarises his talk. Storage Area Network Pros Ease of movement of applications between servers (data is never stored on a specific server) Reliability, typically because of more expensive hardware Easier to do traditional HighAvailability where if an application fails on one server, you can start it on another server Cons Performance network based storage is always slower than direct attached storage because the data has to travel further Cost always more expensive by a factor of 2-3x Failures are horrifying (everything fails!). Examples: VPS.net, Amazon EBS, MediaTemple Hybrid Web Cluster's HCFS data replication is the missing piece of the puzzle making it trivial to migrate instances between servers with direct-attached storage (there's a button in the Control Panel for it), thereby solving the data management issues normally associated with direct-attached storage. It simultaneously adds fault tolerance to otherwise vulnerable servers without relying on a similarly fallible central system, resulting in a more reliable, higher performance, and 2-3x less expensive solution. Furthermore, our replication system works across Wide Area Networks such as the Internet, allowing you to migrate websites, databases and mailboxes between data centers, and fail-over even if entire data centers fail.
Wednesday 2 November 2011 5/11

Direct Attached Storage Performance is always better. Cost is always lower. Easier to customize.

Difficult to deploy traditional HA, because if a server fails then it had the data stored on it. You need something else. It's difficult to move applications between instances.

Hybrid Web Cluster Whitepaper Version 2.0

Technical Details
Hybrid Web Cluster: A Paradigm Shift in Web Hosting
Hybrid Web Cluster represents a fundamental shift in the way you are able to provision and deploy web hosting accounts across globally distributed physical infrastructure. In a nutshell, we keep your lights on in the face of hardware and network failures and automatically scale your websites in the face of quickly-changing and sometimes significant traffic levels. The following key innovations provide our licensees with an unparalleled feature set: 1. Our pure-software HCFS data replication allows web clusters to run across geographically diverse regions, on inexpensive commodity hardware with high performance directly attached storage, or on public cloud infrastructure. Our replication system provides continuous data protection and automatic disaster recovery even if an entire region fails. Our distributed protocol handler AwesomeProxy provides distributed and highly-available implementations of all the protocols you need: HTTP, HTTPS, FTP, MySQL, POP, IMAP, SMTP & SSH. Our live migration technology controls the HCFS and AwesomeProxy systems in tandem to provide two orders of magnitude of scalability beyond shared hosting via seamless and nearinstant migration of websites and databases between servers. Our platform is compatible with every LAMP website and web application. Within each protected cluster instance, we run standard installations of Apache, MySQL, Exim and Dovecot and applications do not need to be modified to run in this context. We provide a feature-complete, white-label brandable and reseller-compatible Control Panel which can fully replace CPanel or Plesk, which integrates with industry standard domain and SSL certificate providers and billing systems such as WHMCS, and has a complete API to allow you to integrate your existing billing & provisioning systems with your own cluster deployment. By leveraging OS-level multi-tenancy technology we offer customer densities orders of magnitude greater than IaaS-based solutions (up to 2,000 customers per server, rather than 30) while offering revolutionary levels of dynamic scalability (bursting from multi-tenancy to dedicated hardware).

2.

3.

4.

5.

6.

In total, the technology provides never-before-seen resilience in the face of hardware, network and system failures, completely eliminating single points of failure, while allowing you to use your existing investment in commodity hardware to compete with a feature set usually reserved for enterprise-cost and highly complex SAN-based solutions.

More Than Just LAMP Web Hosting


We will shortly be adding support for Python, Ruby, Node.js (via the emerging PaaS standard CloudFoundry) and NoSQL data stores CouchDB, MongoDB, Redis and Memcache.

Wednesday 2 November 2011

6/11

Hybrid Web Cluster Whitepaper Version 2.0

Analysis of a Typical Web Request


The following diagram gives a high-level overview of the Hybrid Web Cluster system in terms of a typical scenario where a user uploads a new photo to their Wordpress blog.

Note that the blue and orange boxes refer to logical, not physical entities. The only physical hardware required are the cluster nodes themselves. Note, therefore, the absence of expensive specialized hardware: in particular no hardware load balancer and no centralized shared storage.

AwesomeProxy replaces load balancers and HCFS replaces SANs.

This approach delivers a cost saving of 60-70% compared to classical clusters and cloud infrastructure solutions based on shared storage. This is what happens when the user uploads a new photo to their blog: 1. 2. 3. The user's browser looks up the website address in DNS and is returned a list of live nodes which are geographically local to the current master for that site. The browser connects to one of them. AwesomeProxy discovers which website is being requested and passes on the request to the correct server. Apache writes the new photo to disk on the current master. Within a few seconds, HCFS detects the write to the filesystem and makes a consistent point-intime snapshot of the new data. Moments later, the change has been replicated to the slaves for that filesystem: typical cluster configurations (see Tunable Parameters section) means that the data is replicated to another another machine in the same data center and one machine in a remote data center.

The direct consequences of these two additional layers the distributed protocol handler above, and the replication system below are that any server or entire data center can fail and the cluster will automatically reconfigure itself so that your websites stay online. The distributed protocol handler ensures that requests are always routed to a server which is online and able to serve your site, and the HCFS replication engine ensures that your data is always safe.

Wednesday 2 November 2011

7/11

Hybrid Web Cluster Whitepaper Version 2.0

Keeping you online in a disaster: Practical choices


When building distributed (cloud) systems you can pick at most two of the following features: 1. 2. 3. Consistency if a system is consistent, then queries to different nodes for the same data will always result in the same answer Availability the system always responds to requests with a valid response Partition tolerance if the parts of a distributed system become disconnected from each other they can continue to operate

Hybrid Web Cluster chooses Availability and Partition tolerance over Consistency, because for website owners, given the choice they would rather their website be online in a disaster scenario. When a cluster becomes partitioned, for example if an under-sea cable gets cut, and the European and US components of a cluster can no longer communicate with each other, the cluster elects new masters for all the sites on both sides of the partition in order to keep the websites online on both sides of the Atlantic. When traffic is re-routed or the under-sea cable is repaired, the cluster rejoins and the masters negotiate which version of the website is more valuable based on how many changes have been made on both sides of the partition. This keeps your websites online all the time, everywhere in the world.

Tunable Parameters
Hybrid Web Cluster allows customization of the topology and timing values for the replication and failover aspects of the software. The following variables can be adjusted on a per-cluster basis, giving you the benefit of being able to customise the cluster to your specific requirements:
Variable Check Timeout Explanation How many seconds to allow to a server to appear to be offline before taking recovery actions. This plus the length of time to effect an automatic recovery (about a minute) corresponds directly to RPO. Default: 60 seconds Snapshot Quick Timer How many seconds to wait before snapshotting and replicating a filesystem which has been modified once. A lower number results in faster replication and smaller RTO but uses proportionally more disk and network I/O. Default: 120 seconds Snapshot Interval Timer As above, but applying to filesystems which are being constantly modified. Default: 300 seconds Local Redundancy The number of slaves to set up for each filesystem in the data center local to the current master. Determines the failure-tolerance of the local data center in terms of how many servers may fail locally before any data is lost. Default: 1 Global Redundancy The number of other remote (relative to the current master) data centers in which to set up slaves for each filesystem. Determines the failure-tolerance in terms of how many data centers may fail before any data is lost. Default: 1 Slaves per Remote Locality The number of servers to add as slaves in each remote data center. Default: 1 Replication Concurrency How many concurrent replication events to allow. This should be adjusted to match approximately the number of spindles (disks) you have in each cluster node, due to disks performing better when operations are serialized. Default: 2

Wednesday 2 November 2011

8/11

Hybrid Web Cluster Whitepaper Version 2.0

Integration Options
Control Panel
We have developed an advanced, easy-to-use, AJAX and whitelabel brandable Control Panel. Here is a brief tour of the Control Panel. It has multiple levels of users: Cluster Administrators, Resellers and Web Hosting Users. This is the dashboard which shows a Cluster Administrators view:

This shows a user adding a Wordpress blog on an external domain:

Adding the blog takes just a few seconds, after which the blog is safely replicated across multiple data centers, immediately ready for automated fail-over and scalability if a server or data center fails or the website gets a spike in traffic.

Wednesday 2 November 2011

9/11

Hybrid Web Cluster Whitepaper Version 2.0

The Control Panel supports advanced features such as full custom DNS editor with user-friendly wizards, and also allows complete white-label branding on a per-reseller level as the following screenshots show. This allows you to completely change the look and feel of the entire Control Panel with just a few clicks. Here we have used a popular cloud infrastructure brand just as an example:

This is just a small taste of what our Control Panel can do. Sign up for a trial to explore it for yourself!

API
In addition to the powerful, flexible and modern Control Panel we also provide a comprehensive JSON/XML API with over 100 commands which allows you to perform a full and complete integration with any billing and provisioning system, to add a Cloud Sites option to your existing systems. We have a full API, documentation for which can be found at: http://www.hybrid-cluster.com/api The API allows you to control every aspect of scalable, redundant website, database and mailbox deployments.

WHMCS & Parallels APS


We are also working on plugins for the above systems. Please contact us if you wish to test them out.

Wednesday 2 November 2011

10/11

Hybrid Web Cluster Whitepaper Version 2.0

Conclusion
Derive Competitive Advantage with Hybrid Web Cluster
Clearly, the hosting industry is changing. Hybrid Web Cluster gives you the tools your business needs to survive in a highly competitive landscape, and with 60-70% cost savings over SAN-based products, and 40-60x better densities than virtualisation-only solutions, can deliver the margins you need to thrive. Start your 30 day free trial today and derive competitive advantage. To obtain access to a Hybrid Web Cluster installation, contact: Luke Marsden, CTO Hybrid Logic Ltd. +44-203-384-6649 (UK) / +1-415-449-1165 (US) sales@hybrid-cluster.com http://www.hybrid-cluster.com/

Wednesday 2 November 2011

11/11