VMware High Availability and Fault Tolerance FAQ must-haves for IT shops that demand 100% uptime
. First introduced in VMware Infrastructure 3, VMware High Availability (HA) provides failover protection against hardware and software malfunctions. If HA detects a failure, it automatically restarts a virtual machine (VM) without the need for manual intervention. HA was a major step for highly available technologies, but some users require continuous uptime for virtual machines. With the release of vSphere, VMware introduced the Fault Tolerance (FT) utility, which provides uninterrupted availability by eliminating the need for VMs to restart. Together, HA and FT supply the feature sets and capabilities for virtual environments to run at nearly 100% uptime. But implementing and maintaining an HA- and FT-enabled infrastructure is challenging. For users that are unsure about the risk-reward proposition of these high-availability tools, the answers to these frequently asked questions should provide some guidance. How does VMware High Availability work? VMware High Availability is a vSphere component that's configured in the vSphere Client. It eliminates the need for dedicated standby equipment by performing the following tasks:
• • •
monitoring physical servers and VMs; server failure detection; and migrating and restarting VMs from offline hosts
In the update to vSphere 4.1, High Availability was upgraded. The new, 64-bit vCenter increased HA's theoretical maximums to 320 VMs per host and 3,000 VMs per cluster. HA also sports a health-status menu, where users can view alarms and alerts in a centralized location. How does VMware Fault Tolerance work? VMware Fault Tolerance is based on VMware Workstation's record/replay technology. FT copies a functional VM to another ESX host, then transfers CPU and virtual device inputs from the primary VM (record) to the secondary VM (replay) through a network interface card. This process ensures that both VMs are synchronized and that during a failure, the secondary VM can take over. Additionally, the hypervisor suppresses the secondary VM until the primary VM malfunctions. What are the benefits and drawbacks to VMware High Availability? HA provides business continuity and disaster recovery protection for businesses. With HA enabled, virtualization administrators have one less task to directly oversee. The technology runs 24/7, so it can restart failed, missioncritical servers and virtual machines during off-peak hours, without the assistance of IT staff. VMware HA minimizes downtime but cannot prevent it completely. During reboots, VMs will remain offline. Also, HA is not available for every vSphere licensing tier, and vCenter is required. Why should I use VMware Fault Tolerance?
FT provides continuous VM availability through its record/replay functionality. Unlike other high-availability technologies, FT is operating system-agnostic and doesn't require licenses for each server. This utility is easy to deploy in vCenter: Right-click on a VM, and select Fault Tolerance. FT creates the secondary virtual machine, and then you can begin the synchronization process. When should I use VMware Fault Tolerance? It's probably not feasible to protect every VM with FT because of host limitations and hardware compatibility issues. But there are ideal use cases for VMware Fault Tolerance:
VMs protected by HA. Generally, these VMs are the most critical. If you have enough computing and storage resources, this option benefits users. If a server fails, they don't have to wait for VMs to restart. On-demand coverage. Certain applications and VMs become mission critical during particular times of the month or year (e.g., payroll and accounting VMs). Because FT is simple to initiate, you can provide failover protection during high-volume periods and deactivate it during nonpertinent times to conserve resources. Servers with a single point of failure. High-availability options for certain application servers can be costly and complex. FT provides a simple alternative. Expensive clustering. Sometimes it's hard to justify clustering solutions for branch offices or mediumsized databases. In these scenarios, FT is a cost-effective option.
What are the requirements for VMware Fault Tolerance? FT has specific and restrictive hardware requirements. It doesn't run on every vSphere-equipped server because of special CPU requirements, for example. FT calls for Intel 31xx or more recent processors and AMD 13xx processors or greater. It also doesn't support multiprocessor VMs (only single CPU VMs) and vSphere's hot-add RAM or hot-plug CPU features. Because of FT's particular hardware requirements, VMware published the SiteSurvey utility, which checks an infrastructure's compatibility with FT. SiteSurvey connects to vCenter Server and generates a host compatibility report. Clicking the report links provides detailed information and charts with FT's requirements. Understanding VMware Fault Tolerance benefits and requirements VMware Fault Tolerance (FT) is a new high availability feature in VMware's vSphere 4. With Fault Tolerance, your virtual guest machine runs on a primary ESX host server while the memory of that virutual machine ()VM) is mirrored (using vLockstep) over to a secondary (ghost) ESX host server. If the primary ESX Server fails, the virtual machine immediately resumes operation on the secondary ESX Server with zero downtime to the VM, application or end user using that VM. The benefits of VMware Fault Tolerance Unlike traditional high-availability (HA) technologies, VMware's Fault Tolerance works regardless of the operating system, and you aren't charged for every server that uses this HA feature. FT is based on VMware Workstation's Record/Replay features that can play back what happened in a VM. Fault Tolerance is not used for load balancing - it is strictly for high availability of a VM in the event that an ESX Server goes down. Unlike VMware HA clusters (VMHA), with FT, the VM that was protected doesn't have to be rebooted. Thus, with FT, unlike VMHA, there is no downtime for end users. The best thing about VMware FT is that to enable it, all you need to do is to right-click on the primary VM and enable Fault Tolerance. At that point, FT takes over, creating the secondary VM that will protect the primary VM if the ESX server running the primary VM has a failure.
The hardware requirements for VMware Fault Tolerance VMware FT requires Intel 31xx, 33xx, 52xx, 54xx, 55xx, 74xx or AMD 13xx,23xx, 83xx series of processors (or greater). Moreover, you can't run FT on just any vSphere-compatible server because FT uses special features of the CPU. Today, FT is also supported only on VMs that have a single CPU (no multiprocessor VMs can use FT). High availability, in general, is an excellent part of any large scale disaster recovery strategy. With Fault Tolerance you have the added protection for smaller-scale disasters such as the loss of a single ESX host server running 50+ virtual machines. VMware's Fault Tolerance (FT) is offered in 3 of the 6 vSphere Editions - Advanced, Enterprise and Enterprise Plus. Additionally, VMware's vCenter is required. For more information, read all about Fault Tolerance at VMware's Fault Tolerance (FT) product page or watch a video demonstrating the power of FT over at VMware's YouTube FT Demo. Return to guide's main page for more on VMware virtualization products and features . VMware vSphere products and features VMware Inc. offers several flavors of its vSphere server virtualization infrastructure suite. In this guide to vSphere editions and features, we break down the differences between vSphere versions, review vSphere installation concerns, discuss vSphere pricing and, finally, discuss vSphere's role in cloud computing. Whether server consolidation, server management or moving to the cloud is your goal, this vSphere guide helps determine which vSphere edition makes the most sense for your virtualization environment and business needs. What is vSphere? VSphere is VMware's flagship server and infrastructure virtualization platform. VMware vSphere is sold as a single product but, in fact, is a suite of products made up of multiple pieces. VSphere is sold in various editions and the features offered are based on the edition you purchase. VMware's vSphere editions differ from Microsoft Windows Server virtualization in that vSphere has many editions of the same product but each has a different feature-set. On the other hand, Microsoft's Hyper-V is an add-on to Windows Server 2008 which offers various editions but the virtualization features don't vary. To add to the confusion, vSphere was previously named VMware Infrastructure suite What are vSphere 4 features? Each edition of vSphere includes ESX or ESXi as its virtualization hypervisor which is loaded on each physical server. Just as critical are the vSphere 4 features that make ESX and ESXi so powerful. For example, some of the new vSphere 4 features are the following:
• • • • • • •
VMotion: moved running virtual machines (VMs) from one server to another Storage VMotion (SVMotion): moves the virtual disks of a running virtual machine from one data store to another VMware High Availability (VMware HA or VMHA): reboots running virtual machines on another ESX server if an ESX host goes down Fault Tolerance (FT): moves a running VMs from one ESX server to another if the server they are running on goes down Distributed Power Management (DPM): when demand is low on the virtual infrastructure, running VMs are consolidated onto fewer servers and the unused servers are powered off Consolidated Backup (or VCB): a VMware backup tool that enables you to backup running virtual machines using your existing backup application vShield Zones: creates a virtual firewall within your virtual infrastructure
Which vSphere editions are available? In addition to the ESXi free edition of vSphere, vSphere is sold in six editions:
• • • • • •
vSphere Essentials vSphere Essentials Plus vSphere Standard vSphere Advanced vSphere Enterprise vSphere Enterprise Plus
Is installing vSphere difficult? VSphere installation is quite simple. You can install ESXi on just about any server, but consult the VMware Hardware Compatibility List (HCL) to determine which server hardware is compatible with your chosen virtualization platform and edition. The typical installation steps are to install (1) ESX or ESXi on a server and (2) vCenter on a Windows member server. For more information, check out the vSphere Installation Guide. How much does vSphere cost? Based on the edition, vSphere pricing varies. The vSphere Essentials edition starts at $995 for three hosts (maximum two processors per host and six cores per processor) and one year of software subscription. The vSphere Essentials bundle includes the ESX or ESXi hypervsior, Virtual Machine File System, four-way SMP (or symmetric multiprocessing), VMware Consolidated Backup, Update Manager, and vCenter Server Essentials. Which new features are available in vSphere 4? VSphere 4 offers various feature improvements over the previous version, VMware Infrastructure. They are the following:
• • • • • • • • • • • •
Server Linked Mode Host Profiles vApps centralized licensing Thin provisioning vCenter Orchestrator Fault Tolerance Data Recovery vShield Zones Virtual Machine hot add vNetwork Distributed Switch (vDS) performance and scalability enhancements
Read What's New in VMware vSphere 4 to learn more. VSphere and cloud computing VMware says that vSphere 4 is an operating system designed for building an internal or external cloud infrastructure. This "cloud OS," says VMware, is a foundation for treating virtualized workloads like an internal cloud by aggregating and managing large pools of infrastructure -- processors, storage and networking -- as a flexible and dynamic operating environment. To learn more about the cloud computing-enabling features of VMware vSphere, read VMware's Cloud-OS and VMware's Cloud Computing pages.
Understanding VMware ESX features VMware's ESX Server is the industry's most widely deployed virtualization option for data center consolidation and server consolidation. In this guide to VMware ESX, we outline the core features of VMware ESX and explain the key differences between ESX and ESXi. The VMware ESX hypervisor is installed on a physical server that enables a server to run multiple guest virtual machines (VMs) inside it. Each of these VMs then runs as a guest OS, such as Windows Server 2008 or Linux. In this guide to VMware's ESX Server, we explain the core features of VMware ESX Server. VMware virtualization's hypervisor was recently upgraded from ESX Server 3.5 to ESX Server 4.0. In doing so, VMware's enterprise virtualization platform was renamed from VMware Infrastructure to vSphere. Now, if you want to use ESX Server, you don't buy ESX Server; you buy the vSphere suite (for more on vSphere pricing check out our guide), where the exception is the free edition of VMware ESXi. Today, vSphere offers two flavors of ESX: ESX (full or classic) or the thin ESX, "ESXi" (installable or embedded). ESXi Server offers the same performance as ESX Server (full or classic editions), but it lacks the service console for command line management. ESXi comes as a free edition and as an installable embedded commercial edition included in vSphere. Unlike VMware Server or Workstation, VMware ESX Server you install VMware ESX directly on a physical server. Because ESX Server is the operating system and it has a limited set of drivers, you must ensure that your hardware is compatible with ESX. To do so, check the VMware Hardware Compatibility Guide before installing ESX Server. Core features of VMware ESX:
• • • • • • • •
Memory overcommittment and deduplication, which allow for higher consolidation ratios Huge scalability with up to 64 logical processing cores, 256 virtual CPUs, and 1 TB of RAM per host, enabling higher consolidation ratios; Memory ballooning; Network traffic shaping; Network interface card teaming (or NIC teaming); VMware vSphere client allows for easy graphical user interface management; VMware Power command-line interface (or PowerCLI) and vCLI; and many more
See VMware ESX Server features for the complete list features.
Understanding VMware ESXi features VMware ESXi Server is VMware's free bare-metal hypervisor. It is the same hypervisor used in VMware's enterprise-class virtualization solution, vSphere but in a scaled-down free version. Unlike VMware Server or Workstation, you install VMware ESXi directly on your hardware and because it doesn't have the underlying host operating system, it offers so much more performance that those options.
ESXi comes in three editions: free ESXi, ESXi Installable, and ESXi Embedded. The latter two editions are part of VMware's vSphere suite and are enterprise solutions that can perform the same vSphere features as ESX full or classic. ESXi server comes with the vSphere client for virtualization management. This makes the configuration and management of ESXi virtual machines quick and easy. ESXi Server offers the same performance as ESX Server (full or classic) but it lacks the service console that is used for command line management. To solve this issue, VMware offers the free vSphere Management Assistant (vMA) and vSphere command-line interface (vCLI) which are both compatible with the free edition of VMware ESXi Server. Core ESXi features ESXi's core features are the following:
• • • • • • • •
Memory overcommittment and deduplication, which allow for higher consolidation ratios; Huge scalability with up to 64 logical processing cores, 256 virtual CPUs, and 1 TB of RAM per host, enabling higher consolidation ratios; Memory ballooning; Network traffic shaping; Network interface card teaming (or NIC teaming); VMware vSphere client allows for easy graphical user interface management; VMware Power command-line interface (or PowerCLI) and vCLI; and many more.
See VMware ESXi Server features for the complete list of ESXi Server features. Return to guide's main page for more on VMware virtualization products and features .
Understanding how VMware VMotion live migration works VMware Inc.'s VMotion feature performs live migration, which enables the movement of running virtual machines (VMs) from one physical ESX host to another. VMotion is one of the most sought-after and powerful capabilities in VMware vSphere because it enables so many critical infrastructure features while creating zero downtime for end users. How VMotion works and capabilities
Load balancing: VMware's Distributed Resource Scheduler (DRS), you can load-balance virtual infrastructure resources between ESX servers. If one of the hosts nears overutilization, guest VMs can be migrated from one ESX Server to another while in use by end users (using VMotion). Distributed Power Management (DPM): moves running VMs from one ESX server to another using VMotion so that ESX Servers can be powered off when the load on the virtual infrastructure is low. This can tremendously reduce a company's power and cooling costs. Maintenance of ESX servers: with VMotion, VMware administrators can move running virtual machines off one ESX server to another to perform hardware or software maintenance, software patches and so on ESX hosts. In fact, VMware's Update Manager (VUM) uses VMotion to apply patches.
VMware's VMotion is offered in three of the six vSphere editions: Advanced, Enterprise and Enterprise Plus. Additionally VMware's vCenter is required.
Until recently, VMware had no competitors that offered anything similar to VMotion. With the release of Microsoft's Windows 2008 R2, Microsoft now has a similar feature to offer called "live migration." As of September 2009, long-distance VMotion is officially supported by VMware but a tremendous amount of hardware and bandwidth are required to make this work. See Joep Piscaer's post on long-distance VMotion (TA3105) to learn the requirements and details. For more information on VMotion go toVMware's VMotion product page. How VMware Storage VMotion works -- and its benefits Part of VMware Inc.'s vStorage offering, Storage VMotion can move the virtual disk files of a running virtual machine, from one data store to another, with zero downtime for the virtual machine or end users. These data store are on different storage area networks (SANs) or on the local storage of an ESX Server. VMware's virtual machine storage migration does this while maintaining full data integrity. Additionally, the guest VMs can be running any guest operating system and the virtual machine file system (VMFS) that is moved can be stored on any type of supported ESX Server storage (such as Fibre Channel SAN, iSCSI SAN, Network File System, or local ESX server storage). @ REG The uses and benefits of Storage VMotion Storage VMotion is useful in a several scenarios, including the following:
• • •
Moving virtual disks off of a SAN volume that has reached capacity Moving virtual disks from one SAN to another to balance load, to replace a SAN or to take a SAN down for maintenance Moving virtual disks from local ESX Server storage to a SAN
For example, say that you have two or three older and slower SAN / NAS systems storing virtual machine disks. You purchase a new SAN and want to perform data consolidation. Storage VMotion would be used to move those virtual disks to the new SAN with no downtime for the virtual machines, applications, or end users. I have used Storage VMotion when I create a VMware failover cluster such as a Distributed Resource Schedule/VMware cluster. In this case, I had to migrate virtual machine disks stored on the local ESX Server storage to a central SAN to create the cluster. We performed this move in the middle of the day with zero downtime for the end users. Two of the six vSphere editions -- Enterprise and Enterprise Plus -- offer VMware's Storage VMotion. Additionally VMware's vCenter is required. For more information on SVMotion go to VMware's Storage VMotion product page. The benefits -- and limitations -- of VMware High Availability VMware's High Availability (VMHA) provides high availability to any guest operating system at a potentially much lower cost than other HA options (as you don't have to pay per virtual machines [VMs] or per server; VMHA is included in the price of vSphere). The pros of VMHA. VMware HA works by moving all virtual machines that run on a failed ESX server to another ESX server in the same failover cluster and restarting those VMs. A new feature of VMHA enables server virtualization administrators to monitor VMs that run in a virtualized environment for guest operating system
failures. If a guest OS fails, VMHA can even restart that VM on another ESX server in the same high-availability cluster. Not all disaster recovery needs involve large data center-wide scenarios. In many cases, the physical hardware on a server can fail, causing critical company applications to be unavailable. VMHA can provide business continuity by quickly restarting VMs that run on a failed ESX server before you are able to get up out of bed and find out why you got a text message about an application outage. The limits of VMHA. Even though VMHA makes all your virtual machines high-availability servers, those guest VMs still have to be rebooted after they have moved. And, VMHA works on the assumption that either guest the OS hung or the physical ESX Server crashed. If you want a technology that doesn't require a VM in the server cluster to be rebooted, check out VMware Fault Tolerance. VMware's High Availability (VMHA) is offered in 5 of the 6 vSphere Editions - Essentials Plus, Standard, Advanced, Enterprise and Enterprise Plus. Additionally VMware's vCenter is required. For more information, read VMware's High Availability product page. VMware Consolidated Backup: When should you use it? What is VMware Consolidated Backup? VMware Consolidated Backup offers the ability to take fast and efficient backups of virtual machines (VMs). But VMware Consolidated Backup (VCB) isn't what you might envision as the typical backup application. Instead, VCB is a handful of Windows-executable programs for data protection that allow you to back up VMware guest machines from the command line. Most VCB users use VCB in conjunction with their existing enterprise data backup application. This is done with a series of "integration scripts" that VMware offers for most of the popular backup programs. The latest version of VCB is version 1.5 Update 1, which is compatible with VMware Infrastructure (VI) suite (or ESX 3.5) and vSphere 4, the latest version of VMware virtualization. VCB is installed on a Windows server that acts as a backup proxy server to the VMware infrastructure. Backup software then goes to the Windows server running VCB to backup virtual machines that are located in the Virtual Machine File System (VMFS). VCB provides the access to the VMware VMFS file system, which typical backup applications would not understand. Using VCB for disaster recovery If you need to perform disaster recovery to ensure business continuity, the virtual machines that were backed-up with VCB can run on any restored ESX host server. For those who are interested in vSphere 4, it is not advisable to use VCB. Instead, you should use one of these two options:
1. VMware Data Recovery (or vDR):a graphical application o back up and restore virtual machines 2. VMware vStorage APIs for Data Protection (or VADP): - a set of interfaces used by third-party backup
vendors to enablebackup applications to efficiently interface with vSphere
VMware vStorage APIs for Data Protection is included with all commercial editions of the vSphere suite, starting with Essentials. VMware Data Recovery is included with the Essentials Plus, Advanced, Enterprise, and Enterprise Plus vSphere Editions. For more information, read about Consolidated Backup at VMware's Consolidated Backup product page and, for more technical information, read the VMware Consolidated Backup Documentation page. Finally, for specific information on how to back up virtual machines through a variety of methods, read the VMware virtual machine backup guide. How VMware vShield Zones aids VM security, monitoring What is vShield Zones? With VMware's vShield Zones, you can monitor network traffic in a virtualized environment and ensure regulatory compliance by segmenting users and sensitive data on a network.VMware's vShield Zones is VMware's virtualization security offering that is based on technology that VMware bought from Blue Lane Technologies in October 2008. Just as many companies need to create demilitarized zones (DMZs) for their physical servers, vShield Zones lets them create security zones for virtual servers. An added benefit of vShield Zones is that companies can receive a tremendous amount of network traffic flow-monitoring, analysis, and reporting. How vShield Zones works VShield performs Stateful Packet Inspection (SPI) and tracks dynamic connections such as FTP. Better yet, vShield understands your virtual infrastructure and works with vCenter to track traffic between virtual machines and event, VMotion-associated traffic. With vShield, you can create various levels of administrative permission and assign that to your hierarchy of network and VMware administrators. VShield Zones works by having a single virtual machine (VM) act as the vShield management station. vShield monitoring VMs are then deployed to monitor each virtual switch (vSwitch) on each ESX Server. To do so, each vSwitch to be monitored is actually cloned and the vShield monitor is connected between the cloned vSwitch (with the VMs) and the original vSwitch. The data collected is sent back to the vShield management station where it is logged and analyzed. You can create policies on the management station to police your virtual infrastructure network traffic and report on both allowed and denied network traffic. VMware's vShield Zones is offered in three of the six vSphere Editions: Advanced, Enterprise and Enterprise Plus. Additionally, VMware's vCenter is required. And check out this writeup on vShield Zones on SearchVMware.com. For more information, read all about DPM at VMware's vShield Zones product page and the vShield Zones 1.0 FAQ. Using VMware Distributed Power Management VMware's Distributed Power Management (DPM) is one feature that demonstrates the potential of VMotion -- not just visually, but in hard cash. DPM, which is actually a piece of VMware's Distributed Resource Scheduler (DRS), and DRS don't get the exposure that they deserve. If every company understood how well DRS and DPM
reduce power consumption, these virtualization management tools would be considered must-haves for server infrastructure. The biggest selling points combine energy efficiency and resource allocation. Envision this scenario: During the night, when demand is low, all but two of your data center servers power off automatically but leave all data center services still available. If end users try to use an application at 3 a.m., they don't notice any difference in performance. In the morning, when they come to work, powered-off servers are magically powered on and virtual machines moved onto these servers. Now consider how much electricity your company could save on powering those servers and on cooling a data center every day of the year, all without your having to do a thing. DPM quickly pays for itself in power consumption right away and over time. Using VMware Distributed Power Management To use DPM, first create a Distributed Resource Scheduler (DRS)-enabled cluster where VM load is distributed across the physical servers in that cluster. Using VMotion, VMs are moved from server to server automatically to balance the load. The functionality of DRS and DPM can also easily be joined with VMware's High Availability solution, VMHA, which improves resource allocation and energy efficiency and prevents server downtime. DPM takes DRS a step further by performing server consolidation of VMs onto fewer hosts (using VMotion) when the load on the virtual infrastructure is low, and then powering off ESX Servers that aren't needed. When the load increases, the network adapter feature - Wake on LAN (WOL) - awakens the powered-off servers and the VMs are moved back. Thus, in addition to the typical vSphere requirements, a network interface card with WOL is also required. VMware's Distributed Power Management is offered in two of the six vSphere Editions: Enterprise and Enterprise Plus. Additionally, VMware's vCenter is required. For more information, read all about DPM at VMware's DRS / Distributed Power Management (DPM) product page or watch a video demonstrating the power of DPM at the YouTube DPM Video page. When should you enable VMware Fault Tolerance? By now you know what VMware Fault Tolerance (FT) is -- high availability for any supported vSphere guest OS between two vSphere servers. More specifically, FT is protection from a failed server (unplanned downtime). VMware offers other levels of protection for other levels of failures (server component, storage, data, and site) but FT protects your critical applications from a server failure. It sounds nice, but how can this help you in the real world, in a real data center? In my tip "Understanding VMware Fault Tolerance (FT) benefits and requirements," I covered how FT can help you and what the hardware requirements are. In this tip, I will discuss the use cases for FT and the impact its stringent hardware requirements make on your future server hardware purchases. Fault Tolerance (FT) use cases While it is easy to enable VMware Fault Tolerance (FT), there is no need to use it on every virtual machine. While it doesn't cost you extra to enable it, by trying to do so, you will quickly deplete the resources on your ESX hosts. This is because when you enable FT, you are creating a second virtual machine, with the same amount of RAM in use, some additional CPU overhead, effectively doubling the number of VMs in your inventory. So, if you don't enable it across the board, where do you enable it? Here are the most common uses for VMware Fault Tolerance:
Anywhere you use VMware High Availability If you are already using VMware High Availability (VMHA) on any virtual machines, enabling FT may be a good move for those VMs. While FT will take up double the resource for those VMs and may require a vSphere upgrade, by using FT those protected virtual machines won't have to reboot and the end uses' applications will never miss a beat when and if an ESX server fails. Fault Tolerance on demand At many organizations there are certain virtual machines that become critical at a specific time of the month or year. For example, the chief financial officer'ss virtual desktops at year end or the payroll virtual server at the end of each month. It is easy enough to enable FT that you can just right-click on these virtual machines when they become critical and enable FT.
While those two uses are valid, the next three use cases are more plausible, in my opinion.
Any application server with a single point of failure (SPOF) Many times application servers start off small and grow larger and much more critical than expected. For example, a Blackberry server might have just started out for a handful of users and, today, fulfills the Blackberry needs for the entire company. Usually there aren't HA options offered for these types of servers or, if they are, they are complex or cost a great deal as compared to the value that the application brings. These applications have a single point of failure (SPOF) and FT is the ideal option to provide HA for these application servers. Any application that required clustering but it was cost prohibitive Certainly if you have a large group of enterprise email servers providing email for 5,000 users, you can justify a HA option costing $10,000. However, for small and medium-sized database servers, Exchange messaging server supporting fewer than 1,000 users, or for critical remote branch office servers like point of sale, FT can provide HA at a much lower cost when compared to dedicated HA clustering options. Use FT in the future Today, FT syncs the memory between two ESX hosts sharing the same storage (where the virtual machine disk is located). In the future, the HP/Lefthand P4000 storage appliance (and other companies) will be able to offer FT between sites that have high-speed connectivity to one another.
Hardware purchase implications of Fault Tolerance As I discussed in my previous article, you can't run FT on just any vSphere-compatible server because FT uses special features of the CPU. That's why VMware FT requires Intel 31xx, 33xx, 52xx, 54xx, 55xx, 74xx or AMD 13xx, 23xx, 83xx series of processors (or greater) processors. While multi-processor support will likely be available in the future, today FT is only on VMs that have a single CPU. Besides the requirements for particular CPUs, for a VM to be protected using fault tolerance both the primary server and secondary server CPUs be the same or from the same category. (These CPU compatibilities categories are broken down in VMware KB article 1008027, "Processors and guest operating systems that support VMware Fault Tolerance.") That means that you could have two servers that are compatible with FT, but if one is a Xeon 7400 series and the other is a Xeon 5500 series, you still can't use FT. Even if you go to buy a brand new server today, not all CPUs will be compatible. Even barnd new, one could be FT-compatible and not the other. So how do you avoid hardware incompatibilities with Fault Tolerance?
Don't assume that if avSphere compatible server is also FT compatible. Check the FT compatibility list and make sure that your servers are not only on it, but that they are also from the same CPU category.
Run the free VMware SiteSurvey tool on your existing servers to see which are compatible and if they are from the same CPU category.
Eric Siebert covers the SiteSurvey tool and everything that it checks for in his Master's Guide to VMware Fault Tolerance. In the past, admins tried to keep their desktops and servers from the same respective families in order to make support and OS images compatible. Now, admins have a new reason to try to make sure that new servers are from the same CPU category as existing servers - Fault Tolerance. Unfortunately, as new server models come out, it isn't always possible to get the older model with the same category of CPU. Admins have already run into this problem with VMware VMotion. While Enhanced VMotion Compatibility (EVC) partially solves this, there isn't such a solution for FT. One more point about compatibility, FT can't protect VMs with operating systems. VMware KB article 1008027 also points out that specific operating systems may not be supported with certain CPU architectures, or you may be required to reboot your OS when enabling FT. In summary, for new hardware purchases, the solution is simple:make sure servers are vSphere compatible, FT compatible, and from the same CPU category as your existing servers (if possible). VSphere Fault Tolerance requirements and FT logging Fault Tolerance (FT) Fault Tolerance (FT) was introduced as a new feature in vSphere to provide something that was missing in VI3: continuous availability for a VM in case of a host failure. HA was introduced in VI3 to protect against host failures, but it caused the VM to be down for a short period of time while it was restarted on another host. FT takes that to the next level and guarantees that the VM stays operational during a host failure by keeping a secondary copy of it running on another host server; in case of a host failure, that VM then becomes the primary VM and a new secondary is created on another functional host. The primary VM and secondary VM stay in sync with each other by using a technology called Record/Replay that was first introduced with VMware Workstation. Record/Replay works by recording the computer execution on a VM and saving it into a logfile; it can then take that recorded information and replay it on another VM to have a copy that is a duplicate of the original VM. The technology behind the Record/Replay functionality is built into certain models of Intel and AMD processors, and is called vLockstep by VMware. This technology required Intel and AMD to make changes to both the performance counter architecture and virtualization hardware assists (Intel VT and AMD-V) that are inside their physical processors. Because of this, only newer processors support the FT feature; this includes the thirdgeneration AMD Opteron based on the AMD Barcelona, Budapest, and Shanghai processor families; and Intel Xeon processors based on the Core 2 and Core i7 micro architectures and their successors. VMware has published a Knowledge Base article (http://kb.vmware.com/kb/1008027) that provides more details on this. How FT works FT works by creating a secondary VM on another ESX host that shares the same virtual disk file as the primary VM, and then transfers the CPU and virtual device inputs from the primary VM (record) to the secondary VM (replay) via an FT logging NIC so that it is in sync with the primary and ready to take over in case of a failure. Although both the primary and secondary VMs receive the same inputs, only the primary VM produces output such as disk writes and network transmits. The secondary VM's output is suppressed by the hypervisor and is not on the network until it becomes a primary VM, so essentially both VMs function as a single VM. It's important to note that not everything that happens on the primary VM is copied to the secondary; certain actions and instructions are not relevant to the secondary VM, and to record everything would take up a huge amount of disk space and processing power. Instead, only nondeterministic events which include inputs to the VM (disk reads,
received network traffic, keystrokes, mouse clicks, etc.) and certain CPU events (RDTSC, interrupts, etc.) are recorded. Inputs are then fed to the secondary VM at the same execution point so that it is in exactly the same state as the primary VM. The information from the primary VM is copied to the secondary VM using a special logging network that is configured on each host server. It is highly recommended that you use a dedicated gigabit or higher NIC for the FT logging traffic; using slower-speed NICs is not recommended. You could use a shared NIC for FT logging for small or dev/test environments and for testing the feature. The information that is sent over the FT logging network between the two hosts can be very intensive depending on the operation of the VM. VMware has a formula that you can use to determine the FT Logging bandwidth requirements: VMware FT logging bandwidth = (Avg disk reads (MB/s) × 8 + Avg network input (Mbps)) × 1.2 [20% headroom ] To get the VM statistics needed for this formula you must use the performance metrics that are supplied in the vSphere Client. The 20% headroom is to allow for CPU events that also need to be transmitted and are not included in the formula. Note that disk or network writes are not used by FT, as these do not factor into the state of the VM. As you can see, disk reads will typically take up the most bandwidth, and if you have a VM that does a lot of disk reading, you can reduce the amount of disk read traffic across the FT logging network by adding a special VM parameter, replay.logReadData = checksum, to the VMX file of the VM; this will cause the secondary VM to read data directly from the shared disk instead of having it transmitted over the FT logging network. For more information on this, see the Knowledge Base article at http://kb.vmware.com/kb/1011965. It is important to note that if you experience an OS failure on the primary VM, such as a Windows BSOD, the secondary VM will also experience the failure, as it is an identical copy of the primary. However, the HA VM monitor feature will detect this, and will restart the primary VM and then respawn a new secondary VM. Also note that FT does not protect against a storage failure; since the VMs on both hosts use the same storage and virtual disk file, it is a single point of failure. Therefore, it's important to have as much redundancy as possible, such as dual storage adapters in your host servers attached to separate switches (multipathing), to prevent this. If a path to the SAN fails on the primary host, the FT feature will detect this and switch over to the secondary VM, but this is not a desirable situation. Furthermore, if there was a complete SAN failure or problem with the LUN that the VM was on, the FT feature would not protect against this. Because of the high overhead and limitations of FT, you will want to use it sparingly. FT could be used in some cases to replace existing Microsoft Cluster Server (MSCS) implementations, but it's important to note what FT does not do, which is to protect against application failure on a VM; it only protects against a host failure. If protection for application failure is something you need, a solution such as MSCS would be better for you. FT is only meant to keep a VM running if there is a problem with the underlying host hardware. If you want to protect against an operating system failure, the VMware HA feature can provide this also, as it can detect unresponsive VMs and restart them on the same host server. You can use FT and HA together to provide maximum protection; if both the primary and secondary hosts failed at the same time, HA would restart the VM on another operable host and respawn a new secondary VM. Configuring FT Although FT is a great feature, it does have many requirements and limitations that you should be aware of. Perhaps the biggest is that it currently only supports single vCPU VMs, which is unfortunate, as many big enterprise applications that would benefit from FT usually need multiple vCPUs (e.g., vSMP). But don't let this discourage you from running FT, as you may find that some applications will run just fine with one vCPU on some of the newer, faster processors that are available. VMware has mentioned that support for vSMP will come in a future release. Trying to keep a single vCPU in lockstep between hosts is no easy task, and VMware needs more time to develop methods to try to keep multiple vCPUs in lockstep between hosts.
Here are the requirements for the host.
• • • • • • •
The vLockstep technology used by FT requires the physical processor extensions added to the latest processors from Intel and AMD. In order to run FT, a host must have an FT-capable processor, and both hosts running an FT VM pair must be in the same processor family. CPU clock speeds between the two hosts must be within 400MHz of each other to ensure that the hosts can stay in sync. All hosts must be running the same build of ESX or ESXi and be licensed for FT, which is only included in the Advanced, Enterprise, and Enterprise Plus editions of vSphere. Hosts used together as an FT cluster must share storage for the protected VMs (FC, iSCSI, or NAS). Hosts must be in an HA-enabled cluster. Network and storage redundancy is recommended to improve reliability; use NIC teaming and storage multipathing for maximum reliability. Each host must have a dedicated NIC for FT logging and one for VMotion with speeds of at least 1Gbps. Each NIC must also be on the same network. Host certificate checking must be enabled in vCenter Server (configured in vCenter Server Settings → SSL Settings).
Here are the requirements for the VMs.
• • • •
The VMs must be single-processor (no vSMPs). All VM disks must be "thick" (fully allocated) and not "thin." If a VM has a thin disk, it will be converted to thick when FT is enabled. There can be no nonreplayable devices (USB devices, serial/parallel ports, sound cards, a physical CDROM, a physical floppy drive, physical RDMs) on the VM. Most guest OSs are supported, with the following exceptions that apply only to hosts with third-generation AMD Opteron processors (i.e., Barcelona, Budapest, Shanghai): Windows XP (32-bit), Windows 2000, and Solaris 10 (32-bit). See VMware Knowledge Base article 1008027 (http://kb.vmware.com/kb/1008027) for more details.
In addition to these requirements, there are also many limitations when using FT, and they are as follows.
• • • • •
• • • • •
Snapshots must be removed before FT can be enabled on a VM. In addition, it is not possible to take snapshots of VMs on which FT is enabled. N_Port ID Virtualization (NPIV) is not supported with FT. To use FT with a VM you must disable the NPIV configuration. Paravirtualized adapters are not supported with FT. Physical RDM is not supported with FT. You may only use virtual RDMs. FT is not supported with VMs that have CD-ROM or floppy virtual devices connected to a physical or remote device. To use FT with a VM with this issue, remove the CD-ROM or floppy virtual device or reconfigure the backing with an ISO installed on shared storage. The hot-plug feature is automatically disabled for fault tolerant VMs. To hot-plug devices (when either adding or removing them), you must momentarily turn off FT, perform the hot plug, and then turn FT back on. EPT/RVI is automatically disabled for VMs with FT turned on. IPv6 is not supported; you must use IPv4 addresses with FT. You can only use FT on a vCenter Server running as a VM if it is running with a single vCPU. VMotion is supported on FT-enabled VMs, but you cannot VMotion both the primary and secondary VMs at the same time. SVMotion is not supported on FT-enabled VMs. In vSphere 4.0, FT was compatible with DRS, but the automation level was disabled for FT-enabled VMs. Starting in vSphere 4.1, you can use FT with DRS when the EVC feature is enabled. DRS will perform initial placement on FT-enabled VMs and also will include them in the cluster's load-balancing
calculations. If EVC in the cluster is disabled, the FT-enabled VMs are given a DRS automation level of "disabled". When a primary VM is powered on, its secondary VM is automatically placed, and neither VM is moved for load-balancing purposes. You might be wondering whether you meet the many requirements to use FT in your own environment. Fortunately, VMware has made this easy for you to determine by providing a utility called SiteSurvey (www.vmware.com/download/ shared_utilities.html) that will look at your infrastructure and see if it is capable of running FT. It is available as either a Windows or a Linux download, and once you install and run it, you will be prompted to connect to a vCenter Server. Once it connects to the vCenter Server, you can choose from your available clusters to generate a SiteSurvey report that shows whether your hosts support FT and if the hosts and VMs meet the individual prerequisites to use the feature. You can also click on links in the report that will give you detailed information about all the prerequisites along with compatible CPU charts. These links go to VMware's website and display the help document for the SiteSurvey utility, which is full of great information about the prerequisites for FT. In vSphere 4.1, you can also click the blue caption icon next to the Host Configured for FT field on the Host Summary tab to see a list of FT requirements that the host does not meet. If you do this in vSphere 4.0, it shows general requirements that are not specific to the host. Another method for checking to see if your hosts meet the FT requirements is to use the vCenter Server Profile Compliance tool. To check using this method just select your cluster in the left pane of the vSphere Client, and then in the right pane select the Profile Compliance tab. Click the Check Compliance Now link and it will check your hosts for compliance, including FT. Before you enable FT, be aware of one important limitation: VMware currently recommends that you do not use FT in a cluster that consists of a mix of ESX and ESXi hosts. This is because ESX hosts might become incompatible with ESXi hosts for FT purposes after they are patched, even when patched to the same level. This is a result of the patching process and will be resolved in a future release so that compatible ESX and ESXi versions are able to interoperate with FT even though patch numbers do not match exactly. Until this is resolved, you will need to take this into consideration if you plan to use FT, and make sure you adjust your clusters that will have FT-enabled VMs so that they consist of only ESX or ESXi hosts and not both. See VMware Knowledge Base article 1013637 (http://kb.vmware.com/kb/1013637) for more information on this. Implementing FT is fairly simple and straightforward once you meet the requirements for using it. The first step is to configure the networking needed for FT on the host servers. You must configure two separate vSwitches on each host: one for VMotion and one for FT logging. Each vSwitch must have at least one 1Gbps NIC, but at least two are recommended for redundancy. The VMotion and FT logging NICs must be on different network subnets. You can do this by creating a VMkernel interface on each vSwitch, and selecting "Use this port group for VMotion" on one of them and "Use this port group for Fault Tolerance logging" on the other. You can confirm that the networking is configured by selecting the Summary tab for the host; the VMotion Enabled and Fault Tolerance Enabled fields should both say Yes. Once the networking is configured, you can enable FT on a VM by right-clicking on it and choosing the Fault Tolerance item, and then Turn On Fault Tolerance. Once enabled, a secondary VM will be created on another host; at that point, you will see a new Fault Tolerance section on the Summary tab of the VM that will display information including the FT status, secondary VM location (host), CPU and memory in use by the secondary VM, secondary VM lag time (how far behind it is from the primary, in seconds), and bandwidth in use for FT logging. Once you have enabled FT, alarms are available that you can use to check for specific conditions such as FT state, latency, secondary VM status, and more. FT considerations Here is some additional information that will help you understand and implement FT.
VMware spent a lot of time working with Intel and AMD to refine their physical processors so that VMware could implement its vLockstep technology, which replicates nondeterministic transactions between the processors by reproducing their CPU instructions. All data is synchronized, so there is no loss
• • • •
of data or transactions between the two systems. In the event of a hardware failure, you may have an IP packet retransmitted, but there is no interruption in service or data loss, as the secondary VM can always reproduce execution of the primary VM up to its last output. FT does not use a specific CPU feature, but requires specific CPU families to function. vLockstep is more of a software solution that relies on some of the underlying functionality of the processors. The software level records the CPU instructions at the VM level and relies on the processor to do so; it has to be very accurate in terms of timing, and VMware needed the processors to be modified by Intel and AMD to ensure complete accuracy. The SiteSurvey utility simply looks for certain CPU models and families, but not specific CPU features, to determine whether a CPU is compatible with FT. In the future, VMware may update its CPU ID utility to also report whether a CPU is FT-capable. In the case of split-brain scenarios (i.e., loss of network connectivity between hosts), the secondary VM may try to become the primary, resulting in two primary VMs running at the same time. This is prevented by using a lock on a special FT file; once a failure is detected, both VMs will try to rename this file, and if the secondary succeeds it becomes the primary and spawns a new secondary. If the secondary fails because the primary is still running and already has the file locked, the secondary VM is killed and a new secondary is spawned on another host. There is no limit to the number of FT-enabled hosts in a cluster, but you cannot have FT-enabled VMs span clusters. A future release may support FT-enabled VMs spanning clusters. There is an API for FT that provides the ability to script certain actions, such as disabling/enabling FT using PowerShell. There is a limit of four FT-enabled VMs per host (not per cluster); this is not a hard limit, but is recommended for optimal performance. The current version of FT is designed to be used between hosts in the same datacenter, and is not designed to work over WAN links between datacenters due to latency issues and failover complications between sites. Future versions may be engineered to allow for FT usage between external datacenters. Be aware that the secondary VM can slow down the primary VM if it is not getting enough CPU resources to keep up. This is noticeable by a lag time of several seconds or more. To resolve this, try setting a CPU reservation on the primary VM which will also be applied to the secondary VM and will ensure that both VMs will run at the same CPU speed. If the secondary VM slows down to the point that it is severely impacting the performance of the primary VM, FT between the two will cease and a new secondary will be created on another host. Patching hosts can be tricky when using the FT feature because of the requirement that the hosts have the same build level, but it is doable, and you can choose between two methods to accomplish this. The simplest method is to temporarily disable FT on any VMs that are using it, update all the hosts in the cluster to the same build level, and then reenable FT on the VMs. This method requires FT to be disabled for a longer period of time; a workaround if you have four or more hosts in your cluster is to VMotion your FT-enabled VMs so that they are all on half of your ESX hosts. Then update the hosts without the FT VMs so that they are the same build levels; once that is complete, disable FT on the VMs, VMotion the primary VMs to one of the updated hosts, reenable FT, and a new secondary will be spawned on one of the updated hosts that has the same build level. Once all the FT VMs are moved and reenabled, update the remaining hosts so that they are the same build level and then VMotion the VMs around so that they are balanced among all the hosts. FT can be enabled and disabled easily at any time; often this is necessary when you need to do something that is not supported when using FT, such as an SVMotion, snapshot, or hot-add of hardware to the VM. In addition, if there are specific time periods when VM availability is critical, such as when a monthly process is running, you can enable it for that time frame to ensure that it stays up while the process is running, and disable it afterward. When FT is enabled, any memory limits on the primary VM will be removed and a memory reservation will be set equal to the amount of RAM assigned to the VM. You will be unable to change memory limits, shares, or reservations on the primary VM while FT is enabled.
For more information on FT, check out VMware's Availability Guide that is included as part of the vSphere documentation (http://vmware.com/pdf/ vsphere4/r40_u1/vsp_40_u1_availability.pdf). Summary In this chapter, we covered some of the more popular advanced features in vSphere. There is a lot to learn about these features, so make sure you read through the documentation and get as much hands-on experience with them as you can before implementing them in a production environment. VMware's Knowledge Base has a great deal of articles specifically about these features, so make sure you look there for any gotchas or compatibility issues as well tips for troubleshooting problems. Advanced vSphere Features Enabling VMware HA, DRS: Advanced vSphere features Configuring VMotion SVMotion: Requirements for VARs VSphere Fault Tolerance requirements and FT logging Eric Siebert is a 25-year IT veteran whose primary focus is VMware virtualization and Windows server administration. He is one of the 300 vExperts named by VMware Inc. for 2009. He is the author of the book VI3 Implementation and Administration and a frequent TechTarget contributor. In addition, he maintains vSphereland.com, a VMware information site. High-availability guidelines and VMware HA best practices VMware High Availability (HA) is a utility that eliminates the need for dedicated standby hardware and software in a virtualized environment. VMware HA is often used to improve reliability, decrease downtime in virtual environments and improve disaster recovery/business continuity. This chapter excerpt from VCP4 Exam Cram: VMware Certified Professional, 2nd Edition by Elias Khnaser explores VMware HA best practices. Read the excerpt below, and then download the entire chapter on backup and high availability. VMware High Availability deals primarily with ESX/ESXi host failure and what happens to the virtual machines (VMs) that are running on this host. HA can also monitor and restart a VM by checking whether the VMware Tools are still running. When an ESX/ESXi host fails for any reason, all the running VMs also fail. VMware HA ensures that the VMs from the failed host are capable of being restarted on other ESX/ESXi hosts. Many people mistakenly confuse VMware HA with fault tolerance. VMware HA is not fault tolerant in that if a host fails, the VMs on it also fail. HA deals only with restarting those VMs on other ESX/ESXi hosts with enough resources. Fault tolerance, on the other hand, provides uninterruptible access to resources in the event of a host failure. VMware HA maintains a communication channel with all the other ESX/ESXi hosts that are members of the same cluster by using a heartbeat that it sends out every 1 second in vSphere 4.0 or every 10 seconds in vSphere 4.1 by default. When an ESX server misses a heartbeat, the other hosts wait 15 seconds for the other host to respond again. After 15 seconds, the cluster initiates the restart of the VMs on the failing ESX/ESXi host on the remaining ESX/ESXi hosts in the cluster. VMware HA also constantly monitors the ESX/ESXi hosts that are members of the cluster and ensures that resources are always available to satisfy requirements in the event of a host failure. Virtual Machine Failure Monitoring
Virtual Machine Failure Monitoring is technology that is disabled by default. Its function is to monitor virtual machines, which it queries every 20 seconds via a heartbeat. It does this by using the VMware Tools that are installed inside the VM. When a VM misses a heartbeat, VMware HA deems this VM as failed and attempts to reset it. Think of Virtual Machine Failure Monitoring as sort of High Availability for VMs. Virtual Machine Failure Monitoring can detect whether a virtual machine was manually powered off, suspended, or migrated, and thereby does not attempt to restart it. VMware HA configuration prerequisites HA requires the following configuration prerequisites before it can function properly:
• • • •
vCenter: Because VMware HA is an enterprise-class feature, it requires vCenter before it can be enabled. DNS resolution: All ESX/ESXi hosts that are members of the HA cluster must be able to resolve one another using DNS. Access to shared storage: All hosts in the HA cluster must have access and visibility to the same shared storage; otherwise, they would have no access to the VMs. Access to same network: All ESX/ESXi hosts must have the same networks configured on all hosts so that when a VM is restarted on any host, it again has access to the correct network.
Service Console redundancy Recommended practice dictates that the Service Console (SC) have redundancy. VMware HA complains and issues a warning if it detects that the Service Console is configured on a vSwitch with only one vmnic. As Figure 1 below shows, (click on image for full size) you can configure Service Console redundancy in one of two ways:
Create two Service Console port groups, each on a different vSwitch. Assign two physical network interface cards (NICs) in the form of a NIC team to the Service Console vSwitch.
Figure 1: Service Console Redundancy In both cases, you need to configure the entire IP stack with IP address, subnet, and gateway. The Service Console vSwitches are used for heartbeats and state synchronization and use the following ports:
• • • • • • • •
Incoming TCP port 8042 Incoming UDP port 8045 Outgoing TCP port 2050 Outgoing UDP port 2250 Incoming TCP port 8042–8045 Incoming UDP port 8042–8045 Outgoing TCP port 2050–2250 Outgoing UDP port 2050–2250
Failure to configure SC redundancy results in a warning message when you enable HA. So, to avoid seeing this error message and to adhere to best practice, configure the SC to be redundant. Host failover capacity planning
When configuring HA, you have to manually configure the maximum host failure tolerance. This is a task that you should thoughtfully consider during the hardware sizing and planning phase of your deployment. This would assume that you have built your ESX/ESXi hosts with enough resources to run more VMs than planned to be able to accommodate HA. For example, in Figure 2 below (click on image for full size), notice that the HA cluster has four ESX hosts and that all four of these hosts have enough capacity to run at least three more VMs. Because they are all already running three VMs, that means that this cluster can afford the loss of two ESX/ESXi hosts because the remaining two ESX/ESXi hosts can power on the six failed VMs with no problem if failure occurs.
Figure 2: HA capacity planning During the configuration phase of the HA cluster, you are presented with a screen similar to that shown in Figure 3 below that prompts you to define two clusterwide configurations as follows:
Host Monitoring Status: o Enable Host Monitoring: This setting enables you to control whether the HA cluster should monitor the hosts for a heartbeat. This is the cluster’s way of determining whether a host is still active. In some cases, when you are running maintenance tasks on ESX/ESXi hosts, it might be desirable to disable this option to avoid isolating a host. Admission Control: o Enable: Do not power on VMs that violate availability constraints: Selecting this option indicates that if no resources are available to satisfy a VM, it should not be powered on. o Disable: Power on VMs that violate availability constraints: Selecting this option indicates that you should power on a VM even if you have to overcommit resources. Admission Control Policy: o Host failures cluster tolerates: This setting enables you to configure how many host failures you want to tolerate. The allowed settings are 1 through 4. o Percentage of cluster resources reserved as failover spare capacity: Selecting this option indicates that you are reserving a percentage of the total cluster resources in spare for failover. In a four-host cluster, a 25% reservation indicates that you are setting aside a full host for failover. If you want to set aside fewer, you can choose 10% of the cluster resources instead. o Specify a failover host: Selecting this option indicates that you are selecting a particular host as the failover host in the cluster. This might be the case if you have a spare host or have a particular host that has significantly more compute and memory resources available.
Figure 3: HA clusterwide policies Host isolation A network phenomenon known as a split-brain occurs when the ESX/ESXi host has stopped receiving a heartbeat from the rest of the cluster. The heartbeat is queried for every one second in vSphere 4.0 or 10 seconds in vSphere 4.1. If a response is not received, the cluster thinks the ESX/ESXi host has failed. When this occurs, the ESX/ESXi host has lost its network connectivity on its management interface. The host might still be up and running and the VMs might not even be affected considering they might be using a different network interface that has not been affected. However, vSphere needs to take action when this happens because it believes a host has failed. For that matter, the host isolation response was created. Host isolation response is HA’s way of dealing with an ESX/ESXi host that has lost its network connection. You can control what happens to VMs in the event of a host isolation. To get to the VM Isolation Response screen, right-click the cluster in question and click on Edit Settings. You can then click Virtual Machine Options under the VMware HA banner in the left pane. You can control options clusterwide by setting the host isolation
response option accordingly. This is applied to all the VMs on the affected host. That being said, you can always override the cluster settings by defining a different response at the VM level. As shown in Figure 4 below (click on image for full size), your Isolation Response options are as follows:
• • •
Leave Powered On: As the label implies, this setting means that in the event of host isolation, the VM remains powered on. Power Off: This setting defines that in the event of an isolation, the VM is powered off. This is a hard power off. Shut down: This setting defines that in the event of an isolation, the VM is shut down gracefully using VMware Tools. If this task is not successfully completed within five minutes, a power off is immediately executed. If VMware Tools is not installed, a power off is executed instead. Use Cluster Setting: This setting forwards the task to the clusterwide setting defined in the window shown previously in Figure 4.
Figure 4: VM-specific isolation policy
In the event of an isolation, this does not necessarily mean that the host is down. Because the VMs might be configured with different physical NICs and connected to different networks, they might continue to function properly; you therefore have to consider this when setting the priority for isolation. When a host is isolated, this simply means that its Service Console cannot communicate with the rest of the ESX/ESXi hosts in the cluster. Virtual machine recovery priority Should your HA cluster not be able to accommodate all the VMs in the event of a failure, you have the ability to prioritize on VMs. The priorities dictate which VMs are restarted first and which VMs are not that important in the event of an emergency. These options are configured on the same screen as the Isolation Response covered in the preceding section. You can configure clusterwide settings that will be applied to all VMs on the affected host, or you can override the cluster settings by configuring an override at the VM level. You can set a VM’s restart priority to one of the following:
• • • • •
High: VMs with a high priority are restarted first. Medium: This is the default setting. Low: VMs with a low priority are restarted last. Use Cluster Setting: VMs are restarted based on the setting defined at the cluster level defined in the window shown in the figure below. Disabled: The VM does not power on.
The priority should be set based on the importance of the VMs. In other words, you might want to restart domain controllers and not restart print servers. The higher priority virtual machines are restarted first. VMs that can tolerate remaining powered off in the event of an emergency should be configured to remain powered off to conserve resources. MSCS clustering The main purpose of a cluster is to ensure that critical systems remain online at any cost and at all times. Similar to physical machines that can be clustered, virtual machines can also be clustered with ESX using three different scenarios:
Cluster-in-a-box: In this scenario, all the VMs that are part of the cluster reside on the same ESX/ESXi host. As you might have guessed, this immediately creates a single point of failure: the ESX/ESXi host. As far as shared storage is concerned, you can use virtual disks as shared storage in this scenario, or you can use Raw Device Mapping (RDM) in virtual compatibility mode. Cluster-across-boxes: In this scenario, the cluster nodes (VMs that are members of the cluster) reside on multiple ESX/ESXi hosts, whereby each of the nodes that make up the cluster can access the same storage so that if one VM fails, the other can continue to function and access the same data. This scenario creates an ideal cluster environment by eliminating a single point of failure. Shared storage is a prerequisite in this and must reside on Fibre Channel SAN. You also must use an RDM in Physical or Virtual Compatibility Mode as virtual disks are not a supported configuration for shared storage. Whereby each of the nodes that make up the cluster can access the same storage so that if one VM fails, the other can continue to function and access the same data. Physical-to-virtual cluster: In this scenario, one member of the cluster is a virtual machine, whereas the other member is a physical machine. Shared storage is a prerequisite in this scenario and must be configured as an RDM in Physical Compatibility Mode.
Whenever you are designing a clustering solution you need to address the issue of shared storage, which would allow multiple hosts or VMs access to the same data. vSphere offers several methods by which you can provision shared storage as follows:
Virtual disks: You can use a virtual disk as a shared storage area only if you are doing clustering in a box —in other words, only if both VMs reside on the same ESX/ESXi host. RDM in Physical Compatibility Mode: This mode enables you to attach a physical LUN directly into a VM or physical machine. This mode prevents you from using functionality such as snapshots and is ideally used when one member of the cluster is a physical machine while the other is a VM. RDM in Virtual Compatibility Mode: This mode enables you to attach a physical LUN directly into a VM or physical machine. This mode gives you all the benefits of virtual disks running on VMFS including snapshots and advanced file locking. The disk is accessed via the hypervisor and is ideal when configuring a cluster-across-boxes scenario where you need to give both VMs access to shared storage.
At the time of this writing, the only VMware-supported clustering service is Microsoft Clustering Services (MSCS). You can consult the VMware white paper “Setup for Failover Clustering and Microsoft Cluster Service." VMware Fault Tolerance VMware Fault Tolerance (FT) is another form of VM clustering developed by VMware for systems that require extreme uptime. One of the most compelling features of FT is its ease of setup. FT is simply a check box that can be enabled. Compared to traditional clustering that requires specific configurations and in some instances cabling, FT is simple but powerful. How does it work? When protecting VMs with FT, a secondary VM is created in lockstep of the protected VM, the first VM. FT works by simultaneously writing to the first VM and the second VM at the same time. Every task is written twice. If you click on the Start menu on the first VM, the Start menu on the second VM will also be clicked. The power of FT is its capability to keep both VMs in sync. If the protected VM should go down for any reason, the secondary VM immediately takes its place, seizing its identity and its IP address, continuing to service users without an interruption. The newly promoted protected VM then creates a secondary for itself on another host and the cycle restarts. To clarify, let’s see an example. If you wanted to protect an Exchange server, you could enable FT. If for any reason the ESX/ESXi host that is carrying the protected VM fails, the secondary VM kicks in and assumes its duties without an interruption in service. The table below outlines the different High Availability and clustering technologies that you have access to with vSphere and highlights limitations of each.
vSphere HA and clustering support matrix
HA Availability Type Downtime Supported OS Supported Hardware High Availability Some All supported OS ESX hardware
FT Fault Tolerance None All supported OS
MSCS Fault Tolerance Some Only Microsoft supported OS Hardware supported by Microsoft
All supported All supported ESX hardware with CPUs that support FT FT for critical VMs
HA for all VMs
FT for critical applications
Fault Tolerance requirements Fault Tolerance is no different from any other enterprise feature in that it requires certain prerequisites to be met before the technology can function properly and efficiently. These requirements are outlined in the following list and broken down into the different categories that require specific minimum requirements:
Host requirements: o FT-compatible CPU. Check this VMware KB article for more information. o Hardware virtualization must be enabled in the bios. o Host’s CPU clock speeds must be within 400 MHz of each other. VM requirements: o VMs must reside on supported shared storage (FC, iSCSI and NFS). o VMs must run a supported OS. o VMs must be stored in either a VMDK or a virtual RDM. o VMs cannot have thinly provisioned VMDK and must be using an Eagerzeroedthick virtual disk. o VMs cannot have more than one vCPU configured. Cluster requirements: o All ESX/ESXi hosts must be same version and same patch level. o All ESX/ESXi hosts must have access to the VM datastores and networks. o VMware HA must be enabled on the cluster. o Each host must have a vMotion and FT Logging NIC configured. o Host certificate checking must also be enabled.
It is highly advisable that in addition to checking processor compatibility with FT, you check your server’s make and model compatibility with FT against the VMware Hardware Compatibility List (HCL). While FT is a great clustering solution, it is important to note that it also has certain limitations. For example, FT VMs cannot be snapshotted, and they cannot be Storage vMotioned. As a matter of fact, these VMs will automatically be flagged DRS-Disabled and will not participate in any dynamic resource load balancing.
How To Enable FT
Enabling FT is not difficult, but it does involve configuring a few different settings. The following settings need to be properly configured for FT to work:
Enable Host Certificate Checking: To enable this setting, log on to your vCenter server and click on Administration from the File menu and click on vCenter Server Settings. In the left pane, click SSL Settings and check the vCenter Requires Verified Host SSL Certificates box. Configure Host Networking: The networking configuration for FT is easy and follows the same steps and procedures as vMotion, except instead of checking the vMotion box, check the Fault Tolerance Logging box as shown in Figure 5 below (click on image for full size). Turning FT On and Off: Once you have met the preceding requirements, you can now turn FT on and off for VMs. This process is also straightforward: Find the VM you want to protect, right-click it, and select Fault Tolerance > Turn On Fault Tolerance.
Figure 5: FT port group settings While FT is a first generation clustering technology, it works impressively well and simplifies overcomplicated traditional methods of building, configuring, and maintaining clusters. FT is an impressive technology for an uptime standpoint and from a seamless failover standpoint.
VMware HA implementation and ESX/ESXi host addition Implementing VMware High Availability High availability has been an industry buzzword that has stood the test of time. The need and/or desire for high availability is often a significant component to any infrastructure design. Within the scope of an ESX/ESXi host, VMware High Availability (HA) is a component of the vSphere 4 product that provides for the automatic failover of virtual machines. But—and it's a big but at this point in time—HA does not provide high availability in the traditional sense of the term. Commonly, HA means the automatic failover of a service or application to another server. Understanding HA The VMware HA feature provides an automatic restart of the virtual machines that were running on an ESX/ESXi host at the time it became unavailable, shown in Figure 11.15.
Figure 11.15 VMware HA provides an automatic restart of virtual machines that were running on an ESX/ESXi host when it failed. In the case of VMware HA, there is still a period of downtime when a server fails. Unfortunately, the duration of the downtime is not a value that can be calculated because it is unknown ahead of time how long it will take to boot a series of virtual machines. From this you can gather that, at this point in time, VMware HA does not provide the same level of high availability as found in a Microsoft server cluster solution. When a failover occurs between ESX/ESXi hosts as a result of the HA feature, there is potential for data loss as a result of the virtual machine that was immediately powered off when the server failed and then brought back up minutes later on another server.
HA Experience in the Field With that said, I want to mention my own personal experience with HA and the results I encountered. Your mileage might vary but should give you a reasonable expectation of what to expect. I had a VMware ESX/ESXi host that was a member of a five-node cluster. This node crashed sometime during the night, and when the host went down, it took anywhere from 15 to 20 virtual machines with it. HA kicked in and restarted all the virtual machines as expected. What made this an interesting experience is that the crash must have happened right after the polling of the monitoring and alerting server. All the virtual machines that were on the general alerting schedule were restarted without triggering any alerts. We did have some of those virtual machines with a more aggressive monitoring that did trip off alerts that were recovered before anyone was able to log on to the system and investigate. I tried to argue the point that if an alert never fired, did the downtime really happen? I did not get too far with that argument but was pleased with the results. In another case, during testing I had a virtual machine running on a two-node cluster. I pulled the power cords on the host that the virtual machine was running to create the failure. My time to recovery from pull to ping was between five and six minutes. That's not too bad for general use but not good enough for everything. VMware Fault Tolerance can now fill that gap for even the most important and critical servers in your environment. I'll talk more about FT in a bit. In the VMware HA scenario, two or more ESX/ESXi hosts are configured in a cluster. Remember, a VMware cluster represents a logical aggregation of CPU and memory resources, as shown in Figure 11.16. By editing the cluster settings, you can enable the VMware HA feature for a cluster. The HA cluster then determines the number of hosts failures it must support.
Figure 11.16 A VMware ESX/ESXi host cluster logically aggregates the CPU and memory resources from all nodes in the cluster. HA: Within, but Not Between, Sites A requisite of HA is that each node in the HA cluster must have access to the same SAN LUNs. This requirement prevents HA from being able to failover between ESX/ESXi hosts in different locations unless both locations have been configured to have access to the same storage devices. It is not acceptable just to have the data in LUNs the
same because of SAN replication software. Mirroring data from a LUN on a SAN in one location to a LUN on a SAN in a hot site is not conducive to allowing HA (VMotion or DRS). When ESX/ESXi hosts are configured into a VMware HA cluster, they receive all the cluster information. vCenter Server informs each node in the HA cluster about the cluster configuration. HA and vCenter Server Although vCenter Server is most certainly required to enable and manage VMware HA, it is not required to execute HA. vCenter Server is a tool that notifies each VMware HA cluster node about the HA configuration. After the nodes have been updated with the information about the cluster, vCenter Server no longer maintains a persistent connection with each node. Each node continues to function as a member of the HA cluster independent of its communication status with vCenter Server. When an ESX/ESXi host is added to a VMware HA cluster, a set of HA-specific components are installed on the ESX/ESXi host. These components, shown in Figure 11.17, include the following:
• • •
Automatic Availability Manager (AAM) Vmap vpxa
Figure 11.17 Adding an ESX/ESXi host to an HA cluster automatically installs the AAM, Vmap, and possibly the vpxa components on the host. The AAM, effectively the engine or service for HA, is a Legato-based component that keeps an internal database of the other nodes in the cluster. The AAM is responsible for the intracluster heartbeat used to identify available and unavailable nodes. Each node in the cluster establishes a heartbeat with each of the other nodes over the Service Console network, or you can use or define another VMkernel port group for the HA heartbeat. As a best practice, you should provide redundancy to the AAM heartbeat by establishing the Service Console port group on a virtual switch with an underlying NIC team. Though the Service Console could be multihomed and have an AAM heartbeat over two different networks, this configuration is not as reliable as the NIC team. The AAM is extremely sensitive to hostname resolution; the inability to resolve names will most certainly result in an inability to execute HA. When problems arise with HA functionality, look first at hostname resolution. Having said that, during HA troubleshooting, you should identify the answers to questions such as these:
• • • • • •
Is the DNS server configuration correct? Is the DNS server available? If DNS is on a remote subnet, is the default gateway correct and functional? Does the /etc/hosts file have bad entries in it? Does the /etc/resolv.conf have the right search suffix? Does the /etc/resolv.conf have the right DNS server?
Adding a Host to vCenter Server When a new host is added into the vCenter Server inventory, the host must be added by its hostname, or the HA will not function properly. As just noted, HA is heavily reliant on successful name resolution. ESX/ESXi hosts should not be added to the vCenter Server inventory using IP addresses. The AAM on each ESX/ESXi host keeps an internal database of the other hosts belonging to the cluster. All hosts in a cluster are considered either a primary host or a secondary host. However, only one ESX/ESXi host in the cluster is considered the primary host at a given time, with all others considered secondary hosts. The primary host functions as the source of information for all new hosts and defaults to the first host added to the cluster. If the primary host experiences failure, the HA cluster will continue to function. In fact, in the event of primary host failure, one of the secondary hosts will move up to the status of primary host. The process of promoting secondary hosts to primary host is limited to four other hosts. Only five hosts could assume the role of primary host in an HA cluster. While AAM is busy managing the intranode communications, the vpxa service (or vCenter Server agent) manages the HA components. The vpxa service communicates to the AAM through a third component called the Vmap. Name Resolution Tip If DNS is set up and configured correctly, then you should not need anything else for name resolution. However, as a method of redundancy, consider adding the other VMware ESX and vCenter Server information to the local host file (/etc/hosts). If there is a failure and the ESX/ESXi host is unable to talk to DNS, this setup will ensure that HA would still work as designed. Ensuring High Availability and Business Continuity Using Microsoft Cluster Services for virtual machine clustering VMware HA implementation and ESX/ESXi host addition HA cluster configuration: Requirements and steps Printed with permission from Wiley Publishing Inc. Copyright 2009. Mastering VMware vSphere 4 by Scott Lowe. For more information about this title and other similar books, please visit Wiley Publishing.
Using fault-tolerant systems for more resilient data centers Creating a true high-availability architecture with redundant networks and storage pools can do wonders for your data center, but live migration and fault-tolerant systems can bring even more business continuity benefits. Live migration Live migrations and true fault tolerance require a shared storage architecture. Both allow virtual machines (VMs) to be moved from one host server to another on the fly. Although there are a lot of similarities between these two features, they are used for entirely different purposes.
Live migrations are made possible by VMware's vMotion feature, and a similar feature is available in Microsoft Hyper-V R2. This feature treats the host servers as a pool of resources that can be allocated to virtual servers. You can move a virtual server from one host to another almost instantly. The live migration feature is useful if a virtual host becomes overloaded and you need to offload some of the virtual servers or take a host server down for maintenance. One thing to remember is that vMotion does not create fault-tolerant systems. Fault-tolerant systems But VMware does include a fault-tolerance feature called VMware Fault Tolerance (FT) with vSphere 4. Unlike vMotion, VMware FT is designed to rapidly detect and respond to hardware failure so that virtual servers can instantly be moved to an alternate host. This is made possible by vLockstep technology. The basic premise of vLockstep is that a primary VM and a secondary VM are kept in perfect sync. That way, if the primary VM fails, the secondary VM is ready to take over in an instant. VLockstep technology creates fault-tolerant systems by ensuring that both the primary and the secondary VMs execute the same instructions in the same sequence. The primary virtual server's instruction set is passed to the secondary VM using a dedicated server backbone network. The backbone network is also used to transmit heartbeats between the primary and secondary VMs so that failures can be quickly detected. The interesting thing about vLockstep technology is that, because the primary and secondary virtual servers are both executing the same instruction sets, both VMs initiate disk writes. But because both VMs are connected to the same storage pool, VMware FT suppresses write operations on the secondary VM. This ensures that only one VM is making changes to the data on the virtual hard drive. VMware FT can be used within a VMware High Availability cluster. This allows multiple failovers to occur. If the primary VM fails, then failover occurs, and the secondary VM becomes the primary. VMware HA will automatically create a new secondary VM on another cluster node. This allows the VM to remain fault tolerant in spite of the failure that has occurred on the original host server. Although creating a resilient data center does not necessarily require you to create traditional server clusters, using redundant hardware is still a must. To make VM migrations and fault tolerance possible, your data center must provide centralized storage that is accessible to all host servers but without creating a single point of failure. Maintaining high availability of SQL Server virtual machines Since the release of Hyper-V, Microsoft has continued its commitment to server virtualization by releasing new software products that are optimized for just that. This is the case for SQL Server 2008 among other Microsoft products. When previously discussing fault-tolerant virtual installations, the Microsoft SQL Server support team has published specific strategies for virtualizing SQL Server installations. These strategies include different guidelines for SQL Server virtualization, but the most interesting being those used to achieve fault tolerance. The SQL Server support team does not support the creation of a cluster at the virtual machine level. This means you cannot create a fault-tolerant virtual machine by creating a SQL Server cluster. You can, however, create a fault-tolerant VM by creating a cluster at the host server level. It basically works like this. Since each host server in a server virtualization resource pool will run several virtual machines at once, most organizations will create fault-tolerant host server configurations to protect these VMs. Then once the hosts are made redundant through a host cluster, each and every one of the virtual machines running on these hosts will become a protected application and thus gain a certain level of fault tolerance. In the
event of a host failure, the VMs running on this host will also fail, but will automatically be restarted on another host in the cluster. This is one strategy that can be used to create SQL Server virtual machines and ensure highly availability. The process is simple: 1. Prepare the physical server nodes as well as the shared storage component they require to join a cluster. 2. Install the hypervisor. For example, with Windows Server 2008, you must first install the operating system and then enable the Hyper-V role. 3. Create a host cluster. This means installing the Failover Clustering feature on both nodes in Windows Server 2008. In Hyper-V, you need to perform two additional actions: a. Create a virtual network. This is performed in the Hyper-V Manager through the Virtual Network Manager. You must add a new external network adapter linked to a physical adapter, and this action must be performed on all nodes of the cluster. In addition, the name of the new virtual adapter must be identical on each cluster node in order for VM failover to work. b. Validate the cluster configuration and create the cluster. This will ensure that all of the components required for cluster operation are in place before you actually create the cluster. 4. Once the cluster is created, you can then create a VM that will host SQL Server and set it up for high availability. First create or copy the VM to the cluster, then use the Failover Clustering Management console to make the VM highly available. You will then have a fault-tolerant SQL Server virtual machine. When and if the host node running the SQL Server VM fails, the VM will automatically be restarted on another node of the cluster. While this does not make SQL Server aware of the failover, it does ensure that the virtual machine is always running (see Figure 1). Figure 1
SQL SERVER AND MICROSOFT HYPER-V Part 1: Creating fault tolerant installations Part 2: Maintaining high availability Part 3: Protecting virtual databases Part 4: Creating virtual appliances Part 5: Deploying virtual appliances
HA cluster configuration: Requirements and steps
High-availability and clustering solutions for vSphere VMs
Clustering applications running on physical servers can be complicated and costly. For VARs looking to provide customers with clustering solutions, virtualization can help.
A great way for solution providers to create new business opportunities is to implement the advanced features in VMware vSphere for your customers. Advanced high-availability and fault-tolerance (FT) features in vSphere use the virtualization architecture to provide an easy clustering solution for virtual machines (VMs). Before you can help customers decide which of these features best fits their environment, you need to define clustering at different layers in the computing stack, where it can protect against different types of failures: • Application clustering — This is typically built into applications that handle replication and failover to another server on its own without involving the operating system (OS). By installing the application on two different servers, the clustering feature can be configured in the application. An example of this type of clustering is Lotus Domino, which allows administrators to cluster multiple Domino servers. • OS clustering — This type of clustering is handled by the OS that’s responsible for syncing and cutting over an application from one server to another. The typical architecture for this is two servers that use a shared disk the application resides on. An example of this type of clustering is Microsoft Cluster Server (MSCS). • Hardware clustering — This type of clustering is done at the hardware layer and can consist of different hardware components inside and outside a server that prevent a single hardware component failure from crashing a server. Examples include RAID, redundant power supplies, multiple NICs, CPUs and memory dual in-line memory modules. HA clustering for vSphere VSphere is able to provide both HA and clustering at each of these layers using built-in features. Although these features require some of the more expensive vSphere licenses, they provide an easy way to implement inexpensive, simple clustering and HA solutions for your customers. Let’s take a look at how the vSphere features can help provide HA and clustering at the different computing layers. • Application clustering — VSphere cannot provide true application clustering, but it can provide HA for applications. HA Application Monitoring is a feature introduced in vSphere 4.1. It enables the HA feature to monitor the heartbeat of applications that have been modified to transmit a heartbeat vSphere can detect and restart a VM if the application is unresponsive. This adds another layer of the stack for which HA can monitor uptime (Host, OS and application). There are currently no applications that support this,, but there will probably be some in the future. • OS clustering — VSphere cannot provide true OS clustering, but it can provide HA for operating systems. A feature that was introduced to vCenter Server 2.5 called Virtual Machine Monitoring (VMM ) extends the HA
feature to be able to detect guest OS failures by monitoring a heartbeat provided by VMware Tools. If a guest OS failure is detected—such as a Windows blue screen—the heartbeat would stop being received, and the VM would be restarted on the same host. VMware went a step further with this to try and prevent false restarts and enabled VMM to check for any VM disk and network to be certain that the OS was truly unresponsive. This feature works with any OS as long as VMware Tools is installed. • Hardware clustering — Although the underlying physical hardware of a vSphere host may be redundant to protect against a single hardware component failure, vSphere also provides some virtual hardware redundancy at the networking and storage layers. Virtual Switches, or vSwitches, can be configured with multiple physical NICs so VMs don’t lose network connectivity if a failure with the connectivity of a physical NIC occurs. With storage adapters, vSphere supports multi-pathing—except for NFS devices—so if a single path to a storage device fails, alternate paths can be used. FT clustering in vSphere The true clustering feature in vSphere is FT, which maintains an identical copy of a VM on a second host. This clustering is done at the virtualization layer, and the guest OS is unaware of it. FT is designed to protect only against a host failure where you would normally lose all the VMs on a host until the host was brought back up or the VMs were started on other hosts with the HA feature. It does not protect against an OS failure or other hardware failures or those with a shared storage device. The primary VM on one host and the secondary VM on another host stay in sync by using a technology called Record/Replay. FT works by creating a secondary VM on another host that shares the same virtual disk file as the primary VM and then transferring the CPU and virtual device inputs from the primary VM (record) to the secondary VM (replay) through a FT-logging NIC so it’s in sync with the primary and ready to take over in case of a failure. Although both the primary and secondary VMs receive the same inputs, only the primary VM produces output such as disk writes and network transmits. Because the secondary VM’s output is suppressed by the hypervisor and is not on the network until it becomes a primary VM, both VMs function as a single VM. Because both the primary and secondary VMs are identical copies, if a failure such as a Windows BSOD occurs in the primary VM, it will also occur in the secondary VM. So even though FT does provide additional protection for a VM, it doesn’t provide total protection for it. Using MSCS in vSphere To achieve maximum protection, solution providers need to use a clustering solution such as MSCS. Virtualization makes implementing MSCS easy and affordable because solution providers create just two VMs without the need to purchase an additional physical server. There are two methods for implementing MSCS in vSphere: Putting both VMs on the same host (cluster in a box) or having the VMs on separate hosts (cluster across boxes). The cluster in a box protects against application and OS failures, and the cluster across boxes provides the additional protection of host hardware failure. Both solutions require the VMs to access the same virtual disk file. With the cluster in a box, the disk can reside either on local disk or on storage area network (SAN) disk. But with the cluster across boxes, it must reside on SAN disk. The requirements for using MSCS on vSphere are fairly straightforward and mostly storage related. For a cluster in a box, solution providers can use standard virtual disks—which is recommended—or a Raw Device Mapping (RDM) to a SAN disk in virtual compatibility mode. For a cluster across boxes, solution providers cannot use standard virtual disks. They must use RDMs either in physical —also recommended—or virtual compatibility mode.
Only Fibre Channel SAN disk is supported for use with the RDM disks, iSCSI and network file system, while the Fibre Channel over Ethernet disk is not supported because of latency that may occur with those protocols. Although MSCS can provide maximum protection for critical applications running on VMs, it does have some limitations. The FT, Distributed Resource Scheduler, VMotion and HA features are not supported on VMs using MSCS. The loss of these features isn’t a huge a deal because they are mostly used to provide availability, which MSCS is already providing. As you can see, VARs have many availability options they can offer their customers for VMs running on vSphere. Your selection will be dependent on your customers’ requirements. Some may need the most protection that MSCS offers, and others might be OK with the limited protection that FT and HA provide. Some things to consider when choosing a solution for your customers is that the FT feature has some strict requirements and limitations that may not be a good fit for everyone. Make sure you are aware of them before implementing it. Not all the vSphere advanced features are included in certain vSphere editions, so your customers may have to upgrade their vSphere licenses to make use of the features. Additionally, FT and HA require shared storage. You might have to make some architecture changes to implement them properly. If you plan on implementing MSCS, be sure to read the VMware guide on how to properly implement it. No matter what availability solution you choose for your customers, make sure you understand its capabilities and limitations. Don’t forget to test it thoroughly to ensure that it performs as expected.
Upgrading from ESXi 5.x to ESXi 5.1 using VMware Update Manager
In-place upgrade of ESXi hosts is quick and simple and well suitable for small environments. However, if you want to upgrade a large number of ESXi hosts without any downtime, you can simply use VMware Update Manager to do the job which is provided with the vCenter Server installation ISO. Before you can upgrade your ESXi hosts make sure that you upgrade the vCenter Server and VMware Update Manager to version 5.1 first. Also, make sure you have a back-up of your hosts in case something goes wrong and you will have to restore the hosts to their original state. The VMware Update Manager is not available in the vSphere Web Client yet, so you will have to use the good, old vSphere Client. Before you begin, download the latest ESXi ISO image from VMware.
1. First, you need to import the new image to the VMware Update Manager. 2. Log in to the vCenter Server and from the Home pane navigate to Update Manager under Solutions and Applications. 3. Go to the ESXi Images tab and click Import ESXi Image. 4. In the Import ESXi Image window, click Browse and select the ESXi ISO image. 5. With the image loaded, click Next to continue. 6. Ignore the security warning and install the certificate. Wait until the image is processed. Click Next and create a new baseline.
7. Click Finish. The image and all software packages in it will now be
visible. 8. Go back to the Home page and navigate to Hosts and Clusters. 9. Select your cluster, and go to the Update Manager tab. Click
Attach. 10.Select the upgrade baseline created earlier and click
Attach. 11.When the baseline is loaded, click Scan to scan your hosts in the cluster against the baseline. Make sure you select the Upgrades check box and unselect the Patches and Extensions. Click Scan to start the process. 12.When the scan completes, you should see how many hosts are compliant and non-compliant with the baseline. In my case, two hosts were not-
compliant. 13.Select the non-compliant hosts and click Remediate. 14.In the Remediate wizard, verify that the correct hosts are selected and click Next. 15.Accept the EULA and click Next.
16.If you have incompatible 3rd party software running on your ESXi hosts, you can remove it at this point. Click Next. 17.Under the Schedule part, type the Task Name and set desired remediation time. Click Next. 18.In the Maintenance Mode Options, select the appropriate settings for your environment and click Next. 19.Do the same in the Cluster Remediation Options, click Next. 20.Review the settings in the Ready to Complete window and click Finish to start the upgrade process. 21.Watch the progress, VMs will be migrated to other hosts. The host that is being upgraded should enter the maintenance mode, disconnect from the vCenter Server, upgrade to ESXi 5.1, reboot and reconnect to vCenter Server again.
When all hosts are upgraded, al hosts should be 100% compliant with the baseline and you have successfully migrated your hosts to ESXi 5.1